Files
galaxy-game/lobby/docs/runbook.md
T
2026-04-25 23:20:55 +02:00

221 lines
8.6 KiB
Markdown

# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and the handful of recovery paths specific to Lobby.
## Startup Checks
Before starting the process, confirm:
- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
five Lobby-related streams.
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
the network the Lobby pods run in. Lobby does not ping these at boot,
but transport failures against them will surface as request errors.
- Stream names match the producers/consumers Lobby integrates with:
- `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
- `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
- `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
`stub` value is only for unit tests.
At startup the process performs a bounded `PING` against Redis. Startup
fails fast if the ping fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
- both ports answer `GET /healthz` and `GET /readyz`.
Expected log lines:
- `lobby starting` from `cmd/lobby`;
- one `redis ping ok` line;
- one `public http listening` and one `internal http listening` line;
- one `worker started` line per background worker (six expected).
## Readiness
Use the probes according to what they actually guarantee:
- `GET /healthz` confirms the listener is alive;
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
at boot.
`/readyz` is process-local. It does not confirm:
- ongoing Redis health after boot;
- User Service reachability;
- Game Master reachability;
- worker liveness.
For a practical readiness check in production:
1. confirm the process emitted the listener and worker startup logs;
2. check `GET /healthz` and `GET /readyz` on both ports;
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
the first traffic;
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
GM starts emitting events.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
- HTTP listeners drain in-flight requests before closing;
- background workers stop their `XREAD` loops and persist the latest offset;
- pending consumer offsets are flushed before exit.
During planned restarts:
1. send `SIGTERM`;
2. wait for the listener and component-stop logs;
3. expect any worker that was mid-cycle to retry from the persisted offset
on the next process start;
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
## Stuck `starting` Recovery
A game that flips to `starting` but never completes one of the post-start
steps will stay in `starting` until manual recovery.
Symptoms:
- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
expected start budget (Runtime Manager start time + GM register call);
- per-game logs show `start_job_published` but no `runtime_job_result` or
`register_runtime_outcome` follow-up.
Recovery:
1. Identify the affected `game_id` from the gauge labels or logs.
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
Lobby. If absent, Runtime Manager never produced a result; resolve at
the runtime layer.
3. If the result exists with `success=true` but no GM call was made, retry
with the admin or owner command `lobby.game.retry_start`.
4. If the result exists with `success=false`, transition through the
`start_failed` path and use `lobby.game.cancel` or `retry_start` once
the underlying issue is resolved.
5. If the metadata persistence step failed, Lobby has already published a
stop-job and moved the game to `start_failed`. Confirm the orphan
container was removed by Runtime Manager.
Lobby always re-accepts a `start` command on a game that is stuck in
`starting`: the first action is a CAS attempt, and a second `start` from a
re-issued admin command will progress the state machine.
## Stuck Stream Offsets
Three stream-lag gauges describe the consumer health:
- `lobby.gm_events.oldest_unprocessed_age_ms`
- `lobby.runtime_results.oldest_unprocessed_age_ms`
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
A persistently increasing gauge means the consumer is unable to advance.
Causes and triage:
1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
and advances the offset; this should not stall the stream. If the gauge
keeps climbing, there is a real handler error.
2. **Handler returns a non-nil error.** The consumer holds the offset and
retries on every cycle. Inspect the latest log lines to identify the
error class (Redis transient, RND store error, RuntimeManager publish
failure for cascade events).
3. **Process restart loop.** A crash before persisting the offset does not
advance progress. Check pod restart counts and `cmd/lobby` panics.
After the underlying cause is fixed, the consumer resumes from the persisted
offset; no manual intervention to the offset key is required in normal
operation. If a corrupt entry must be skipped, advance
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
process.
## Pending Registration Window Expiry
The pending-registration expirer ticks every
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
`pending_registration` entries past their `eligible_until` timestamp.
The 30-day window length is the in-process constant
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
override is reserved for a future change under the env var
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
The worker absorbs Race Name Directory failures: a failing `Expire` call is
logged at warn level, the worker waits for the next tick, and no offset is
moved (there is no offset; this is a periodic worker, not a consumer). A
backlog of expirable entries is therefore self-healing once the directory
is reachable again.
To inspect the backlog:
```bash
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
```
Entries with `score < now()` (Unix milliseconds) are expirable on the next
tick.
## Cascade Release Operator Notes
The `user:lifecycle_events` consumer fans out a single user-lifecycle event
into many actions:
1. Race Name Directory release (`RND.ReleaseAllByUser`).
2. Membership status flips (`active``blocked`) on every membership the
user holds, with a `lobby.membership.blocked` notification per
third-party private game.
3. Application status flips (`submitted``rejected`).
4. Invite status flips (`created``revoked`) on both addressed and
inviter-side invites.
5. Owned non-terminal games transition to `cancelled` via the
`external_block` trigger. In-flight statuses (`starting`, `running`,
`paused`) get a stop-job published to Runtime Manager before the game
record is updated.
The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
is treated as «already done». A retry on the next consumer cycle will
re-traverse the same set without producing duplicate side effects.
A single failing step (transient store error or runtime stop-job publish
failure) leaves the offset on the current entry. The next cycle retries the
full cascade. Do not advance the offset manually unless you have first
verified that the cascade actions for the current entry have been completed
out-of-band.
## Diagnostic Queries
A handful of Redis CLI snippets help during incidents:
```bash
# Live game count by status
redis-cli ZCARD lobby:games_by_status:enrollment_open
redis-cli ZCARD lobby:games_by_status:running
# Inspect a specific game record
redis-cli GET lobby:games:<game_id>
# Member roster for a game
redis-cli SMEMBERS lobby:game_memberships:<game_id>
# Race name pending entries (oldest first)
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
# Stream lag inspection
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
```
The gauges and counters surfaced through OpenTelemetry are the primary
observability surface; raw Redis access is for last-resort triage.