Files
galaxy-game/lobby/docs/runbook.md
T
2026-04-28 20:39:18 +02:00

253 lines
10 KiB
Markdown

# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and the handful of recovery paths specific to Lobby.
## Startup Checks
Before starting the process, confirm:
- `LOBBY_REDIS_MASTER_ADDR` and `LOBBY_REDIS_PASSWORD` point to the Redis
deployment used for the runtime-coordination state that intentionally
stays on Redis: stream consumers/publishers, stream offsets, per-game
turn-stats aggregates, gap-activation timestamps, and the
capability-evaluation guard. The deprecated `LOBBY_REDIS_ADDR`,
`LOBBY_REDIS_USERNAME`, and `LOBBY_REDIS_TLS_ENABLED` env vars were
retired in PG_PLAN.md §6A; setting either of the latter two now fails
fast at startup.
- `LOBBY_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that
hosts the `lobby` schema. The DSN must include `search_path=lobby` and
`sslmode=disable`. Embedded goose migrations apply at startup before
any HTTP listener opens; a migration or ping failure terminates the
process with a non-zero exit. After PG_PLAN.md §6A the schema holds
`games`, `applications`, `invites`, `memberships`; after §6B it also
holds `race_names`. The schema and the `lobbyservice` role are
provisioned externally (operator init script in production, the
testcontainers harness in tests).
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
the network the Lobby pods run in. Lobby does not ping these at boot,
but transport failures against them will surface as request errors.
- Stream names match the producers/consumers Lobby integrates with:
- `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
- `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
- `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `postgres` for production
(the default after PG_PLAN.md §6B); the `stub` value selects the
in-memory adapter at `lobby/internal/adapters/racenameinmem/`,
intended for unit tests and small local deployments without
PostgreSQL. The config token name is kept as `stub` for backward
compatibility.
At startup the process opens the PostgreSQL pool, applies migrations,
pings PostgreSQL, then opens the Redis client and pings Redis. Startup
fails fast if any step fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
- both ports answer `GET /healthz` and `GET /readyz`.
Expected log lines:
- `lobby starting` from `cmd/lobby`;
- one `redis ping ok` line;
- one `public http listening` and one `internal http listening` line;
- one `worker started` line per background worker (six expected).
## Readiness
Use the probes according to what they actually guarantee:
- `GET /healthz` confirms the listener is alive;
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
at boot.
`/readyz` is process-local. It does not confirm:
- ongoing Redis health after boot;
- User Service reachability;
- Game Master reachability;
- worker liveness.
For a practical readiness check in production:
1. confirm the process emitted the listener and worker startup logs;
2. check `GET /healthz` and `GET /readyz` on both ports;
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
the first traffic;
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
GM starts emitting events.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
- HTTP listeners drain in-flight requests before closing;
- background workers stop their `XREAD` loops and persist the latest offset;
- pending consumer offsets are flushed before exit.
During planned restarts:
1. send `SIGTERM`;
2. wait for the listener and component-stop logs;
3. expect any worker that was mid-cycle to retry from the persisted offset
on the next process start;
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
## Stuck `starting` Recovery
A game that flips to `starting` but never completes one of the post-start
steps will stay in `starting` until manual recovery.
Symptoms:
- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
expected start budget (Runtime Manager start time + GM register call);
- per-game logs show `start_job_published` but no `runtime_job_result` or
`register_runtime_outcome` follow-up.
Recovery:
1. Identify the affected `game_id` from the gauge labels or logs.
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
Lobby. If absent, Runtime Manager never produced a result; resolve at
the runtime layer.
3. If the result exists with `success=true` but no GM call was made, retry
with the admin or owner command `lobby.game.retry_start`.
4. If the result exists with `success=false`, transition through the
`start_failed` path and use `lobby.game.cancel` or `retry_start` once
the underlying issue is resolved.
5. If the metadata persistence step failed, Lobby has already published a
stop-job and moved the game to `start_failed`. Confirm the orphan
container was removed by Runtime Manager.
Lobby always re-accepts a `start` command on a game that is stuck in
`starting`: the first action is a CAS attempt, and a second `start` from a
re-issued admin command will progress the state machine.
## Stuck Stream Offsets
Three stream-lag gauges describe the consumer health:
- `lobby.gm_events.oldest_unprocessed_age_ms`
- `lobby.runtime_results.oldest_unprocessed_age_ms`
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
A persistently increasing gauge means the consumer is unable to advance.
Causes and triage:
1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
and advances the offset; this should not stall the stream. If the gauge
keeps climbing, there is a real handler error.
2. **Handler returns a non-nil error.** The consumer holds the offset and
retries on every cycle. Inspect the latest log lines to identify the
error class (Redis transient, RND store error, RuntimeManager publish
failure for cascade events).
3. **Process restart loop.** A crash before persisting the offset does not
advance progress. Check pod restart counts and `cmd/lobby` panics.
After the underlying cause is fixed, the consumer resumes from the persisted
offset; no manual intervention to the offset key is required in normal
operation. If a corrupt entry must be skipped, advance
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
process.
## Pending Registration Window Expiry
The pending-registration expirer ticks every
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
`pending_registration` entries past their `eligible_until` timestamp.
The 30-day window length is the in-process constant
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
override is reserved for a future change under the env var
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
The worker absorbs Race Name Directory failures: a failing `Expire` call is
logged at warn level, the worker waits for the next tick, and no offset is
moved (there is no offset; this is a periodic worker, not a consumer). A
backlog of expirable entries is therefore self-healing once the directory
is reachable again.
To inspect the backlog:
```bash
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
```
Rows whose `eligible_until_ms` is at or below `extract(epoch from now()) * 1000`
are expirable on the next tick. The partial index
`race_names_pending_eligible_idx` keeps this scan cheap.
## Cascade Release Operator Notes
The `user:lifecycle_events` consumer fans out a single user-lifecycle event
into many actions:
1. Race Name Directory release (`RND.ReleaseAllByUser`).
2. Membership status flips (`active``blocked`) on every membership the
user holds, with a `lobby.membership.blocked` notification per
third-party private game.
3. Application status flips (`submitted``rejected`).
4. Invite status flips (`created``revoked`) on both addressed and
inviter-side invites.
5. Owned non-terminal games transition to `cancelled` via the
`external_block` trigger. In-flight statuses (`starting`, `running`,
`paused`) get a stop-job published to Runtime Manager before the game
record is updated.
The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
is treated as «already done». A retry on the next consumer cycle will
re-traverse the same set without producing duplicate side effects.
A single failing step (transient store error or runtime stop-job publish
failure) leaves the offset on the current entry. The next cycle retries the
full cascade. Do not advance the offset manually unless you have first
verified that the cascade actions for the current entry have been completed
out-of-band.
## Diagnostic Queries
Durable enrollment state and Race Name Directory bindings live in
PostgreSQL; runtime coordination state stays in Redis. A handful of CLI
snippets help during incidents:
```bash
# Live game count by status (PostgreSQL)
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"
# Inspect a specific game record
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"
# Member roster for a game
psql -c "SELECT user_id, race_name, status, joined_at
FROM lobby.memberships
WHERE game_id = '<game_id>'
ORDER BY joined_at"
# Race name pending entries (oldest first)
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
# Stream lag inspection (Redis)
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
```
The gauges and counters surfaced through OpenTelemetry are the primary
observability surface; raw PostgreSQL and Redis access is for last-resort
triage.