250 lines
10 KiB
Markdown
250 lines
10 KiB
Markdown
# Operator Runbook
|
|
|
|
This runbook covers the checks that matter most during startup, steady-state
|
|
readiness, shutdown, and the handful of recovery paths specific to Lobby.
|
|
|
|
## Startup Checks
|
|
|
|
Before starting the process, confirm:
|
|
|
|
- `LOBBY_REDIS_MASTER_ADDR` and `LOBBY_REDIS_PASSWORD` point to the Redis
|
|
deployment used for the runtime-coordination state that intentionally
|
|
stays on Redis: stream consumers/publishers, stream offsets, per-game
|
|
turn-stats aggregates, gap-activation timestamps, and the
|
|
capability-evaluation guard. The deprecated `LOBBY_REDIS_ADDR`,
|
|
`LOBBY_REDIS_USERNAME`, and `LOBBY_REDIS_TLS_ENABLED` env vars were
|
|
retired in PG_PLAN.md §6A; setting either of the latter two now fails
|
|
fast at startup.
|
|
- `LOBBY_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that
|
|
hosts the `lobby` schema. The DSN must include `search_path=lobby` and
|
|
`sslmode=disable`. Embedded goose migrations apply at startup before
|
|
any HTTP listener opens; a migration or ping failure terminates the
|
|
process with a non-zero exit. After PG_PLAN.md §6A the schema holds
|
|
`games`, `applications`, `invites`, `memberships`; after §6B it also
|
|
holds `race_names`. The schema and the `lobbyservice` role are
|
|
provisioned externally (operator init script in production, the
|
|
testcontainers harness in tests).
|
|
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
|
|
the network the Lobby pods run in. Lobby does not ping these at boot,
|
|
but transport failures against them will surface as request errors.
|
|
- Stream names match the producers/consumers Lobby integrates with:
|
|
- `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
|
|
- `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
|
|
- `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
|
|
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
|
|
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
|
|
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
|
|
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `postgres` for production
|
|
(the default after PG_PLAN.md §6B); the `stub` value is only for
|
|
unit tests that do not need a real PostgreSQL.
|
|
|
|
At startup the process opens the PostgreSQL pool, applies migrations,
|
|
pings PostgreSQL, then opens the Redis client and pings Redis. Startup
|
|
fails fast if any step fails. There are no liveness checks against User
|
|
Service or Game Master at boot; those are surfaced at request time.
|
|
|
|
Expected listener state after a healthy start:
|
|
|
|
- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
|
|
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
|
|
- both ports answer `GET /healthz` and `GET /readyz`.
|
|
|
|
Expected log lines:
|
|
|
|
- `lobby starting` from `cmd/lobby`;
|
|
- one `redis ping ok` line;
|
|
- one `public http listening` and one `internal http listening` line;
|
|
- one `worker started` line per background worker (six expected).
|
|
|
|
## Readiness
|
|
|
|
Use the probes according to what they actually guarantee:
|
|
|
|
- `GET /healthz` confirms the listener is alive;
|
|
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
|
|
at boot.
|
|
|
|
`/readyz` is process-local. It does not confirm:
|
|
|
|
- ongoing Redis health after boot;
|
|
- User Service reachability;
|
|
- Game Master reachability;
|
|
- worker liveness.
|
|
|
|
For a practical readiness check in production:
|
|
|
|
1. confirm the process emitted the listener and worker startup logs;
|
|
2. check `GET /healthz` and `GET /readyz` on both ports;
|
|
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
|
|
the first traffic;
|
|
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
|
|
GM starts emitting events.
|
|
|
|
## Shutdown
|
|
|
|
The process handles `SIGINT` and `SIGTERM`.
|
|
|
|
Shutdown behavior:
|
|
|
|
- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
|
|
- HTTP listeners drain in-flight requests before closing;
|
|
- background workers stop their `XREAD` loops and persist the latest offset;
|
|
- pending consumer offsets are flushed before exit.
|
|
|
|
During planned restarts:
|
|
|
|
1. send `SIGTERM`;
|
|
2. wait for the listener and component-stop logs;
|
|
3. expect any worker that was mid-cycle to retry from the persisted offset
|
|
on the next process start;
|
|
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
|
|
|
|
## Stuck `starting` Recovery
|
|
|
|
A game that flips to `starting` but never completes one of the post-start
|
|
steps will stay in `starting` until manual recovery.
|
|
|
|
Symptoms:
|
|
|
|
- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
|
|
expected start budget (Runtime Manager start time + GM register call);
|
|
- per-game logs show `start_job_published` but no `runtime_job_result` or
|
|
`register_runtime_outcome` follow-up.
|
|
|
|
Recovery:
|
|
|
|
1. Identify the affected `game_id` from the gauge labels or logs.
|
|
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
|
|
Lobby. If absent, Runtime Manager never produced a result; resolve at
|
|
the runtime layer.
|
|
3. If the result exists with `success=true` but no GM call was made, retry
|
|
with the admin or owner command `lobby.game.retry_start`.
|
|
4. If the result exists with `success=false`, transition through the
|
|
`start_failed` path and use `lobby.game.cancel` or `retry_start` once
|
|
the underlying issue is resolved.
|
|
5. If the metadata persistence step failed, Lobby has already published a
|
|
stop-job and moved the game to `start_failed`. Confirm the orphan
|
|
container was removed by Runtime Manager.
|
|
|
|
Lobby always re-accepts a `start` command on a game that is stuck in
|
|
`starting`: the first action is a CAS attempt, and a second `start` from a
|
|
re-issued admin command will progress the state machine.
|
|
|
|
## Stuck Stream Offsets
|
|
|
|
Three stream-lag gauges describe the consumer health:
|
|
|
|
- `lobby.gm_events.oldest_unprocessed_age_ms`
|
|
- `lobby.runtime_results.oldest_unprocessed_age_ms`
|
|
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
|
|
|
|
A persistently increasing gauge means the consumer is unable to advance.
|
|
Causes and triage:
|
|
|
|
1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
|
|
and advances the offset; this should not stall the stream. If the gauge
|
|
keeps climbing, there is a real handler error.
|
|
2. **Handler returns a non-nil error.** The consumer holds the offset and
|
|
retries on every cycle. Inspect the latest log lines to identify the
|
|
error class (Redis transient, RND store error, RuntimeManager publish
|
|
failure for cascade events).
|
|
3. **Process restart loop.** A crash before persisting the offset does not
|
|
advance progress. Check pod restart counts and `cmd/lobby` panics.
|
|
|
|
After the underlying cause is fixed, the consumer resumes from the persisted
|
|
offset; no manual intervention to the offset key is required in normal
|
|
operation. If a corrupt entry must be skipped, advance
|
|
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
|
|
process.
|
|
|
|
## Pending Registration Window Expiry
|
|
|
|
The pending-registration expirer ticks every
|
|
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
|
|
`pending_registration` entries past their `eligible_until` timestamp.
|
|
|
|
The 30-day window length is the in-process constant
|
|
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
|
|
override is reserved for a future change under the env var
|
|
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
|
|
|
|
The worker absorbs Race Name Directory failures: a failing `Expire` call is
|
|
logged at warn level, the worker waits for the next tick, and no offset is
|
|
moved (there is no offset; this is a periodic worker, not a consumer). A
|
|
backlog of expirable entries is therefore self-healing once the directory
|
|
is reachable again.
|
|
|
|
To inspect the backlog:
|
|
|
|
```bash
|
|
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
|
|
FROM lobby.race_names
|
|
WHERE binding_kind = 'pending_registration'
|
|
ORDER BY eligible_until_ms ASC"
|
|
```
|
|
|
|
Rows whose `eligible_until_ms` is at or below `extract(epoch from now()) * 1000`
|
|
are expirable on the next tick. The partial index
|
|
`race_names_pending_eligible_idx` keeps this scan cheap.
|
|
|
|
## Cascade Release Operator Notes
|
|
|
|
The `user:lifecycle_events` consumer fans out a single user-lifecycle event
|
|
into many actions:
|
|
|
|
1. Race Name Directory release (`RND.ReleaseAllByUser`).
|
|
2. Membership status flips (`active` → `blocked`) on every membership the
|
|
user holds, with a `lobby.membership.blocked` notification per
|
|
third-party private game.
|
|
3. Application status flips (`submitted` → `rejected`).
|
|
4. Invite status flips (`created` → `revoked`) on both addressed and
|
|
inviter-side invites.
|
|
5. Owned non-terminal games transition to `cancelled` via the
|
|
`external_block` trigger. In-flight statuses (`starting`, `running`,
|
|
`paused`) get a stop-job published to Runtime Manager before the game
|
|
record is updated.
|
|
|
|
The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
|
|
is treated as «already done». A retry on the next consumer cycle will
|
|
re-traverse the same set without producing duplicate side effects.
|
|
|
|
A single failing step (transient store error or runtime stop-job publish
|
|
failure) leaves the offset on the current entry. The next cycle retries the
|
|
full cascade. Do not advance the offset manually unless you have first
|
|
verified that the cascade actions for the current entry have been completed
|
|
out-of-band.
|
|
|
|
## Diagnostic Queries
|
|
|
|
Durable enrollment state and Race Name Directory bindings live in
|
|
PostgreSQL; runtime coordination state stays in Redis. A handful of CLI
|
|
snippets help during incidents:
|
|
|
|
```bash
|
|
# Live game count by status (PostgreSQL)
|
|
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"
|
|
|
|
# Inspect a specific game record
|
|
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"
|
|
|
|
# Member roster for a game
|
|
psql -c "SELECT user_id, race_name, status, joined_at
|
|
FROM lobby.memberships
|
|
WHERE game_id = '<game_id>'
|
|
ORDER BY joined_at"
|
|
|
|
# Race name pending entries (oldest first)
|
|
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
|
|
FROM lobby.race_names
|
|
WHERE binding_kind = 'pending_registration'
|
|
ORDER BY eligible_until_ms ASC"
|
|
|
|
# Stream lag inspection (Redis)
|
|
redis-cli XINFO STREAM gm:lobby_events
|
|
redis-cli GET lobby:stream_offsets:gm_events
|
|
```
|
|
|
|
The gauges and counters surfaced through OpenTelemetry are the primary
|
|
observability surface; raw PostgreSQL and Redis access is for last-resort
|
|
triage.
|