10 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Lobby.
Startup Checks
Before starting the process, confirm:
LOBBY_REDIS_MASTER_ADDRandLOBBY_REDIS_PASSWORDpoint to the Redis deployment used for the runtime-coordination state that intentionally stays on Redis: stream consumers/publishers, stream offsets, per-game turn-stats aggregates, gap-activation timestamps, and the capability-evaluation guard. The deprecatedLOBBY_REDIS_ADDR,LOBBY_REDIS_USERNAME, andLOBBY_REDIS_TLS_ENABLEDenv vars were retired in PG_PLAN.md §6A; setting either of the latter two now fails fast at startup.LOBBY_POSTGRES_PRIMARY_DSNpoints to the PostgreSQL primary that hosts thelobbyschema. The DSN must includesearch_path=lobbyandsslmode=disable. Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. After PG_PLAN.md §6A the schema holdsgames,applications,invites,memberships; after §6B it also holdsrace_names. The schema and thelobbyservicerole are provisioned externally (operator init script in production, the testcontainers harness in tests).LOBBY_USER_SERVICE_BASE_URLandLOBBY_GM_BASE_URLare reachable from the network the Lobby pods run in. Lobby does not ping these at boot, but transport failures against them will surface as request errors.- Stream names match the producers/consumers Lobby integrates with:
LOBBY_GM_EVENTS_STREAM(defaultgm:lobby_events)LOBBY_RUNTIME_START_JOBS_STREAM(defaultruntime:start_jobs)LOBBY_RUNTIME_STOP_JOBS_STREAM(defaultruntime:stop_jobs)LOBBY_RUNTIME_JOB_RESULTS_STREAM(defaultruntime:job_results)LOBBY_USER_LIFECYCLE_STREAM(defaultuser:lifecycle_events)LOBBY_NOTIFICATION_INTENTS_STREAM(defaultnotification:intents)
LOBBY_RACE_NAME_DIRECTORY_BACKENDispostgresfor production (the default after PG_PLAN.md §6B); thestubvalue is only for unit tests that do not need a real PostgreSQL.
At startup the process opens the PostgreSQL pool, applies migrations, pings PostgreSQL, then opens the Redis client and pings Redis. Startup fails fast if any step fails. There are no liveness checks against User Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
- public HTTP is enabled on
LOBBY_PUBLIC_HTTP_ADDR(default:8094); - internal HTTP is enabled on
LOBBY_INTERNAL_HTTP_ADDR(default:8095); - both ports answer
GET /healthzandGET /readyz.
Expected log lines:
lobby startingfromcmd/lobby;- one
redis ping okline; - one
public http listeningand oneinternal http listeningline; - one
worker startedline per background worker (six expected).
Readiness
Use the probes according to what they actually guarantee:
GET /healthzconfirms the listener is alive;GET /readyzconfirms the runtime wiring completed and Redis was reachable at boot.
/readyz is process-local. It does not confirm:
- ongoing Redis health after boot;
- User Service reachability;
- Game Master reachability;
- worker liveness.
For a practical readiness check in production:
- confirm the process emitted the listener and worker startup logs;
- check
GET /healthzandGET /readyzon both ports; - verify
lobby.active_gamesgauge is non-zero in the metrics backend after the first traffic; - verify
lobby.gm_events.oldest_unprocessed_age_msis small or zero after GM starts emitting events.
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- the per-component shutdown budget is controlled by
LOBBY_SHUTDOWN_TIMEOUT; - HTTP listeners drain in-flight requests before closing;
- background workers stop their
XREADloops and persist the latest offset; - pending consumer offsets are flushed before exit.
During planned restarts:
- send
SIGTERM; - wait for the listener and component-stop logs;
- expect any worker that was mid-cycle to retry from the persisted offset on the next process start;
- investigate only if shutdown exceeds
LOBBY_SHUTDOWN_TIMEOUT.
Stuck starting Recovery
A game that flips to starting but never completes one of the post-start
steps will stay in starting until manual recovery.
Symptoms:
lobby.active_games{status="starting"}gauge non-zero for longer than the expected start budget (Runtime Manager start time + GM register call);- per-game logs show
start_job_publishedbut noruntime_job_resultorregister_runtime_outcomefollow-up.
Recovery:
- Identify the affected
game_idfrom the gauge labels or logs. - Inspect
runtime:job_resultsfor theruntime_job_idpublished by Lobby. If absent, Runtime Manager never produced a result; resolve at the runtime layer. - If the result exists with
success=truebut no GM call was made, retry with the admin or owner commandlobby.game.retry_start. - If the result exists with
success=false, transition through thestart_failedpath and uselobby.game.cancelorretry_startonce the underlying issue is resolved. - If the metadata persistence step failed, Lobby has already published a
stop-job and moved the game to
start_failed. Confirm the orphan container was removed by Runtime Manager.
Lobby always re-accepts a start command on a game that is stuck in
starting: the first action is a CAS attempt, and a second start from a
re-issued admin command will progress the state machine.
Stuck Stream Offsets
Three stream-lag gauges describe the consumer health:
lobby.gm_events.oldest_unprocessed_age_mslobby.runtime_results.oldest_unprocessed_age_mslobby.user_lifecycle.oldest_unprocessed_age_ms
A persistently increasing gauge means the consumer is unable to advance. Causes and triage:
- Decoder rejects a malformed entry. The consumer logs
malformed_eventand advances the offset; this should not stall the stream. If the gauge keeps climbing, there is a real handler error. - Handler returns a non-nil error. The consumer holds the offset and retries on every cycle. Inspect the latest log lines to identify the error class (Redis transient, RND store error, RuntimeManager publish failure for cascade events).
- Process restart loop. A crash before persisting the offset does not
advance progress. Check pod restart counts and
cmd/lobbypanics.
After the underlying cause is fixed, the consumer resumes from the persisted
offset; no manual intervention to the offset key is required in normal
operation. If a corrupt entry must be skipped, advance
lobby:stream_offsets:<label> to the next valid stream ID and restart the
process.
Pending Registration Window Expiry
The pending-registration expirer ticks every
LOBBY_RACE_NAME_EXPIRATION_INTERVAL (default 1h) and releases
pending_registration entries past their eligible_until timestamp.
The 30-day window length is the in-process constant
service/capabilityevaluation.PendingRegistrationWindow. Operator-tunable
override is reserved for a future change under the env var
LOBBY_PENDING_REGISTRATION_TTL_HOURS; today the constant is final.
The worker absorbs Race Name Directory failures: a failing Expire call is
logged at warn level, the worker waits for the next tick, and no offset is
moved (there is no offset; this is a periodic worker, not a consumer). A
backlog of expirable entries is therefore self-healing once the directory
is reachable again.
To inspect the backlog:
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
Rows whose eligible_until_ms is at or below extract(epoch from now()) * 1000
are expirable on the next tick. The partial index
race_names_pending_eligible_idx keeps this scan cheap.
Cascade Release Operator Notes
The user:lifecycle_events consumer fans out a single user-lifecycle event
into many actions:
- Race Name Directory release (
RND.ReleaseAllByUser). - Membership status flips (
active→blocked) on every membership the user holds, with alobby.membership.blockednotification per third-party private game. - Application status flips (
submitted→rejected). - Invite status flips (
created→revoked) on both addressed and inviter-side invites. - Owned non-terminal games transition to
cancelledvia theexternal_blocktrigger. In-flight statuses (starting,running,paused) get a stop-job published to Runtime Manager before the game record is updated.
The cascade is idempotent: every store mutation uses CAS, and ErrConflict
is treated as «already done». A retry on the next consumer cycle will
re-traverse the same set without producing duplicate side effects.
A single failing step (transient store error or runtime stop-job publish failure) leaves the offset on the current entry. The next cycle retries the full cascade. Do not advance the offset manually unless you have first verified that the cascade actions for the current entry have been completed out-of-band.
Diagnostic Queries
Durable enrollment state and Race Name Directory bindings live in PostgreSQL; runtime coordination state stays in Redis. A handful of CLI snippets help during incidents:
# Live game count by status (PostgreSQL)
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"
# Inspect a specific game record
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"
# Member roster for a game
psql -c "SELECT user_id, race_name, status, joined_at
FROM lobby.memberships
WHERE game_id = '<game_id>'
ORDER BY joined_at"
# Race name pending entries (oldest first)
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
# Stream lag inspection (Redis)
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
The gauges and counters surfaced through OpenTelemetry are the primary observability surface; raw PostgreSQL and Redis access is for last-resort triage.