8.6 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Lobby.
Startup Checks
Before starting the process, confirm:
LOBBY_REDIS_ADDRpoints to the Redis deployment used for state and the five Lobby-related streams.LOBBY_USER_SERVICE_BASE_URLandLOBBY_GM_BASE_URLare reachable from the network the Lobby pods run in. Lobby does not ping these at boot, but transport failures against them will surface as request errors.- Stream names match the producers/consumers Lobby integrates with:
LOBBY_GM_EVENTS_STREAM(defaultgm:lobby_events)LOBBY_RUNTIME_START_JOBS_STREAM(defaultruntime:start_jobs)LOBBY_RUNTIME_STOP_JOBS_STREAM(defaultruntime:stop_jobs)LOBBY_RUNTIME_JOB_RESULTS_STREAM(defaultruntime:job_results)LOBBY_USER_LIFECYCLE_STREAM(defaultuser:lifecycle_events)LOBBY_NOTIFICATION_INTENTS_STREAM(defaultnotification:intents)
LOBBY_RACE_NAME_DIRECTORY_BACKENDisredisfor production; thestubvalue is only for unit tests.
At startup the process performs a bounded PING against Redis. Startup
fails fast if the ping fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
- public HTTP is enabled on
LOBBY_PUBLIC_HTTP_ADDR(default:8094); - internal HTTP is enabled on
LOBBY_INTERNAL_HTTP_ADDR(default:8095); - both ports answer
GET /healthzandGET /readyz.
Expected log lines:
lobby startingfromcmd/lobby;- one
redis ping okline; - one
public http listeningand oneinternal http listeningline; - one
worker startedline per background worker (six expected).
Readiness
Use the probes according to what they actually guarantee:
GET /healthzconfirms the listener is alive;GET /readyzconfirms the runtime wiring completed and Redis was reachable at boot.
/readyz is process-local. It does not confirm:
- ongoing Redis health after boot;
- User Service reachability;
- Game Master reachability;
- worker liveness.
For a practical readiness check in production:
- confirm the process emitted the listener and worker startup logs;
- check
GET /healthzandGET /readyzon both ports; - verify
lobby.active_gamesgauge is non-zero in the metrics backend after the first traffic; - verify
lobby.gm_events.oldest_unprocessed_age_msis small or zero after GM starts emitting events.
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- the per-component shutdown budget is controlled by
LOBBY_SHUTDOWN_TIMEOUT; - HTTP listeners drain in-flight requests before closing;
- background workers stop their
XREADloops and persist the latest offset; - pending consumer offsets are flushed before exit.
During planned restarts:
- send
SIGTERM; - wait for the listener and component-stop logs;
- expect any worker that was mid-cycle to retry from the persisted offset on the next process start;
- investigate only if shutdown exceeds
LOBBY_SHUTDOWN_TIMEOUT.
Stuck starting Recovery
A game that flips to starting but never completes one of the post-start
steps will stay in starting until manual recovery.
Symptoms:
lobby.active_games{status="starting"}gauge non-zero for longer than the expected start budget (Runtime Manager start time + GM register call);- per-game logs show
start_job_publishedbut noruntime_job_resultorregister_runtime_outcomefollow-up.
Recovery:
- Identify the affected
game_idfrom the gauge labels or logs. - Inspect
runtime:job_resultsfor theruntime_job_idpublished by Lobby. If absent, Runtime Manager never produced a result; resolve at the runtime layer. - If the result exists with
success=truebut no GM call was made, retry with the admin or owner commandlobby.game.retry_start. - If the result exists with
success=false, transition through thestart_failedpath and uselobby.game.cancelorretry_startonce the underlying issue is resolved. - If the metadata persistence step failed, Lobby has already published a
stop-job and moved the game to
start_failed. Confirm the orphan container was removed by Runtime Manager.
Lobby always re-accepts a start command on a game that is stuck in
starting: the first action is a CAS attempt, and a second start from a
re-issued admin command will progress the state machine.
Stuck Stream Offsets
Three stream-lag gauges describe the consumer health:
lobby.gm_events.oldest_unprocessed_age_mslobby.runtime_results.oldest_unprocessed_age_mslobby.user_lifecycle.oldest_unprocessed_age_ms
A persistently increasing gauge means the consumer is unable to advance. Causes and triage:
- Decoder rejects a malformed entry. The consumer logs
malformed_eventand advances the offset; this should not stall the stream. If the gauge keeps climbing, there is a real handler error. - Handler returns a non-nil error. The consumer holds the offset and retries on every cycle. Inspect the latest log lines to identify the error class (Redis transient, RND store error, RuntimeManager publish failure for cascade events).
- Process restart loop. A crash before persisting the offset does not
advance progress. Check pod restart counts and
cmd/lobbypanics.
After the underlying cause is fixed, the consumer resumes from the persisted
offset; no manual intervention to the offset key is required in normal
operation. If a corrupt entry must be skipped, advance
lobby:stream_offsets:<label> to the next valid stream ID and restart the
process.
Pending Registration Window Expiry
The pending-registration expirer ticks every
LOBBY_RACE_NAME_EXPIRATION_INTERVAL (default 1h) and releases
pending_registration entries past their eligible_until timestamp.
The 30-day window length is the in-process constant
service/capabilityevaluation.PendingRegistrationWindow. Operator-tunable
override is reserved for a future change under the env var
LOBBY_PENDING_REGISTRATION_TTL_HOURS; today the constant is final.
The worker absorbs Race Name Directory failures: a failing Expire call is
logged at warn level, the worker waits for the next tick, and no offset is
moved (there is no offset; this is a periodic worker, not a consumer). A
backlog of expirable entries is therefore self-healing once the directory
is reachable again.
To inspect the backlog:
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
Entries with score < now() (Unix milliseconds) are expirable on the next
tick.
Cascade Release Operator Notes
The user:lifecycle_events consumer fans out a single user-lifecycle event
into many actions:
- Race Name Directory release (
RND.ReleaseAllByUser). - Membership status flips (
active→blocked) on every membership the user holds, with alobby.membership.blockednotification per third-party private game. - Application status flips (
submitted→rejected). - Invite status flips (
created→revoked) on both addressed and inviter-side invites. - Owned non-terminal games transition to
cancelledvia theexternal_blocktrigger. In-flight statuses (starting,running,paused) get a stop-job published to Runtime Manager before the game record is updated.
The cascade is idempotent: every store mutation uses CAS, and ErrConflict
is treated as «already done». A retry on the next consumer cycle will
re-traverse the same set without producing duplicate side effects.
A single failing step (transient store error or runtime stop-job publish failure) leaves the offset on the current entry. The next cycle retries the full cascade. Do not advance the offset manually unless you have first verified that the cascade actions for the current entry have been completed out-of-band.
Diagnostic Queries
A handful of Redis CLI snippets help during incidents:
# Live game count by status
redis-cli ZCARD lobby:games_by_status:enrollment_open
redis-cli ZCARD lobby:games_by_status:running
# Inspect a specific game record
redis-cli GET lobby:games:<game_id>
# Member roster for a game
redis-cli SMEMBERS lobby:game_memberships:<game_id>
# Race name pending entries (oldest first)
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
# Stream lag inspection
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
The gauges and counters surfaced through OpenTelemetry are the primary observability surface; raw Redis access is for last-resort triage.