# Operator Runbook This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Lobby. ## Startup Checks Before starting the process, confirm: - `LOBBY_REDIS_MASTER_ADDR` and `LOBBY_REDIS_PASSWORD` point to the Redis deployment used for the runtime-coordination state that intentionally stays on Redis: stream consumers/publishers, stream offsets, per-game turn-stats aggregates, gap-activation timestamps, and the capability-evaluation guard. The deprecated `LOBBY_REDIS_ADDR`, `LOBBY_REDIS_USERNAME`, and `LOBBY_REDIS_TLS_ENABLED` env vars were retired in PG_PLAN.md §6A; setting either of the latter two now fails fast at startup. - `LOBBY_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that hosts the `lobby` schema. The DSN must include `search_path=lobby` and `sslmode=disable`. Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. After PG_PLAN.md §6A the schema holds `games`, `applications`, `invites`, `memberships`; after §6B it also holds `race_names`. The schema and the `lobbyservice` role are provisioned externally (operator init script in production, the testcontainers harness in tests). - `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from the network the Lobby pods run in. Lobby does not ping these at boot, but transport failures against them will surface as request errors. - Stream names match the producers/consumers Lobby integrates with: - `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`) - `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`) - `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`) - `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`) - `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`) - `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`) - `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `postgres` for production (the default after PG_PLAN.md §6B); the `stub` value selects the in-memory adapter at `lobby/internal/adapters/racenameinmem/`, intended for unit tests and small local deployments without PostgreSQL. The config token name is kept as `stub` for backward compatibility. At startup the process opens the PostgreSQL pool, applies migrations, pings PostgreSQL, then opens the Redis client and pings Redis. Startup fails fast if any step fails. There are no liveness checks against User Service or Game Master at boot; those are surfaced at request time. Expected listener state after a healthy start: - public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`); - internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`); - both ports answer `GET /healthz` and `GET /readyz`. Expected log lines: - `lobby starting` from `cmd/lobby`; - one `redis ping ok` line; - one `public http listening` and one `internal http listening` line; - one `worker started` line per background worker (six expected). ## Readiness Use the probes according to what they actually guarantee: - `GET /healthz` confirms the listener is alive; - `GET /readyz` confirms the runtime wiring completed and Redis was reachable at boot. `/readyz` is process-local. It does not confirm: - ongoing Redis health after boot; - User Service reachability; - Game Master reachability; - worker liveness. For a practical readiness check in production: 1. confirm the process emitted the listener and worker startup logs; 2. check `GET /healthz` and `GET /readyz` on both ports; 3. verify `lobby.active_games` gauge is non-zero in the metrics backend after the first traffic; 4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after GM starts emitting events. ## Shutdown The process handles `SIGINT` and `SIGTERM`. Shutdown behavior: - the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`; - HTTP listeners drain in-flight requests before closing; - background workers stop their `XREAD` loops and persist the latest offset; - pending consumer offsets are flushed before exit. During planned restarts: 1. send `SIGTERM`; 2. wait for the listener and component-stop logs; 3. expect any worker that was mid-cycle to retry from the persisted offset on the next process start; 4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`. ## Stuck `starting` Recovery A game that flips to `starting` but never completes one of the post-start steps will stay in `starting` until manual recovery. Symptoms: - `lobby.active_games{status="starting"}` gauge non-zero for longer than the expected start budget (Runtime Manager start time + GM register call); - per-game logs show `start_job_published` but no `runtime_job_result` or `register_runtime_outcome` follow-up. Recovery: 1. Identify the affected `game_id` from the gauge labels or logs. 2. Inspect `runtime:job_results` for the `runtime_job_id` published by Lobby. If absent, Runtime Manager never produced a result; resolve at the runtime layer. 3. If the result exists with `success=true` but no GM call was made, retry with the admin or owner command `lobby.game.retry_start`. 4. If the result exists with `success=false`, transition through the `start_failed` path and use `lobby.game.cancel` or `retry_start` once the underlying issue is resolved. 5. If the metadata persistence step failed, Lobby has already published a stop-job and moved the game to `start_failed`. Confirm the orphan container was removed by Runtime Manager. Lobby always re-accepts a `start` command on a game that is stuck in `starting`: the first action is a CAS attempt, and a second `start` from a re-issued admin command will progress the state machine. ## Stuck Stream Offsets Three stream-lag gauges describe the consumer health: - `lobby.gm_events.oldest_unprocessed_age_ms` - `lobby.runtime_results.oldest_unprocessed_age_ms` - `lobby.user_lifecycle.oldest_unprocessed_age_ms` A persistently increasing gauge means the consumer is unable to advance. Causes and triage: 1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event` and advances the offset; this should not stall the stream. If the gauge keeps climbing, there is a real handler error. 2. **Handler returns a non-nil error.** The consumer holds the offset and retries on every cycle. Inspect the latest log lines to identify the error class (Redis transient, RND store error, RuntimeManager publish failure for cascade events). 3. **Process restart loop.** A crash before persisting the offset does not advance progress. Check pod restart counts and `cmd/lobby` panics. After the underlying cause is fixed, the consumer resumes from the persisted offset; no manual intervention to the offset key is required in normal operation. If a corrupt entry must be skipped, advance `lobby:stream_offsets: