feat: game lobby service

2026-04-25 23:20:55 +02:00
parent 32dc29359a
commit 48b0056b49
336 changed files with 57074 additions and 1418 deletions
@@ -0,0 +1,220 @@
+# Operator Runbook
+
+This runbook covers the checks that matter most during startup, steady-state
+readiness, shutdown, and the handful of recovery paths specific to Lobby.
+
+## Startup Checks
+
+Before starting the process, confirm:
+
+- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
+  five Lobby-related streams.
+- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
+  the network the Lobby pods run in. Lobby does not ping these at boot,
+  but transport failures against them will surface as request errors.
+- Stream names match the producers/consumers Lobby integrates with:
+  - `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
+  - `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
+  - `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
+  - `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
+  - `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
+  - `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
+- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
+  `stub` value is only for unit tests.
+
+At startup the process performs a bounded `PING` against Redis. Startup
+fails fast if the ping fails. There are no liveness checks against User
+Service or Game Master at boot; those are surfaced at request time.
+
+Expected listener state after a healthy start:
+
+- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
+- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
+- both ports answer `GET /healthz` and `GET /readyz`.
+
+Expected log lines:
+
+- `lobby starting` from `cmd/lobby`;
+- one `redis ping ok` line;
+- one `public http listening` and one `internal http listening` line;
+- one `worker started` line per background worker (six expected).
+
+## Readiness
+
+Use the probes according to what they actually guarantee:
+
+- `GET /healthz` confirms the listener is alive;
+- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
+  at boot.
+
+`/readyz` is process-local. It does not confirm:
+
+- ongoing Redis health after boot;
+- User Service reachability;
+- Game Master reachability;
+- worker liveness.
+
+For a practical readiness check in production:
+
+1. confirm the process emitted the listener and worker startup logs;
+2. check `GET /healthz` and `GET /readyz` on both ports;
+3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
+   the first traffic;
+4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
+   GM starts emitting events.
+
+## Shutdown
+
+The process handles `SIGINT` and `SIGTERM`.
+
+Shutdown behavior:
+
+- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
+- HTTP listeners drain in-flight requests before closing;
+- background workers stop their `XREAD` loops and persist the latest offset;
+- pending consumer offsets are flushed before exit.
+
+During planned restarts:
+
+1. send `SIGTERM`;
+2. wait for the listener and component-stop logs;
+3. expect any worker that was mid-cycle to retry from the persisted offset
+   on the next process start;
+4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
+
+## Stuck `starting` Recovery
+
+A game that flips to `starting` but never completes one of the post-start
+steps will stay in `starting` until manual recovery.
+
+Symptoms:
+
+- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
+  expected start budget (Runtime Manager start time + GM register call);
+- per-game logs show `start_job_published` but no `runtime_job_result` or
+  `register_runtime_outcome` follow-up.
+
+Recovery:
+
+1. Identify the affected `game_id` from the gauge labels or logs.
+2. Inspect `runtime:job_results` for the `runtime_job_id` published by
+   Lobby. If absent, Runtime Manager never produced a result; resolve at
+   the runtime layer.
+3. If the result exists with `success=true` but no GM call was made, retry
+   with the admin or owner command `lobby.game.retry_start`.
+4. If the result exists with `success=false`, transition through the
+   `start_failed` path and use `lobby.game.cancel` or `retry_start` once
+   the underlying issue is resolved.
+5. If the metadata persistence step failed, Lobby has already published a
+   stop-job and moved the game to `start_failed`. Confirm the orphan
+   container was removed by Runtime Manager.
+
+Lobby always re-accepts a `start` command on a game that is stuck in
+`starting`: the first action is a CAS attempt, and a second `start` from a
+re-issued admin command will progress the state machine.
+
+## Stuck Stream Offsets
+
+Three stream-lag gauges describe the consumer health:
+
+- `lobby.gm_events.oldest_unprocessed_age_ms`
+- `lobby.runtime_results.oldest_unprocessed_age_ms`
+- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
+
+A persistently increasing gauge means the consumer is unable to advance.
+Causes and triage:
+
+1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
+   and advances the offset; this should not stall the stream. If the gauge
+   keeps climbing, there is a real handler error.
+2. **Handler returns a non-nil error.** The consumer holds the offset and
+   retries on every cycle. Inspect the latest log lines to identify the
+   error class (Redis transient, RND store error, RuntimeManager publish
+   failure for cascade events).
+3. **Process restart loop.** A crash before persisting the offset does not
+   advance progress. Check pod restart counts and `cmd/lobby` panics.
+
+After the underlying cause is fixed, the consumer resumes from the persisted
+offset; no manual intervention to the offset key is required in normal
+operation. If a corrupt entry must be skipped, advance
+`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
+process.
+
+## Pending Registration Window Expiry
+
+The pending-registration expirer ticks every
+`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
+`pending_registration` entries past their `eligible_until` timestamp.
+
+The 30-day window length is the in-process constant
+`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
+override is reserved for a future change under the env var
+`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
+
+The worker absorbs Race Name Directory failures: a failing `Expire` call is
+logged at warn level, the worker waits for the next tick, and no offset is
+moved (there is no offset; this is a periodic worker, not a consumer). A
+backlog of expirable entries is therefore self-healing once the directory
+is reachable again.
+
+To inspect the backlog:
+
+```bash
+redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
+```
+
+Entries with `score < now()` (Unix milliseconds) are expirable on the next
+tick.
+
+## Cascade Release Operator Notes
+
+The `user:lifecycle_events` consumer fans out a single user-lifecycle event
+into many actions:
+
+1. Race Name Directory release (`RND.ReleaseAllByUser`).
+2. Membership status flips (`active` → `blocked`) on every membership the
+   user holds, with a `lobby.membership.blocked` notification per
+   third-party private game.
+3. Application status flips (`submitted` → `rejected`).
+4. Invite status flips (`created` → `revoked`) on both addressed and
+   inviter-side invites.
+5. Owned non-terminal games transition to `cancelled` via the
+   `external_block` trigger. In-flight statuses (`starting`, `running`,
+   `paused`) get a stop-job published to Runtime Manager before the game
+   record is updated.
+
+The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
+is treated as «already done». A retry on the next consumer cycle will
+re-traverse the same set without producing duplicate side effects.
+
+A single failing step (transient store error or runtime stop-job publish
+failure) leaves the offset on the current entry. The next cycle retries the
+full cascade. Do not advance the offset manually unless you have first
+verified that the cascade actions for the current entry have been completed
+out-of-band.
+
+## Diagnostic Queries
+
+A handful of Redis CLI snippets help during incidents:
+
+```bash
+# Live game count by status
+redis-cli ZCARD lobby:games_by_status:enrollment_open
+redis-cli ZCARD lobby:games_by_status:running
+
+# Inspect a specific game record
+redis-cli GET lobby:games:<game_id>
+
+# Member roster for a game
+redis-cli SMEMBERS lobby:game_memberships:<game_id>
+
+# Race name pending entries (oldest first)
+redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
+
+# Stream lag inspection
+redis-cli XINFO STREAM gm:lobby_events
+redis-cli GET lobby:stream_offsets:gm_events
+```
+
+The gauges and counters surfaced through OpenTelemetry are the primary
+observability surface; raw Redis access is for last-resort triage.