feat: game lobby service
This commit is contained in:
@@ -0,0 +1,220 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup, steady-state
|
||||
readiness, shutdown, and the handful of recovery paths specific to Lobby.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
|
||||
five Lobby-related streams.
|
||||
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
|
||||
the network the Lobby pods run in. Lobby does not ping these at boot,
|
||||
but transport failures against them will surface as request errors.
|
||||
- Stream names match the producers/consumers Lobby integrates with:
|
||||
- `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
|
||||
- `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
|
||||
- `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
|
||||
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
|
||||
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
|
||||
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
|
||||
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
|
||||
`stub` value is only for unit tests.
|
||||
|
||||
At startup the process performs a bounded `PING` against Redis. Startup
|
||||
fails fast if the ping fails. There are no liveness checks against User
|
||||
Service or Game Master at boot; those are surfaced at request time.
|
||||
|
||||
Expected listener state after a healthy start:
|
||||
|
||||
- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
|
||||
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
|
||||
- both ports answer `GET /healthz` and `GET /readyz`.
|
||||
|
||||
Expected log lines:
|
||||
|
||||
- `lobby starting` from `cmd/lobby`;
|
||||
- one `redis ping ok` line;
|
||||
- one `public http listening` and one `internal http listening` line;
|
||||
- one `worker started` line per background worker (six expected).
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually guarantee:
|
||||
|
||||
- `GET /healthz` confirms the listener is alive;
|
||||
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
|
||||
at boot.
|
||||
|
||||
`/readyz` is process-local. It does not confirm:
|
||||
|
||||
- ongoing Redis health after boot;
|
||||
- User Service reachability;
|
||||
- Game Master reachability;
|
||||
- worker liveness.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted the listener and worker startup logs;
|
||||
2. check `GET /healthz` and `GET /readyz` on both ports;
|
||||
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
|
||||
the first traffic;
|
||||
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
|
||||
GM starts emitting events.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
|
||||
- HTTP listeners drain in-flight requests before closing;
|
||||
- background workers stop their `XREAD` loops and persist the latest offset;
|
||||
- pending consumer offsets are flushed before exit.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for the listener and component-stop logs;
|
||||
3. expect any worker that was mid-cycle to retry from the persisted offset
|
||||
on the next process start;
|
||||
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
|
||||
|
||||
## Stuck `starting` Recovery
|
||||
|
||||
A game that flips to `starting` but never completes one of the post-start
|
||||
steps will stay in `starting` until manual recovery.
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
|
||||
expected start budget (Runtime Manager start time + GM register call);
|
||||
- per-game logs show `start_job_published` but no `runtime_job_result` or
|
||||
`register_runtime_outcome` follow-up.
|
||||
|
||||
Recovery:
|
||||
|
||||
1. Identify the affected `game_id` from the gauge labels or logs.
|
||||
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
|
||||
Lobby. If absent, Runtime Manager never produced a result; resolve at
|
||||
the runtime layer.
|
||||
3. If the result exists with `success=true` but no GM call was made, retry
|
||||
with the admin or owner command `lobby.game.retry_start`.
|
||||
4. If the result exists with `success=false`, transition through the
|
||||
`start_failed` path and use `lobby.game.cancel` or `retry_start` once
|
||||
the underlying issue is resolved.
|
||||
5. If the metadata persistence step failed, Lobby has already published a
|
||||
stop-job and moved the game to `start_failed`. Confirm the orphan
|
||||
container was removed by Runtime Manager.
|
||||
|
||||
Lobby always re-accepts a `start` command on a game that is stuck in
|
||||
`starting`: the first action is a CAS attempt, and a second `start` from a
|
||||
re-issued admin command will progress the state machine.
|
||||
|
||||
## Stuck Stream Offsets
|
||||
|
||||
Three stream-lag gauges describe the consumer health:
|
||||
|
||||
- `lobby.gm_events.oldest_unprocessed_age_ms`
|
||||
- `lobby.runtime_results.oldest_unprocessed_age_ms`
|
||||
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
|
||||
|
||||
A persistently increasing gauge means the consumer is unable to advance.
|
||||
Causes and triage:
|
||||
|
||||
1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
|
||||
and advances the offset; this should not stall the stream. If the gauge
|
||||
keeps climbing, there is a real handler error.
|
||||
2. **Handler returns a non-nil error.** The consumer holds the offset and
|
||||
retries on every cycle. Inspect the latest log lines to identify the
|
||||
error class (Redis transient, RND store error, RuntimeManager publish
|
||||
failure for cascade events).
|
||||
3. **Process restart loop.** A crash before persisting the offset does not
|
||||
advance progress. Check pod restart counts and `cmd/lobby` panics.
|
||||
|
||||
After the underlying cause is fixed, the consumer resumes from the persisted
|
||||
offset; no manual intervention to the offset key is required in normal
|
||||
operation. If a corrupt entry must be skipped, advance
|
||||
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
|
||||
process.
|
||||
|
||||
## Pending Registration Window Expiry
|
||||
|
||||
The pending-registration expirer ticks every
|
||||
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
|
||||
`pending_registration` entries past their `eligible_until` timestamp.
|
||||
|
||||
The 30-day window length is the in-process constant
|
||||
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
|
||||
override is reserved for a future change under the env var
|
||||
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
|
||||
|
||||
The worker absorbs Race Name Directory failures: a failing `Expire` call is
|
||||
logged at warn level, the worker waits for the next tick, and no offset is
|
||||
moved (there is no offset; this is a periodic worker, not a consumer). A
|
||||
backlog of expirable entries is therefore self-healing once the directory
|
||||
is reachable again.
|
||||
|
||||
To inspect the backlog:
|
||||
|
||||
```bash
|
||||
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
|
||||
```
|
||||
|
||||
Entries with `score < now()` (Unix milliseconds) are expirable on the next
|
||||
tick.
|
||||
|
||||
## Cascade Release Operator Notes
|
||||
|
||||
The `user:lifecycle_events` consumer fans out a single user-lifecycle event
|
||||
into many actions:
|
||||
|
||||
1. Race Name Directory release (`RND.ReleaseAllByUser`).
|
||||
2. Membership status flips (`active` → `blocked`) on every membership the
|
||||
user holds, with a `lobby.membership.blocked` notification per
|
||||
third-party private game.
|
||||
3. Application status flips (`submitted` → `rejected`).
|
||||
4. Invite status flips (`created` → `revoked`) on both addressed and
|
||||
inviter-side invites.
|
||||
5. Owned non-terminal games transition to `cancelled` via the
|
||||
`external_block` trigger. In-flight statuses (`starting`, `running`,
|
||||
`paused`) get a stop-job published to Runtime Manager before the game
|
||||
record is updated.
|
||||
|
||||
The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
|
||||
is treated as «already done». A retry on the next consumer cycle will
|
||||
re-traverse the same set without producing duplicate side effects.
|
||||
|
||||
A single failing step (transient store error or runtime stop-job publish
|
||||
failure) leaves the offset on the current entry. The next cycle retries the
|
||||
full cascade. Do not advance the offset manually unless you have first
|
||||
verified that the cascade actions for the current entry have been completed
|
||||
out-of-band.
|
||||
|
||||
## Diagnostic Queries
|
||||
|
||||
A handful of Redis CLI snippets help during incidents:
|
||||
|
||||
```bash
|
||||
# Live game count by status
|
||||
redis-cli ZCARD lobby:games_by_status:enrollment_open
|
||||
redis-cli ZCARD lobby:games_by_status:running
|
||||
|
||||
# Inspect a specific game record
|
||||
redis-cli GET lobby:games:<game_id>
|
||||
|
||||
# Member roster for a game
|
||||
redis-cli SMEMBERS lobby:game_memberships:<game_id>
|
||||
|
||||
# Race name pending entries (oldest first)
|
||||
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
|
||||
|
||||
# Stream lag inspection
|
||||
redis-cli XINFO STREAM gm:lobby_events
|
||||
redis-cli GET lobby:stream_offsets:gm_events
|
||||
```
|
||||
|
||||
The gauges and counters surfaced through OpenTelemetry are the primary
|
||||
observability surface; raw Redis access is for last-resort triage.
|
||||
Reference in New Issue
Block a user