galaxy-game/lobby/docs/runbook.md

# Operator Runbook

This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and the handful of recovery paths specific to Lobby.

## Startup Checks

Before starting the process, confirm:

- `LOBBY_REDIS_MASTER_ADDR` and `LOBBY_REDIS_PASSWORD` point to the Redis
  deployment used for the runtime-coordination state that intentionally
  stays on Redis: stream consumers/publishers, stream offsets, per-game
  turn-stats aggregates, gap-activation timestamps, and the
  capability-evaluation guard. The deprecated `LOBBY_REDIS_ADDR`,
  `LOBBY_REDIS_USERNAME`, and `LOBBY_REDIS_TLS_ENABLED` env vars were
  retired in PG_PLAN.md §6A; setting either of the latter two now fails
  fast at startup.
- `LOBBY_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that
  hosts the `lobby` schema. The DSN must include `search_path=lobby` and
  `sslmode=disable`. Embedded goose migrations apply at startup before
  any HTTP listener opens; a migration or ping failure terminates the
  process with a non-zero exit. After PG_PLAN.md §6A the schema holds
  `games`, `applications`, `invites`, `memberships`; after §6B it also
  holds `race_names`. The schema and the `lobbyservice` role are
  provisioned externally (operator init script in production, the
  testcontainers harness in tests).
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
  the network the Lobby pods run in. Lobby does not ping these at boot,
  but transport failures against them will surface as request errors.
- Stream names match the producers/consumers Lobby integrates with:
  - `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
  - `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
  - `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
  - `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
  - `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
  - `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `postgres` for production
  (the default after PG_PLAN.md §6B); the `stub` value is only for
  unit tests that do not need a real PostgreSQL.

At startup the process opens the PostgreSQL pool, applies migrations,
pings PostgreSQL, then opens the Redis client and pings Redis. Startup
fails fast if any step fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.

Expected listener state after a healthy start:

- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
- both ports answer `GET /healthz` and `GET /readyz`.

Expected log lines:

- `lobby starting` from `cmd/lobby`;
- one `redis ping ok` line;
- one `public http listening` and one `internal http listening` line;
- one `worker started` line per background worker (six expected).

## Readiness

Use the probes according to what they actually guarantee:

- `GET /healthz` confirms the listener is alive;
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
  at boot.

`/readyz` is process-local. It does not confirm:

- ongoing Redis health after boot;
- User Service reachability;
- Game Master reachability;
- worker liveness.

For a practical readiness check in production:

1. confirm the process emitted the listener and worker startup logs;
2. check `GET /healthz` and `GET /readyz` on both ports;
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
   the first traffic;
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
   GM starts emitting events.

## Shutdown

The process handles `SIGINT` and `SIGTERM`.

Shutdown behavior:

- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
- HTTP listeners drain in-flight requests before closing;
- background workers stop their `XREAD` loops and persist the latest offset;
- pending consumer offsets are flushed before exit.

During planned restarts:

1. send `SIGTERM`;
2. wait for the listener and component-stop logs;
3. expect any worker that was mid-cycle to retry from the persisted offset
   on the next process start;
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.

## Stuck `starting` Recovery

A game that flips to `starting` but never completes one of the post-start
steps will stay in `starting` until manual recovery.

Symptoms:

- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
  expected start budget (Runtime Manager start time + GM register call);
- per-game logs show `start_job_published` but no `runtime_job_result` or
  `register_runtime_outcome` follow-up.

Recovery:

1. Identify the affected `game_id` from the gauge labels or logs.
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
   Lobby. If absent, Runtime Manager never produced a result; resolve at
   the runtime layer.
3. If the result exists with `success=true` but no GM call was made, retry
   with the admin or owner command `lobby.game.retry_start`.
4. If the result exists with `success=false`, transition through the
   `start_failed` path and use `lobby.game.cancel` or `retry_start` once
   the underlying issue is resolved.
5. If the metadata persistence step failed, Lobby has already published a
   stop-job and moved the game to `start_failed`. Confirm the orphan
   container was removed by Runtime Manager.

Lobby always re-accepts a `start` command on a game that is stuck in
`starting`: the first action is a CAS attempt, and a second `start` from a
re-issued admin command will progress the state machine.

## Stuck Stream Offsets

Three stream-lag gauges describe the consumer health:

- `lobby.gm_events.oldest_unprocessed_age_ms`
- `lobby.runtime_results.oldest_unprocessed_age_ms`
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`

A persistently increasing gauge means the consumer is unable to advance.
Causes and triage:

1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
   and advances the offset; this should not stall the stream. If the gauge
   keeps climbing, there is a real handler error.
2. **Handler returns a non-nil error.** The consumer holds the offset and
   retries on every cycle. Inspect the latest log lines to identify the
   error class (Redis transient, RND store error, RuntimeManager publish
   failure for cascade events).
3. **Process restart loop.** A crash before persisting the offset does not
   advance progress. Check pod restart counts and `cmd/lobby` panics.

After the underlying cause is fixed, the consumer resumes from the persisted
offset; no manual intervention to the offset key is required in normal
operation. If a corrupt entry must be skipped, advance
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
process.

## Pending Registration Window Expiry

The pending-registration expirer ticks every
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
`pending_registration` entries past their `eligible_until` timestamp.

The 30-day window length is the in-process constant
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
override is reserved for a future change under the env var
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.

The worker absorbs Race Name Directory failures: a failing `Expire` call is
logged at warn level, the worker waits for the next tick, and no offset is
moved (there is no offset; this is a periodic worker, not a consumer). A
backlog of expirable entries is therefore self-healing once the directory
is reachable again.

To inspect the backlog:

```bash
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
         FROM lobby.race_names
         WHERE binding_kind = 'pending_registration'
         ORDER BY eligible_until_ms ASC"
```

Rows whose `eligible_until_ms` is at or below `extract(epoch from now()) * 1000`
are expirable on the next tick. The partial index
`race_names_pending_eligible_idx` keeps this scan cheap.

## Cascade Release Operator Notes

The `user:lifecycle_events` consumer fans out a single user-lifecycle event
into many actions:

1. Race Name Directory release (`RND.ReleaseAllByUser`).
2. Membership status flips (`active` → `blocked`) on every membership the
   user holds, with a `lobby.membership.blocked` notification per
   third-party private game.
3. Application status flips (`submitted` → `rejected`).
4. Invite status flips (`created` → `revoked`) on both addressed and
   inviter-side invites.
5. Owned non-terminal games transition to `cancelled` via the
   `external_block` trigger. In-flight statuses (`starting`, `running`,
   `paused`) get a stop-job published to Runtime Manager before the game
   record is updated.

The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
is treated as «already done». A retry on the next consumer cycle will
re-traverse the same set without producing duplicate side effects.

A single failing step (transient store error or runtime stop-job publish
failure) leaves the offset on the current entry. The next cycle retries the
full cascade. Do not advance the offset manually unless you have first
verified that the cascade actions for the current entry have been completed
out-of-band.

## Diagnostic Queries

Durable enrollment state and Race Name Directory bindings live in
PostgreSQL; runtime coordination state stays in Redis. A handful of CLI
snippets help during incidents:

```bash
# Live game count by status (PostgreSQL)
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"

# Inspect a specific game record
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"

# Member roster for a game
psql -c "SELECT user_id, race_name, status, joined_at
         FROM lobby.memberships
         WHERE game_id = '<game_id>'
         ORDER BY joined_at"

# Race name pending entries (oldest first)
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
         FROM lobby.race_names
         WHERE binding_kind = 'pending_registration'
         ORDER BY eligible_until_ms ASC"

# Stream lag inspection (Redis)
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
```

The gauges and counters surfaced through OpenTelemetry are the primary
observability surface; raw PostgreSQL and Redis access is for last-resort
triage.