Files
galaxy-game/lobby/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

10 KiB

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Lobby.

Startup Checks

Before starting the process, confirm:

  • LOBBY_REDIS_MASTER_ADDR and LOBBY_REDIS_PASSWORD point to the Redis deployment used for the runtime-coordination state that intentionally stays on Redis: stream consumers/publishers, stream offsets, per-game turn-stats aggregates, gap-activation timestamps, and the capability-evaluation guard. The deprecated LOBBY_REDIS_ADDR, LOBBY_REDIS_USERNAME, and LOBBY_REDIS_TLS_ENABLED env vars were retired in PG_PLAN.md §6A; setting either of the latter two now fails fast at startup.
  • LOBBY_POSTGRES_PRIMARY_DSN points to the PostgreSQL primary that hosts the lobby schema. The DSN must include search_path=lobby and sslmode=disable. Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. After PG_PLAN.md §6A the schema holds games, applications, invites, memberships; after §6B it also holds race_names. The schema and the lobbyservice role are provisioned externally (operator init script in production, the testcontainers harness in tests).
  • LOBBY_USER_SERVICE_BASE_URL and LOBBY_GM_BASE_URL are reachable from the network the Lobby pods run in. Lobby does not ping these at boot, but transport failures against them will surface as request errors.
  • Stream names match the producers/consumers Lobby integrates with:
    • LOBBY_GM_EVENTS_STREAM (default gm:lobby_events)
    • LOBBY_RUNTIME_START_JOBS_STREAM (default runtime:start_jobs)
    • LOBBY_RUNTIME_STOP_JOBS_STREAM (default runtime:stop_jobs)
    • LOBBY_RUNTIME_JOB_RESULTS_STREAM (default runtime:job_results)
    • LOBBY_USER_LIFECYCLE_STREAM (default user:lifecycle_events)
    • LOBBY_NOTIFICATION_INTENTS_STREAM (default notification:intents)
  • LOBBY_RACE_NAME_DIRECTORY_BACKEND is postgres for production (the default after PG_PLAN.md §6B); the stub value is only for unit tests that do not need a real PostgreSQL.

At startup the process opens the PostgreSQL pool, applies migrations, pings PostgreSQL, then opens the Redis client and pings Redis. Startup fails fast if any step fails. There are no liveness checks against User Service or Game Master at boot; those are surfaced at request time.

Expected listener state after a healthy start:

  • public HTTP is enabled on LOBBY_PUBLIC_HTTP_ADDR (default :8094);
  • internal HTTP is enabled on LOBBY_INTERNAL_HTTP_ADDR (default :8095);
  • both ports answer GET /healthz and GET /readyz.

Expected log lines:

  • lobby starting from cmd/lobby;
  • one redis ping ok line;
  • one public http listening and one internal http listening line;
  • one worker started line per background worker (six expected).

Readiness

Use the probes according to what they actually guarantee:

  • GET /healthz confirms the listener is alive;
  • GET /readyz confirms the runtime wiring completed and Redis was reachable at boot.

/readyz is process-local. It does not confirm:

  • ongoing Redis health after boot;
  • User Service reachability;
  • Game Master reachability;
  • worker liveness.

For a practical readiness check in production:

  1. confirm the process emitted the listener and worker startup logs;
  2. check GET /healthz and GET /readyz on both ports;
  3. verify lobby.active_games gauge is non-zero in the metrics backend after the first traffic;
  4. verify lobby.gm_events.oldest_unprocessed_age_ms is small or zero after GM starts emitting events.

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • the per-component shutdown budget is controlled by LOBBY_SHUTDOWN_TIMEOUT;
  • HTTP listeners drain in-flight requests before closing;
  • background workers stop their XREAD loops and persist the latest offset;
  • pending consumer offsets are flushed before exit.

During planned restarts:

  1. send SIGTERM;
  2. wait for the listener and component-stop logs;
  3. expect any worker that was mid-cycle to retry from the persisted offset on the next process start;
  4. investigate only if shutdown exceeds LOBBY_SHUTDOWN_TIMEOUT.

Stuck starting Recovery

A game that flips to starting but never completes one of the post-start steps will stay in starting until manual recovery.

Symptoms:

  • lobby.active_games{status="starting"} gauge non-zero for longer than the expected start budget (Runtime Manager start time + GM register call);
  • per-game logs show start_job_published but no runtime_job_result or register_runtime_outcome follow-up.

Recovery:

  1. Identify the affected game_id from the gauge labels or logs.
  2. Inspect runtime:job_results for the runtime_job_id published by Lobby. If absent, Runtime Manager never produced a result; resolve at the runtime layer.
  3. If the result exists with success=true but no GM call was made, retry with the admin or owner command lobby.game.retry_start.
  4. If the result exists with success=false, transition through the start_failed path and use lobby.game.cancel or retry_start once the underlying issue is resolved.
  5. If the metadata persistence step failed, Lobby has already published a stop-job and moved the game to start_failed. Confirm the orphan container was removed by Runtime Manager.

Lobby always re-accepts a start command on a game that is stuck in starting: the first action is a CAS attempt, and a second start from a re-issued admin command will progress the state machine.

Stuck Stream Offsets

Three stream-lag gauges describe the consumer health:

  • lobby.gm_events.oldest_unprocessed_age_ms
  • lobby.runtime_results.oldest_unprocessed_age_ms
  • lobby.user_lifecycle.oldest_unprocessed_age_ms

A persistently increasing gauge means the consumer is unable to advance. Causes and triage:

  1. Decoder rejects a malformed entry. The consumer logs malformed_event and advances the offset; this should not stall the stream. If the gauge keeps climbing, there is a real handler error.
  2. Handler returns a non-nil error. The consumer holds the offset and retries on every cycle. Inspect the latest log lines to identify the error class (Redis transient, RND store error, RuntimeManager publish failure for cascade events).
  3. Process restart loop. A crash before persisting the offset does not advance progress. Check pod restart counts and cmd/lobby panics.

After the underlying cause is fixed, the consumer resumes from the persisted offset; no manual intervention to the offset key is required in normal operation. If a corrupt entry must be skipped, advance lobby:stream_offsets:<label> to the next valid stream ID and restart the process.

Pending Registration Window Expiry

The pending-registration expirer ticks every LOBBY_RACE_NAME_EXPIRATION_INTERVAL (default 1h) and releases pending_registration entries past their eligible_until timestamp.

The 30-day window length is the in-process constant service/capabilityevaluation.PendingRegistrationWindow. Operator-tunable override is reserved for a future change under the env var LOBBY_PENDING_REGISTRATION_TTL_HOURS; today the constant is final.

The worker absorbs Race Name Directory failures: a failing Expire call is logged at warn level, the worker waits for the next tick, and no offset is moved (there is no offset; this is a periodic worker, not a consumer). A backlog of expirable entries is therefore self-healing once the directory is reachable again.

To inspect the backlog:

psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
         FROM lobby.race_names
         WHERE binding_kind = 'pending_registration'
         ORDER BY eligible_until_ms ASC"

Rows whose eligible_until_ms is at or below extract(epoch from now()) * 1000 are expirable on the next tick. The partial index race_names_pending_eligible_idx keeps this scan cheap.

Cascade Release Operator Notes

The user:lifecycle_events consumer fans out a single user-lifecycle event into many actions:

  1. Race Name Directory release (RND.ReleaseAllByUser).
  2. Membership status flips (activeblocked) on every membership the user holds, with a lobby.membership.blocked notification per third-party private game.
  3. Application status flips (submittedrejected).
  4. Invite status flips (createdrevoked) on both addressed and inviter-side invites.
  5. Owned non-terminal games transition to cancelled via the external_block trigger. In-flight statuses (starting, running, paused) get a stop-job published to Runtime Manager before the game record is updated.

The cascade is idempotent: every store mutation uses CAS, and ErrConflict is treated as «already done». A retry on the next consumer cycle will re-traverse the same set without producing duplicate side effects.

A single failing step (transient store error or runtime stop-job publish failure) leaves the offset on the current entry. The next cycle retries the full cascade. Do not advance the offset manually unless you have first verified that the cascade actions for the current entry have been completed out-of-band.

Diagnostic Queries

Durable enrollment state and Race Name Directory bindings live in PostgreSQL; runtime coordination state stays in Redis. A handful of CLI snippets help during incidents:

# Live game count by status (PostgreSQL)
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"

# Inspect a specific game record
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"

# Member roster for a game
psql -c "SELECT user_id, race_name, status, joined_at
         FROM lobby.memberships
         WHERE game_id = '<game_id>'
         ORDER BY joined_at"

# Race name pending entries (oldest first)
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
         FROM lobby.race_names
         WHERE binding_kind = 'pending_registration'
         ORDER BY eligible_until_ms ASC"

# Stream lag inspection (Redis)
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events

The gauges and counters surfaced through OpenTelemetry are the primary observability surface; raw PostgreSQL and Redis access is for last-resort triage.