Files

T

Ilia Denisov 48b0056b49 feat: game lobby service

2026-04-25 23:20:55 +02:00

8.6 KiB

Raw Blame History

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Lobby.

Startup Checks

Before starting the process, confirm:

LOBBY_REDIS_ADDR points to the Redis deployment used for state and the five Lobby-related streams.
LOBBY_USER_SERVICE_BASE_URL and LOBBY_GM_BASE_URL are reachable from the network the Lobby pods run in. Lobby does not ping these at boot, but transport failures against them will surface as request errors.
Stream names match the producers/consumers Lobby integrates with:
- LOBBY_GM_EVENTS_STREAM (default gm:lobby_events)
- LOBBY_RUNTIME_START_JOBS_STREAM (default runtime:start_jobs)
- LOBBY_RUNTIME_STOP_JOBS_STREAM (default runtime:stop_jobs)
- LOBBY_RUNTIME_JOB_RESULTS_STREAM (default runtime:job_results)
- LOBBY_USER_LIFECYCLE_STREAM (default user:lifecycle_events)
- LOBBY_NOTIFICATION_INTENTS_STREAM (default notification:intents)
LOBBY_RACE_NAME_DIRECTORY_BACKEND is redis for production; the stub value is only for unit tests.

At startup the process performs a bounded PING against Redis. Startup fails fast if the ping fails. There are no liveness checks against User Service or Game Master at boot; those are surfaced at request time.

Expected listener state after a healthy start:

public HTTP is enabled on LOBBY_PUBLIC_HTTP_ADDR (default :8094);
internal HTTP is enabled on LOBBY_INTERNAL_HTTP_ADDR (default :8095);
both ports answer GET /healthz and GET /readyz.

Expected log lines:

lobby starting from cmd/lobby;
one redis ping ok line;
one public http listening and one internal http listening line;
one worker started line per background worker (six expected).

Readiness

Use the probes according to what they actually guarantee:

GET /healthz confirms the listener is alive;
GET /readyz confirms the runtime wiring completed and Redis was reachable at boot.

/readyz is process-local. It does not confirm:

ongoing Redis health after boot;
User Service reachability;
Game Master reachability;
worker liveness.

For a practical readiness check in production:

confirm the process emitted the listener and worker startup logs;
check GET /healthz and GET /readyz on both ports;
verify lobby.active_games gauge is non-zero in the metrics backend after the first traffic;
verify lobby.gm_events.oldest_unprocessed_age_ms is small or zero after GM starts emitting events.

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

the per-component shutdown budget is controlled by LOBBY_SHUTDOWN_TIMEOUT;
HTTP listeners drain in-flight requests before closing;
background workers stop their XREAD loops and persist the latest offset;
pending consumer offsets are flushed before exit.

During planned restarts:

send SIGTERM;
wait for the listener and component-stop logs;
expect any worker that was mid-cycle to retry from the persisted offset on the next process start;
investigate only if shutdown exceeds LOBBY_SHUTDOWN_TIMEOUT.

Stuck `starting` Recovery

A game that flips to starting but never completes one of the post-start steps will stay in starting until manual recovery.

Symptoms:

lobby.active_games{status="starting"} gauge non-zero for longer than the expected start budget (Runtime Manager start time + GM register call);
per-game logs show start_job_published but no runtime_job_result or register_runtime_outcome follow-up.

Recovery:

Identify the affected game_id from the gauge labels or logs.
Inspect runtime:job_results for the runtime_job_id published by Lobby. If absent, Runtime Manager never produced a result; resolve at the runtime layer.
If the result exists with success=true but no GM call was made, retry with the admin or owner command lobby.game.retry_start.
If the result exists with success=false, transition through the start_failed path and use lobby.game.cancel or retry_start once the underlying issue is resolved.
If the metadata persistence step failed, Lobby has already published a stop-job and moved the game to start_failed. Confirm the orphan container was removed by Runtime Manager.

Lobby always re-accepts a start command on a game that is stuck in starting: the first action is a CAS attempt, and a second start from a re-issued admin command will progress the state machine.

Stuck Stream Offsets

Three stream-lag gauges describe the consumer health:

lobby.gm_events.oldest_unprocessed_age_ms
lobby.runtime_results.oldest_unprocessed_age_ms
lobby.user_lifecycle.oldest_unprocessed_age_ms

A persistently increasing gauge means the consumer is unable to advance. Causes and triage:

Decoder rejects a malformed entry. The consumer logs malformed_event and advances the offset; this should not stall the stream. If the gauge keeps climbing, there is a real handler error.
Handler returns a non-nil error. The consumer holds the offset and retries on every cycle. Inspect the latest log lines to identify the error class (Redis transient, RND store error, RuntimeManager publish failure for cascade events).
Process restart loop. A crash before persisting the offset does not advance progress. Check pod restart counts and cmd/lobby panics.

After the underlying cause is fixed, the consumer resumes from the persisted offset; no manual intervention to the offset key is required in normal operation. If a corrupt entry must be skipped, advance lobby:stream_offsets:<label> to the next valid stream ID and restart the process.

Pending Registration Window Expiry

The pending-registration expirer ticks every LOBBY_RACE_NAME_EXPIRATION_INTERVAL (default 1h) and releases pending_registration entries past their eligible_until timestamp.

The 30-day window length is the in-process constant service/capabilityevaluation.PendingRegistrationWindow. Operator-tunable override is reserved for a future change under the env var LOBBY_PENDING_REGISTRATION_TTL_HOURS; today the constant is final.

The worker absorbs Race Name Directory failures: a failing Expire call is logged at warn level, the worker waits for the next tick, and no offset is moved (there is no offset; this is a periodic worker, not a consumer). A backlog of expirable entries is therefore self-healing once the directory is reachable again.

To inspect the backlog:

redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES

Entries with score < now() (Unix milliseconds) are expirable on the next tick.

Cascade Release Operator Notes

The user:lifecycle_events consumer fans out a single user-lifecycle event into many actions:

Race Name Directory release (RND.ReleaseAllByUser).
Membership status flips (active → blocked) on every membership the user holds, with a lobby.membership.blocked notification per third-party private game.
Application status flips (submitted → rejected).
Invite status flips (created → revoked) on both addressed and inviter-side invites.
Owned non-terminal games transition to cancelled via the external_block trigger. In-flight statuses (starting, running, paused) get a stop-job published to Runtime Manager before the game record is updated.

The cascade is idempotent: every store mutation uses CAS, and ErrConflict is treated as «already done». A retry on the next consumer cycle will re-traverse the same set without producing duplicate side effects.

A single failing step (transient store error or runtime stop-job publish failure) leaves the offset on the current entry. The next cycle retries the full cascade. Do not advance the offset manually unless you have first verified that the cascade actions for the current entry have been completed out-of-band.

Diagnostic Queries

A handful of Redis CLI snippets help during incidents:

# Live game count by status
redis-cli ZCARD lobby:games_by_status:enrollment_open
redis-cli ZCARD lobby:games_by_status:running

# Inspect a specific game record
redis-cli GET lobby:games:<game_id>

# Member roster for a game
redis-cli SMEMBERS lobby:game_memberships:<game_id>

# Race name pending entries (oldest first)
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES

# Stream lag inspection
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events

The gauges and counters surfaced through OpenTelemetry are the primary observability surface; raw Redis access is for last-resort triage.

8.6 KiB Raw Blame History