feat: game lobby service

This commit is contained in:
Ilia Denisov
2026-04-25 23:20:55 +02:00
committed by GitHub
parent 32dc29359a
commit 48b0056b49
336 changed files with 57074 additions and 1418 deletions
+18
View File
@@ -0,0 +1,18 @@
# Game Lobby Docs
This directory keeps service-local documentation that is too detailed for the
root architecture documents and too diagram-heavy for the module README.
Sections:
- [Runtime and components](runtime.md)
- [Flows](flows.md)
- [Operator runbook](runbook.md)
- [Configuration and contract examples](examples.md)
Primary references:
- `../README.md` — service scope, contracts, configuration, observability.
- `../api/public-openapi.yaml` — public REST contract.
- `../api/internal-openapi.yaml` — internal REST contract.
- `../../ARCHITECTURE.md` — workspace architecture (§7 Game Lobby).
- `../../notification/README.md` — notification intent catalog.
- `../../user/README.md` — User Service eligibility surface.
+195
View File
@@ -0,0 +1,195 @@
# Configuration And Contract Examples
The examples below are illustrative. Replace `localhost`, port numbers, IDs,
and timestamps with values that match the deployment under inspection.
## Example `.env`
A minimum-viable `LOBBY_*` set for a local run against a single Redis
container. The full list with defaults lives in `../README.md` §Configuration.
```bash
LOBBY_REDIS_ADDR=127.0.0.1:6379
LOBBY_USER_SERVICE_BASE_URL=http://127.0.0.1:8083
LOBBY_GM_BASE_URL=http://127.0.0.1:8096
LOBBY_PUBLIC_HTTP_ADDR=:8094
LOBBY_INTERNAL_HTTP_ADDR=:8095
LOBBY_LOG_LEVEL=info
LOBBY_SHUTDOWN_TIMEOUT=30s
LOBBY_RACE_NAME_DIRECTORY_BACKEND=redis
LOBBY_ENROLLMENT_AUTOMATION_INTERVAL=30s
LOBBY_RACE_NAME_EXPIRATION_INTERVAL=1h
OTEL_SERVICE_NAME=galaxy-lobby
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
LOBBY_OTEL_STDOUT_TRACES_ENABLED=false
LOBBY_OTEL_STDOUT_METRICS_ENABLED=false
```
## Public HTTP Examples
The public listener trusts the `X-User-ID` header injected by Edge Gateway.
Direct calls during development can supply the header manually.
### Submit an application to a public game
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-User-ID: user-01HZ...' \
http://localhost:8094/api/v1/lobby/games/game-01HZ.../applications \
-d '{"race_name":"Aurora"}'
```
Response (`200 OK`):
```json
{
"application_id": "application-01HZ...",
"game_id": "game-01HZ...",
"user_id": "user-01HZ...",
"status": "submitted",
"created_at": 1714081234567
}
```
### List my open invites
```bash
curl -s \
-H 'X-User-ID: user-01HZ...' \
'http://localhost:8094/api/v1/lobby/my/invites?page_size=50'
```
### Register a race name from a pending entry
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-User-ID: user-01HZ...' \
http://localhost:8094/api/v1/lobby/race-names/register \
-d '{"race_name":"Aurora"}'
```
A `422` response with `error.code="race_name_pending_window_expired"`
indicates the 30-day window has elapsed and the user must enter a new game
to re-establish eligibility.
## Internal HTTP Examples
The internal listener admits the admin actor without `X-User-ID` and serves
GM-facing read paths.
### Create a public game (admin)
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
http://localhost:8095/api/v1/lobby/games \
-d '{
"game_name": "Spring Tournament",
"game_type": "public",
"min_players": 4,
"max_players": 12,
"start_gap_hours": 24,
"start_gap_players": 4,
"enrollment_ends_at": 1716673200,
"turn_schedule": "0 18 * * *",
"target_engine_version": "1.4.0"
}'
```
### Read a game record (Game Master)
```bash
curl -s http://localhost:8095/api/v1/internal/games/game-01HZ...
```
### List memberships for a running game (Game Master)
```bash
curl -s http://localhost:8095/api/v1/internal/games/game-01HZ.../memberships
```
## Redis Examples
### Inspect a game record
```bash
redis-cli GET lobby:games:game-01HZ...
```
The value is a strict JSON blob with the fields documented in
`../README.md` §Game Record Model.
### Publish a runtime job result (Runtime Manager simulation)
Runtime Manager would normally publish this. The shape matches the consumer
in `internal/worker/runtimejobresult/consumer.go`.
```bash
redis-cli XADD runtime:job_results '*' \
job_id 'runtime-job-01HZ...' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id 'container-7f...' \
engine_endpoint '127.0.0.1:9100' \
bound_at_ms 1714081239876
```
### Publish a Game Master runtime snapshot update
```bash
redis-cli XADD gm:lobby_events '*' \
kind 'runtime_snapshot_update' \
game_id 'game-01HZ...' \
current_turn '12' \
runtime_status 'healthy' \
engine_health_summary 'ok' \
player_turn_stats '[{"user_id":"user-01HZ...","planets":4,"population":900,"ships_built":17}]'
```
### Publish a game-finished event
```bash
redis-cli XADD gm:lobby_events '*' \
kind 'game_finished' \
game_id 'game-01HZ...' \
finished_at_ms 1714123456789
```
### Inspect open enrollment games (sorted by created_at)
```bash
redis-cli ZRANGE lobby:games_by_status:enrollment_open 0 -1 WITHSCORES
```
## Notification Intent Format
Lobby produces every notification through `pkg/notificationintent` and
appends to `notification:intents` with plain `XADD`. A representative
intent for `lobby.application.submitted`:
```bash
redis-cli XADD notification:intents '*' \
envelope '{
"type": "lobby.application.submitted",
"producer": "lobby",
"idempotency_key": "lobby.application.submitted:application-01HZ...",
"audience": {"kind": "admin_email", "email_address_kind": "lobby_application_submitted"},
"payload": {
"game_id": "game-01HZ...",
"game_name": "Spring Tournament",
"applicant_user_id": "user-01HZ...",
"applicant_name": "Aurora"
}
}'
```
The exact field set per type is documented in `../../notification/README.md`
and frozen by the AsyncAPI spec under
`../../notification/api/intents-asyncapi.yaml`.
+196
View File
@@ -0,0 +1,196 @@
# Flows
This document collects the eight platform flows that span Game Lobby plus
its synchronous and asynchronous neighbours. Narrative descriptions of the
rules these flows enforce live in `../README.md`; the diagrams here focus on
the message order across the boundary.
## Public Game Application
```mermaid
sequenceDiagram
participant User
participant Gateway
participant Lobby as Lobby publichttp
participant UserSvc as User Service
participant Redis
participant Stream as notification:intents
User->>Gateway: lobby.application.submit(game_id, race_name)
Gateway->>Lobby: POST /api/v1/lobby/games/{id}/applications + X-User-ID
Lobby->>UserSvc: GetEligibility(user_id)
UserSvc-->>Lobby: snapshot (entitlement, sanctions)
Lobby->>Redis: persist Application(submitted) + indexes
Lobby->>Stream: lobby.application.submitted (admin recipients)
Lobby-->>Gateway: 200 ApplicationRecord
```
Approval and rejection follow the same pattern, mutating the application
status to `approved`/`rejected` and emitting
`lobby.membership.approved`/`lobby.membership.rejected` to the applicant.
## Private Game Invite
```mermaid
sequenceDiagram
participant Owner
participant Invitee
participant Lobby
participant Redis
participant Stream as notification:intents
Owner->>Lobby: lobby.invite.create(invitee_user_id)
Lobby->>Redis: persist Invite(created)
Lobby->>Stream: lobby.invite.created (recipient: invitee)
Invitee->>Lobby: lobby.invite.redeem(race_name)
Lobby->>Lobby: User Service guard for inviter and invitee
Lobby->>Redis: RND.Reserve + Membership(active) + Invite(redeemed)
Lobby->>Stream: lobby.invite.redeemed (recipient: owner)
```
The owner-facing decline and revoke transitions persist the invite status
update and produce no notification in v1.
## Enrollment Automation
```mermaid
sequenceDiagram
participant Tick as Worker tick
participant Lobby
participant Redis
participant Stream as notification:intents
Tick->>Lobby: enrollment automation cycle
Lobby->>Redis: load enrollment_open games + roster sizes
alt deadline reached or gap exhausted
Lobby->>Redis: status enrollment_open → ready_to_start (CAS)
Lobby->>Redis: pending invites → expired
Lobby->>Stream: lobby.invite.expired (per expired invite)
else still within window
Lobby-->>Tick: no-op
end
```
Manual `lobby.game.ready_to_start` from owner or admin runs the same close
pipeline synchronously without waiting for the next tick.
## Game Start (happy path)
```mermaid
sequenceDiagram
participant Actor as Owner or Admin
participant Lobby
participant Redis
participant RT as Runtime Manager
participant GM as Game Master
Actor->>Lobby: lobby.game.start
Lobby->>Redis: status ready_to_start → starting (CAS)
Lobby->>Redis: XADD runtime:start_jobs
RT->>Redis: XADD runtime:job_results (success + container metadata)
Lobby->>Redis: persist runtime_binding on game record
Lobby->>GM: POST /internal/games/{id}/register-runtime
GM-->>Lobby: 200 OK
Lobby->>Redis: status starting → running; set started_at
```
If runtime metadata persistence fails, Lobby publishes a stop-job to remove
the orphan container before flipping the game to `start_failed`.
## Game Start (GM unavailable)
```mermaid
sequenceDiagram
participant Lobby
participant Redis
participant GM as Game Master
participant Stream as notification:intents
Lobby->>GM: POST /internal/games/{id}/register-runtime
GM-->>Lobby: timeout / 5xx
Lobby->>Redis: status starting → paused (CAS)
Lobby->>Stream: lobby.runtime_paused_after_start (admin)
Note over Lobby,GM: Container stays alive; admin restarts GM<br/>and issues lobby.game.resume.
```
## Game Finish + Capability Evaluation
```mermaid
sequenceDiagram
participant GM as Game Master
participant Stream as gm:lobby_events
participant Lobby
participant Redis
participant Intents as notification:intents
GM->>Stream: XADD runtime_snapshot_update (player_turn_stats)
Lobby->>Redis: UpdateMax for each member's stats aggregate
GM->>Stream: XADD game_finished
Lobby->>Redis: status running/paused → finished; finished_at = event_ts
Lobby->>Redis: capability evaluator runs per active membership
alt member capable
Lobby->>Redis: RND.MarkPendingRegistration(eligible_until = finished_at + 30d)
Lobby->>Intents: lobby.race_name.registration_eligible (recipient: user)
else not capable
Lobby->>Redis: RND.ReleaseReservation
Lobby->>Intents: lobby.race_name.registration_denied (optional)
end
Lobby->>Redis: ReleaseReservation for removed/blocked memberships
Lobby->>Redis: delete per-game stats aggregate
```
The evaluation guard `lobby:capability_evaluation:done:<game_id>` makes a
replayed `game_finished` event a no-op.
## Race Name Registration
```mermaid
sequenceDiagram
participant User
participant Lobby
participant UserSvc as User Service
participant RND as Race Name Directory
participant Stream as notification:intents
User->>Lobby: lobby.race_name.register(race_name)
Lobby->>UserSvc: GetEligibility (sanctions, max_registered_race_names)
UserSvc-->>Lobby: snapshot
Lobby->>RND: Register(game_id, user_id, race_name)
RND-->>Lobby: ok / ErrPendingExpired / ErrQuotaExceeded
alt success
Lobby->>Stream: lobby.race_name.registered (recipient: user)
Lobby-->>User: 200 RegisteredRaceName
else precondition failure
Lobby-->>User: 422 DomainPreconditionError
end
```
Registration consumes one tariff slot keyed by `(canonical_key, user_id)`;
tariff downgrade never revokes existing registrations.
## Cascade Release on User Lifecycle Event
```mermaid
sequenceDiagram
participant US as User Service
participant Stream as user:lifecycle_events
participant Lobby
participant RT as Runtime Manager
participant Intents as notification:intents
US->>Stream: XADD permanent_blocked or deleted
Lobby->>Stream: XREAD (consumer)
Lobby->>Lobby: RND.ReleaseAllByUser
Lobby->>Lobby: memberships → blocked + lobby.membership.blocked per private game
Lobby->>Lobby: applications → rejected
Lobby->>Lobby: invites (addressed and inviter-side) → revoked
Lobby->>Lobby: owned non-terminal games → cancelled (external_block trigger)
Lobby->>RT: XADD runtime:stop_jobs for in-flight owned games
Lobby->>Intents: lobby.membership.blocked per affected membership
Lobby->>Stream: advance offset
```
Every step is idempotent at the store layer (`ErrConflict` from a CAS is
treated as «already done»); the consumer only advances the offset once the
handler returns nil.
+220
View File
@@ -0,0 +1,220 @@
# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and the handful of recovery paths specific to Lobby.
## Startup Checks
Before starting the process, confirm:
- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
five Lobby-related streams.
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
the network the Lobby pods run in. Lobby does not ping these at boot,
but transport failures against them will surface as request errors.
- Stream names match the producers/consumers Lobby integrates with:
- `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
- `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
- `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
`stub` value is only for unit tests.
At startup the process performs a bounded `PING` against Redis. Startup
fails fast if the ping fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
- both ports answer `GET /healthz` and `GET /readyz`.
Expected log lines:
- `lobby starting` from `cmd/lobby`;
- one `redis ping ok` line;
- one `public http listening` and one `internal http listening` line;
- one `worker started` line per background worker (six expected).
## Readiness
Use the probes according to what they actually guarantee:
- `GET /healthz` confirms the listener is alive;
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
at boot.
`/readyz` is process-local. It does not confirm:
- ongoing Redis health after boot;
- User Service reachability;
- Game Master reachability;
- worker liveness.
For a practical readiness check in production:
1. confirm the process emitted the listener and worker startup logs;
2. check `GET /healthz` and `GET /readyz` on both ports;
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
the first traffic;
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
GM starts emitting events.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
- HTTP listeners drain in-flight requests before closing;
- background workers stop their `XREAD` loops and persist the latest offset;
- pending consumer offsets are flushed before exit.
During planned restarts:
1. send `SIGTERM`;
2. wait for the listener and component-stop logs;
3. expect any worker that was mid-cycle to retry from the persisted offset
on the next process start;
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
## Stuck `starting` Recovery
A game that flips to `starting` but never completes one of the post-start
steps will stay in `starting` until manual recovery.
Symptoms:
- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
expected start budget (Runtime Manager start time + GM register call);
- per-game logs show `start_job_published` but no `runtime_job_result` or
`register_runtime_outcome` follow-up.
Recovery:
1. Identify the affected `game_id` from the gauge labels or logs.
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
Lobby. If absent, Runtime Manager never produced a result; resolve at
the runtime layer.
3. If the result exists with `success=true` but no GM call was made, retry
with the admin or owner command `lobby.game.retry_start`.
4. If the result exists with `success=false`, transition through the
`start_failed` path and use `lobby.game.cancel` or `retry_start` once
the underlying issue is resolved.
5. If the metadata persistence step failed, Lobby has already published a
stop-job and moved the game to `start_failed`. Confirm the orphan
container was removed by Runtime Manager.
Lobby always re-accepts a `start` command on a game that is stuck in
`starting`: the first action is a CAS attempt, and a second `start` from a
re-issued admin command will progress the state machine.
## Stuck Stream Offsets
Three stream-lag gauges describe the consumer health:
- `lobby.gm_events.oldest_unprocessed_age_ms`
- `lobby.runtime_results.oldest_unprocessed_age_ms`
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
A persistently increasing gauge means the consumer is unable to advance.
Causes and triage:
1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
and advances the offset; this should not stall the stream. If the gauge
keeps climbing, there is a real handler error.
2. **Handler returns a non-nil error.** The consumer holds the offset and
retries on every cycle. Inspect the latest log lines to identify the
error class (Redis transient, RND store error, RuntimeManager publish
failure for cascade events).
3. **Process restart loop.** A crash before persisting the offset does not
advance progress. Check pod restart counts and `cmd/lobby` panics.
After the underlying cause is fixed, the consumer resumes from the persisted
offset; no manual intervention to the offset key is required in normal
operation. If a corrupt entry must be skipped, advance
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
process.
## Pending Registration Window Expiry
The pending-registration expirer ticks every
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
`pending_registration` entries past their `eligible_until` timestamp.
The 30-day window length is the in-process constant
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
override is reserved for a future change under the env var
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
The worker absorbs Race Name Directory failures: a failing `Expire` call is
logged at warn level, the worker waits for the next tick, and no offset is
moved (there is no offset; this is a periodic worker, not a consumer). A
backlog of expirable entries is therefore self-healing once the directory
is reachable again.
To inspect the backlog:
```bash
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
```
Entries with `score < now()` (Unix milliseconds) are expirable on the next
tick.
## Cascade Release Operator Notes
The `user:lifecycle_events` consumer fans out a single user-lifecycle event
into many actions:
1. Race Name Directory release (`RND.ReleaseAllByUser`).
2. Membership status flips (`active``blocked`) on every membership the
user holds, with a `lobby.membership.blocked` notification per
third-party private game.
3. Application status flips (`submitted``rejected`).
4. Invite status flips (`created``revoked`) on both addressed and
inviter-side invites.
5. Owned non-terminal games transition to `cancelled` via the
`external_block` trigger. In-flight statuses (`starting`, `running`,
`paused`) get a stop-job published to Runtime Manager before the game
record is updated.
The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
is treated as «already done». A retry on the next consumer cycle will
re-traverse the same set without producing duplicate side effects.
A single failing step (transient store error or runtime stop-job publish
failure) leaves the offset on the current entry. The next cycle retries the
full cascade. Do not advance the offset manually unless you have first
verified that the cascade actions for the current entry have been completed
out-of-band.
## Diagnostic Queries
A handful of Redis CLI snippets help during incidents:
```bash
# Live game count by status
redis-cli ZCARD lobby:games_by_status:enrollment_open
redis-cli ZCARD lobby:games_by_status:running
# Inspect a specific game record
redis-cli GET lobby:games:<game_id>
# Member roster for a game
redis-cli SMEMBERS lobby:game_memberships:<game_id>
# Race name pending entries (oldest first)
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
# Stream lag inspection
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
```
The gauges and counters surfaced through OpenTelemetry are the primary
observability surface; raw Redis access is for last-resort triage.
+163
View File
@@ -0,0 +1,163 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/lobby` process and its
runtime dependencies.
```mermaid
flowchart LR
subgraph Clients
Gateway["Edge Gateway"]
Admin["Admin Service"]
GM["Game Master"]
end
subgraph Lobby["Game Lobby process"]
PublicHTTP["Public HTTP listener\n:8094 /healthz /readyz"]
InternalHTTP["Internal HTTP listener\n:8095 /healthz /readyz"]
EnrollAuto["Enrollment automation worker"]
RTJobsConsumer["runtime:job_results consumer"]
GMEventsConsumer["gm:lobby_events consumer"]
PendingExpirer["Pending registration expirer"]
ULConsumer["user:lifecycle_events consumer"]
IntentPublisher["notification:intents publisher"]
Telemetry["Logs, traces, metrics"]
end
User["User Service"]
Redis["Redis\nKV + Streams"]
Gateway --> PublicHTTP
Admin --> InternalHTTP
GM --> InternalHTTP
PublicHTTP --> User
InternalHTTP --> User
PublicHTTP -. register-runtime .-> GM
InternalHTTP -. register-runtime .-> GM
EnrollAuto --> Redis
RTJobsConsumer --> Redis
GMEventsConsumer --> Redis
PendingExpirer --> Redis
ULConsumer --> Redis
IntentPublisher --> Redis
PublicHTTP --> Redis
InternalHTTP --> Redis
PublicHTTP --> Telemetry
InternalHTTP --> Telemetry
EnrollAuto --> Telemetry
RTJobsConsumer --> Telemetry
GMEventsConsumer --> Telemetry
PendingExpirer --> Telemetry
ULConsumer --> Telemetry
```
Notes:
- `cmd/lobby` refuses startup when Redis connectivity is misconfigured. User
Service and Game Master reachability are not verified at boot; transport
failures surface as request errors.
- Both HTTP listeners expose `/healthz` and `/readyz` independently so health
checks can target either port.
- `register-runtime` is an outgoing call from Lobby to Game Master after the
container start completes. Lobby does not expose an inbound endpoint of the
same name.
## Listeners
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Public HTTP | `:8094` | Authenticated user routes; gateway-facing |
| Internal HTTP | `:8095` | Admin-mirrored routes + Game Master read paths |
Shared listener defaults:
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
Public-port routes carry an `X-User-ID` header injected by Edge Gateway;
internal-port routes admit the admin actor without the header.
Probe routes:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}` once startup wiring completes.
- Neither probe performs a live Redis ping per request.
- There is no `/metrics` route. Metrics flow through OpenTelemetry exporters.
## Background Workers
| Worker | Trigger | Function |
| --- | --- | --- |
| Enrollment automation | Periodic tick (`LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`) | Closes enrollment when the deadline or the gap window is exhausted. |
| `runtime:job_results` consumer | Redis `XREAD` | Drives `starting` to `running`/`paused`/`start_failed` based on Runtime Manager outcomes. |
| `gm:lobby_events` consumer | Redis `XREAD` | Applies runtime snapshot updates and game-finish events from Game Master; hands `game_finished` events off to capability evaluation. |
| Pending registration expirer | Periodic tick (`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`) | Releases `pending_registration` entries past their 30-day window. |
| `user:lifecycle_events` consumer | Redis `XREAD` | Fans out the cascade for `permanent_blocked` and `deleted` user events (RND release, membership block, application/invite cancel, owned-game cancel). |
| `notification:intents` publisher | Synchronous from services | Wraps every notification publish with metric instrumentation; producer-side failures degrade notifications without rolling back business state. |
## Synchronous Upstream Clients
| Client | Endpoint | Failure mapping |
| --- | --- | --- |
| `User Service` eligibility | `POST {LOBBY_USER_SERVICE_BASE_URL}/api/v1/internal/users/{user_id}/lobby-eligibility` | Network or non-2xx → `503 service_unavailable`; `permanent_block``404 subject_not_found`. |
| `Game Master` register-runtime | `POST {LOBBY_GM_BASE_URL}/api/v1/internal/games/{game_id}/register-runtime` | Network or non-2xx → forced-pause path (`paused` + `lobby.runtime_paused_after_start`). |
| `Game Master` liveness probe | `GET {LOBBY_GM_BASE_URL}/api/v1/internal/healthz` | Used during `lobby.game.resume`; failure surfaces as `503 service_unavailable`. |
## Stream Offsets
Each consumer persists its position under a dedicated key so process restart
preserves stream progress.
| Stream | Offset key | Read block timeout env |
| --- | --- | --- |
| `gm:lobby_events` | `lobby:stream_offsets:gm_events` | `LOBBY_GM_EVENTS_READ_BLOCK_TIMEOUT` |
| `runtime:job_results` | `lobby:stream_offsets:runtime_results` | `LOBBY_RUNTIME_JOB_RESULTS_READ_BLOCK_TIMEOUT` |
| `user:lifecycle_events` | `lobby:stream_offsets:user_lifecycle` | `LOBBY_USER_LIFECYCLE_READ_BLOCK_TIMEOUT` |
Stream lag is exposed through observable gauges
`lobby.gm_events.oldest_unprocessed_age_ms`,
`lobby.runtime_results.oldest_unprocessed_age_ms`, and
`lobby.user_lifecycle.oldest_unprocessed_age_ms`. The probe samples the
oldest entry whose ID is greater than the persisted offset; when a consumer
lags or stalls, the gauge climbs and stays high.
## Configuration Groups
The full env-var list with defaults lives in `../README.md` §Configuration.
The groups below summarize the structure:
- **Required** — `LOBBY_REDIS_ADDR`, `LOBBY_USER_SERVICE_BASE_URL`,
`LOBBY_GM_BASE_URL`.
- **Process and logging** — `LOBBY_SHUTDOWN_TIMEOUT`, `LOBBY_LOG_LEVEL`.
- **HTTP listeners** — `LOBBY_PUBLIC_HTTP_*`, `LOBBY_INTERNAL_HTTP_*`.
- **Redis connectivity** — `LOBBY_REDIS_USERNAME`, `LOBBY_REDIS_PASSWORD`,
`LOBBY_REDIS_DB`, `LOBBY_REDIS_TLS_ENABLED`,
`LOBBY_REDIS_OPERATION_TIMEOUT`.
- **Streams** — `LOBBY_GM_EVENTS_STREAM`, `LOBBY_RUNTIME_START_JOBS_STREAM`,
`LOBBY_RUNTIME_STOP_JOBS_STREAM`, `LOBBY_RUNTIME_JOB_RESULTS_STREAM`,
`LOBBY_NOTIFICATION_INTENTS_STREAM`, `LOBBY_USER_LIFECYCLE_STREAM`.
- **Upstream clients** — `LOBBY_USER_SERVICE_TIMEOUT`, `LOBBY_GM_TIMEOUT`.
- **Workers** — `LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`,
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`,
`LOBBY_RACE_NAME_DIRECTORY_BACKEND`.
- **Telemetry** — standard `OTEL_*` plus
`LOBBY_OTEL_STDOUT_TRACES_ENABLED`,
`LOBBY_OTEL_STDOUT_METRICS_ENABLED`.
## Runtime Notes
- `Game Lobby` owns platform game state. Game Master may cache snapshots but
is not the source of truth.
- The Race Name Directory ships a Redis adapter and an in-process stub; the
stub is intended for unit tests and is selected via
`LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
- A `permanent_block` or `deleted` event from User Service fans out
asynchronously through the `user:lifecycle_events` consumer; in-flight
games owned by the affected user receive a stop-job and transition to
`cancelled` via the `external_block` trigger.
- `notification:intents` publishes are best-effort: a failed publish is
logged and counted but does not roll back the committed business state.