feat: game lobby service
This commit is contained in:
@@ -0,0 +1,18 @@
|
||||
# Game Lobby Docs
|
||||
|
||||
This directory keeps service-local documentation that is too detailed for the
|
||||
root architecture documents and too diagram-heavy for the module README.
|
||||
|
||||
Sections:
|
||||
- [Runtime and components](runtime.md)
|
||||
- [Flows](flows.md)
|
||||
- [Operator runbook](runbook.md)
|
||||
- [Configuration and contract examples](examples.md)
|
||||
|
||||
Primary references:
|
||||
- `../README.md` — service scope, contracts, configuration, observability.
|
||||
- `../api/public-openapi.yaml` — public REST contract.
|
||||
- `../api/internal-openapi.yaml` — internal REST contract.
|
||||
- `../../ARCHITECTURE.md` — workspace architecture (§7 Game Lobby).
|
||||
- `../../notification/README.md` — notification intent catalog.
|
||||
- `../../user/README.md` — User Service eligibility surface.
|
||||
@@ -0,0 +1,195 @@
|
||||
# Configuration And Contract Examples
|
||||
|
||||
The examples below are illustrative. Replace `localhost`, port numbers, IDs,
|
||||
and timestamps with values that match the deployment under inspection.
|
||||
|
||||
## Example `.env`
|
||||
|
||||
A minimum-viable `LOBBY_*` set for a local run against a single Redis
|
||||
container. The full list with defaults lives in `../README.md` §Configuration.
|
||||
|
||||
```bash
|
||||
LOBBY_REDIS_ADDR=127.0.0.1:6379
|
||||
LOBBY_USER_SERVICE_BASE_URL=http://127.0.0.1:8083
|
||||
LOBBY_GM_BASE_URL=http://127.0.0.1:8096
|
||||
|
||||
LOBBY_PUBLIC_HTTP_ADDR=:8094
|
||||
LOBBY_INTERNAL_HTTP_ADDR=:8095
|
||||
|
||||
LOBBY_LOG_LEVEL=info
|
||||
LOBBY_SHUTDOWN_TIMEOUT=30s
|
||||
|
||||
LOBBY_RACE_NAME_DIRECTORY_BACKEND=redis
|
||||
LOBBY_ENROLLMENT_AUTOMATION_INTERVAL=30s
|
||||
LOBBY_RACE_NAME_EXPIRATION_INTERVAL=1h
|
||||
|
||||
OTEL_SERVICE_NAME=galaxy-lobby
|
||||
OTEL_TRACES_EXPORTER=none
|
||||
OTEL_METRICS_EXPORTER=none
|
||||
LOBBY_OTEL_STDOUT_TRACES_ENABLED=false
|
||||
LOBBY_OTEL_STDOUT_METRICS_ENABLED=false
|
||||
```
|
||||
|
||||
## Public HTTP Examples
|
||||
|
||||
The public listener trusts the `X-User-ID` header injected by Edge Gateway.
|
||||
Direct calls during development can supply the header manually.
|
||||
|
||||
### Submit an application to a public game
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-User-ID: user-01HZ...' \
|
||||
http://localhost:8094/api/v1/lobby/games/game-01HZ.../applications \
|
||||
-d '{"race_name":"Aurora"}'
|
||||
```
|
||||
|
||||
Response (`200 OK`):
|
||||
|
||||
```json
|
||||
{
|
||||
"application_id": "application-01HZ...",
|
||||
"game_id": "game-01HZ...",
|
||||
"user_id": "user-01HZ...",
|
||||
"status": "submitted",
|
||||
"created_at": 1714081234567
|
||||
}
|
||||
```
|
||||
|
||||
### List my open invites
|
||||
|
||||
```bash
|
||||
curl -s \
|
||||
-H 'X-User-ID: user-01HZ...' \
|
||||
'http://localhost:8094/api/v1/lobby/my/invites?page_size=50'
|
||||
```
|
||||
|
||||
### Register a race name from a pending entry
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-User-ID: user-01HZ...' \
|
||||
http://localhost:8094/api/v1/lobby/race-names/register \
|
||||
-d '{"race_name":"Aurora"}'
|
||||
```
|
||||
|
||||
A `422` response with `error.code="race_name_pending_window_expired"`
|
||||
indicates the 30-day window has elapsed and the user must enter a new game
|
||||
to re-establish eligibility.
|
||||
|
||||
## Internal HTTP Examples
|
||||
|
||||
The internal listener admits the admin actor without `X-User-ID` and serves
|
||||
GM-facing read paths.
|
||||
|
||||
### Create a public game (admin)
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
http://localhost:8095/api/v1/lobby/games \
|
||||
-d '{
|
||||
"game_name": "Spring Tournament",
|
||||
"game_type": "public",
|
||||
"min_players": 4,
|
||||
"max_players": 12,
|
||||
"start_gap_hours": 24,
|
||||
"start_gap_players": 4,
|
||||
"enrollment_ends_at": 1716673200,
|
||||
"turn_schedule": "0 18 * * *",
|
||||
"target_engine_version": "1.4.0"
|
||||
}'
|
||||
```
|
||||
|
||||
### Read a game record (Game Master)
|
||||
|
||||
```bash
|
||||
curl -s http://localhost:8095/api/v1/internal/games/game-01HZ...
|
||||
```
|
||||
|
||||
### List memberships for a running game (Game Master)
|
||||
|
||||
```bash
|
||||
curl -s http://localhost:8095/api/v1/internal/games/game-01HZ.../memberships
|
||||
```
|
||||
|
||||
## Redis Examples
|
||||
|
||||
### Inspect a game record
|
||||
|
||||
```bash
|
||||
redis-cli GET lobby:games:game-01HZ...
|
||||
```
|
||||
|
||||
The value is a strict JSON blob with the fields documented in
|
||||
`../README.md` §Game Record Model.
|
||||
|
||||
### Publish a runtime job result (Runtime Manager simulation)
|
||||
|
||||
Runtime Manager would normally publish this. The shape matches the consumer
|
||||
in `internal/worker/runtimejobresult/consumer.go`.
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
job_id 'runtime-job-01HZ...' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id 'container-7f...' \
|
||||
engine_endpoint '127.0.0.1:9100' \
|
||||
bound_at_ms 1714081239876
|
||||
```
|
||||
|
||||
### Publish a Game Master runtime snapshot update
|
||||
|
||||
```bash
|
||||
redis-cli XADD gm:lobby_events '*' \
|
||||
kind 'runtime_snapshot_update' \
|
||||
game_id 'game-01HZ...' \
|
||||
current_turn '12' \
|
||||
runtime_status 'healthy' \
|
||||
engine_health_summary 'ok' \
|
||||
player_turn_stats '[{"user_id":"user-01HZ...","planets":4,"population":900,"ships_built":17}]'
|
||||
```
|
||||
|
||||
### Publish a game-finished event
|
||||
|
||||
```bash
|
||||
redis-cli XADD gm:lobby_events '*' \
|
||||
kind 'game_finished' \
|
||||
game_id 'game-01HZ...' \
|
||||
finished_at_ms 1714123456789
|
||||
```
|
||||
|
||||
### Inspect open enrollment games (sorted by created_at)
|
||||
|
||||
```bash
|
||||
redis-cli ZRANGE lobby:games_by_status:enrollment_open 0 -1 WITHSCORES
|
||||
```
|
||||
|
||||
## Notification Intent Format
|
||||
|
||||
Lobby produces every notification through `pkg/notificationintent` and
|
||||
appends to `notification:intents` with plain `XADD`. A representative
|
||||
intent for `lobby.application.submitted`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD notification:intents '*' \
|
||||
envelope '{
|
||||
"type": "lobby.application.submitted",
|
||||
"producer": "lobby",
|
||||
"idempotency_key": "lobby.application.submitted:application-01HZ...",
|
||||
"audience": {"kind": "admin_email", "email_address_kind": "lobby_application_submitted"},
|
||||
"payload": {
|
||||
"game_id": "game-01HZ...",
|
||||
"game_name": "Spring Tournament",
|
||||
"applicant_user_id": "user-01HZ...",
|
||||
"applicant_name": "Aurora"
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
The exact field set per type is documented in `../../notification/README.md`
|
||||
and frozen by the AsyncAPI spec under
|
||||
`../../notification/api/intents-asyncapi.yaml`.
|
||||
@@ -0,0 +1,196 @@
|
||||
# Flows
|
||||
|
||||
This document collects the eight platform flows that span Game Lobby plus
|
||||
its synchronous and asynchronous neighbours. Narrative descriptions of the
|
||||
rules these flows enforce live in `../README.md`; the diagrams here focus on
|
||||
the message order across the boundary.
|
||||
|
||||
## Public Game Application
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Gateway
|
||||
participant Lobby as Lobby publichttp
|
||||
participant UserSvc as User Service
|
||||
participant Redis
|
||||
participant Stream as notification:intents
|
||||
|
||||
User->>Gateway: lobby.application.submit(game_id, race_name)
|
||||
Gateway->>Lobby: POST /api/v1/lobby/games/{id}/applications + X-User-ID
|
||||
Lobby->>UserSvc: GetEligibility(user_id)
|
||||
UserSvc-->>Lobby: snapshot (entitlement, sanctions)
|
||||
Lobby->>Redis: persist Application(submitted) + indexes
|
||||
Lobby->>Stream: lobby.application.submitted (admin recipients)
|
||||
Lobby-->>Gateway: 200 ApplicationRecord
|
||||
```
|
||||
|
||||
Approval and rejection follow the same pattern, mutating the application
|
||||
status to `approved`/`rejected` and emitting
|
||||
`lobby.membership.approved`/`lobby.membership.rejected` to the applicant.
|
||||
|
||||
## Private Game Invite
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Owner
|
||||
participant Invitee
|
||||
participant Lobby
|
||||
participant Redis
|
||||
participant Stream as notification:intents
|
||||
|
||||
Owner->>Lobby: lobby.invite.create(invitee_user_id)
|
||||
Lobby->>Redis: persist Invite(created)
|
||||
Lobby->>Stream: lobby.invite.created (recipient: invitee)
|
||||
|
||||
Invitee->>Lobby: lobby.invite.redeem(race_name)
|
||||
Lobby->>Lobby: User Service guard for inviter and invitee
|
||||
Lobby->>Redis: RND.Reserve + Membership(active) + Invite(redeemed)
|
||||
Lobby->>Stream: lobby.invite.redeemed (recipient: owner)
|
||||
```
|
||||
|
||||
The owner-facing decline and revoke transitions persist the invite status
|
||||
update and produce no notification in v1.
|
||||
|
||||
## Enrollment Automation
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Tick as Worker tick
|
||||
participant Lobby
|
||||
participant Redis
|
||||
participant Stream as notification:intents
|
||||
|
||||
Tick->>Lobby: enrollment automation cycle
|
||||
Lobby->>Redis: load enrollment_open games + roster sizes
|
||||
alt deadline reached or gap exhausted
|
||||
Lobby->>Redis: status enrollment_open → ready_to_start (CAS)
|
||||
Lobby->>Redis: pending invites → expired
|
||||
Lobby->>Stream: lobby.invite.expired (per expired invite)
|
||||
else still within window
|
||||
Lobby-->>Tick: no-op
|
||||
end
|
||||
```
|
||||
|
||||
Manual `lobby.game.ready_to_start` from owner or admin runs the same close
|
||||
pipeline synchronously without waiting for the next tick.
|
||||
|
||||
## Game Start (happy path)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Actor as Owner or Admin
|
||||
participant Lobby
|
||||
participant Redis
|
||||
participant RT as Runtime Manager
|
||||
participant GM as Game Master
|
||||
|
||||
Actor->>Lobby: lobby.game.start
|
||||
Lobby->>Redis: status ready_to_start → starting (CAS)
|
||||
Lobby->>Redis: XADD runtime:start_jobs
|
||||
RT->>Redis: XADD runtime:job_results (success + container metadata)
|
||||
Lobby->>Redis: persist runtime_binding on game record
|
||||
Lobby->>GM: POST /internal/games/{id}/register-runtime
|
||||
GM-->>Lobby: 200 OK
|
||||
Lobby->>Redis: status starting → running; set started_at
|
||||
```
|
||||
|
||||
If runtime metadata persistence fails, Lobby publishes a stop-job to remove
|
||||
the orphan container before flipping the game to `start_failed`.
|
||||
|
||||
## Game Start (GM unavailable)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Lobby
|
||||
participant Redis
|
||||
participant GM as Game Master
|
||||
participant Stream as notification:intents
|
||||
|
||||
Lobby->>GM: POST /internal/games/{id}/register-runtime
|
||||
GM-->>Lobby: timeout / 5xx
|
||||
Lobby->>Redis: status starting → paused (CAS)
|
||||
Lobby->>Stream: lobby.runtime_paused_after_start (admin)
|
||||
Note over Lobby,GM: Container stays alive; admin restarts GM<br/>and issues lobby.game.resume.
|
||||
```
|
||||
|
||||
## Game Finish + Capability Evaluation
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant GM as Game Master
|
||||
participant Stream as gm:lobby_events
|
||||
participant Lobby
|
||||
participant Redis
|
||||
participant Intents as notification:intents
|
||||
|
||||
GM->>Stream: XADD runtime_snapshot_update (player_turn_stats)
|
||||
Lobby->>Redis: UpdateMax for each member's stats aggregate
|
||||
GM->>Stream: XADD game_finished
|
||||
Lobby->>Redis: status running/paused → finished; finished_at = event_ts
|
||||
Lobby->>Redis: capability evaluator runs per active membership
|
||||
alt member capable
|
||||
Lobby->>Redis: RND.MarkPendingRegistration(eligible_until = finished_at + 30d)
|
||||
Lobby->>Intents: lobby.race_name.registration_eligible (recipient: user)
|
||||
else not capable
|
||||
Lobby->>Redis: RND.ReleaseReservation
|
||||
Lobby->>Intents: lobby.race_name.registration_denied (optional)
|
||||
end
|
||||
Lobby->>Redis: ReleaseReservation for removed/blocked memberships
|
||||
Lobby->>Redis: delete per-game stats aggregate
|
||||
```
|
||||
|
||||
The evaluation guard `lobby:capability_evaluation:done:<game_id>` makes a
|
||||
replayed `game_finished` event a no-op.
|
||||
|
||||
## Race Name Registration
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Lobby
|
||||
participant UserSvc as User Service
|
||||
participant RND as Race Name Directory
|
||||
participant Stream as notification:intents
|
||||
|
||||
User->>Lobby: lobby.race_name.register(race_name)
|
||||
Lobby->>UserSvc: GetEligibility (sanctions, max_registered_race_names)
|
||||
UserSvc-->>Lobby: snapshot
|
||||
Lobby->>RND: Register(game_id, user_id, race_name)
|
||||
RND-->>Lobby: ok / ErrPendingExpired / ErrQuotaExceeded
|
||||
alt success
|
||||
Lobby->>Stream: lobby.race_name.registered (recipient: user)
|
||||
Lobby-->>User: 200 RegisteredRaceName
|
||||
else precondition failure
|
||||
Lobby-->>User: 422 DomainPreconditionError
|
||||
end
|
||||
```
|
||||
|
||||
Registration consumes one tariff slot keyed by `(canonical_key, user_id)`;
|
||||
tariff downgrade never revokes existing registrations.
|
||||
|
||||
## Cascade Release on User Lifecycle Event
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant US as User Service
|
||||
participant Stream as user:lifecycle_events
|
||||
participant Lobby
|
||||
participant RT as Runtime Manager
|
||||
participant Intents as notification:intents
|
||||
|
||||
US->>Stream: XADD permanent_blocked or deleted
|
||||
Lobby->>Stream: XREAD (consumer)
|
||||
Lobby->>Lobby: RND.ReleaseAllByUser
|
||||
Lobby->>Lobby: memberships → blocked + lobby.membership.blocked per private game
|
||||
Lobby->>Lobby: applications → rejected
|
||||
Lobby->>Lobby: invites (addressed and inviter-side) → revoked
|
||||
Lobby->>Lobby: owned non-terminal games → cancelled (external_block trigger)
|
||||
Lobby->>RT: XADD runtime:stop_jobs for in-flight owned games
|
||||
Lobby->>Intents: lobby.membership.blocked per affected membership
|
||||
Lobby->>Stream: advance offset
|
||||
```
|
||||
|
||||
Every step is idempotent at the store layer (`ErrConflict` from a CAS is
|
||||
treated as «already done»); the consumer only advances the offset once the
|
||||
handler returns nil.
|
||||
@@ -0,0 +1,220 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup, steady-state
|
||||
readiness, shutdown, and the handful of recovery paths specific to Lobby.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
|
||||
five Lobby-related streams.
|
||||
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
|
||||
the network the Lobby pods run in. Lobby does not ping these at boot,
|
||||
but transport failures against them will surface as request errors.
|
||||
- Stream names match the producers/consumers Lobby integrates with:
|
||||
- `LOBBY_GM_EVENTS_STREAM` (default `gm:lobby_events`)
|
||||
- `LOBBY_RUNTIME_START_JOBS_STREAM` (default `runtime:start_jobs`)
|
||||
- `LOBBY_RUNTIME_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
|
||||
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
|
||||
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
|
||||
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
|
||||
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
|
||||
`stub` value is only for unit tests.
|
||||
|
||||
At startup the process performs a bounded `PING` against Redis. Startup
|
||||
fails fast if the ping fails. There are no liveness checks against User
|
||||
Service or Game Master at boot; those are surfaced at request time.
|
||||
|
||||
Expected listener state after a healthy start:
|
||||
|
||||
- public HTTP is enabled on `LOBBY_PUBLIC_HTTP_ADDR` (default `:8094`);
|
||||
- internal HTTP is enabled on `LOBBY_INTERNAL_HTTP_ADDR` (default `:8095`);
|
||||
- both ports answer `GET /healthz` and `GET /readyz`.
|
||||
|
||||
Expected log lines:
|
||||
|
||||
- `lobby starting` from `cmd/lobby`;
|
||||
- one `redis ping ok` line;
|
||||
- one `public http listening` and one `internal http listening` line;
|
||||
- one `worker started` line per background worker (six expected).
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually guarantee:
|
||||
|
||||
- `GET /healthz` confirms the listener is alive;
|
||||
- `GET /readyz` confirms the runtime wiring completed and Redis was reachable
|
||||
at boot.
|
||||
|
||||
`/readyz` is process-local. It does not confirm:
|
||||
|
||||
- ongoing Redis health after boot;
|
||||
- User Service reachability;
|
||||
- Game Master reachability;
|
||||
- worker liveness.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted the listener and worker startup logs;
|
||||
2. check `GET /healthz` and `GET /readyz` on both ports;
|
||||
3. verify `lobby.active_games` gauge is non-zero in the metrics backend after
|
||||
the first traffic;
|
||||
4. verify `lobby.gm_events.oldest_unprocessed_age_ms` is small or zero after
|
||||
GM starts emitting events.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- the per-component shutdown budget is controlled by `LOBBY_SHUTDOWN_TIMEOUT`;
|
||||
- HTTP listeners drain in-flight requests before closing;
|
||||
- background workers stop their `XREAD` loops and persist the latest offset;
|
||||
- pending consumer offsets are flushed before exit.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for the listener and component-stop logs;
|
||||
3. expect any worker that was mid-cycle to retry from the persisted offset
|
||||
on the next process start;
|
||||
4. investigate only if shutdown exceeds `LOBBY_SHUTDOWN_TIMEOUT`.
|
||||
|
||||
## Stuck `starting` Recovery
|
||||
|
||||
A game that flips to `starting` but never completes one of the post-start
|
||||
steps will stay in `starting` until manual recovery.
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `lobby.active_games{status="starting"}` gauge non-zero for longer than the
|
||||
expected start budget (Runtime Manager start time + GM register call);
|
||||
- per-game logs show `start_job_published` but no `runtime_job_result` or
|
||||
`register_runtime_outcome` follow-up.
|
||||
|
||||
Recovery:
|
||||
|
||||
1. Identify the affected `game_id` from the gauge labels or logs.
|
||||
2. Inspect `runtime:job_results` for the `runtime_job_id` published by
|
||||
Lobby. If absent, Runtime Manager never produced a result; resolve at
|
||||
the runtime layer.
|
||||
3. If the result exists with `success=true` but no GM call was made, retry
|
||||
with the admin or owner command `lobby.game.retry_start`.
|
||||
4. If the result exists with `success=false`, transition through the
|
||||
`start_failed` path and use `lobby.game.cancel` or `retry_start` once
|
||||
the underlying issue is resolved.
|
||||
5. If the metadata persistence step failed, Lobby has already published a
|
||||
stop-job and moved the game to `start_failed`. Confirm the orphan
|
||||
container was removed by Runtime Manager.
|
||||
|
||||
Lobby always re-accepts a `start` command on a game that is stuck in
|
||||
`starting`: the first action is a CAS attempt, and a second `start` from a
|
||||
re-issued admin command will progress the state machine.
|
||||
|
||||
## Stuck Stream Offsets
|
||||
|
||||
Three stream-lag gauges describe the consumer health:
|
||||
|
||||
- `lobby.gm_events.oldest_unprocessed_age_ms`
|
||||
- `lobby.runtime_results.oldest_unprocessed_age_ms`
|
||||
- `lobby.user_lifecycle.oldest_unprocessed_age_ms`
|
||||
|
||||
A persistently increasing gauge means the consumer is unable to advance.
|
||||
Causes and triage:
|
||||
|
||||
1. **Decoder rejects a malformed entry.** The consumer logs `malformed_event`
|
||||
and advances the offset; this should not stall the stream. If the gauge
|
||||
keeps climbing, there is a real handler error.
|
||||
2. **Handler returns a non-nil error.** The consumer holds the offset and
|
||||
retries on every cycle. Inspect the latest log lines to identify the
|
||||
error class (Redis transient, RND store error, RuntimeManager publish
|
||||
failure for cascade events).
|
||||
3. **Process restart loop.** A crash before persisting the offset does not
|
||||
advance progress. Check pod restart counts and `cmd/lobby` panics.
|
||||
|
||||
After the underlying cause is fixed, the consumer resumes from the persisted
|
||||
offset; no manual intervention to the offset key is required in normal
|
||||
operation. If a corrupt entry must be skipped, advance
|
||||
`lobby:stream_offsets:<label>` to the next valid stream ID and restart the
|
||||
process.
|
||||
|
||||
## Pending Registration Window Expiry
|
||||
|
||||
The pending-registration expirer ticks every
|
||||
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL` (default `1h`) and releases
|
||||
`pending_registration` entries past their `eligible_until` timestamp.
|
||||
|
||||
The 30-day window length is the in-process constant
|
||||
`service/capabilityevaluation.PendingRegistrationWindow`. Operator-tunable
|
||||
override is reserved for a future change under the env var
|
||||
`LOBBY_PENDING_REGISTRATION_TTL_HOURS`; today the constant is final.
|
||||
|
||||
The worker absorbs Race Name Directory failures: a failing `Expire` call is
|
||||
logged at warn level, the worker waits for the next tick, and no offset is
|
||||
moved (there is no offset; this is a periodic worker, not a consumer). A
|
||||
backlog of expirable entries is therefore self-healing once the directory
|
||||
is reachable again.
|
||||
|
||||
To inspect the backlog:
|
||||
|
||||
```bash
|
||||
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
|
||||
```
|
||||
|
||||
Entries with `score < now()` (Unix milliseconds) are expirable on the next
|
||||
tick.
|
||||
|
||||
## Cascade Release Operator Notes
|
||||
|
||||
The `user:lifecycle_events` consumer fans out a single user-lifecycle event
|
||||
into many actions:
|
||||
|
||||
1. Race Name Directory release (`RND.ReleaseAllByUser`).
|
||||
2. Membership status flips (`active` → `blocked`) on every membership the
|
||||
user holds, with a `lobby.membership.blocked` notification per
|
||||
third-party private game.
|
||||
3. Application status flips (`submitted` → `rejected`).
|
||||
4. Invite status flips (`created` → `revoked`) on both addressed and
|
||||
inviter-side invites.
|
||||
5. Owned non-terminal games transition to `cancelled` via the
|
||||
`external_block` trigger. In-flight statuses (`starting`, `running`,
|
||||
`paused`) get a stop-job published to Runtime Manager before the game
|
||||
record is updated.
|
||||
|
||||
The cascade is idempotent: every store mutation uses CAS, and `ErrConflict`
|
||||
is treated as «already done». A retry on the next consumer cycle will
|
||||
re-traverse the same set without producing duplicate side effects.
|
||||
|
||||
A single failing step (transient store error or runtime stop-job publish
|
||||
failure) leaves the offset on the current entry. The next cycle retries the
|
||||
full cascade. Do not advance the offset manually unless you have first
|
||||
verified that the cascade actions for the current entry have been completed
|
||||
out-of-band.
|
||||
|
||||
## Diagnostic Queries
|
||||
|
||||
A handful of Redis CLI snippets help during incidents:
|
||||
|
||||
```bash
|
||||
# Live game count by status
|
||||
redis-cli ZCARD lobby:games_by_status:enrollment_open
|
||||
redis-cli ZCARD lobby:games_by_status:running
|
||||
|
||||
# Inspect a specific game record
|
||||
redis-cli GET lobby:games:<game_id>
|
||||
|
||||
# Member roster for a game
|
||||
redis-cli SMEMBERS lobby:game_memberships:<game_id>
|
||||
|
||||
# Race name pending entries (oldest first)
|
||||
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
|
||||
|
||||
# Stream lag inspection
|
||||
redis-cli XINFO STREAM gm:lobby_events
|
||||
redis-cli GET lobby:stream_offsets:gm_events
|
||||
```
|
||||
|
||||
The gauges and counters surfaced through OpenTelemetry are the primary
|
||||
observability surface; raw Redis access is for last-resort triage.
|
||||
@@ -0,0 +1,163 @@
|
||||
# Runtime and Components
|
||||
|
||||
The diagram below focuses on the deployed `galaxy/lobby` process and its
|
||||
runtime dependencies.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Clients
|
||||
Gateway["Edge Gateway"]
|
||||
Admin["Admin Service"]
|
||||
GM["Game Master"]
|
||||
end
|
||||
|
||||
subgraph Lobby["Game Lobby process"]
|
||||
PublicHTTP["Public HTTP listener\n:8094 /healthz /readyz"]
|
||||
InternalHTTP["Internal HTTP listener\n:8095 /healthz /readyz"]
|
||||
EnrollAuto["Enrollment automation worker"]
|
||||
RTJobsConsumer["runtime:job_results consumer"]
|
||||
GMEventsConsumer["gm:lobby_events consumer"]
|
||||
PendingExpirer["Pending registration expirer"]
|
||||
ULConsumer["user:lifecycle_events consumer"]
|
||||
IntentPublisher["notification:intents publisher"]
|
||||
Telemetry["Logs, traces, metrics"]
|
||||
end
|
||||
|
||||
User["User Service"]
|
||||
Redis["Redis\nKV + Streams"]
|
||||
|
||||
Gateway --> PublicHTTP
|
||||
Admin --> InternalHTTP
|
||||
GM --> InternalHTTP
|
||||
|
||||
PublicHTTP --> User
|
||||
InternalHTTP --> User
|
||||
PublicHTTP -. register-runtime .-> GM
|
||||
InternalHTTP -. register-runtime .-> GM
|
||||
|
||||
EnrollAuto --> Redis
|
||||
RTJobsConsumer --> Redis
|
||||
GMEventsConsumer --> Redis
|
||||
PendingExpirer --> Redis
|
||||
ULConsumer --> Redis
|
||||
IntentPublisher --> Redis
|
||||
|
||||
PublicHTTP --> Redis
|
||||
InternalHTTP --> Redis
|
||||
|
||||
PublicHTTP --> Telemetry
|
||||
InternalHTTP --> Telemetry
|
||||
EnrollAuto --> Telemetry
|
||||
RTJobsConsumer --> Telemetry
|
||||
GMEventsConsumer --> Telemetry
|
||||
PendingExpirer --> Telemetry
|
||||
ULConsumer --> Telemetry
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `cmd/lobby` refuses startup when Redis connectivity is misconfigured. User
|
||||
Service and Game Master reachability are not verified at boot; transport
|
||||
failures surface as request errors.
|
||||
- Both HTTP listeners expose `/healthz` and `/readyz` independently so health
|
||||
checks can target either port.
|
||||
- `register-runtime` is an outgoing call from Lobby to Game Master after the
|
||||
container start completes. Lobby does not expose an inbound endpoint of the
|
||||
same name.
|
||||
|
||||
## Listeners
|
||||
|
||||
| Listener | Default addr | Purpose |
|
||||
| --- | --- | --- |
|
||||
| Public HTTP | `:8094` | Authenticated user routes; gateway-facing |
|
||||
| Internal HTTP | `:8095` | Admin-mirrored routes + Game Master read paths |
|
||||
|
||||
Shared listener defaults:
|
||||
|
||||
- read-header timeout: `2s`
|
||||
- read timeout: `10s`
|
||||
- idle timeout: `1m`
|
||||
|
||||
Public-port routes carry an `X-User-ID` header injected by Edge Gateway;
|
||||
internal-port routes admit the admin actor without the header.
|
||||
|
||||
Probe routes:
|
||||
|
||||
- `GET /healthz` returns `{"status":"ok"}`
|
||||
- `GET /readyz` returns `{"status":"ready"}` once startup wiring completes.
|
||||
- Neither probe performs a live Redis ping per request.
|
||||
- There is no `/metrics` route. Metrics flow through OpenTelemetry exporters.
|
||||
|
||||
## Background Workers
|
||||
|
||||
| Worker | Trigger | Function |
|
||||
| --- | --- | --- |
|
||||
| Enrollment automation | Periodic tick (`LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`) | Closes enrollment when the deadline or the gap window is exhausted. |
|
||||
| `runtime:job_results` consumer | Redis `XREAD` | Drives `starting` to `running`/`paused`/`start_failed` based on Runtime Manager outcomes. |
|
||||
| `gm:lobby_events` consumer | Redis `XREAD` | Applies runtime snapshot updates and game-finish events from Game Master; hands `game_finished` events off to capability evaluation. |
|
||||
| Pending registration expirer | Periodic tick (`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`) | Releases `pending_registration` entries past their 30-day window. |
|
||||
| `user:lifecycle_events` consumer | Redis `XREAD` | Fans out the cascade for `permanent_blocked` and `deleted` user events (RND release, membership block, application/invite cancel, owned-game cancel). |
|
||||
| `notification:intents` publisher | Synchronous from services | Wraps every notification publish with metric instrumentation; producer-side failures degrade notifications without rolling back business state. |
|
||||
|
||||
## Synchronous Upstream Clients
|
||||
|
||||
| Client | Endpoint | Failure mapping |
|
||||
| --- | --- | --- |
|
||||
| `User Service` eligibility | `POST {LOBBY_USER_SERVICE_BASE_URL}/api/v1/internal/users/{user_id}/lobby-eligibility` | Network or non-2xx → `503 service_unavailable`; `permanent_block` → `404 subject_not_found`. |
|
||||
| `Game Master` register-runtime | `POST {LOBBY_GM_BASE_URL}/api/v1/internal/games/{game_id}/register-runtime` | Network or non-2xx → forced-pause path (`paused` + `lobby.runtime_paused_after_start`). |
|
||||
| `Game Master` liveness probe | `GET {LOBBY_GM_BASE_URL}/api/v1/internal/healthz` | Used during `lobby.game.resume`; failure surfaces as `503 service_unavailable`. |
|
||||
|
||||
## Stream Offsets
|
||||
|
||||
Each consumer persists its position under a dedicated key so process restart
|
||||
preserves stream progress.
|
||||
|
||||
| Stream | Offset key | Read block timeout env |
|
||||
| --- | --- | --- |
|
||||
| `gm:lobby_events` | `lobby:stream_offsets:gm_events` | `LOBBY_GM_EVENTS_READ_BLOCK_TIMEOUT` |
|
||||
| `runtime:job_results` | `lobby:stream_offsets:runtime_results` | `LOBBY_RUNTIME_JOB_RESULTS_READ_BLOCK_TIMEOUT` |
|
||||
| `user:lifecycle_events` | `lobby:stream_offsets:user_lifecycle` | `LOBBY_USER_LIFECYCLE_READ_BLOCK_TIMEOUT` |
|
||||
|
||||
Stream lag is exposed through observable gauges
|
||||
`lobby.gm_events.oldest_unprocessed_age_ms`,
|
||||
`lobby.runtime_results.oldest_unprocessed_age_ms`, and
|
||||
`lobby.user_lifecycle.oldest_unprocessed_age_ms`. The probe samples the
|
||||
oldest entry whose ID is greater than the persisted offset; when a consumer
|
||||
lags or stalls, the gauge climbs and stays high.
|
||||
|
||||
## Configuration Groups
|
||||
|
||||
The full env-var list with defaults lives in `../README.md` §Configuration.
|
||||
The groups below summarize the structure:
|
||||
|
||||
- **Required** — `LOBBY_REDIS_ADDR`, `LOBBY_USER_SERVICE_BASE_URL`,
|
||||
`LOBBY_GM_BASE_URL`.
|
||||
- **Process and logging** — `LOBBY_SHUTDOWN_TIMEOUT`, `LOBBY_LOG_LEVEL`.
|
||||
- **HTTP listeners** — `LOBBY_PUBLIC_HTTP_*`, `LOBBY_INTERNAL_HTTP_*`.
|
||||
- **Redis connectivity** — `LOBBY_REDIS_USERNAME`, `LOBBY_REDIS_PASSWORD`,
|
||||
`LOBBY_REDIS_DB`, `LOBBY_REDIS_TLS_ENABLED`,
|
||||
`LOBBY_REDIS_OPERATION_TIMEOUT`.
|
||||
- **Streams** — `LOBBY_GM_EVENTS_STREAM`, `LOBBY_RUNTIME_START_JOBS_STREAM`,
|
||||
`LOBBY_RUNTIME_STOP_JOBS_STREAM`, `LOBBY_RUNTIME_JOB_RESULTS_STREAM`,
|
||||
`LOBBY_NOTIFICATION_INTENTS_STREAM`, `LOBBY_USER_LIFECYCLE_STREAM`.
|
||||
- **Upstream clients** — `LOBBY_USER_SERVICE_TIMEOUT`, `LOBBY_GM_TIMEOUT`.
|
||||
- **Workers** — `LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`,
|
||||
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`,
|
||||
`LOBBY_RACE_NAME_DIRECTORY_BACKEND`.
|
||||
- **Telemetry** — standard `OTEL_*` plus
|
||||
`LOBBY_OTEL_STDOUT_TRACES_ENABLED`,
|
||||
`LOBBY_OTEL_STDOUT_METRICS_ENABLED`.
|
||||
|
||||
## Runtime Notes
|
||||
|
||||
- `Game Lobby` owns platform game state. Game Master may cache snapshots but
|
||||
is not the source of truth.
|
||||
- The Race Name Directory ships a Redis adapter and an in-process stub; the
|
||||
stub is intended for unit tests and is selected via
|
||||
`LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
|
||||
- A `permanent_block` or `deleted` event from User Service fans out
|
||||
asynchronously through the `user:lifecycle_events` consumer; in-flight
|
||||
games owned by the affected user receive a stop-job and transition to
|
||||
`cancelled` via the `external_block` trigger.
|
||||
- `notification:intents` publishes are best-effort: a failed publish is
|
||||
logged and counted but does not roll back the committed business state.
|
||||
Reference in New Issue
Block a user