Files
galaxy-game/lobby/docs/runtime.md
T
2026-04-26 20:34:39 +02:00

172 lines
7.6 KiB
Markdown

# Runtime and Components
The diagram below focuses on the deployed `galaxy/lobby` process and its
runtime dependencies.
```mermaid
flowchart LR
subgraph Clients
Gateway["Edge Gateway"]
Admin["Admin Service"]
GM["Game Master"]
end
subgraph Lobby["Game Lobby process"]
PublicHTTP["Public HTTP listener\n:8094 /healthz /readyz"]
InternalHTTP["Internal HTTP listener\n:8095 /healthz /readyz"]
EnrollAuto["Enrollment automation worker"]
RTJobsConsumer["runtime:job_results consumer"]
GMEventsConsumer["gm:lobby_events consumer"]
PendingExpirer["Pending registration expirer"]
ULConsumer["user:lifecycle_events consumer"]
IntentPublisher["notification:intents publisher"]
Telemetry["Logs, traces, metrics"]
end
User["User Service"]
Redis["Redis\nKV + Streams"]
Gateway --> PublicHTTP
Admin --> InternalHTTP
GM --> InternalHTTP
PublicHTTP --> User
InternalHTTP --> User
PublicHTTP -. register-runtime .-> GM
InternalHTTP -. register-runtime .-> GM
EnrollAuto --> Redis
RTJobsConsumer --> Redis
GMEventsConsumer --> Redis
PendingExpirer --> Redis
ULConsumer --> Redis
IntentPublisher --> Redis
PublicHTTP --> Redis
InternalHTTP --> Redis
PublicHTTP --> Telemetry
InternalHTTP --> Telemetry
EnrollAuto --> Telemetry
RTJobsConsumer --> Telemetry
GMEventsConsumer --> Telemetry
PendingExpirer --> Telemetry
ULConsumer --> Telemetry
```
Notes:
- `cmd/lobby` refuses startup when Redis connectivity is misconfigured, when
PostgreSQL is unreachable, or when the embedded goose migrations fail to
apply. User Service and Game Master reachability are not verified at boot;
transport failures surface as request errors.
- Both HTTP listeners expose `/healthz` and `/readyz` independently so health
checks can target either port.
- `register-runtime` is an outgoing call from Lobby to Game Master after the
container start completes. Lobby does not expose an inbound endpoint of the
same name.
## Listeners
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Public HTTP | `:8094` | Authenticated user routes; gateway-facing |
| Internal HTTP | `:8095` | Admin-mirrored routes + Game Master read paths |
Shared listener defaults:
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
Public-port routes carry an `X-User-ID` header injected by Edge Gateway;
internal-port routes admit the admin actor without the header.
Probe routes:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}` once startup wiring completes.
- Neither probe performs a live Redis or PostgreSQL ping per request.
- There is no `/metrics` route. Metrics flow through OpenTelemetry exporters.
## Background Workers
| Worker | Trigger | Function |
| --- | --- | --- |
| Enrollment automation | Periodic tick (`LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`) | Closes enrollment when the deadline or the gap window is exhausted. |
| `runtime:job_results` consumer | Redis `XREAD` | Drives `starting` to `running`/`paused`/`start_failed` based on Runtime Manager outcomes. |
| `gm:lobby_events` consumer | Redis `XREAD` | Applies runtime snapshot updates and game-finish events from Game Master; hands `game_finished` events off to capability evaluation. |
| Pending registration expirer | Periodic tick (`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`) | Releases `pending_registration` entries past their 30-day window. |
| `user:lifecycle_events` consumer | Redis `XREAD` | Fans out the cascade for `permanent_blocked` and `deleted` user events (RND release, membership block, application/invite cancel, owned-game cancel). |
| `notification:intents` publisher | Synchronous from services | Wraps every notification publish with metric instrumentation; producer-side failures degrade notifications without rolling back business state. |
## Synchronous Upstream Clients
| Client | Endpoint | Failure mapping |
| --- | --- | --- |
| `User Service` eligibility | `POST {LOBBY_USER_SERVICE_BASE_URL}/api/v1/internal/users/{user_id}/lobby-eligibility` | Network or non-2xx → `503 service_unavailable`; `permanent_block``404 subject_not_found`. |
| `Game Master` register-runtime | `POST {LOBBY_GM_BASE_URL}/api/v1/internal/games/{game_id}/register-runtime` | Network or non-2xx → forced-pause path (`paused` + `lobby.runtime_paused_after_start`). |
| `Game Master` liveness probe | `GET {LOBBY_GM_BASE_URL}/api/v1/internal/healthz` | Used during `lobby.game.resume`; failure surfaces as `503 service_unavailable`. |
## Stream Offsets
Each consumer persists its position under a dedicated key so process restart
preserves stream progress.
| Stream | Offset key | Read block timeout env |
| --- | --- | --- |
| `gm:lobby_events` | `lobby:stream_offsets:gm_events` | `LOBBY_GM_EVENTS_READ_BLOCK_TIMEOUT` |
| `runtime:job_results` | `lobby:stream_offsets:runtime_results` | `LOBBY_RUNTIME_JOB_RESULTS_READ_BLOCK_TIMEOUT` |
| `user:lifecycle_events` | `lobby:stream_offsets:user_lifecycle` | `LOBBY_USER_LIFECYCLE_READ_BLOCK_TIMEOUT` |
Stream lag is exposed through observable gauges
`lobby.gm_events.oldest_unprocessed_age_ms`,
`lobby.runtime_results.oldest_unprocessed_age_ms`, and
`lobby.user_lifecycle.oldest_unprocessed_age_ms`. The probe samples the
oldest entry whose ID is greater than the persisted offset; when a consumer
lags or stalls, the gauge climbs and stays high.
## Configuration Groups
The full env-var list with defaults lives in `../README.md` §Configuration.
The groups below summarize the structure:
- **Required** — `LOBBY_REDIS_MASTER_ADDR`, `LOBBY_REDIS_PASSWORD`,
`LOBBY_POSTGRES_PRIMARY_DSN`, `LOBBY_USER_SERVICE_BASE_URL`,
`LOBBY_GM_BASE_URL`.
- **Process and logging** — `LOBBY_SHUTDOWN_TIMEOUT`, `LOBBY_LOG_LEVEL`.
- **HTTP listeners** — `LOBBY_PUBLIC_HTTP_*`, `LOBBY_INTERNAL_HTTP_*`.
- **Redis connectivity** — `LOBBY_REDIS_MASTER_ADDR`,
`LOBBY_REDIS_REPLICA_ADDRS`, `LOBBY_REDIS_PASSWORD`, `LOBBY_REDIS_DB`,
`LOBBY_REDIS_OPERATION_TIMEOUT` (legacy `LOBBY_REDIS_ADDR`,
`LOBBY_REDIS_TLS_ENABLED`, `LOBBY_REDIS_USERNAME` removed in PG_PLAN.md
§6A).
- **PostgreSQL connectivity** — `LOBBY_POSTGRES_PRIMARY_DSN`,
`LOBBY_POSTGRES_REPLICA_DSNS`, `LOBBY_POSTGRES_OPERATION_TIMEOUT`,
`LOBBY_POSTGRES_MAX_OPEN_CONNS`, `LOBBY_POSTGRES_MAX_IDLE_CONNS`,
`LOBBY_POSTGRES_CONN_MAX_LIFETIME`.
- **Streams** — `LOBBY_GM_EVENTS_STREAM`, `LOBBY_RUNTIME_START_JOBS_STREAM`,
`LOBBY_RUNTIME_STOP_JOBS_STREAM`, `LOBBY_RUNTIME_JOB_RESULTS_STREAM`,
`LOBBY_NOTIFICATION_INTENTS_STREAM`, `LOBBY_USER_LIFECYCLE_STREAM`.
- **Upstream clients** — `LOBBY_USER_SERVICE_TIMEOUT`, `LOBBY_GM_TIMEOUT`.
- **Workers** — `LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`,
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`,
`LOBBY_RACE_NAME_DIRECTORY_BACKEND`.
- **Telemetry** — standard `OTEL_*` plus
`LOBBY_OTEL_STDOUT_TRACES_ENABLED`,
`LOBBY_OTEL_STDOUT_METRICS_ENABLED`.
## Runtime Notes
- `Game Lobby` owns platform game state. Game Master may cache snapshots but
is not the source of truth.
- The Race Name Directory ships a PostgreSQL adapter (default after
PG_PLAN.md §6B) and an in-process stub. The stub is intended for unit
tests and is selected via `LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
- A `permanent_block` or `deleted` event from User Service fans out
asynchronously through the `user:lifecycle_events` consumer; in-flight
games owned by the affected user receive a stop-job and transition to
`cancelled` via the `external_block` trigger.
- `notification:intents` publishes are best-effort: a failed publish is
logged and counted but does not roll back the committed business state.