Files
galaxy-game/lobby/docs/runtime.md
T
2026-04-25 23:20:55 +02:00

164 lines
7.1 KiB
Markdown

# Runtime and Components
The diagram below focuses on the deployed `galaxy/lobby` process and its
runtime dependencies.
```mermaid
flowchart LR
subgraph Clients
Gateway["Edge Gateway"]
Admin["Admin Service"]
GM["Game Master"]
end
subgraph Lobby["Game Lobby process"]
PublicHTTP["Public HTTP listener\n:8094 /healthz /readyz"]
InternalHTTP["Internal HTTP listener\n:8095 /healthz /readyz"]
EnrollAuto["Enrollment automation worker"]
RTJobsConsumer["runtime:job_results consumer"]
GMEventsConsumer["gm:lobby_events consumer"]
PendingExpirer["Pending registration expirer"]
ULConsumer["user:lifecycle_events consumer"]
IntentPublisher["notification:intents publisher"]
Telemetry["Logs, traces, metrics"]
end
User["User Service"]
Redis["Redis\nKV + Streams"]
Gateway --> PublicHTTP
Admin --> InternalHTTP
GM --> InternalHTTP
PublicHTTP --> User
InternalHTTP --> User
PublicHTTP -. register-runtime .-> GM
InternalHTTP -. register-runtime .-> GM
EnrollAuto --> Redis
RTJobsConsumer --> Redis
GMEventsConsumer --> Redis
PendingExpirer --> Redis
ULConsumer --> Redis
IntentPublisher --> Redis
PublicHTTP --> Redis
InternalHTTP --> Redis
PublicHTTP --> Telemetry
InternalHTTP --> Telemetry
EnrollAuto --> Telemetry
RTJobsConsumer --> Telemetry
GMEventsConsumer --> Telemetry
PendingExpirer --> Telemetry
ULConsumer --> Telemetry
```
Notes:
- `cmd/lobby` refuses startup when Redis connectivity is misconfigured. User
Service and Game Master reachability are not verified at boot; transport
failures surface as request errors.
- Both HTTP listeners expose `/healthz` and `/readyz` independently so health
checks can target either port.
- `register-runtime` is an outgoing call from Lobby to Game Master after the
container start completes. Lobby does not expose an inbound endpoint of the
same name.
## Listeners
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Public HTTP | `:8094` | Authenticated user routes; gateway-facing |
| Internal HTTP | `:8095` | Admin-mirrored routes + Game Master read paths |
Shared listener defaults:
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
Public-port routes carry an `X-User-ID` header injected by Edge Gateway;
internal-port routes admit the admin actor without the header.
Probe routes:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}` once startup wiring completes.
- Neither probe performs a live Redis ping per request.
- There is no `/metrics` route. Metrics flow through OpenTelemetry exporters.
## Background Workers
| Worker | Trigger | Function |
| --- | --- | --- |
| Enrollment automation | Periodic tick (`LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`) | Closes enrollment when the deadline or the gap window is exhausted. |
| `runtime:job_results` consumer | Redis `XREAD` | Drives `starting` to `running`/`paused`/`start_failed` based on Runtime Manager outcomes. |
| `gm:lobby_events` consumer | Redis `XREAD` | Applies runtime snapshot updates and game-finish events from Game Master; hands `game_finished` events off to capability evaluation. |
| Pending registration expirer | Periodic tick (`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`) | Releases `pending_registration` entries past their 30-day window. |
| `user:lifecycle_events` consumer | Redis `XREAD` | Fans out the cascade for `permanent_blocked` and `deleted` user events (RND release, membership block, application/invite cancel, owned-game cancel). |
| `notification:intents` publisher | Synchronous from services | Wraps every notification publish with metric instrumentation; producer-side failures degrade notifications without rolling back business state. |
## Synchronous Upstream Clients
| Client | Endpoint | Failure mapping |
| --- | --- | --- |
| `User Service` eligibility | `POST {LOBBY_USER_SERVICE_BASE_URL}/api/v1/internal/users/{user_id}/lobby-eligibility` | Network or non-2xx → `503 service_unavailable`; `permanent_block``404 subject_not_found`. |
| `Game Master` register-runtime | `POST {LOBBY_GM_BASE_URL}/api/v1/internal/games/{game_id}/register-runtime` | Network or non-2xx → forced-pause path (`paused` + `lobby.runtime_paused_after_start`). |
| `Game Master` liveness probe | `GET {LOBBY_GM_BASE_URL}/api/v1/internal/healthz` | Used during `lobby.game.resume`; failure surfaces as `503 service_unavailable`. |
## Stream Offsets
Each consumer persists its position under a dedicated key so process restart
preserves stream progress.
| Stream | Offset key | Read block timeout env |
| --- | --- | --- |
| `gm:lobby_events` | `lobby:stream_offsets:gm_events` | `LOBBY_GM_EVENTS_READ_BLOCK_TIMEOUT` |
| `runtime:job_results` | `lobby:stream_offsets:runtime_results` | `LOBBY_RUNTIME_JOB_RESULTS_READ_BLOCK_TIMEOUT` |
| `user:lifecycle_events` | `lobby:stream_offsets:user_lifecycle` | `LOBBY_USER_LIFECYCLE_READ_BLOCK_TIMEOUT` |
Stream lag is exposed through observable gauges
`lobby.gm_events.oldest_unprocessed_age_ms`,
`lobby.runtime_results.oldest_unprocessed_age_ms`, and
`lobby.user_lifecycle.oldest_unprocessed_age_ms`. The probe samples the
oldest entry whose ID is greater than the persisted offset; when a consumer
lags or stalls, the gauge climbs and stays high.
## Configuration Groups
The full env-var list with defaults lives in `../README.md` §Configuration.
The groups below summarize the structure:
- **Required** — `LOBBY_REDIS_ADDR`, `LOBBY_USER_SERVICE_BASE_URL`,
`LOBBY_GM_BASE_URL`.
- **Process and logging** — `LOBBY_SHUTDOWN_TIMEOUT`, `LOBBY_LOG_LEVEL`.
- **HTTP listeners** — `LOBBY_PUBLIC_HTTP_*`, `LOBBY_INTERNAL_HTTP_*`.
- **Redis connectivity** — `LOBBY_REDIS_USERNAME`, `LOBBY_REDIS_PASSWORD`,
`LOBBY_REDIS_DB`, `LOBBY_REDIS_TLS_ENABLED`,
`LOBBY_REDIS_OPERATION_TIMEOUT`.
- **Streams** — `LOBBY_GM_EVENTS_STREAM`, `LOBBY_RUNTIME_START_JOBS_STREAM`,
`LOBBY_RUNTIME_STOP_JOBS_STREAM`, `LOBBY_RUNTIME_JOB_RESULTS_STREAM`,
`LOBBY_NOTIFICATION_INTENTS_STREAM`, `LOBBY_USER_LIFECYCLE_STREAM`.
- **Upstream clients** — `LOBBY_USER_SERVICE_TIMEOUT`, `LOBBY_GM_TIMEOUT`.
- **Workers** — `LOBBY_ENROLLMENT_AUTOMATION_INTERVAL`,
`LOBBY_RACE_NAME_EXPIRATION_INTERVAL`,
`LOBBY_RACE_NAME_DIRECTORY_BACKEND`.
- **Telemetry** — standard `OTEL_*` plus
`LOBBY_OTEL_STDOUT_TRACES_ENABLED`,
`LOBBY_OTEL_STDOUT_METRICS_ENABLED`.
## Runtime Notes
- `Game Lobby` owns platform game state. Game Master may cache snapshots but
is not the source of truth.
- The Race Name Directory ships a Redis adapter and an in-process stub; the
stub is intended for unit tests and is selected via
`LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
- A `permanent_block` or `deleted` event from User Service fans out
asynchronously through the `user:lifecycle_events` consumer; in-flight
games owned by the affected user receive a stop-job and transition to
`cancelled` via the `external_block` trigger.
- `notification:intents` publishes are best-effort: a failed publish is
logged and counted but does not roll back the committed business state.