feat: gamemaster

This commit is contained in:
Ilia Denisov
2026-05-03 07:59:03 +02:00
committed by GitHub
parent a7cee15115
commit 3e2622757e
229 changed files with 41521 additions and 1098 deletions
+76 -66
View File
@@ -150,7 +150,9 @@ The service starts two HTTP listeners and one Redis Stream consumer pipeline.
- `User Service` reachable at `LOBBY_USER_SERVICE_BASE_URL` (startup check only;
runtime failures are surfaced as request errors, not boot failures)
- `Game Master` at `LOBBY_GM_BASE_URL` (same policy — startup check omitted;
unreachability at registration triggers the forced-pause path)
unreachability at image-ref resolve fails `lobby.game.start` with
`service_unavailable`, unreachability at register-runtime triggers the
forced-pause path)
### Probes
@@ -714,27 +716,55 @@ sequenceDiagram
Admin->>Lobby: lobby.game.start
Lobby->>Lobby: validate ready_to_start + roster
Lobby->>Lobby: status → starting
Lobby->>Redis: publish start job to runtime:start_jobs
Runtime->>Runtime: start container
Runtime->>Redis: publish result to runtime:job_results
Lobby->>GM: GET /internal/engine-versions/{version}/image-ref (sync)
alt GM image-ref resolve failed
GM-->>Lobby: error / timeout / not found
Lobby-->>Admin: service_unavailable (GM unreachable) or engine_version_not_found
else image_ref resolved
GM-->>Lobby: 200 OK { image_ref }
Lobby->>Lobby: status → starting
Lobby->>Redis: publish start job to runtime:start_jobs (with image_ref)
Runtime->>Runtime: start container
Runtime->>Redis: publish result to runtime:job_results
alt container start failed
Lobby->>Lobby: status → start_failed
else container started
Lobby->>Lobby: persist runtime binding
Lobby->>GM: POST /internal/games/{game_id}/register (sync)
alt GM registration success
GM-->>Lobby: 200 OK
Lobby->>Lobby: status → running; set started_at
else GM unavailable
GM-->>Lobby: error / timeout
Lobby->>Lobby: status → paused
Lobby->>Redis: publish lobby.runtime_paused_after_start intent
alt container start failed
Lobby->>Lobby: status → start_failed
else container started
Lobby->>Lobby: persist runtime binding
Lobby->>GM: POST /internal/games/{game_id}/register-runtime (sync)
alt GM registration success
GM-->>Lobby: 200 OK
Lobby->>Lobby: status → running; set started_at
else GM unavailable
GM-->>Lobby: error / timeout
Lobby->>Lobby: status → paused
Lobby->>Redis: publish lobby.runtime_paused_after_start intent
end
end
end
```
### Image-ref resolution (synchronous via Game Master)
Before publishing the start job, `Lobby` resolves the Docker `image_ref`
for `target_engine_version` by calling
`GET /api/v1/internal/engine-versions/{version}/image-ref` on `Game Master`'s
internal port. The call is synchronous and runs while the game is still
in `ready_to_start`:
- success ⇒ `Lobby` proceeds to `starting`, embeds the resolved
`image_ref` into the `runtime:start_jobs` envelope, and publishes;
- the version is missing or deprecated on GM (`engine_version_not_found`)
⇒ `lobby.game.start` returns `engine_version_not_found`; the game stays
in `ready_to_start`;
- GM is unreachable (network error, timeout, `5xx`) ⇒ `lobby.game.start`
returns `service_unavailable`; the game stays in `ready_to_start` and
the operator can retry.
Resolving against GM is the v1 contract; the legacy
`LOBBY_ENGINE_IMAGE_TEMPLATE` Go-template variable is retired together
with the inline `engineimage.Resolver`.
### Critical invariants
- If the container starts but `Lobby` cannot persist the runtime binding metadata,
@@ -743,6 +773,10 @@ sequenceDiagram
- If metadata is persisted but `Game Master` is unavailable, the game must be
placed in `paused`, not in `start_failed`. The container is alive; only the
platform tracking is incomplete.
- If `Game Master` is unavailable at image-ref resolve time, the start
command itself fails with `service_unavailable`. The game stays in
`ready_to_start`; no container is created and no `runtime:start_jobs`
envelope is published.
- No start job is accepted while the game is not in `ready_to_start`.
- Concurrent start attempts for the same game must be serialized; the second
attempt must fail if the first already moved the game to `starting`.
@@ -758,7 +792,7 @@ is no synchronous Lobby→RTM REST call in v1 or planned for v2.
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | Lobby `game_id`. |
| `image_ref` | string | Docker reference resolved from `target_engine_version` via `LOBBY_ENGINE_IMAGE_TEMPLATE`. |
| `image_ref` | string | Docker reference resolved synchronously from `target_engine_version` against `Game Master`'s engine version registry; see §Game Start Flow. |
| `requested_at_ms` | int64 | UTC milliseconds; diagnostics only. |
`runtime:stop_jobs` envelope:
@@ -803,40 +837,6 @@ Alternatives considered and rejected:
outside that package and would have to depend on a concrete adapter
for an enum value.
### Design rationale: `engineimage.Resolver` validates the template at construction
`engineimage.Resolver` stores the validated template; the per-game
`Resolve(version)` call is therefore a pure string substitution that
cannot fail except on an empty `version`.
`LOBBY_ENGINE_IMAGE_TEMPLATE` is loaded at startup. A malformed value
(missing `{engine_version}` placeholder, empty string) is an
operational misconfiguration that fails fast before any traffic arrives
— not on the first start-game request hours later. The synchronous
start handler then incurs no per-call template-shape recheck.
A stateless free function `engineimage.Resolve(template, version)` was
rejected: the only useful checkpoint for the template literal is at
startup; a free function would either re-validate on every call (waste)
or skip validation (regression).
The resolver only guards against an empty/whitespace `version`. Semver
validation lives in `lobby/internal/domain/game/model.go:validateSemver`
and runs at game-record construction time. Re-running it inside the
resolver would either duplicate the rule (drift risk) or import the
validator across package boundaries for no behavioural gain. Keeping the
resolver narrow leaves it reusable from a future producer (for example
`Game Master`, when it takes over `image_ref` resolution) without
dragging Lobby's domain rules along.
The defensive `return start game: resolve image ref: %w` in
`startgame.Service.Handle` is a guard against a future invariant
violation; it is not exercised by the service-level test suite because
the only resolver-failure mode (empty `version`) requires bypassing
`game.Validate`, which `gameinmem.Save` always runs. Adding test
scaffolding to skip validation would teach the test suite a back door
that the production code path does not have.
## Paused State
`Lobby.paused` is a platform-level pause, distinct from `Game Master` runtime
@@ -904,11 +904,12 @@ game finish.
### Per-member stats aggregate
Each `runtime_snapshot_update` carries a `player_turn_stats` array with one
entry per active member: `{user_id, planets, population, ships_built}`.
entry per active member: `{user_id, planets, population}`.
`Lobby` aggregates these in `lobby:game_turn_stats:<game_id>:<user_id>` with
the shape
`{initial_planets, initial_population, initial_ships_built, max_planets,
max_population, max_ships_built}`.
`{initial_planets, initial_population, max_planets, max_population}`.
`ships_built` is not part of the contract; the capability rule reduces to
`planets` and `population` only.
Rules:
@@ -1032,11 +1033,18 @@ Key internal endpoints:
| `GET` | `/api/v1/internal/healthz` | health probe |
| `GET` | `/api/v1/internal/readyz` | readiness probe |
Note: the registration call from Lobby to Game Master after a successful
container start is **outgoing** — Lobby calls
`POST /api/v1/internal/games/{game_id}/register-runtime` on Game Master's
internal port. Lobby does not expose an inbound `register-runtime`
endpoint.
Note: every Lobby Game Master synchronous call is **outgoing** from
Lobby to Game Master's internal port at `LOBBY_GM_BASE_URL`. Lobby does
not expose an inbound `register-runtime` endpoint or any other
GM-facing endpoint:
| Call site | Method | Path on Game Master | Purpose |
| --- | --- | --- | --- |
| `startgame` (pre-publish) | `GET` | `/api/v1/internal/engine-versions/{version}/image-ref` | Resolve the Docker `image_ref` for `target_engine_version` synchronously before publishing `runtime:start_jobs`. Failure ⇒ `service_unavailable` or `engine_version_not_found`; the game stays in `ready_to_start`. |
| `startgame` (post-container-up) | `POST` | `/api/v1/internal/games/{game_id}/register-runtime` | Register the runtime after a successful container start. Failure ⇒ forced `paused` (see §Paused State). |
| `approveapplication`, `rejectapplication`, `redeeminvite`, `removemember`, `blockmember`, user-lifecycle cascade | `POST` | `/api/v1/internal/games/{game_id}/memberships/invalidate` | Tell GM to drop its in-process membership cache for the game after a roster mutation. Called **post-commit** and is fail-open: a non-2xx response is logged and metered but never rolls back the Lobby commit. GM's TTL safety net catches stale data within the next cache TTL window. |
| `removemember` (engine-side cleanup, post-commit) | `POST` | `/api/v1/internal/games/{game_id}/race/{race_name}/banish` | Ask GM to deactivate the engine-side player after a permanent removal. Fail-open in the same sense as the invalidate call. |
| `resumegame` | `GET` | `/api/v1/internal/games/{game_id}/liveness` | Check that GM has the runtime in `running` before transitioning the platform record from `paused` back to `running`. |
Admin-only operations (approve, reject, cancel, create public games, etc.) are
also exposed on the internal port and are intended to be called by `Admin Service`
@@ -1158,6 +1166,9 @@ Stable error codes:
`permanent_block` sanction
- `forbidden` — caller is not authorized for this operation on this game or
this race name
- `engine_version_not_found` — `target_engine_version` is missing or
deprecated on `Game Master`'s engine version registry (returned by
`lobby.game.start` at image-ref resolve time)
- `internal_error` — unexpected service error
- `service_unavailable` — upstream dependency unavailable
@@ -1227,13 +1238,12 @@ Stream names:
- `LOBBY_RUNTIME_JOB_RESULTS_READ_BLOCK_TIMEOUT` with default `2s`
- `LOBBY_NOTIFICATION_INTENTS_STREAM` with default `notification:intents`
Runtime Manager integration:
Game Master image-ref resolver:
- `LOBBY_ENGINE_IMAGE_TEMPLATE` with default `galaxy/game:{engine_version}` —
Go-style template applied to a game's `target_engine_version` to resolve
the Docker `image_ref` published on `runtime:start_jobs`. The template
must contain the literal placeholder `{engine_version}`; Lobby fails
fast at startup otherwise.
- `image_ref` is resolved synchronously by `Game Master` from
`target_engine_version` over its engine version registry; see
§Game Start Flow. The legacy `LOBBY_ENGINE_IMAGE_TEMPLATE` Go-template
variable is retired and rejected at startup if set.
Upstream clients: