Files
galaxy-game/gamemaster/README.md
T
2026-05-03 07:59:03 +02:00

976 lines
42 KiB
Markdown

# Game Master
`Game Master` (GM) is the only Galaxy platform service permitted to talk to
running game engine containers. It owns runtime and operational state of
already-running games, the engine version registry, the platform mapping of
`(user_id ↔ race_name ↔ engine_player_uuid)`, the per-game turn scheduler,
and the synchronous and asynchronous boundaries that other services use to
interact with running games.
## References
- [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §8 Game
Master.
- [`../TESTING.md`](../TESTING.md) §8 — testing matrix for GM.
- [`./PLAN.md`](./PLAN.md) — staged implementation plan.
- [`./docs/README.md`](./docs/README.md) — service-local documentation entry
point (created at PLAN stage 24).
- [`./docs/stage06-contract-files.md`](./docs/stage06-contract-files.md) —
decisions behind the OpenAPI and AsyncAPI specs frozen at PLAN stage 06.
- [`./docs/stage07-notification-catalog-audit.md`](./docs/stage07-notification-catalog-audit.md) —
notification catalog audit and producer-side freeze test added at PLAN stage 07.
- [`./docs/stage08-module-skeleton.md`](./docs/stage08-module-skeleton.md) —
module skeleton wiring decisions (config groups, telemetry instruments,
Makefile targets, deferred dependencies) recorded at PLAN stage 08.
- [`./docs/stage09-postgres-migration.md`](./docs/stage09-postgres-migration.md) —
PostgreSQL schema, embedded migration, jet generation pipeline, and
runtime wiring landed at PLAN stage 09.
- [`./docs/stage10-domain-and-ports.md`](./docs/stage10-domain-and-ports.md) —
domain types, port interfaces, and the six stage-10 decisions
(operation domain package, membership DTO placement, engine-version
options shape, schedule wrapper signature, recovery transition,
deferred mock destination) landed at PLAN stage 10.
- [`./docs/stage11-persistence-adapters.md`](./docs/stage11-persistence-adapters.md) —
PostgreSQL stores (`runtimerecordstore`, `engineversionstore`,
`playermappingstore`, `operationlog`), the Redis offset store, and
the eight stage-11 decisions (sqlx/pgtest local clones, CAS
pattern, port-level Now extension, domain conflict sentinels, jsonb
cast, idempotent Deprecate, multi-row BulkInsert, miniredis
dependency) landed at PLAN stage 11.
- [`./docs/stage12-external-clients.md`](./docs/stage12-external-clients.md) —
outbound adapters (engine, Lobby, Runtime Manager, notification
intent publisher, lobby-events publisher) and the seven stage-12
decisions (per-call engine base URL, dual engine timeout dispatch,
engine population rounding, Lobby pagination cap, no extra RTM
sentinels, AsyncAPI-aligned XADD encoding for `gm:lobby_events`,
Makefile mocks-target guard) landed at PLAN stage 12.
- [`./docs/stage13-register-runtime.md`](./docs/stage13-register-runtime.md) —
register-runtime service-layer orchestrator and the five
stage-13 decisions (`RuntimeRecordStore.Delete` extension, engine
4xx/5xx classification split, engine response validated as
`engine_protocol_violation`, initial snapshot carries `player_turn_stats`
from `/admin/init`, two-flag rollback gating) landed at PLAN
stage 13.
- [`./docs/stage14-engine-version-registry.md`](./docs/stage14-engine-version-registry.md) —
engine version registry service-layer orchestrator (List, Get,
Create, Update, Deprecate, Delete, ResolveImageRef) and the five
stage-14 decisions (`EngineVersionStore.Delete` port extension,
reference probe before hard delete, new `engine_version_delete`
op_kind in schema and domain, `operation_log.game_id` overloaded
as audit subject for registry entries, JSON-object validation for
`options`) landed at PLAN stage 14.
- [`./docs/stage15-scheduler-and-turn-generation.md`](./docs/stage15-scheduler-and-turn-generation.md) —
scheduler ticker, turn-generation orchestrator, and snapshot
publisher and the seven stage-15 decisions
(`LobbyClient.GetGameSummary` extension with fail-soft `game_name`
fallback, telemetry-only `Trigger` parameter, two-CAS pattern with
external-mutation conflict, single-snapshot-per-outcome cadence,
player_mappings as recipient source, stateless scheduler utility,
in-flight set on the ticker) landed at PLAN stage 15.
- [`./docs/stage16-membership-cache-and-invalidation.md`](./docs/stage16-membership-cache-and-invalidation.md) —
hot-path services (`commandexecute`, `orderput`, `reportget`),
membership cache, and the six stage-16 decisions (no
`runtime_not_running` for reports, GM-side envelope rewrite
`commands``cmd` with injected `actor`, hot-path skips
`operation_log`, hand-rolled per-game inflight tracker, raw status
string return, missing-mapping surfaces as `forbidden`) landed at
PLAN stage 16.
- [`./docs/stage17-admin-operations.md`](./docs/stage17-admin-operations.md) —
admin service-layer operations (`adminstop`, `adminforce`,
`adminpatch`, `adminbanish`, `livenessreply`) and the six
stage-17 decisions (`RuntimeRecordStore.UpdateImage` extension,
`adminstop` idempotent on terminal statuses and `conflict` on
`starting`, `adminforce` always sets `skip_next_tick`,
`adminbanish` without status check and missing race surfaces as
`forbidden`, `livenessreply` 200 + empty status on
`runtime_not_found`, RTM failures map to `service_unavailable`)
landed at PLAN stage 17.
- [`./docs/stage18-health-events-consumer.md`](./docs/stage18-health-events-consumer.md) —
`runtime:health_events` consumer worker and the seven stage-18
decisions (event-type taxonomy expanded to seven values with
`container_started` and `probe_recovered`, CAS-conflict fallback to
health-only update, new `RuntimeRecordStore.UpdateEngineHealth`
port method, in-memory dedupe of last-emitted summaries,
read-after-write snapshot construction, `health_events` stream
offset label, worker wiring deferred to Stage 19) landed at PLAN
stage 18.
- [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — internal
trusted REST contract.
- [`./api/runtime-events-asyncapi.yaml`](./api/runtime-events-asyncapi.yaml) —
`gm:lobby_events` Redis Stream contract.
- [`../game/README.md`](../game/README.md) — game engine container contract
(env, ports, admin and player REST surfaces, `/healthz`).
- [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with GM.
- [`../rtmanager/README.md`](../rtmanager/README.md) — Runtime Manager
contract used synchronously by GM admin operations.
## Purpose
A running Galaxy game lives in exactly one Docker container managed by
`Runtime Manager`. The platform must:
- register a freshly started container with platform-level membership;
- initialise the engine with the agreed race roster;
- accept and forward player commands and orders to the engine;
- route per-player report reads;
- generate turns according to a schedule;
- detect game finish and propagate it back to platform-level state;
- expose runtime/operational controls (force-next-turn, stop, patch, banish);
- own the catalogue of supported engine versions and resolve `image_ref`
values for `Game Lobby`.
`Game Master` is the single component that performs these actions. It does
**not** own platform metadata of games (that is `Game Lobby`), Docker control
(that is `Runtime Manager`), or the full game state (that is the engine
container). Engine state on disk is the engine's domain; GM never reads or
writes the bind-mounted state directory.
## Scope
`Game Master` is the source of truth for:
- the runtime mapping `game_id → engine_endpoint` for every running game;
- the runtime status (`starting | running | generation_in_progress |
generation_failed | stopped | engine_unreachable | finished`);
- the current turn number and the next-tick timestamp;
- the per-game `(user_id, race_name, engine_player_uuid)` triple;
- the engine version registry: `(version, image_ref, options, status)`;
- the durable history of every operation GM performed (`operation_log`);
- the latest engine health summary per game.
`Game Master` is **not** the source of truth for:
- platform game records (created, draft, enrollment, finished metadata) —
owned by `Game Lobby`;
- container lifecycle and Docker reality — owned by `Runtime Manager`;
- in-game world state (planets, ships, science, reports) — owned by the
engine container;
- platform user identity and entitlements — owned by `User Service`;
- in-game `race_name` reservations and the Race Name Directory — owned by
`Game Lobby`.
## Non-Goals
- Multi-instance operation in v1. GM runs as a single process; the in-process
scheduler is authoritative. Multi-instance with leader election is an
explicit future iteration.
- Direct Docker access. GM never imports the Docker SDK; every container
operation goes through `Runtime Manager` over trusted internal REST.
- Player removal/block at platform level. `Game Lobby` owns that decision;
GM only performs the engine-side `banish` call when explicitly invoked.
- Pause/resume of a running game on the platform side. `Game Lobby.paused`
is a platform-only state; GM only answers a liveness probe used by
Lobby's resume flow.
- Automatic semver-patch upgrades. Patch is always an explicit admin
operation against a target engine version present in the registry.
- TLS or mTLS on the internal listener. GM trusts its network segment.
- Direct delivery of player-visible push events. `Notification Service`
owns user-targeted push delivery; GM publishes notification intents only.
- A separate Admin Service. GM exposes its trusted internal REST surface;
Admin Service will adopt it in a later iteration.
- Engine state file management. Backup, archival, and cleanup of the
bind-mounted state directories are operator concerns.
## Position in the System
```mermaid
flowchart LR
Gateway["Edge Gateway"]
Lobby["Game Lobby"]
Admin["Admin Service\n(future)"]
GM["Game Master"]
RTM["Runtime Manager"]
Notify["Notification Service"]
Engine["Game Engine container\n(galaxy/game)"]
Postgres["PostgreSQL\nschema gamemaster"]
Redis["Redis\nstreams + caches"]
Gateway -- "verified player commands\n(REST/JSON)" --> GM
Lobby -- "register-runtime,\nimage-ref resolve,\nmemberships invalidate" --> GM
Admin -- "internal REST" --> GM
GM -- "engine HTTP API" --> Engine
GM -- "stop / restart / patch" --> RTM
GM -- "notification:intents" --> Notify
GM -- "gm:lobby_events" --> Redis
Redis -- "runtime:health_events" --> GM
GM --> Postgres
```
`Edge Gateway` routes verified player message types (`game.command.execute`,
`game.order.put`, `game.report.get`) to GM as trusted REST/JSON after
transcoding from FlatBuffers. `Game Lobby` calls GM synchronously to
register runtimes after a successful container start, to resolve `image_ref`
from the engine version registry, to invalidate membership cache on roster
changes, and to verify GM liveness during platform resume. `Game Master`
calls `Runtime Manager` synchronously over REST for stop, restart, and
patch. `Runtime Manager` publishes `runtime:health_events`, which GM
consumes asynchronously. GM publishes `gm:lobby_events` consumed by
`Game Lobby`, and `notification:intents` consumed by `Notification Service`.
## Responsibility Boundaries
`Game Master` is responsible for:
- registering a freshly started container into platform-level runtime state;
- initialising the engine with the race roster received from Lobby;
- maintaining the platform mapping of `user_id`, `race_name`, and
`engine_player_uuid`;
- forwarding player commands, orders, and report reads to the engine after
authorising the actor;
- generating turns on schedule, including the force-next-turn skip rule;
- evaluating engine finish on every turn boundary;
- publishing runtime snapshot updates and the final game-finish event;
- consuming runtime health events from `Runtime Manager` and updating its
per-game health summary;
- exposing the engine version registry CRUD;
- driving admin-level runtime operations (stop, force-next-turn, patch,
banish) by calling `Runtime Manager` and the engine on demand.
`Game Master` is not responsible for:
- creating or stopping containers on Docker (that is `Runtime Manager`);
- evaluating whether a game is allowed to start (that is `Game Lobby`);
- deriving recipient user lists for non-game notifications (that is
`Notification Service`);
- verifying authenticated transport, signatures, freshness, and replay
(that is `Edge Gateway`);
- mapping `user_id` to platform-level membership (that is `Game Lobby`).
## Engine Container Contract
The engine container is `galaxy/game`. GM uses two route classes:
| Class | Path | Purpose |
| --- | --- | --- |
| Admin (GM-only) | `POST /api/v1/admin/init` | Initialise the engine with a race roster. |
| Admin (GM-only) | `GET /api/v1/admin/status` | Read the full game state. |
| Admin (GM-only) | `PUT /api/v1/admin/turn` | Generate the next turn. |
| Admin (GM-only) | `POST /api/v1/admin/race/banish` | Deactivate a race after permanent platform removal. Body `{race_name}`. |
| Player | `PUT /api/v1/command` | Execute a batch of player commands. |
| Player | `PUT /api/v1/order` | Validate and store a batch of player orders. |
| Player | `GET /api/v1/report` | Fetch per-player turn report. |
| Probe | `GET /healthz` | Liveness probe used by `Runtime Manager` and operator tooling. |
Admin paths are unauthenticated but routed only from inside the trusted
network segment that connects GM to the engine container. The engine does
not enforce caller identity — network-level segmentation is the boundary.
`StateResponse` carries an extra boolean `finished` field. When `true` on a
turn-generation response, GM treats the game as finished and runs the
finish flow described below. The conditional logic that flips `finished`
to `true` lives in the engine's domain code and is not GM's concern.
The engine endpoint URL is the `engine_endpoint` value handed to GM by
`Game Lobby` during `register-runtime`: `http://galaxy-game-{game_id}:8080`.
The DNS name is stable across restart and patch.
## Runtime Surface
### Listeners
| Listener | Default address | Purpose |
| --- | --- | --- |
| Internal HTTP | `:8097` (`GAMEMASTER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Edge Gateway`, `Game Lobby`, and `Admin Service`. |
There is no public listener. The internal listener is unauthenticated and
assumes a trusted network segment. Authentication of player commands has
already happened at `Edge Gateway`; GM enforces authorisation only.
### Background workers
| Worker | Driver | Description |
| --- | --- | --- |
| Scheduler ticker | 1 s loop | Scans `runtime_records` for due `next_generation_at`, runs the turn-generation service for each, recomputes `next_generation_at` from `turn_schedule` (skipping one tick when `skip_next_tick=true` is set). |
| `runtime:health_events` consumer | Redis Stream | XREADs from `runtime:health_events` (produced by RTM), updates `runtime_records.engine_health` summary, debounces `runtime_snapshot_update` publication. |
### Startup dependencies
In start order:
1. PostgreSQL primary (`GAMEMASTER_POSTGRES_PRIMARY_DSN`). Embedded goose
migrations apply synchronously before any listener opens.
2. Redis master (`GAMEMASTER_REDIS_MASTER_ADDR`).
3. Telemetry exporter (OTLP grpc/http or stdout).
4. Internal HTTP listener.
5. Health-events consumer worker.
6. Scheduler ticker worker.
A failure in any step exits the process non-zero.
### Probes
`/healthz` reports liveness — the process responds when the HTTP server is
alive.
`/readyz` reports readiness — `200` only when the PostgreSQL pool can ping
the primary and the Redis master client can ping. No deeper dependency is
checked synchronously; the engine is reached only on demand.
Both probes are documented in
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
## Lifecycles
### Register-runtime
**Triggered by:** `Game Lobby` after a successful container start, calling
`POST /api/v1/internal/games/{game_id}/register-runtime` with body
`{engine_endpoint, members:[{user_id, race_name}], target_engine_version,
turn_schedule}`.
**Flow on success:**
1. Validate request shape; reject with `invalid_request` if any required
field is missing.
2. Reject with `conflict` if `runtime_records.{game_id}` already exists.
3. Resolve `image_ref` for `target_engine_version` from `engine_versions`;
reject with `engine_version_not_found` when missing.
4. Persist `runtime_records` with `status=starting`, `engine_endpoint`,
`current_image_ref`, `current_engine_version`, `turn_schedule`, and
`created_at`.
5. Call engine `POST /api/v1/admin/init` with the race-name list derived
from `members`.
6. Read `StateResponse` and persist one `player_mappings` row per player:
`(game_id, user_id, race_name, engine_player_uuid)`.
7. CAS `runtime_records.status: starting → running`. Persist
`current_turn=0` and `next_generation_at` computed from `turn_schedule`.
8. Append `operation_log` entry (`op_kind=register_runtime`,
`outcome=success`).
9. Publish `runtime_snapshot_update` to `gm:lobby_events`.
10. Return `200` with the persisted `runtime_records` row.
**Failure paths:**
| Failure | Side effect | Outcome to caller |
| --- | --- | --- |
| Invalid envelope | None | `400 invalid_request` |
| `runtime_records` already exists | None | `409 conflict` |
| Engine `/admin/init` returns 4xx | Roll back `runtime_records`; append failure to `operation_log` | `502 engine_validation_error` |
| Engine `/admin/init` returns 5xx or fails at the transport layer | Roll back; append failure | `502 engine_unreachable` |
| Engine response missing players or contains races not in roster | Roll back; append failure | `502 engine_protocol_violation` |
| PostgreSQL transaction failure | Roll back; append failure if possible | `503 service_unavailable` |
A failed `register-runtime` leaves no `runtime_records` row and no
`player_mappings` rows. `Game Lobby` then transitions the platform game
record to `paused` (per the architecture's flow §4 forced-pause path).
### Turn generation
**Triggered by:** the scheduler ticker when `now >= next_generation_at`
for a game in `status=running`, or by an admin invocation of
`force-next-turn`.
**Flow on success:**
1. CAS `runtime_records.status: running → generation_in_progress`. If the
CAS fails (status changed concurrently), the tick is skipped silently.
2. Call engine `PUT /api/v1/admin/turn`. Engine returns `StateResponse`
with the new `turn` and the updated `player[]` array.
3. Persist `runtime_records.current_turn` and refresh
`runtime_records.engine_health` summary.
4. If `StateResponse.finished == true`:
- CAS `runtime_records.status: generation_in_progress → finished`;
- publish `game_finished` to `gm:lobby_events` with
`{game_id, final_turn_number, finished_at_ms, player_turn_stats[]}`;
- publish `game.finished` notification intent to all `active` members.
5. If `StateResponse.finished == false`:
- CAS `runtime_records.status: generation_in_progress → running`;
- recompute `next_generation_at` from `turn_schedule`. If
`skip_next_tick=true`, advance by one extra cron step and clear the
flag;
- publish `runtime_snapshot_update` to `gm:lobby_events` with
`{game_id, current_turn, runtime_status, engine_health_summary,
player_turn_stats[]}`;
- publish `game.turn.ready` notification intent to all `active`
members.
6. Append `operation_log` entry (`op_kind=turn_generation`,
`outcome=success`).
**Failure paths:**
| Failure | Side effect | Outcome |
| --- | --- | --- |
| Engine timeout / 5xx | CAS `status: generation_in_progress → generation_failed`; publish `runtime_snapshot_update`; publish `game.generation_failed` admin notification | Logged; ticker leaves the game in `generation_failed` until manual recovery (admin issues `force-next-turn` or `stop`). |
| Persistence failure after engine success | Append failure to `operation_log`; status stays `generation_in_progress` | Health-summary update on next probe will resync. |
`player_turn_stats[]` is built from `StateResponse.player[]` by mapping
`raceName → user_id` through `player_mappings` and projecting
`{user_id, planets, population}`. `ships_built` is intentionally absent
(see [`./docs/stage01-architecture-sync.md`](./docs/stage01-architecture-sync.md)).
### Force-next-turn
**Triggered by:** `Admin Service` or system-admin via
`POST /api/v1/internal/runtimes/{game_id}/force-next-turn`.
**Pre-conditions:** runtime exists, `status=running`.
**Flow:**
1. Run the turn-generation flow synchronously (the same code path the
scheduler uses).
2. After success, set `runtime_records.skip_next_tick = true`. The next
regular tick computed from `turn_schedule` is then advanced by one
extra step before being persisted as `next_generation_at`.
3. Append `operation_log` entry (`op_kind=force_next_turn`).
The skip rule guarantees that the inter-turn spacing is never shorter than
one schedule interval, regardless of when the force is issued.
### Game finish
The finish flow is driven entirely by the engine signal `finished:bool`.
GM never decides finish independently. After `game_finished` is published,
`Game Lobby` transitions its platform record to `finished`, runs the
capability evaluation, and finalises Race Name Directory state. The GM
record stays in `status=finished` indefinitely; cleanup is operator-driven.
### Banish (engine-side player removal)
**Triggered by:** `Game Lobby` synchronously calling
`POST /api/v1/internal/games/{game_id}/race/{race_name}/banish` after a
permanent membership removal at platform level.
**Pre-conditions:** runtime exists; `race_name` resolves to an existing
`player_mappings` row.
**Flow:**
1. Call engine `POST /api/v1/admin/race/banish` with `{race_name}`.
2. On engine success, append `operation_log` entry (`op_kind=banish`,
`outcome=success`).
3. Return `204` to Lobby.
**Failure path:** engine error returns `502 engine_unreachable`. Lobby
treats this as a degraded state and may retry; the platform-level
membership stays `removed` regardless.
### Stop
**Triggered by:** system-admin via
`POST /api/v1/internal/runtimes/{game_id}/stop` with body `{reason}`,
where `reason ∈ {admin_request, finished, timeout}`.
**Flow:**
1. Call `Runtime Manager` `POST /api/v1/internal/runtimes/{game_id}/stop`
with the same `reason`.
2. CAS `runtime_records.status: * → stopped`.
3. Append `operation_log` entry.
4. Publish `runtime_snapshot_update` reflecting the stopped status.
### Patch
**Triggered by:** system-admin via
`POST /api/v1/internal/runtimes/{game_id}/patch` with body `{version}`.
**Pre-conditions:**
- `engine_versions.{version}` exists with `status=active`;
- the new version is a semver-patch of the current version (same major and
minor); otherwise reject with `semver_patch_only`.
**Flow:**
1. Resolve `image_ref` from `engine_versions.{version}`.
2. Call `Runtime Manager`
`POST /api/v1/internal/runtimes/{game_id}/patch` with `{image_ref}`.
3. On success, persist new `current_image_ref` and `current_engine_version`
on `runtime_records`.
4. Append `operation_log` entry.
The engine container is recreated by RTM with the same DNS name; the
`engine_endpoint` is unchanged. GM does not call `/admin/init` again —
the bind-mounted state directory is preserved and the engine resumes from
the previous turn.
### Liveness reply (Lobby resume)
**Triggered by:** `Game Lobby` resuming a paused game, calling
`GET /api/v1/internal/games/{game_id}/liveness`.
**Flow:** if `runtime_records.{game_id}` exists and `status=running`,
return `200 {ready: true}`. Otherwise return `200 {ready: false, status:
"<observed status>"}`.
This endpoint never calls the engine; it reflects GM's own view only.
## Hot Path
### Player commands and orders
Both `game.command.execute` and `game.order.put` use the same FlatBuffers
schema (`pkg/schema/fbs/order.fbs` `Order{updated_at, commands:[…]}`). The
gateway transcodes the verified payload to JSON via
`pkg/transcoder/order.go` before calling GM.
**GM endpoints:**
- `POST /api/v1/internal/games/{game_id}/commands` — execute now; engine
`PUT /api/v1/command`.
- `POST /api/v1/internal/games/{game_id}/orders` — validate-and-store;
engine `PUT /api/v1/order`.
Both endpoints accept body `{commands:[{cmd_id, @type, …}, …]}` and the
`X-User-ID` header. The actor field on the engine call is **always** set
by GM from the authenticated user identity; GM never trusts a payload
field for actor identification.
**Pre-conditions:**
- `runtime_records.{game_id}` exists with `status=running`;
- the user is an `active` member of the game (cache lookup);
- `player_mappings.(game_id, user_id)` exists.
**Errors:**
- `runtime_not_found` — runtime missing.
- `runtime_not_running` — `runtime_status` is anything other than
`running`.
- `forbidden` — caller is not an active member.
- `engine_unreachable` — engine returned 5xx.
- `engine_validation_error` — engine returned 4xx; the body carries the
engine's per-command result (`cmd_applied`, `cmd_error_code`).
### Reports
**GM endpoint:** `GET /api/v1/internal/games/{game_id}/reports/{turn}`
with the `X-User-ID` header.
**Flow:**
1. Authorise: caller must be an active member of the game.
2. Resolve `race_name` from `player_mappings`.
3. Call engine `GET /api/v1/report?player={race_name}&turn={turn}`.
4. Return the engine response verbatim. Reports are full per-player
payloads and are never cached at the platform layer; the engine remains
the source of truth.
### Membership cache and invalidation
GM holds an in-process per-game TTL cache (default 30 s) of memberships
loaded from `Lobby /api/v1/internal/games/{id}/memberships`. The cache
shape is `map[user_id]MembershipStatus` plus a load timestamp. TTL is
the safety-net fallback.
The primary invalidation mechanism is an explicit hook from Lobby:
- Endpoint: `POST /api/v1/internal/games/{game_id}/memberships/invalidate`.
- Lobby invokes it post-commit on every operation that mutates roster:
application approval, application rejection, invite redeem, member
remove, member block, user-lifecycle cascade.
- Failed invalidation does not roll back Lobby state; the TTL safety net
catches stale data within the next 30 s.
This is a deliberate tight coupling. The trade-off is recorded in
[`./PLAN.md` Stage 16](./PLAN.md).
## Engine Version Registry
The registry is the source of truth for which engine versions are
deployable. CRUD is exposed on the GM internal port; `Game Lobby`
consumes it synchronously to resolve `image_ref` for `target_engine_version`
just before publishing a `runtime:start_jobs` envelope.
| Method | Path | Purpose |
| --- | --- | --- |
| `GET` | `/api/v1/internal/engine-versions` | List versions; supports `status` filter. |
| `POST` | `/api/v1/internal/engine-versions` | Create a new version with `version`, `image_ref`, optional `options`. Validates semver shape and Docker reference. |
| `GET` | `/api/v1/internal/engine-versions/{version}` | Read one version. |
| `PATCH` | `/api/v1/internal/engine-versions/{version}` | Update `image_ref`, `options`, or `status`. |
| `DELETE` | `/api/v1/internal/engine-versions/{version}` | Soft-deprecate (`status=deprecated`). Hard delete is rejected if the version is referenced by any non-finished `runtime_records` row. |
| `GET` | `/api/v1/internal/engine-versions/{version}/image-ref` | Resolve `image_ref` only. Used by Lobby's start flow. |
`options` is a free-form `jsonb` document stored verbatim. v1 does not
enforce a schema; future engine-side options follow the engine's own
contract.
`status` values: `active` (deployable), `deprecated` (rejected on new
starts; existing runtimes unaffected). Hard removal of a deprecated
version requires that no runtime references it.
Lobby resolves `image_ref` synchronously per game start. If the resolve
call fails or the version is missing, Lobby fails the start with
`engine_version_not_found` and never publishes `runtime:start_jobs`.
## Trusted Surfaces
### Internal REST
The internal REST surface is consumed by:
- `Edge Gateway` — verified player commands and report reads;
- `Game Lobby` — register-runtime, image-ref resolve, membership invalidate,
banish, liveness reply;
- `Admin Service` (future) — full administrative operations;
- platform probes — `/healthz`, `/readyz`.
The listener is unauthenticated; downstream services rely on network
segmentation. Caller identity for audit is recorded from the optional
`X-Galaxy-Caller` header (`gateway`, `lobby`, `admin`) and reflected as
`op_source` in `operation_log` (`gateway_player`, `lobby_internal`,
`admin_rest`); when missing or unrecognised, GM defaults to
`op_source=admin_rest`.
For player-command endpoints, the additional `X-User-ID` header is
required and authoritative for the acting user identity.
Request and response shapes are defined in
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml). Unknown JSON
fields are rejected with `invalid_request`.
## Async Stream Contracts
### `gm:lobby_events` (out)
Producer: `Game Master`. Consumer: `Game Lobby`.
Two message types share the stream, discriminated by `event_type`:
| `event_type` | Body |
| --- | --- |
| `runtime_snapshot_update` | `{game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats:[{user_id, planets, population}], occurred_at_ms}` |
| `game_finished` | `{game_id, final_turn_number, runtime_status:"finished", player_turn_stats:[…], finished_at_ms}` |
Publication cadence: events only. GM publishes a snapshot when:
- a turn was generated (success or failure);
- `runtime_status` transitioned (e.g., `running ↔ generation_in_progress`,
`running → engine_unreachable`, `* → finished`);
- `engine_health_summary` changed in response to a `runtime:health_events`
observation (debounced — duplicates are suppressed when the summary did
not change).
There is no periodic heartbeat. `Game Lobby` consumes these events to
update its denormalised runtime snapshot and to feed the per-game
`player_turn_stats` aggregate used at game finish.
The first `runtime_snapshot_update` published right after a successful
`register-runtime` carries `player_turn_stats` projected from the
engine `/admin/init` response — the per-player baseline (`planets`,
`population`) at turn 0. Lobby treats this baseline as the reference
point against which subsequent turn deltas are measured. For other
status transitions that fire without a fresh engine state payload
(e.g., a pure health-summary change), `player_turn_stats` is empty.
The full schema is enforced by
[`./api/runtime-events-asyncapi.yaml`](./api/runtime-events-asyncapi.yaml).
### `runtime:health_events` (in)
Producer: `Runtime Manager`. Consumer: `Game Master`.
GM consumes the stream to update `runtime_records.engine_health` summary
per game. The schema is owned by `Runtime Manager` and documented in
[`../rtmanager/api/runtime-health-asyncapi.yaml`](../rtmanager/api/runtime-health-asyncapi.yaml).
GM never modifies `runtime:health_events`; it is read-only.
GM does not publish notifications in response to runtime health changes
in v1; the operator surface is `gm:lobby_events` plus the GM REST
inspect endpoints.
## Notification Contracts
`Game Master` publishes notification intents to `notification:intents`
using the shared `pkg/notificationintent` producer module:
| Trigger | `notification_type` | Audience | Channels |
| --- | --- | --- | --- |
| Successful turn generation | `game.turn.ready` | active members of the game | `push+email` |
| Game finish | `game.finished` | active members of the game | `push+email` |
| Turn generation failed | `game.generation_failed` | configured admin email list | `email` |
Recipient resolution: GM materialises `recipient_user_ids` from its own
membership cache (loaded from Lobby) at publish time; admin recipients
are resolved by `Notification Service` from configuration.
A failed publication is a notification degradation and must not roll back
already committed runtime state. Failed publications are logged and
counted via `gamemaster.notification.publish_attempts`.
## Persistence Layout
### PostgreSQL durable state (schema `gamemaster`)
| Table | Purpose | Key |
| --- | --- | --- |
| `runtime_records` | One row per game; latest known runtime status and scheduling state. | `game_id` |
| `engine_versions` | Engine version registry. | `version` |
| `player_mappings` | `(game_id, user_id) → race_name + engine_player_uuid`. | composite `(game_id, user_id)` |
| `operation_log` | Append-only audit of every GM operation. | `id` (auto) |
`runtime_records` columns:
- `game_id` — primary key, references Lobby's identifier.
- `status` — `starting | running | generation_in_progress |
generation_failed | stopped | engine_unreachable | finished`.
- `engine_endpoint` — `http://galaxy-game-{game_id}:8080`.
- `current_image_ref` — Docker reference of the running image.
- `current_engine_version` — semver string registered in `engine_versions`.
- `turn_schedule` — five-field cron expression copied from Lobby.
- `current_turn` — last completed turn number; `0` until the first turn
generates.
- `next_generation_at` — UTC timestamp of the next due tick.
- `skip_next_tick` — boolean; set by `force-next-turn`, cleared after the
first cron step is skipped.
- `engine_health` — short text summary derived from
`runtime:health_events`.
- `created_at`, `updated_at`, `started_at`, `stopped_at`, `finished_at` —
lifecycle timestamps.
`engine_versions` columns:
- `version` — primary key; semver string.
- `image_ref` — non-empty Docker reference.
- `options` — `jsonb`, free-form, default `'{}'`.
- `status` — `active | deprecated`.
- `created_at`, `updated_at`.
`player_mappings` columns:
- composite primary key `(game_id, user_id)`.
- `race_name` — non-empty string; unique per `game_id`.
- `engine_player_uuid` — UUID returned by the engine `/admin/init`.
- `created_at`.
`operation_log` columns:
- `id`, `game_id`, `op_kind` (`register_runtime | turn_generation |
force_next_turn | banish | stop | patch | engine_version_create |
engine_version_update | engine_version_deprecate |
engine_version_delete`), `op_source`, `source_ref` (request id
when known), `outcome` (`success | failure`), `error_code`,
`error_message`, `started_at`, `finished_at`.
For engine-version registry entries (`op_kind` starting with
`engine_version_`), the `game_id` column doubles as the audit subject
and stores the canonical `version` string instead of a platform game
identifier; the registry is global, not per-game. The convention is
documented in
[`./docs/stage14-engine-version-registry.md`](./docs/stage14-engine-version-registry.md).
Indexes:
- `runtime_records (status, next_generation_at)` — drives the scheduler
ticker scan.
- `operation_log (game_id, started_at DESC)` — drives audit reads.
- UNIQUE on `player_mappings (game_id, race_name)` —
one-race-per-game invariant.
Per-game roster reads (`WHERE game_id = $1`) are served by the
leftmost prefix of the composite primary key on
`player_mappings (game_id, user_id)`; no extra single-column index is
added.
Migrations are embedded `00001_init.sql` (single-init pre-launch policy
from `ARCHITECTURE.md §Persistence Backends`).
### Redis runtime-coordination state
| Key shape | Purpose |
| --- | --- |
| `gamemaster:stream_offsets:{label}` | Last processed entry id per consumer (`health_events`). Same shape as Lobby and RTM. |
GM does not persist the membership cache to Redis in v1; the cache is
in-process. This trade-off is documented in [`./PLAN.md` Stage 16](./PLAN.md).
## Error Model
Error envelope: `{ "error": { "code": "...", "message": "..." } }`,
identical to Lobby and RTM.
Stable error codes:
| Code | Meaning |
| --- | --- |
| `invalid_request` | Malformed JSON, unknown fields, missing required parameter. |
| `runtime_not_found` | `runtime_records.{game_id}` does not exist. |
| `runtime_not_running` | Operation requires `status=running`. |
| `conflict` | State transition not allowed. |
| `forbidden` | Caller is not an active member or not authorised. |
| `engine_version_not_found` | `engine_versions.{version}` does not exist. |
| `engine_version_in_use` | Hard-delete attempt against a version referenced by a non-finished runtime. |
| `semver_patch_only` | Patch attempt across major/minor boundary. |
| `engine_unreachable` | Engine returned 5xx or connection error. |
| `engine_protocol_violation` | Engine response missing required fields or carries unexpected payload. |
| `engine_validation_error` | Engine returned 4xx with per-command results. |
| `service_unavailable` | Dependency (PostgreSQL, Redis, Lobby, RTM) unavailable. |
| `internal_error` | Unspecified failure. |
## Configuration
All variables use the `GAMEMASTER_` prefix. Required variables fail-fast
on startup.
### Required
- `GAMEMASTER_INTERNAL_HTTP_ADDR`
- `GAMEMASTER_POSTGRES_PRIMARY_DSN`
- `GAMEMASTER_REDIS_MASTER_ADDR`
- `GAMEMASTER_REDIS_PASSWORD`
- `GAMEMASTER_LOBBY_INTERNAL_BASE_URL`
- `GAMEMASTER_RTM_INTERNAL_BASE_URL`
### Configuration groups
**Listener:**
- `GAMEMASTER_INTERNAL_HTTP_ADDR` (e.g., `:8097`).
- `GAMEMASTER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`).
- `GAMEMASTER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `30s`).
- `GAMEMASTER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`).
**PostgreSQL:**
- `GAMEMASTER_POSTGRES_PRIMARY_DSN`
(`postgres://gamemaster:<pwd>@<host>:5432/galaxy?search_path=gamemaster&sslmode=disable`).
- `GAMEMASTER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used
in v1).
- `GAMEMASTER_POSTGRES_OPERATION_TIMEOUT` (default `2s`).
- `GAMEMASTER_POSTGRES_MAX_OPEN_CONNS` (default `10`).
- `GAMEMASTER_POSTGRES_MAX_IDLE_CONNS` (default `2`).
- `GAMEMASTER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`).
**Redis:**
- `GAMEMASTER_REDIS_MASTER_ADDR`.
- `GAMEMASTER_REDIS_REPLICA_ADDRS` (optional, comma-separated).
- `GAMEMASTER_REDIS_PASSWORD`.
- `GAMEMASTER_REDIS_DB` (default `0`).
- `GAMEMASTER_REDIS_OPERATION_TIMEOUT` (default `2s`).
**Streams:**
- `GAMEMASTER_REDIS_LOBBY_EVENTS_STREAM` (default `gm:lobby_events`).
- `GAMEMASTER_REDIS_HEALTH_EVENTS_STREAM` (default
`runtime:health_events`).
- `GAMEMASTER_REDIS_NOTIFICATION_INTENTS_STREAM` (default
`notification:intents`).
- `GAMEMASTER_STREAM_BLOCK_TIMEOUT` (default `5s`).
**Engine client:**
- `GAMEMASTER_ENGINE_CALL_TIMEOUT` (default `30s` — covers turn generation
on large games).
- `GAMEMASTER_ENGINE_PROBE_TIMEOUT` (default `5s` — for inspect-style
reads).
**Lobby internal client:**
- `GAMEMASTER_LOBBY_INTERNAL_BASE_URL`.
- `GAMEMASTER_LOBBY_INTERNAL_TIMEOUT` (default `2s`).
**Runtime Manager internal client:**
- `GAMEMASTER_RTM_INTERNAL_BASE_URL`.
- `GAMEMASTER_RTM_INTERNAL_TIMEOUT` (default `5s`).
**Scheduler:**
- `GAMEMASTER_SCHEDULER_TICK_INTERVAL` (default `1s`).
- `GAMEMASTER_TURN_GENERATION_TIMEOUT` (default `60s`).
**Membership cache:**
- `GAMEMASTER_MEMBERSHIP_CACHE_TTL` (default `30s`).
- `GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES` (default `4096`; LRU eviction).
**Logging:**
- `GAMEMASTER_LOG_LEVEL` (default `info`).
**Lifecycle:**
- `GAMEMASTER_SHUTDOWN_TIMEOUT` (default `30s`).
**Telemetry:** uses the standard OTLP env vars
(`OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_EXPORTER_OTLP_PROTOCOL`, etc.)
shared with other Galaxy services.
## Observability
### Metrics (OpenTelemetry, low cardinality)
- `gamemaster.register_runtime.outcomes` — counter; labels `outcome`,
`error_code`.
- `gamemaster.turn_generation.outcomes` — counter; labels `outcome`,
`error_code`, `trigger` (`scheduler | force`).
- `gamemaster.command_execute.outcomes` — counter; labels `outcome`,
`error_code`.
- `gamemaster.order_put.outcomes` — counter; labels `outcome`,
`error_code`.
- `gamemaster.report_get.outcomes` — counter; labels `outcome`,
`error_code`.
- `gamemaster.banish.outcomes` — counter; labels `outcome`, `error_code`.
- `gamemaster.engine_call.latency` — histogram; label `op` (`init |
status | turn | banish | command | order | report`).
- `gamemaster.runtime_records_by_status` — gauge; label `status`.
- `gamemaster.scheduler.due_games` — gauge.
- `gamemaster.health_events.consumed` — counter.
- `gamemaster.lobby_events.published` — counter; label `event_type`.
- `gamemaster.notification.publish_attempts` — counter; label
`notification_type`, `result` (`ok | error`).
- `gamemaster.membership_cache.hits` — counter; labels `result` (`hit |
miss | invalidate`).
- `gamemaster.engine_versions_total` — gauge.
Metrics avoid high-cardinality attributes such as `game_id` and `user_id`.
### Structured logs (slog JSON to stdout)
Common fields on every entry: `service=gamemaster`, `request_id`,
`trace_id`, `span_id`, `game_id` (when known), `user_id` (when known),
`op_kind`, `op_source`, `outcome`, `error_code`.
Worker-specific fields: `event_type` (lobby-events publisher),
`stream_entry_id` (health-events consumer), `turn` (turn-generation),
`engine_endpoint` (engine calls).
## Verification
Service-level (per [`./PLAN.md`](./PLAN.md)):
- Unit tests for every service-layer operation against mocked engine,
Lobby, RTM, notification publisher, lobby-events publisher.
- Adapter tests using `testcontainers-go` for PostgreSQL and Redis.
- Contract tests for `internal-openapi.yaml` and
`runtime-events-asyncapi.yaml`.
Service-local integration suite under `gamemaster/integration/`:
- Register-runtime + first turn happy path against the real
`galaxy/game` test image.
- Force-next-turn skip behaviour.
- Engine version registry CRUD + resolve.
- Admin stop synchronous REST.
- Banish round-trip.
- Membership invalidation hook.
- `runtime:health_events` consumption.
Inter-service suite under `integration/lobbygm/` and
`integration/lobbygmrtm/`:
- `lobbygm`: real Lobby + real GM + real engine + stub RTM. Covers
enrollment → register-runtime → first turn → finish + capability
evaluation.
- `lobbygmrtm`: full Lobby + GM + RTM + engine. Covers happy path and the
documented failure paths from `ARCHITECTURE.md` flow §4.
Manual smoke (development):
```sh
docker network create galaxy-net # once
GAMEMASTER_INTERNAL_HTTP_ADDR=:8097 \
GAMEMASTER_POSTGRES_PRIMARY_DSN=postgres://gamemaster:secret@localhost:5432/galaxy?search_path=gamemaster&sslmode=disable \
GAMEMASTER_REDIS_MASTER_ADDR=localhost:6379 \
GAMEMASTER_REDIS_PASSWORD=secret \
GAMEMASTER_LOBBY_INTERNAL_BASE_URL=http://localhost:8095 \
GAMEMASTER_RTM_INTERNAL_BASE_URL=http://localhost:8096 \
... go run ./gamemaster/cmd/gamemaster
```
After start, `curl http://localhost:8097/readyz` returns `200`. Driving
Lobby through its public start flow brings up `galaxy-game-{game_id}`
containers, GM registers each runtime, generates turns on the configured
schedule, and propagates events to Lobby.