# Game Master `Game Master` (GM) is the only Galaxy platform service permitted to talk to running game engine containers. It owns runtime and operational state of already-running games, the engine version registry, the platform mapping of `(user_id ↔ race_name ↔ engine_player_uuid)`, the per-game turn scheduler, and the synchronous and asynchronous boundaries that other services use to interact with running games. ## References - [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §8 Game Master. - [`../TESTING.md`](../TESTING.md) §8 — testing matrix for GM. - [`./PLAN.md`](./PLAN.md) — staged implementation plan. - [`./docs/README.md`](./docs/README.md) — service-local documentation entry point (created at PLAN stage 24). - [`./docs/stage06-contract-files.md`](./docs/stage06-contract-files.md) — decisions behind the OpenAPI and AsyncAPI specs frozen at PLAN stage 06. - [`./docs/stage07-notification-catalog-audit.md`](./docs/stage07-notification-catalog-audit.md) — notification catalog audit and producer-side freeze test added at PLAN stage 07. - [`./docs/stage08-module-skeleton.md`](./docs/stage08-module-skeleton.md) — module skeleton wiring decisions (config groups, telemetry instruments, Makefile targets, deferred dependencies) recorded at PLAN stage 08. - [`./docs/stage09-postgres-migration.md`](./docs/stage09-postgres-migration.md) — PostgreSQL schema, embedded migration, jet generation pipeline, and runtime wiring landed at PLAN stage 09. - [`./docs/stage10-domain-and-ports.md`](./docs/stage10-domain-and-ports.md) — domain types, port interfaces, and the six stage-10 decisions (operation domain package, membership DTO placement, engine-version options shape, schedule wrapper signature, recovery transition, deferred mock destination) landed at PLAN stage 10. - [`./docs/stage11-persistence-adapters.md`](./docs/stage11-persistence-adapters.md) — PostgreSQL stores (`runtimerecordstore`, `engineversionstore`, `playermappingstore`, `operationlog`), the Redis offset store, and the eight stage-11 decisions (sqlx/pgtest local clones, CAS pattern, port-level Now extension, domain conflict sentinels, jsonb cast, idempotent Deprecate, multi-row BulkInsert, miniredis dependency) landed at PLAN stage 11. - [`./docs/stage12-external-clients.md`](./docs/stage12-external-clients.md) — outbound adapters (engine, Lobby, Runtime Manager, notification intent publisher, lobby-events publisher) and the seven stage-12 decisions (per-call engine base URL, dual engine timeout dispatch, engine population rounding, Lobby pagination cap, no extra RTM sentinels, AsyncAPI-aligned XADD encoding for `gm:lobby_events`, Makefile mocks-target guard) landed at PLAN stage 12. - [`./docs/stage13-register-runtime.md`](./docs/stage13-register-runtime.md) — register-runtime service-layer orchestrator and the five stage-13 decisions (`RuntimeRecordStore.Delete` extension, engine 4xx/5xx classification split, engine response validated as `engine_protocol_violation`, initial snapshot carries `player_turn_stats` from `/admin/init`, two-flag rollback gating) landed at PLAN stage 13. - [`./docs/stage14-engine-version-registry.md`](./docs/stage14-engine-version-registry.md) — engine version registry service-layer orchestrator (List, Get, Create, Update, Deprecate, Delete, ResolveImageRef) and the five stage-14 decisions (`EngineVersionStore.Delete` port extension, reference probe before hard delete, new `engine_version_delete` op_kind in schema and domain, `operation_log.game_id` overloaded as audit subject for registry entries, JSON-object validation for `options`) landed at PLAN stage 14. - [`./docs/stage15-scheduler-and-turn-generation.md`](./docs/stage15-scheduler-and-turn-generation.md) — scheduler ticker, turn-generation orchestrator, and snapshot publisher and the seven stage-15 decisions (`LobbyClient.GetGameSummary` extension with fail-soft `game_name` fallback, telemetry-only `Trigger` parameter, two-CAS pattern with external-mutation conflict, single-snapshot-per-outcome cadence, player_mappings as recipient source, stateless scheduler utility, in-flight set on the ticker) landed at PLAN stage 15. - [`./docs/stage16-membership-cache-and-invalidation.md`](./docs/stage16-membership-cache-and-invalidation.md) — hot-path services (`commandexecute`, `orderput`, `reportget`), membership cache, and the six stage-16 decisions (no `runtime_not_running` for reports, GM-side envelope rewrite `commands`→`cmd` with injected `actor`, hot-path skips `operation_log`, hand-rolled per-game inflight tracker, raw status string return, missing-mapping surfaces as `forbidden`) landed at PLAN stage 16. - [`./docs/stage17-admin-operations.md`](./docs/stage17-admin-operations.md) — admin service-layer operations (`adminstop`, `adminforce`, `adminpatch`, `adminbanish`, `livenessreply`) and the six stage-17 decisions (`RuntimeRecordStore.UpdateImage` extension, `adminstop` idempotent on terminal statuses and `conflict` on `starting`, `adminforce` always sets `skip_next_tick`, `adminbanish` without status check and missing race surfaces as `forbidden`, `livenessreply` 200 + empty status on `runtime_not_found`, RTM failures map to `service_unavailable`) landed at PLAN stage 17. - [`./docs/stage18-health-events-consumer.md`](./docs/stage18-health-events-consumer.md) — `runtime:health_events` consumer worker and the seven stage-18 decisions (event-type taxonomy expanded to seven values with `container_started` and `probe_recovered`, CAS-conflict fallback to health-only update, new `RuntimeRecordStore.UpdateEngineHealth` port method, in-memory dedupe of last-emitted summaries, read-after-write snapshot construction, `health_events` stream offset label, worker wiring deferred to Stage 19) landed at PLAN stage 18. - [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — internal trusted REST contract. - [`./api/runtime-events-asyncapi.yaml`](./api/runtime-events-asyncapi.yaml) — `gm:lobby_events` Redis Stream contract. - [`../game/README.md`](../game/README.md) — game engine container contract (env, ports, admin and player REST surfaces, `/healthz`). - [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with GM. - [`../rtmanager/README.md`](../rtmanager/README.md) — Runtime Manager contract used synchronously by GM admin operations. ## Purpose A running Galaxy game lives in exactly one Docker container managed by `Runtime Manager`. The platform must: - register a freshly started container with platform-level membership; - initialise the engine with the agreed race roster; - accept and forward player commands and orders to the engine; - route per-player report reads; - generate turns according to a schedule; - detect game finish and propagate it back to platform-level state; - expose runtime/operational controls (force-next-turn, stop, patch, banish); - own the catalogue of supported engine versions and resolve `image_ref` values for `Game Lobby`. `Game Master` is the single component that performs these actions. It does **not** own platform metadata of games (that is `Game Lobby`), Docker control (that is `Runtime Manager`), or the full game state (that is the engine container). Engine state on disk is the engine's domain; GM never reads or writes the bind-mounted state directory. ## Scope `Game Master` is the source of truth for: - the runtime mapping `game_id → engine_endpoint` for every running game; - the runtime status (`starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished`); - the current turn number and the next-tick timestamp; - the per-game `(user_id, race_name, engine_player_uuid)` triple; - the engine version registry: `(version, image_ref, options, status)`; - the durable history of every operation GM performed (`operation_log`); - the latest engine health summary per game. `Game Master` is **not** the source of truth for: - platform game records (created, draft, enrollment, finished metadata) — owned by `Game Lobby`; - container lifecycle and Docker reality — owned by `Runtime Manager`; - in-game world state (planets, ships, science, reports) — owned by the engine container; - platform user identity and entitlements — owned by `User Service`; - in-game `race_name` reservations and the Race Name Directory — owned by `Game Lobby`. ## Non-Goals - Multi-instance operation in v1. GM runs as a single process; the in-process scheduler is authoritative. Multi-instance with leader election is an explicit future iteration. - Direct Docker access. GM never imports the Docker SDK; every container operation goes through `Runtime Manager` over trusted internal REST. - Player removal/block at platform level. `Game Lobby` owns that decision; GM only performs the engine-side `banish` call when explicitly invoked. - Pause/resume of a running game on the platform side. `Game Lobby.paused` is a platform-only state; GM only answers a liveness probe used by Lobby's resume flow. - Automatic semver-patch upgrades. Patch is always an explicit admin operation against a target engine version present in the registry. - TLS or mTLS on the internal listener. GM trusts its network segment. - Direct delivery of player-visible push events. `Notification Service` owns user-targeted push delivery; GM publishes notification intents only. - A separate Admin Service. GM exposes its trusted internal REST surface; Admin Service will adopt it in a later iteration. - Engine state file management. Backup, archival, and cleanup of the bind-mounted state directories are operator concerns. ## Position in the System ```mermaid flowchart LR Gateway["Edge Gateway"] Lobby["Game Lobby"] Admin["Admin Service\n(future)"] GM["Game Master"] RTM["Runtime Manager"] Notify["Notification Service"] Engine["Game Engine container\n(galaxy/game)"] Postgres["PostgreSQL\nschema gamemaster"] Redis["Redis\nstreams + caches"] Gateway -- "verified player commands\n(REST/JSON)" --> GM Lobby -- "register-runtime,\nimage-ref resolve,\nmemberships invalidate" --> GM Admin -- "internal REST" --> GM GM -- "engine HTTP API" --> Engine GM -- "stop / restart / patch" --> RTM GM -- "notification:intents" --> Notify GM -- "gm:lobby_events" --> Redis Redis -- "runtime:health_events" --> GM GM --> Postgres ``` `Edge Gateway` routes verified player message types (`game.command.execute`, `game.order.put`, `game.report.get`) to GM as trusted REST/JSON after transcoding from FlatBuffers. `Game Lobby` calls GM synchronously to register runtimes after a successful container start, to resolve `image_ref` from the engine version registry, to invalidate membership cache on roster changes, and to verify GM liveness during platform resume. `Game Master` calls `Runtime Manager` synchronously over REST for stop, restart, and patch. `Runtime Manager` publishes `runtime:health_events`, which GM consumes asynchronously. GM publishes `gm:lobby_events` consumed by `Game Lobby`, and `notification:intents` consumed by `Notification Service`. ## Responsibility Boundaries `Game Master` is responsible for: - registering a freshly started container into platform-level runtime state; - initialising the engine with the race roster received from Lobby; - maintaining the platform mapping of `user_id`, `race_name`, and `engine_player_uuid`; - forwarding player commands, orders, and report reads to the engine after authorising the actor; - generating turns on schedule, including the force-next-turn skip rule; - evaluating engine finish on every turn boundary; - publishing runtime snapshot updates and the final game-finish event; - consuming runtime health events from `Runtime Manager` and updating its per-game health summary; - exposing the engine version registry CRUD; - driving admin-level runtime operations (stop, force-next-turn, patch, banish) by calling `Runtime Manager` and the engine on demand. `Game Master` is not responsible for: - creating or stopping containers on Docker (that is `Runtime Manager`); - evaluating whether a game is allowed to start (that is `Game Lobby`); - deriving recipient user lists for non-game notifications (that is `Notification Service`); - verifying authenticated transport, signatures, freshness, and replay (that is `Edge Gateway`); - mapping `user_id` to platform-level membership (that is `Game Lobby`). ## Engine Container Contract The engine container is `galaxy/game`. GM uses two route classes: | Class | Path | Purpose | | --- | --- | --- | | Admin (GM-only) | `POST /api/v1/admin/init` | Initialise the engine with a race roster. | | Admin (GM-only) | `GET /api/v1/admin/status` | Read the full game state. | | Admin (GM-only) | `PUT /api/v1/admin/turn` | Generate the next turn. | | Admin (GM-only) | `POST /api/v1/admin/race/banish` | Deactivate a race after permanent platform removal. Body `{race_name}`. | | Player | `PUT /api/v1/command` | Execute a batch of player commands. | | Player | `PUT /api/v1/order` | Validate and store a batch of player orders. | | Player | `GET /api/v1/report` | Fetch per-player turn report. | | Probe | `GET /healthz` | Liveness probe used by `Runtime Manager` and operator tooling. | Admin paths are unauthenticated but routed only from inside the trusted network segment that connects GM to the engine container. The engine does not enforce caller identity — network-level segmentation is the boundary. `StateResponse` carries an extra boolean `finished` field. When `true` on a turn-generation response, GM treats the game as finished and runs the finish flow described below. The conditional logic that flips `finished` to `true` lives in the engine's domain code and is not GM's concern. The engine endpoint URL is the `engine_endpoint` value handed to GM by `Game Lobby` during `register-runtime`: `http://galaxy-game-{game_id}:8080`. The DNS name is stable across restart and patch. ## Runtime Surface ### Listeners | Listener | Default address | Purpose | | --- | --- | --- | | Internal HTTP | `:8097` (`GAMEMASTER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Edge Gateway`, `Game Lobby`, and `Admin Service`. | There is no public listener. The internal listener is unauthenticated and assumes a trusted network segment. Authentication of player commands has already happened at `Edge Gateway`; GM enforces authorisation only. ### Background workers | Worker | Driver | Description | | --- | --- | --- | | Scheduler ticker | 1 s loop | Scans `runtime_records` for due `next_generation_at`, runs the turn-generation service for each, recomputes `next_generation_at` from `turn_schedule` (skipping one tick when `skip_next_tick=true` is set). | | `runtime:health_events` consumer | Redis Stream | XREADs from `runtime:health_events` (produced by RTM), updates `runtime_records.engine_health` summary, debounces `runtime_snapshot_update` publication. | ### Startup dependencies In start order: 1. PostgreSQL primary (`GAMEMASTER_POSTGRES_PRIMARY_DSN`). Embedded goose migrations apply synchronously before any listener opens. 2. Redis master (`GAMEMASTER_REDIS_MASTER_ADDR`). 3. Telemetry exporter (OTLP grpc/http or stdout). 4. Internal HTTP listener. 5. Health-events consumer worker. 6. Scheduler ticker worker. A failure in any step exits the process non-zero. ### Probes `/healthz` reports liveness — the process responds when the HTTP server is alive. `/readyz` reports readiness — `200` only when the PostgreSQL pool can ping the primary and the Redis master client can ping. No deeper dependency is checked synchronously; the engine is reached only on demand. Both probes are documented in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml). ## Lifecycles ### Register-runtime **Triggered by:** `Game Lobby` after a successful container start, calling `POST /api/v1/internal/games/{game_id}/register-runtime` with body `{engine_endpoint, members:[{user_id, race_name}], target_engine_version, turn_schedule}`. **Flow on success:** 1. Validate request shape; reject with `invalid_request` if any required field is missing. 2. Reject with `conflict` if `runtime_records.{game_id}` already exists. 3. Resolve `image_ref` for `target_engine_version` from `engine_versions`; reject with `engine_version_not_found` when missing. 4. Persist `runtime_records` with `status=starting`, `engine_endpoint`, `current_image_ref`, `current_engine_version`, `turn_schedule`, and `created_at`. 5. Call engine `POST /api/v1/admin/init` with the race-name list derived from `members`. 6. Read `StateResponse` and persist one `player_mappings` row per player: `(game_id, user_id, race_name, engine_player_uuid)`. 7. CAS `runtime_records.status: starting → running`. Persist `current_turn=0` and `next_generation_at` computed from `turn_schedule`. 8. Append `operation_log` entry (`op_kind=register_runtime`, `outcome=success`). 9. Publish `runtime_snapshot_update` to `gm:lobby_events`. 10. Return `200` with the persisted `runtime_records` row. **Failure paths:** | Failure | Side effect | Outcome to caller | | --- | --- | --- | | Invalid envelope | None | `400 invalid_request` | | `runtime_records` already exists | None | `409 conflict` | | Engine `/admin/init` returns 4xx | Roll back `runtime_records`; append failure to `operation_log` | `502 engine_validation_error` | | Engine `/admin/init` returns 5xx or fails at the transport layer | Roll back; append failure | `502 engine_unreachable` | | Engine response missing players or contains races not in roster | Roll back; append failure | `502 engine_protocol_violation` | | PostgreSQL transaction failure | Roll back; append failure if possible | `503 service_unavailable` | A failed `register-runtime` leaves no `runtime_records` row and no `player_mappings` rows. `Game Lobby` then transitions the platform game record to `paused` (per the architecture's flow §4 forced-pause path). ### Turn generation **Triggered by:** the scheduler ticker when `now >= next_generation_at` for a game in `status=running`, or by an admin invocation of `force-next-turn`. **Flow on success:** 1. CAS `runtime_records.status: running → generation_in_progress`. If the CAS fails (status changed concurrently), the tick is skipped silently. 2. Call engine `PUT /api/v1/admin/turn`. Engine returns `StateResponse` with the new `turn` and the updated `player[]` array. 3. Persist `runtime_records.current_turn` and refresh `runtime_records.engine_health` summary. 4. If `StateResponse.finished == true`: - CAS `runtime_records.status: generation_in_progress → finished`; - publish `game_finished` to `gm:lobby_events` with `{game_id, final_turn_number, finished_at_ms, player_turn_stats[]}`; - publish `game.finished` notification intent to all `active` members. 5. If `StateResponse.finished == false`: - CAS `runtime_records.status: generation_in_progress → running`; - recompute `next_generation_at` from `turn_schedule`. If `skip_next_tick=true`, advance by one extra cron step and clear the flag; - publish `runtime_snapshot_update` to `gm:lobby_events` with `{game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats[]}`; - publish `game.turn.ready` notification intent to all `active` members. 6. Append `operation_log` entry (`op_kind=turn_generation`, `outcome=success`). **Failure paths:** | Failure | Side effect | Outcome | | --- | --- | --- | | Engine timeout / 5xx | CAS `status: generation_in_progress → generation_failed`; publish `runtime_snapshot_update`; publish `game.generation_failed` admin notification | Logged; ticker leaves the game in `generation_failed` until manual recovery (admin issues `force-next-turn` or `stop`). | | Persistence failure after engine success | Append failure to `operation_log`; status stays `generation_in_progress` | Health-summary update on next probe will resync. | `player_turn_stats[]` is built from `StateResponse.player[]` by mapping `raceName → user_id` through `player_mappings` and projecting `{user_id, planets, population}`. `ships_built` is intentionally absent (see [`./docs/stage01-architecture-sync.md`](./docs/stage01-architecture-sync.md)). ### Force-next-turn **Triggered by:** `Admin Service` or system-admin via `POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. **Pre-conditions:** runtime exists, `status=running`. **Flow:** 1. Run the turn-generation flow synchronously (the same code path the scheduler uses). 2. After success, set `runtime_records.skip_next_tick = true`. The next regular tick computed from `turn_schedule` is then advanced by one extra step before being persisted as `next_generation_at`. 3. Append `operation_log` entry (`op_kind=force_next_turn`). The skip rule guarantees that the inter-turn spacing is never shorter than one schedule interval, regardless of when the force is issued. ### Game finish The finish flow is driven entirely by the engine signal `finished:bool`. GM never decides finish independently. After `game_finished` is published, `Game Lobby` transitions its platform record to `finished`, runs the capability evaluation, and finalises Race Name Directory state. The GM record stays in `status=finished` indefinitely; cleanup is operator-driven. ### Banish (engine-side player removal) **Triggered by:** `Game Lobby` synchronously calling `POST /api/v1/internal/games/{game_id}/race/{race_name}/banish` after a permanent membership removal at platform level. **Pre-conditions:** runtime exists; `race_name` resolves to an existing `player_mappings` row. **Flow:** 1. Call engine `POST /api/v1/admin/race/banish` with `{race_name}`. 2. On engine success, append `operation_log` entry (`op_kind=banish`, `outcome=success`). 3. Return `204` to Lobby. **Failure path:** engine error returns `502 engine_unreachable`. Lobby treats this as a degraded state and may retry; the platform-level membership stays `removed` regardless. ### Stop **Triggered by:** system-admin via `POST /api/v1/internal/runtimes/{game_id}/stop` with body `{reason}`, where `reason ∈ {admin_request, finished, timeout}`. **Flow:** 1. Call `Runtime Manager` `POST /api/v1/internal/runtimes/{game_id}/stop` with the same `reason`. 2. CAS `runtime_records.status: * → stopped`. 3. Append `operation_log` entry. 4. Publish `runtime_snapshot_update` reflecting the stopped status. ### Patch **Triggered by:** system-admin via `POST /api/v1/internal/runtimes/{game_id}/patch` with body `{version}`. **Pre-conditions:** - `engine_versions.{version}` exists with `status=active`; - the new version is a semver-patch of the current version (same major and minor); otherwise reject with `semver_patch_only`. **Flow:** 1. Resolve `image_ref` from `engine_versions.{version}`. 2. Call `Runtime Manager` `POST /api/v1/internal/runtimes/{game_id}/patch` with `{image_ref}`. 3. On success, persist new `current_image_ref` and `current_engine_version` on `runtime_records`. 4. Append `operation_log` entry. The engine container is recreated by RTM with the same DNS name; the `engine_endpoint` is unchanged. GM does not call `/admin/init` again — the bind-mounted state directory is preserved and the engine resumes from the previous turn. ### Liveness reply (Lobby resume) **Triggered by:** `Game Lobby` resuming a paused game, calling `GET /api/v1/internal/games/{game_id}/liveness`. **Flow:** if `runtime_records.{game_id}` exists and `status=running`, return `200 {ready: true}`. Otherwise return `200 {ready: false, status: ""}`. This endpoint never calls the engine; it reflects GM's own view only. ## Hot Path ### Player commands and orders Both `game.command.execute` and `game.order.put` use the same FlatBuffers schema (`pkg/schema/fbs/order.fbs` `Order{updated_at, commands:[…]}`). The gateway transcodes the verified payload to JSON via `pkg/transcoder/order.go` before calling GM. **GM endpoints:** - `POST /api/v1/internal/games/{game_id}/commands` — execute now; engine `PUT /api/v1/command`. - `POST /api/v1/internal/games/{game_id}/orders` — validate-and-store; engine `PUT /api/v1/order`. Both endpoints accept body `{commands:[{cmd_id, @type, …}, …]}` and the `X-User-ID` header. The actor field on the engine call is **always** set by GM from the authenticated user identity; GM never trusts a payload field for actor identification. **Pre-conditions:** - `runtime_records.{game_id}` exists with `status=running`; - the user is an `active` member of the game (cache lookup); - `player_mappings.(game_id, user_id)` exists. **Errors:** - `runtime_not_found` — runtime missing. - `runtime_not_running` — `runtime_status` is anything other than `running`. - `forbidden` — caller is not an active member. - `engine_unreachable` — engine returned 5xx. - `engine_validation_error` — engine returned 4xx; the body carries the engine's per-command result (`cmd_applied`, `cmd_error_code`). ### Reports **GM endpoint:** `GET /api/v1/internal/games/{game_id}/reports/{turn}` with the `X-User-ID` header. **Flow:** 1. Authorise: caller must be an active member of the game. 2. Resolve `race_name` from `player_mappings`. 3. Call engine `GET /api/v1/report?player={race_name}&turn={turn}`. 4. Return the engine response verbatim. Reports are full per-player payloads and are never cached at the platform layer; the engine remains the source of truth. ### Membership cache and invalidation GM holds an in-process per-game TTL cache (default 30 s) of memberships loaded from `Lobby /api/v1/internal/games/{id}/memberships`. The cache shape is `map[user_id]MembershipStatus` plus a load timestamp. TTL is the safety-net fallback. The primary invalidation mechanism is an explicit hook from Lobby: - Endpoint: `POST /api/v1/internal/games/{game_id}/memberships/invalidate`. - Lobby invokes it post-commit on every operation that mutates roster: application approval, application rejection, invite redeem, member remove, member block, user-lifecycle cascade. - Failed invalidation does not roll back Lobby state; the TTL safety net catches stale data within the next 30 s. This is a deliberate tight coupling. The trade-off is recorded in [`./PLAN.md` Stage 16](./PLAN.md). ## Engine Version Registry The registry is the source of truth for which engine versions are deployable. CRUD is exposed on the GM internal port; `Game Lobby` consumes it synchronously to resolve `image_ref` for `target_engine_version` just before publishing a `runtime:start_jobs` envelope. | Method | Path | Purpose | | --- | --- | --- | | `GET` | `/api/v1/internal/engine-versions` | List versions; supports `status` filter. | | `POST` | `/api/v1/internal/engine-versions` | Create a new version with `version`, `image_ref`, optional `options`. Validates semver shape and Docker reference. | | `GET` | `/api/v1/internal/engine-versions/{version}` | Read one version. | | `PATCH` | `/api/v1/internal/engine-versions/{version}` | Update `image_ref`, `options`, or `status`. | | `DELETE` | `/api/v1/internal/engine-versions/{version}` | Soft-deprecate (`status=deprecated`). Hard delete is rejected if the version is referenced by any non-finished `runtime_records` row. | | `GET` | `/api/v1/internal/engine-versions/{version}/image-ref` | Resolve `image_ref` only. Used by Lobby's start flow. | `options` is a free-form `jsonb` document stored verbatim. v1 does not enforce a schema; future engine-side options follow the engine's own contract. `status` values: `active` (deployable), `deprecated` (rejected on new starts; existing runtimes unaffected). Hard removal of a deprecated version requires that no runtime references it. Lobby resolves `image_ref` synchronously per game start. If the resolve call fails or the version is missing, Lobby fails the start with `engine_version_not_found` and never publishes `runtime:start_jobs`. ## Trusted Surfaces ### Internal REST The internal REST surface is consumed by: - `Edge Gateway` — verified player commands and report reads; - `Game Lobby` — register-runtime, image-ref resolve, membership invalidate, banish, liveness reply; - `Admin Service` (future) — full administrative operations; - platform probes — `/healthz`, `/readyz`. The listener is unauthenticated; downstream services rely on network segmentation. Caller identity for audit is recorded from the optional `X-Galaxy-Caller` header (`gateway`, `lobby`, `admin`) and reflected as `op_source` in `operation_log` (`gateway_player`, `lobby_internal`, `admin_rest`); when missing or unrecognised, GM defaults to `op_source=admin_rest`. For player-command endpoints, the additional `X-User-ID` header is required and authoritative for the acting user identity. Request and response shapes are defined in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml). Unknown JSON fields are rejected with `invalid_request`. ## Async Stream Contracts ### `gm:lobby_events` (out) Producer: `Game Master`. Consumer: `Game Lobby`. Two message types share the stream, discriminated by `event_type`: | `event_type` | Body | | --- | --- | | `runtime_snapshot_update` | `{game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats:[{user_id, planets, population}], occurred_at_ms}` | | `game_finished` | `{game_id, final_turn_number, runtime_status:"finished", player_turn_stats:[…], finished_at_ms}` | Publication cadence: events only. GM publishes a snapshot when: - a turn was generated (success or failure); - `runtime_status` transitioned (e.g., `running ↔ generation_in_progress`, `running → engine_unreachable`, `* → finished`); - `engine_health_summary` changed in response to a `runtime:health_events` observation (debounced — duplicates are suppressed when the summary did not change). There is no periodic heartbeat. `Game Lobby` consumes these events to update its denormalised runtime snapshot and to feed the per-game `player_turn_stats` aggregate used at game finish. The first `runtime_snapshot_update` published right after a successful `register-runtime` carries `player_turn_stats` projected from the engine `/admin/init` response — the per-player baseline (`planets`, `population`) at turn 0. Lobby treats this baseline as the reference point against which subsequent turn deltas are measured. For other status transitions that fire without a fresh engine state payload (e.g., a pure health-summary change), `player_turn_stats` is empty. The full schema is enforced by [`./api/runtime-events-asyncapi.yaml`](./api/runtime-events-asyncapi.yaml). ### `runtime:health_events` (in) Producer: `Runtime Manager`. Consumer: `Game Master`. GM consumes the stream to update `runtime_records.engine_health` summary per game. The schema is owned by `Runtime Manager` and documented in [`../rtmanager/api/runtime-health-asyncapi.yaml`](../rtmanager/api/runtime-health-asyncapi.yaml). GM never modifies `runtime:health_events`; it is read-only. GM does not publish notifications in response to runtime health changes in v1; the operator surface is `gm:lobby_events` plus the GM REST inspect endpoints. ## Notification Contracts `Game Master` publishes notification intents to `notification:intents` using the shared `pkg/notificationintent` producer module: | Trigger | `notification_type` | Audience | Channels | | --- | --- | --- | --- | | Successful turn generation | `game.turn.ready` | active members of the game | `push+email` | | Game finish | `game.finished` | active members of the game | `push+email` | | Turn generation failed | `game.generation_failed` | configured admin email list | `email` | Recipient resolution: GM materialises `recipient_user_ids` from its own membership cache (loaded from Lobby) at publish time; admin recipients are resolved by `Notification Service` from configuration. A failed publication is a notification degradation and must not roll back already committed runtime state. Failed publications are logged and counted via `gamemaster.notification.publish_attempts`. ## Persistence Layout ### PostgreSQL durable state (schema `gamemaster`) | Table | Purpose | Key | | --- | --- | --- | | `runtime_records` | One row per game; latest known runtime status and scheduling state. | `game_id` | | `engine_versions` | Engine version registry. | `version` | | `player_mappings` | `(game_id, user_id) → race_name + engine_player_uuid`. | composite `(game_id, user_id)` | | `operation_log` | Append-only audit of every GM operation. | `id` (auto) | `runtime_records` columns: - `game_id` — primary key, references Lobby's identifier. - `status` — `starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished`. - `engine_endpoint` — `http://galaxy-game-{game_id}:8080`. - `current_image_ref` — Docker reference of the running image. - `current_engine_version` — semver string registered in `engine_versions`. - `turn_schedule` — five-field cron expression copied from Lobby. - `current_turn` — last completed turn number; `0` until the first turn generates. - `next_generation_at` — UTC timestamp of the next due tick. - `skip_next_tick` — boolean; set by `force-next-turn`, cleared after the first cron step is skipped. - `engine_health` — short text summary derived from `runtime:health_events`. - `created_at`, `updated_at`, `started_at`, `stopped_at`, `finished_at` — lifecycle timestamps. `engine_versions` columns: - `version` — primary key; semver string. - `image_ref` — non-empty Docker reference. - `options` — `jsonb`, free-form, default `'{}'`. - `status` — `active | deprecated`. - `created_at`, `updated_at`. `player_mappings` columns: - composite primary key `(game_id, user_id)`. - `race_name` — non-empty string; unique per `game_id`. - `engine_player_uuid` — UUID returned by the engine `/admin/init`. - `created_at`. `operation_log` columns: - `id`, `game_id`, `op_kind` (`register_runtime | turn_generation | force_next_turn | banish | stop | patch | engine_version_create | engine_version_update | engine_version_deprecate | engine_version_delete`), `op_source`, `source_ref` (request id when known), `outcome` (`success | failure`), `error_code`, `error_message`, `started_at`, `finished_at`. For engine-version registry entries (`op_kind` starting with `engine_version_`), the `game_id` column doubles as the audit subject and stores the canonical `version` string instead of a platform game identifier; the registry is global, not per-game. The convention is documented in [`./docs/stage14-engine-version-registry.md`](./docs/stage14-engine-version-registry.md). Indexes: - `runtime_records (status, next_generation_at)` — drives the scheduler ticker scan. - `operation_log (game_id, started_at DESC)` — drives audit reads. - UNIQUE on `player_mappings (game_id, race_name)` — one-race-per-game invariant. Per-game roster reads (`WHERE game_id = $1`) are served by the leftmost prefix of the composite primary key on `player_mappings (game_id, user_id)`; no extra single-column index is added. Migrations are embedded `00001_init.sql` (single-init pre-launch policy from `ARCHITECTURE.md §Persistence Backends`). ### Redis runtime-coordination state | Key shape | Purpose | | --- | --- | | `gamemaster:stream_offsets:{label}` | Last processed entry id per consumer (`health_events`). Same shape as Lobby and RTM. | GM does not persist the membership cache to Redis in v1; the cache is in-process. This trade-off is documented in [`./PLAN.md` Stage 16](./PLAN.md). ## Error Model Error envelope: `{ "error": { "code": "...", "message": "..." } }`, identical to Lobby and RTM. Stable error codes: | Code | Meaning | | --- | --- | | `invalid_request` | Malformed JSON, unknown fields, missing required parameter. | | `runtime_not_found` | `runtime_records.{game_id}` does not exist. | | `runtime_not_running` | Operation requires `status=running`. | | `conflict` | State transition not allowed. | | `forbidden` | Caller is not an active member or not authorised. | | `engine_version_not_found` | `engine_versions.{version}` does not exist. | | `engine_version_in_use` | Hard-delete attempt against a version referenced by a non-finished runtime. | | `semver_patch_only` | Patch attempt across major/minor boundary. | | `engine_unreachable` | Engine returned 5xx or connection error. | | `engine_protocol_violation` | Engine response missing required fields or carries unexpected payload. | | `engine_validation_error` | Engine returned 4xx with per-command results. | | `service_unavailable` | Dependency (PostgreSQL, Redis, Lobby, RTM) unavailable. | | `internal_error` | Unspecified failure. | ## Configuration All variables use the `GAMEMASTER_` prefix. Required variables fail-fast on startup. ### Required - `GAMEMASTER_INTERNAL_HTTP_ADDR` - `GAMEMASTER_POSTGRES_PRIMARY_DSN` - `GAMEMASTER_REDIS_MASTER_ADDR` - `GAMEMASTER_REDIS_PASSWORD` - `GAMEMASTER_LOBBY_INTERNAL_BASE_URL` - `GAMEMASTER_RTM_INTERNAL_BASE_URL` ### Configuration groups **Listener:** - `GAMEMASTER_INTERNAL_HTTP_ADDR` (e.g., `:8097`). - `GAMEMASTER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`). - `GAMEMASTER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `30s`). - `GAMEMASTER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`). **PostgreSQL:** - `GAMEMASTER_POSTGRES_PRIMARY_DSN` (`postgres://gamemaster:@:5432/galaxy?search_path=gamemaster&sslmode=disable`). - `GAMEMASTER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used in v1). - `GAMEMASTER_POSTGRES_OPERATION_TIMEOUT` (default `2s`). - `GAMEMASTER_POSTGRES_MAX_OPEN_CONNS` (default `10`). - `GAMEMASTER_POSTGRES_MAX_IDLE_CONNS` (default `2`). - `GAMEMASTER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`). **Redis:** - `GAMEMASTER_REDIS_MASTER_ADDR`. - `GAMEMASTER_REDIS_REPLICA_ADDRS` (optional, comma-separated). - `GAMEMASTER_REDIS_PASSWORD`. - `GAMEMASTER_REDIS_DB` (default `0`). - `GAMEMASTER_REDIS_OPERATION_TIMEOUT` (default `2s`). **Streams:** - `GAMEMASTER_REDIS_LOBBY_EVENTS_STREAM` (default `gm:lobby_events`). - `GAMEMASTER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`). - `GAMEMASTER_REDIS_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`). - `GAMEMASTER_STREAM_BLOCK_TIMEOUT` (default `5s`). **Engine client:** - `GAMEMASTER_ENGINE_CALL_TIMEOUT` (default `30s` — covers turn generation on large games). - `GAMEMASTER_ENGINE_PROBE_TIMEOUT` (default `5s` — for inspect-style reads). **Lobby internal client:** - `GAMEMASTER_LOBBY_INTERNAL_BASE_URL`. - `GAMEMASTER_LOBBY_INTERNAL_TIMEOUT` (default `2s`). **Runtime Manager internal client:** - `GAMEMASTER_RTM_INTERNAL_BASE_URL`. - `GAMEMASTER_RTM_INTERNAL_TIMEOUT` (default `5s`). **Scheduler:** - `GAMEMASTER_SCHEDULER_TICK_INTERVAL` (default `1s`). - `GAMEMASTER_TURN_GENERATION_TIMEOUT` (default `60s`). **Membership cache:** - `GAMEMASTER_MEMBERSHIP_CACHE_TTL` (default `30s`). - `GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES` (default `4096`; LRU eviction). **Logging:** - `GAMEMASTER_LOG_LEVEL` (default `info`). **Lifecycle:** - `GAMEMASTER_SHUTDOWN_TIMEOUT` (default `30s`). **Telemetry:** uses the standard OTLP env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_EXPORTER_OTLP_PROTOCOL`, etc.) shared with other Galaxy services. ## Observability ### Metrics (OpenTelemetry, low cardinality) - `gamemaster.register_runtime.outcomes` — counter; labels `outcome`, `error_code`. - `gamemaster.turn_generation.outcomes` — counter; labels `outcome`, `error_code`, `trigger` (`scheduler | force`). - `gamemaster.command_execute.outcomes` — counter; labels `outcome`, `error_code`. - `gamemaster.order_put.outcomes` — counter; labels `outcome`, `error_code`. - `gamemaster.report_get.outcomes` — counter; labels `outcome`, `error_code`. - `gamemaster.banish.outcomes` — counter; labels `outcome`, `error_code`. - `gamemaster.engine_call.latency` — histogram; label `op` (`init | status | turn | banish | command | order | report`). - `gamemaster.runtime_records_by_status` — gauge; label `status`. - `gamemaster.scheduler.due_games` — gauge. - `gamemaster.health_events.consumed` — counter. - `gamemaster.lobby_events.published` — counter; label `event_type`. - `gamemaster.notification.publish_attempts` — counter; label `notification_type`, `result` (`ok | error`). - `gamemaster.membership_cache.hits` — counter; labels `result` (`hit | miss | invalidate`). - `gamemaster.engine_versions_total` — gauge. Metrics avoid high-cardinality attributes such as `game_id` and `user_id`. ### Structured logs (slog JSON to stdout) Common fields on every entry: `service=gamemaster`, `request_id`, `trace_id`, `span_id`, `game_id` (when known), `user_id` (when known), `op_kind`, `op_source`, `outcome`, `error_code`. Worker-specific fields: `event_type` (lobby-events publisher), `stream_entry_id` (health-events consumer), `turn` (turn-generation), `engine_endpoint` (engine calls). ## Verification Service-level (per [`./PLAN.md`](./PLAN.md)): - Unit tests for every service-layer operation against mocked engine, Lobby, RTM, notification publisher, lobby-events publisher. - Adapter tests using `testcontainers-go` for PostgreSQL and Redis. - Contract tests for `internal-openapi.yaml` and `runtime-events-asyncapi.yaml`. Service-local integration suite under `gamemaster/integration/`: - Register-runtime + first turn happy path against the real `galaxy/game` test image. - Force-next-turn skip behaviour. - Engine version registry CRUD + resolve. - Admin stop synchronous REST. - Banish round-trip. - Membership invalidation hook. - `runtime:health_events` consumption. Inter-service suite under `integration/lobbygm/` and `integration/lobbygmrtm/`: - `lobbygm`: real Lobby + real GM + real engine + stub RTM. Covers enrollment → register-runtime → first turn → finish + capability evaluation. - `lobbygmrtm`: full Lobby + GM + RTM + engine. Covers happy path and the documented failure paths from `ARCHITECTURE.md` flow §4. Manual smoke (development): ```sh docker network create galaxy-net # once GAMEMASTER_INTERNAL_HTTP_ADDR=:8097 \ GAMEMASTER_POSTGRES_PRIMARY_DSN=postgres://gamemaster:secret@localhost:5432/galaxy?search_path=gamemaster&sslmode=disable \ GAMEMASTER_REDIS_MASTER_ADDR=localhost:6379 \ GAMEMASTER_REDIS_PASSWORD=secret \ GAMEMASTER_LOBBY_INTERNAL_BASE_URL=http://localhost:8095 \ GAMEMASTER_RTM_INTERNAL_BASE_URL=http://localhost:8096 \ ... go run ./gamemaster/cmd/gamemaster ``` After start, `curl http://localhost:8097/readyz` returns `200`. Driving Lobby through its public start flow brings up `galaxy-game-{game_id}` containers, GM registers each runtime, generates turns on the configured schedule, and propagates events to Lobby.