galaxy-game/gamemaster/README.md

# Game Master

`Game Master` (GM) is the only Galaxy platform service permitted to talk to
running game engine containers. It owns runtime and operational state of
already-running games, the engine version registry, the platform mapping of
`(user_id ↔ race_name ↔ engine_player_uuid)`, the per-game turn scheduler,
and the synchronous and asynchronous boundaries that other services use to
interact with running games.

## References

- [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §8 Game
  Master.
- [`../TESTING.md`](../TESTING.md) §8 — testing matrix for GM.
- [`./PLAN.md`](./PLAN.md) — staged implementation plan.
- [`./docs/README.md`](./docs/README.md) — service-local documentation entry
  point (created at PLAN stage 24).
- [`./docs/stage06-contract-files.md`](./docs/stage06-contract-files.md) —
  decisions behind the OpenAPI and AsyncAPI specs frozen at PLAN stage 06.
- [`./docs/stage07-notification-catalog-audit.md`](./docs/stage07-notification-catalog-audit.md) —
  notification catalog audit and producer-side freeze test added at PLAN stage 07.
- [`./docs/stage08-module-skeleton.md`](./docs/stage08-module-skeleton.md) —
  module skeleton wiring decisions (config groups, telemetry instruments,
  Makefile targets, deferred dependencies) recorded at PLAN stage 08.
- [`./docs/stage09-postgres-migration.md`](./docs/stage09-postgres-migration.md) —
  PostgreSQL schema, embedded migration, jet generation pipeline, and
  runtime wiring landed at PLAN stage 09.
- [`./docs/stage10-domain-and-ports.md`](./docs/stage10-domain-and-ports.md) —
  domain types, port interfaces, and the six stage-10 decisions
  (operation domain package, membership DTO placement, engine-version
  options shape, schedule wrapper signature, recovery transition,
  deferred mock destination) landed at PLAN stage 10.
- [`./docs/stage11-persistence-adapters.md`](./docs/stage11-persistence-adapters.md) —
  PostgreSQL stores (`runtimerecordstore`, `engineversionstore`,
  `playermappingstore`, `operationlog`), the Redis offset store, and
  the eight stage-11 decisions (sqlx/pgtest local clones, CAS
  pattern, port-level Now extension, domain conflict sentinels, jsonb
  cast, idempotent Deprecate, multi-row BulkInsert, miniredis
  dependency) landed at PLAN stage 11.
- [`./docs/stage12-external-clients.md`](./docs/stage12-external-clients.md) —
  outbound adapters (engine, Lobby, Runtime Manager, notification
  intent publisher, lobby-events publisher) and the seven stage-12
  decisions (per-call engine base URL, dual engine timeout dispatch,
  engine population rounding, Lobby pagination cap, no extra RTM
  sentinels, AsyncAPI-aligned XADD encoding for `gm:lobby_events`,
  Makefile mocks-target guard) landed at PLAN stage 12.
- [`./docs/stage13-register-runtime.md`](./docs/stage13-register-runtime.md) —
  register-runtime service-layer orchestrator and the five
  stage-13 decisions (`RuntimeRecordStore.Delete` extension, engine
  4xx/5xx classification split, engine response validated as
  `engine_protocol_violation`, initial snapshot carries `player_turn_stats`
  from `/admin/init`, two-flag rollback gating) landed at PLAN
  stage 13.
- [`./docs/stage14-engine-version-registry.md`](./docs/stage14-engine-version-registry.md) —
  engine version registry service-layer orchestrator (List, Get,
  Create, Update, Deprecate, Delete, ResolveImageRef) and the five
  stage-14 decisions (`EngineVersionStore.Delete` port extension,
  reference probe before hard delete, new `engine_version_delete`
  op_kind in schema and domain, `operation_log.game_id` overloaded
  as audit subject for registry entries, JSON-object validation for
  `options`) landed at PLAN stage 14.
- [`./docs/stage15-scheduler-and-turn-generation.md`](./docs/stage15-scheduler-and-turn-generation.md) —
  scheduler ticker, turn-generation orchestrator, and snapshot
  publisher and the seven stage-15 decisions
  (`LobbyClient.GetGameSummary` extension with fail-soft `game_name`
  fallback, telemetry-only `Trigger` parameter, two-CAS pattern with
  external-mutation conflict, single-snapshot-per-outcome cadence,
  player_mappings as recipient source, stateless scheduler utility,
  in-flight set on the ticker) landed at PLAN stage 15.
- [`./docs/stage16-membership-cache-and-invalidation.md`](./docs/stage16-membership-cache-and-invalidation.md) —
  hot-path services (`commandexecute`, `orderput`, `reportget`),
  membership cache, and the six stage-16 decisions (no
  `runtime_not_running` for reports, GM-side envelope rewrite
  `commands`→`cmd` with injected `actor`, hot-path skips
  `operation_log`, hand-rolled per-game inflight tracker, raw status
  string return, missing-mapping surfaces as `forbidden`) landed at
  PLAN stage 16.
- [`./docs/stage17-admin-operations.md`](./docs/stage17-admin-operations.md) —
  admin service-layer operations (`adminstop`, `adminforce`,
  `adminpatch`, `adminbanish`, `livenessreply`) and the six
  stage-17 decisions (`RuntimeRecordStore.UpdateImage` extension,
  `adminstop` idempotent on terminal statuses and `conflict` on
  `starting`, `adminforce` always sets `skip_next_tick`,
  `adminbanish` without status check and missing race surfaces as
  `forbidden`, `livenessreply` 200 + empty status on
  `runtime_not_found`, RTM failures map to `service_unavailable`)
  landed at PLAN stage 17.
- [`./docs/stage18-health-events-consumer.md`](./docs/stage18-health-events-consumer.md) —
  `runtime:health_events` consumer worker and the seven stage-18
  decisions (event-type taxonomy expanded to seven values with
  `container_started` and `probe_recovered`, CAS-conflict fallback to
  health-only update, new `RuntimeRecordStore.UpdateEngineHealth`
  port method, in-memory dedupe of last-emitted summaries,
  read-after-write snapshot construction, `health_events` stream
  offset label, worker wiring deferred to Stage 19) landed at PLAN
  stage 18.
- [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — internal
  trusted REST contract.
- [`./api/runtime-events-asyncapi.yaml`](./api/runtime-events-asyncapi.yaml) —
  `gm:lobby_events` Redis Stream contract.
- [`../game/README.md`](../game/README.md) — game engine container contract
  (env, ports, admin and player REST surfaces, `/healthz`).
- [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with GM.
- [`../rtmanager/README.md`](../rtmanager/README.md) — Runtime Manager
  contract used synchronously by GM admin operations.

## Purpose

A running Galaxy game lives in exactly one Docker container managed by
`Runtime Manager`. The platform must:

- register a freshly started container with platform-level membership;
- initialise the engine with the agreed race roster;
- accept and forward player commands and orders to the engine;
- route per-player report reads;
- generate turns according to a schedule;
- detect game finish and propagate it back to platform-level state;
- expose runtime/operational controls (force-next-turn, stop, patch, banish);
- own the catalogue of supported engine versions and resolve `image_ref`
  values for `Game Lobby`.

`Game Master` is the single component that performs these actions. It does
**not** own platform metadata of games (that is `Game Lobby`), Docker control
(that is `Runtime Manager`), or the full game state (that is the engine
container). Engine state on disk is the engine's domain; GM never reads or
writes the bind-mounted state directory.

## Scope

`Game Master` is the source of truth for:

- the runtime mapping `game_id → engine_endpoint` for every running game;
- the runtime status (`starting | running | generation_in_progress |
  generation_failed | stopped | engine_unreachable | finished`);
- the current turn number and the next-tick timestamp;
- the per-game `(user_id, race_name, engine_player_uuid)` triple;
- the engine version registry: `(version, image_ref, options, status)`;
- the durable history of every operation GM performed (`operation_log`);
- the latest engine health summary per game.

`Game Master` is **not** the source of truth for:

- platform game records (created, draft, enrollment, finished metadata) —
  owned by `Game Lobby`;
- container lifecycle and Docker reality — owned by `Runtime Manager`;
- in-game world state (planets, ships, science, reports) — owned by the
  engine container;
- platform user identity and entitlements — owned by `User Service`;
- in-game `race_name` reservations and the Race Name Directory — owned by
  `Game Lobby`.

## Non-Goals

- Multi-instance operation in v1. GM runs as a single process; the in-process
  scheduler is authoritative. Multi-instance with leader election is an
  explicit future iteration.
- Direct Docker access. GM never imports the Docker SDK; every container
  operation goes through `Runtime Manager` over trusted internal REST.
- Player removal/block at platform level. `Game Lobby` owns that decision;
  GM only performs the engine-side `banish` call when explicitly invoked.
- Pause/resume of a running game on the platform side. `Game Lobby.paused`
  is a platform-only state; GM only answers a liveness probe used by
  Lobby's resume flow.
- Automatic semver-patch upgrades. Patch is always an explicit admin
  operation against a target engine version present in the registry.
- TLS or mTLS on the internal listener. GM trusts its network segment.
- Direct delivery of player-visible push events. `Notification Service`
  owns user-targeted push delivery; GM publishes notification intents only.
- A separate Admin Service. GM exposes its trusted internal REST surface;
  Admin Service will adopt it in a later iteration.
- Engine state file management. Backup, archival, and cleanup of the
  bind-mounted state directories are operator concerns.

## Position in the System

```mermaid
flowchart LR
    Gateway["Edge Gateway"]
    Lobby["Game Lobby"]
    Admin["Admin Service\n(future)"]
    GM["Game Master"]
    RTM["Runtime Manager"]
    Notify["Notification Service"]
    Engine["Game Engine container\n(galaxy/game)"]
    Postgres["PostgreSQL\nschema gamemaster"]
    Redis["Redis\nstreams + caches"]

    Gateway -- "verified player commands\n(REST/JSON)" --> GM
    Lobby -- "register-runtime,\nimage-ref resolve,\nmemberships invalidate" --> GM
    Admin -- "internal REST" --> GM
    GM -- "engine HTTP API" --> Engine
    GM -- "stop / restart / patch" --> RTM
    GM -- "notification:intents" --> Notify
    GM -- "gm:lobby_events" --> Redis
    Redis -- "runtime:health_events" --> GM
    GM --> Postgres
```

`Edge Gateway` routes verified player message types (`game.command.execute`,
`game.order.put`, `game.report.get`) to GM as trusted REST/JSON after
transcoding from FlatBuffers. `Game Lobby` calls GM synchronously to
register runtimes after a successful container start, to resolve `image_ref`
from the engine version registry, to invalidate membership cache on roster
changes, and to verify GM liveness during platform resume. `Game Master`
calls `Runtime Manager` synchronously over REST for stop, restart, and
patch. `Runtime Manager` publishes `runtime:health_events`, which GM
consumes asynchronously. GM publishes `gm:lobby_events` consumed by
`Game Lobby`, and `notification:intents` consumed by `Notification Service`.

## Responsibility Boundaries

`Game Master` is responsible for:

- registering a freshly started container into platform-level runtime state;
- initialising the engine with the race roster received from Lobby;
- maintaining the platform mapping of `user_id`, `race_name`, and
  `engine_player_uuid`;
- forwarding player commands, orders, and report reads to the engine after
  authorising the actor;
- generating turns on schedule, including the force-next-turn skip rule;
- evaluating engine finish on every turn boundary;
- publishing runtime snapshot updates and the final game-finish event;
- consuming runtime health events from `Runtime Manager` and updating its
  per-game health summary;
- exposing the engine version registry CRUD;
- driving admin-level runtime operations (stop, force-next-turn, patch,
  banish) by calling `Runtime Manager` and the engine on demand.

`Game Master` is not responsible for:

- creating or stopping containers on Docker (that is `Runtime Manager`);
- evaluating whether a game is allowed to start (that is `Game Lobby`);
- deriving recipient user lists for non-game notifications (that is
  `Notification Service`);
- verifying authenticated transport, signatures, freshness, and replay
  (that is `Edge Gateway`);
- mapping `user_id` to platform-level membership (that is `Game Lobby`).

## Engine Container Contract

The engine container is `galaxy/game`. GM uses two route classes:

| Class | Path | Purpose |
| --- | --- | --- |
| Admin (GM-only) | `POST /api/v1/admin/init` | Initialise the engine with a race roster. |
| Admin (GM-only) | `GET /api/v1/admin/status` | Read the full game state. |
| Admin (GM-only) | `PUT /api/v1/admin/turn` | Generate the next turn. |
| Admin (GM-only) | `POST /api/v1/admin/race/banish` | Deactivate a race after permanent platform removal. Body `{race_name}`. |
| Player | `PUT /api/v1/command` | Execute a batch of player commands. |
| Player | `PUT /api/v1/order` | Validate and store a batch of player orders. |
| Player | `GET /api/v1/report` | Fetch per-player turn report. |
| Probe | `GET /healthz` | Liveness probe used by `Runtime Manager` and operator tooling. |

Admin paths are unauthenticated but routed only from inside the trusted
network segment that connects GM to the engine container. The engine does
not enforce caller identity — network-level segmentation is the boundary.

`StateResponse` carries an extra boolean `finished` field. When `true` on a
turn-generation response, GM treats the game as finished and runs the
finish flow described below. The conditional logic that flips `finished`
to `true` lives in the engine's domain code and is not GM's concern.

The engine endpoint URL is the `engine_endpoint` value handed to GM by
`Game Lobby` during `register-runtime`: `http://galaxy-game-{game_id}:8080`.
The DNS name is stable across restart and patch.

## Runtime Surface

### Listeners

| Listener | Default address | Purpose |
| --- | --- | --- |
| Internal HTTP | `:8097` (`GAMEMASTER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Edge Gateway`, `Game Lobby`, and `Admin Service`. |

There is no public listener. The internal listener is unauthenticated and
assumes a trusted network segment. Authentication of player commands has
already happened at `Edge Gateway`; GM enforces authorisation only.

### Background workers

| Worker | Driver | Description |
| --- | --- | --- |
| Scheduler ticker | 1 s loop | Scans `runtime_records` for due `next_generation_at`, runs the turn-generation service for each, recomputes `next_generation_at` from `turn_schedule` (skipping one tick when `skip_next_tick=true` is set). |
| `runtime:health_events` consumer | Redis Stream | XREADs from `runtime:health_events` (produced by RTM), updates `runtime_records.engine_health` summary, debounces `runtime_snapshot_update` publication. |

### Startup dependencies

In start order:

1. PostgreSQL primary (`GAMEMASTER_POSTGRES_PRIMARY_DSN`). Embedded goose
   migrations apply synchronously before any listener opens.
2. Redis master (`GAMEMASTER_REDIS_MASTER_ADDR`).
3. Telemetry exporter (OTLP grpc/http or stdout).
4. Internal HTTP listener.
5. Health-events consumer worker.
6. Scheduler ticker worker.

A failure in any step exits the process non-zero.

### Probes

`/healthz` reports liveness — the process responds when the HTTP server is
alive.

`/readyz` reports readiness — `200` only when the PostgreSQL pool can ping
the primary and the Redis master client can ping. No deeper dependency is
checked synchronously; the engine is reached only on demand.

Both probes are documented in
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).

## Lifecycles

### Register-runtime

**Triggered by:** `Game Lobby` after a successful container start, calling
`POST /api/v1/internal/games/{game_id}/register-runtime` with body
`{engine_endpoint, members:[{user_id, race_name}], target_engine_version,
turn_schedule}`.

**Flow on success:**

1. Validate request shape; reject with `invalid_request` if any required
   field is missing.
2. Reject with `conflict` if `runtime_records.{game_id}` already exists.
3. Resolve `image_ref` for `target_engine_version` from `engine_versions`;
   reject with `engine_version_not_found` when missing.
4. Persist `runtime_records` with `status=starting`, `engine_endpoint`,
   `current_image_ref`, `current_engine_version`, `turn_schedule`, and
   `created_at`.
5. Call engine `POST /api/v1/admin/init` with the race-name list derived
   from `members`.
6. Read `StateResponse` and persist one `player_mappings` row per player:
   `(game_id, user_id, race_name, engine_player_uuid)`.
7. CAS `runtime_records.status: starting → running`. Persist
   `current_turn=0` and `next_generation_at` computed from `turn_schedule`.
8. Append `operation_log` entry (`op_kind=register_runtime`,
   `outcome=success`).
9. Publish `runtime_snapshot_update` to `gm:lobby_events`.
10. Return `200` with the persisted `runtime_records` row.

**Failure paths:**

| Failure | Side effect | Outcome to caller |
| --- | --- | --- |
| Invalid envelope | None | `400 invalid_request` |
| `runtime_records` already exists | None | `409 conflict` |
| Engine `/admin/init` returns 4xx | Roll back `runtime_records`; append failure to `operation_log` | `502 engine_validation_error` |
| Engine `/admin/init` returns 5xx or fails at the transport layer | Roll back; append failure | `502 engine_unreachable` |
| Engine response missing players or contains races not in roster | Roll back; append failure | `502 engine_protocol_violation` |
| PostgreSQL transaction failure | Roll back; append failure if possible | `503 service_unavailable` |

A failed `register-runtime` leaves no `runtime_records` row and no
`player_mappings` rows. `Game Lobby` then transitions the platform game
record to `paused` (per the architecture's flow §4 forced-pause path).

### Turn generation

**Triggered by:** the scheduler ticker when `now >= next_generation_at`
for a game in `status=running`, or by an admin invocation of
`force-next-turn`.

**Flow on success:**

1. CAS `runtime_records.status: running → generation_in_progress`. If the
   CAS fails (status changed concurrently), the tick is skipped silently.
2. Call engine `PUT /api/v1/admin/turn`. Engine returns `StateResponse`
   with the new `turn` and the updated `player[]` array.
3. Persist `runtime_records.current_turn` and refresh
   `runtime_records.engine_health` summary.
4. If `StateResponse.finished == true`:
   - CAS `runtime_records.status: generation_in_progress → finished`;
   - publish `game_finished` to `gm:lobby_events` with
     `{game_id, final_turn_number, finished_at_ms, player_turn_stats[]}`;
   - publish `game.finished` notification intent to all `active` members.
5. If `StateResponse.finished == false`:
   - CAS `runtime_records.status: generation_in_progress → running`;
   - recompute `next_generation_at` from `turn_schedule`. If
     `skip_next_tick=true`, advance by one extra cron step and clear the
     flag;
   - publish `runtime_snapshot_update` to `gm:lobby_events` with
     `{game_id, current_turn, runtime_status, engine_health_summary,
     player_turn_stats[]}`;
   - publish `game.turn.ready` notification intent to all `active`
     members.
6. Append `operation_log` entry (`op_kind=turn_generation`,
   `outcome=success`).

**Failure paths:**

| Failure | Side effect | Outcome |
| --- | --- | --- |
| Engine timeout / 5xx | CAS `status: generation_in_progress → generation_failed`; publish `runtime_snapshot_update`; publish `game.generation_failed` admin notification | Logged; ticker leaves the game in `generation_failed` until manual recovery (admin issues `force-next-turn` or `stop`). |
| Persistence failure after engine success | Append failure to `operation_log`; status stays `generation_in_progress` | Health-summary update on next probe will resync. |

`player_turn_stats[]` is built from `StateResponse.player[]` by mapping
`raceName → user_id` through `player_mappings` and projecting
`{user_id, planets, population}`. `ships_built` is intentionally absent
(see [`./docs/stage01-architecture-sync.md`](./docs/stage01-architecture-sync.md)).

### Force-next-turn

**Triggered by:** `Admin Service` or system-admin via
`POST /api/v1/internal/runtimes/{game_id}/force-next-turn`.

**Pre-conditions:** runtime exists, `status=running`.

**Flow:**

1. Run the turn-generation flow synchronously (the same code path the
   scheduler uses).
2. After success, set `runtime_records.skip_next_tick = true`. The next
   regular tick computed from `turn_schedule` is then advanced by one
   extra step before being persisted as `next_generation_at`.
3. Append `operation_log` entry (`op_kind=force_next_turn`).

The skip rule guarantees that the inter-turn spacing is never shorter than
one schedule interval, regardless of when the force is issued.

### Game finish

The finish flow is driven entirely by the engine signal `finished:bool`.
GM never decides finish independently. After `game_finished` is published,
`Game Lobby` transitions its platform record to `finished`, runs the
capability evaluation, and finalises Race Name Directory state. The GM
record stays in `status=finished` indefinitely; cleanup is operator-driven.

### Banish (engine-side player removal)

**Triggered by:** `Game Lobby` synchronously calling
`POST /api/v1/internal/games/{game_id}/race/{race_name}/banish` after a
permanent membership removal at platform level.

**Pre-conditions:** runtime exists; `race_name` resolves to an existing
`player_mappings` row.

**Flow:**

1. Call engine `POST /api/v1/admin/race/banish` with `{race_name}`.
2. On engine success, append `operation_log` entry (`op_kind=banish`,
   `outcome=success`).
3. Return `204` to Lobby.

**Failure path:** engine error returns `502 engine_unreachable`. Lobby
treats this as a degraded state and may retry; the platform-level
membership stays `removed` regardless.

### Stop

**Triggered by:** system-admin via
`POST /api/v1/internal/runtimes/{game_id}/stop` with body `{reason}`,
where `reason ∈ {admin_request, finished, timeout}`.

**Flow:**

1. Call `Runtime Manager` `POST /api/v1/internal/runtimes/{game_id}/stop`
   with the same `reason`.
2. CAS `runtime_records.status: * → stopped`.
3. Append `operation_log` entry.
4. Publish `runtime_snapshot_update` reflecting the stopped status.

### Patch

**Triggered by:** system-admin via
`POST /api/v1/internal/runtimes/{game_id}/patch` with body `{version}`.

**Pre-conditions:**

- `engine_versions.{version}` exists with `status=active`;
- the new version is a semver-patch of the current version (same major and
  minor); otherwise reject with `semver_patch_only`.

**Flow:**

1. Resolve `image_ref` from `engine_versions.{version}`.
2. Call `Runtime Manager`
   `POST /api/v1/internal/runtimes/{game_id}/patch` with `{image_ref}`.
3. On success, persist new `current_image_ref` and `current_engine_version`
   on `runtime_records`.
4. Append `operation_log` entry.

The engine container is recreated by RTM with the same DNS name; the
`engine_endpoint` is unchanged. GM does not call `/admin/init` again —
the bind-mounted state directory is preserved and the engine resumes from
the previous turn.

### Liveness reply (Lobby resume)

**Triggered by:** `Game Lobby` resuming a paused game, calling
`GET /api/v1/internal/games/{game_id}/liveness`.

**Flow:** if `runtime_records.{game_id}` exists and `status=running`,
return `200 {ready: true}`. Otherwise return `200 {ready: false, status:
"<observed status>"}`.

This endpoint never calls the engine; it reflects GM's own view only.

## Hot Path

### Player commands and orders

Both `game.command.execute` and `game.order.put` use the same FlatBuffers
schema (`pkg/schema/fbs/order.fbs` `Order{updated_at, commands:[…]}`). The
gateway transcodes the verified payload to JSON via
`pkg/transcoder/order.go` before calling GM.

**GM endpoints:**

- `POST /api/v1/internal/games/{game_id}/commands` — execute now; engine
  `PUT /api/v1/command`.
- `POST /api/v1/internal/games/{game_id}/orders` — validate-and-store;
  engine `PUT /api/v1/order`.

Both endpoints accept body `{commands:[{cmd_id, @type, …}, …]}` and the
`X-User-ID` header. The actor field on the engine call is **always** set
by GM from the authenticated user identity; GM never trusts a payload
field for actor identification.

**Pre-conditions:**

- `runtime_records.{game_id}` exists with `status=running`;
- the user is an `active` member of the game (cache lookup);
- `player_mappings.(game_id, user_id)` exists.

**Errors:**

- `runtime_not_found` — runtime missing.
- `runtime_not_running` — `runtime_status` is anything other than
  `running`.
- `forbidden` — caller is not an active member.
- `engine_unreachable` — engine returned 5xx.
- `engine_validation_error` — engine returned 4xx; the body carries the
  engine's per-command result (`cmd_applied`, `cmd_error_code`).

### Reports

**GM endpoint:** `GET /api/v1/internal/games/{game_id}/reports/{turn}`
with the `X-User-ID` header.

**Flow:**

1. Authorise: caller must be an active member of the game.
2. Resolve `race_name` from `player_mappings`.
3. Call engine `GET /api/v1/report?player={race_name}&turn={turn}`.
4. Return the engine response verbatim. Reports are full per-player
   payloads and are never cached at the platform layer; the engine remains
   the source of truth.

### Membership cache and invalidation

GM holds an in-process per-game TTL cache (default 30 s) of memberships
loaded from `Lobby /api/v1/internal/games/{id}/memberships`. The cache
shape is `map[user_id]MembershipStatus` plus a load timestamp. TTL is
the safety-net fallback.

The primary invalidation mechanism is an explicit hook from Lobby:

- Endpoint: `POST /api/v1/internal/games/{game_id}/memberships/invalidate`.
- Lobby invokes it post-commit on every operation that mutates roster:
  application approval, application rejection, invite redeem, member
  remove, member block, user-lifecycle cascade.
- Failed invalidation does not roll back Lobby state; the TTL safety net
  catches stale data within the next 30 s.

This is a deliberate tight coupling. The trade-off is recorded in
[`./PLAN.md` Stage 16](./PLAN.md).

## Engine Version Registry

The registry is the source of truth for which engine versions are
deployable. CRUD is exposed on the GM internal port; `Game Lobby`
consumes it synchronously to resolve `image_ref` for `target_engine_version`
just before publishing a `runtime:start_jobs` envelope.

| Method | Path | Purpose |
| --- | --- | --- |
| `GET` | `/api/v1/internal/engine-versions` | List versions; supports `status` filter. |
| `POST` | `/api/v1/internal/engine-versions` | Create a new version with `version`, `image_ref`, optional `options`. Validates semver shape and Docker reference. |
| `GET` | `/api/v1/internal/engine-versions/{version}` | Read one version. |
| `PATCH` | `/api/v1/internal/engine-versions/{version}` | Update `image_ref`, `options`, or `status`. |
| `DELETE` | `/api/v1/internal/engine-versions/{version}` | Soft-deprecate (`status=deprecated`). Hard delete is rejected if the version is referenced by any non-finished `runtime_records` row. |
| `GET` | `/api/v1/internal/engine-versions/{version}/image-ref` | Resolve `image_ref` only. Used by Lobby's start flow. |

`options` is a free-form `jsonb` document stored verbatim. v1 does not
enforce a schema; future engine-side options follow the engine's own
contract.

`status` values: `active` (deployable), `deprecated` (rejected on new
starts; existing runtimes unaffected). Hard removal of a deprecated
version requires that no runtime references it.

Lobby resolves `image_ref` synchronously per game start. If the resolve
call fails or the version is missing, Lobby fails the start with
`engine_version_not_found` and never publishes `runtime:start_jobs`.

## Trusted Surfaces

### Internal REST

The internal REST surface is consumed by:

- `Edge Gateway` — verified player commands and report reads;
- `Game Lobby` — register-runtime, image-ref resolve, membership invalidate,
  banish, liveness reply;
- `Admin Service` (future) — full administrative operations;
- platform probes — `/healthz`, `/readyz`.

The listener is unauthenticated; downstream services rely on network
segmentation. Caller identity for audit is recorded from the optional
`X-Galaxy-Caller` header (`gateway`, `lobby`, `admin`) and reflected as
`op_source` in `operation_log` (`gateway_player`, `lobby_internal`,
`admin_rest`); when missing or unrecognised, GM defaults to
`op_source=admin_rest`.

For player-command endpoints, the additional `X-User-ID` header is
required and authoritative for the acting user identity.

Request and response shapes are defined in
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml). Unknown JSON
fields are rejected with `invalid_request`.

## Async Stream Contracts

### `gm:lobby_events` (out)

Producer: `Game Master`. Consumer: `Game Lobby`.

Two message types share the stream, discriminated by `event_type`:

| `event_type` | Body |
| --- | --- |
| `runtime_snapshot_update` | `{game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats:[{user_id, planets, population}], occurred_at_ms}` |
| `game_finished` | `{game_id, final_turn_number, runtime_status:"finished", player_turn_stats:[…], finished_at_ms}` |

Publication cadence: events only. GM publishes a snapshot when:

- a turn was generated (success or failure);
- `runtime_status` transitioned (e.g., `running ↔ generation_in_progress`,
  `running → engine_unreachable`, `* → finished`);
- `engine_health_summary` changed in response to a `runtime:health_events`
  observation (debounced — duplicates are suppressed when the summary did
  not change).

There is no periodic heartbeat. `Game Lobby` consumes these events to
update its denormalised runtime snapshot and to feed the per-game
`player_turn_stats` aggregate used at game finish.

The first `runtime_snapshot_update` published right after a successful
`register-runtime` carries `player_turn_stats` projected from the
engine `/admin/init` response — the per-player baseline (`planets`,
`population`) at turn 0. Lobby treats this baseline as the reference
point against which subsequent turn deltas are measured. For other
status transitions that fire without a fresh engine state payload
(e.g., a pure health-summary change), `player_turn_stats` is empty.

The full schema is enforced by
[`./api/runtime-events-asyncapi.yaml`](./api/runtime-events-asyncapi.yaml).

### `runtime:health_events` (in)

Producer: `Runtime Manager`. Consumer: `Game Master`.

GM consumes the stream to update `runtime_records.engine_health` summary
per game. The schema is owned by `Runtime Manager` and documented in
[`../rtmanager/api/runtime-health-asyncapi.yaml`](../rtmanager/api/runtime-health-asyncapi.yaml).
GM never modifies `runtime:health_events`; it is read-only.

GM does not publish notifications in response to runtime health changes
in v1; the operator surface is `gm:lobby_events` plus the GM REST
inspect endpoints.

## Notification Contracts

`Game Master` publishes notification intents to `notification:intents`
using the shared `pkg/notificationintent` producer module:

| Trigger | `notification_type` | Audience | Channels |
| --- | --- | --- | --- |
| Successful turn generation | `game.turn.ready` | active members of the game | `push+email` |
| Game finish | `game.finished` | active members of the game | `push+email` |
| Turn generation failed | `game.generation_failed` | configured admin email list | `email` |

Recipient resolution: GM materialises `recipient_user_ids` from its own
membership cache (loaded from Lobby) at publish time; admin recipients
are resolved by `Notification Service` from configuration.

A failed publication is a notification degradation and must not roll back
already committed runtime state. Failed publications are logged and
counted via `gamemaster.notification.publish_attempts`.

## Persistence Layout

### PostgreSQL durable state (schema `gamemaster`)

| Table | Purpose | Key |
| --- | --- | --- |
| `runtime_records` | One row per game; latest known runtime status and scheduling state. | `game_id` |
| `engine_versions` | Engine version registry. | `version` |
| `player_mappings` | `(game_id, user_id) → race_name + engine_player_uuid`. | composite `(game_id, user_id)` |
| `operation_log` | Append-only audit of every GM operation. | `id` (auto) |

`runtime_records` columns:

- `game_id` — primary key, references Lobby's identifier.
- `status` — `starting | running | generation_in_progress |
  generation_failed | stopped | engine_unreachable | finished`.
- `engine_endpoint` — `http://galaxy-game-{game_id}:8080`.
- `current_image_ref` — Docker reference of the running image.
- `current_engine_version` — semver string registered in `engine_versions`.
- `turn_schedule` — five-field cron expression copied from Lobby.
- `current_turn` — last completed turn number; `0` until the first turn
  generates.
- `next_generation_at` — UTC timestamp of the next due tick.
- `skip_next_tick` — boolean; set by `force-next-turn`, cleared after the
  first cron step is skipped.
- `engine_health` — short text summary derived from
  `runtime:health_events`.
- `created_at`, `updated_at`, `started_at`, `stopped_at`, `finished_at` —
  lifecycle timestamps.

`engine_versions` columns:

- `version` — primary key; semver string.
- `image_ref` — non-empty Docker reference.
- `options` — `jsonb`, free-form, default `'{}'`.
- `status` — `active | deprecated`.
- `created_at`, `updated_at`.

`player_mappings` columns:

- composite primary key `(game_id, user_id)`.
- `race_name` — non-empty string; unique per `game_id`.
- `engine_player_uuid` — UUID returned by the engine `/admin/init`.
- `created_at`.

`operation_log` columns:

- `id`, `game_id`, `op_kind` (`register_runtime | turn_generation |
  force_next_turn | banish | stop | patch | engine_version_create |
  engine_version_update | engine_version_deprecate |
  engine_version_delete`), `op_source`, `source_ref` (request id
  when known), `outcome` (`success | failure`), `error_code`,
  `error_message`, `started_at`, `finished_at`.

For engine-version registry entries (`op_kind` starting with
`engine_version_`), the `game_id` column doubles as the audit subject
and stores the canonical `version` string instead of a platform game
identifier; the registry is global, not per-game. The convention is
documented in
[`./docs/stage14-engine-version-registry.md`](./docs/stage14-engine-version-registry.md).

Indexes:

- `runtime_records (status, next_generation_at)` — drives the scheduler
  ticker scan.
- `operation_log (game_id, started_at DESC)` — drives audit reads.
- UNIQUE on `player_mappings (game_id, race_name)` —
  one-race-per-game invariant.

Per-game roster reads (`WHERE game_id = $1`) are served by the
leftmost prefix of the composite primary key on
`player_mappings (game_id, user_id)`; no extra single-column index is
added.

Migrations are embedded `00001_init.sql` (single-init pre-launch policy
from `ARCHITECTURE.md §Persistence Backends`).

### Redis runtime-coordination state

| Key shape | Purpose |
| --- | --- |
| `gamemaster:stream_offsets:{label}` | Last processed entry id per consumer (`health_events`). Same shape as Lobby and RTM. |

GM does not persist the membership cache to Redis in v1; the cache is
in-process. This trade-off is documented in [`./PLAN.md` Stage 16](./PLAN.md).

## Error Model

Error envelope: `{ "error": { "code": "...", "message": "..." } }`,
identical to Lobby and RTM.

Stable error codes:

| Code | Meaning |
| --- | --- |
| `invalid_request` | Malformed JSON, unknown fields, missing required parameter. |
| `runtime_not_found` | `runtime_records.{game_id}` does not exist. |
| `runtime_not_running` | Operation requires `status=running`. |
| `conflict` | State transition not allowed. |
| `forbidden` | Caller is not an active member or not authorised. |
| `engine_version_not_found` | `engine_versions.{version}` does not exist. |
| `engine_version_in_use` | Hard-delete attempt against a version referenced by a non-finished runtime. |
| `semver_patch_only` | Patch attempt across major/minor boundary. |
| `engine_unreachable` | Engine returned 5xx or connection error. |
| `engine_protocol_violation` | Engine response missing required fields or carries unexpected payload. |
| `engine_validation_error` | Engine returned 4xx with per-command results. |
| `service_unavailable` | Dependency (PostgreSQL, Redis, Lobby, RTM) unavailable. |
| `internal_error` | Unspecified failure. |

## Configuration

All variables use the `GAMEMASTER_` prefix. Required variables fail-fast
on startup.

### Required

- `GAMEMASTER_INTERNAL_HTTP_ADDR`
- `GAMEMASTER_POSTGRES_PRIMARY_DSN`
- `GAMEMASTER_REDIS_MASTER_ADDR`
- `GAMEMASTER_REDIS_PASSWORD`
- `GAMEMASTER_LOBBY_INTERNAL_BASE_URL`
- `GAMEMASTER_RTM_INTERNAL_BASE_URL`

### Configuration groups

**Listener:**

- `GAMEMASTER_INTERNAL_HTTP_ADDR` (e.g., `:8097`).
- `GAMEMASTER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`).
- `GAMEMASTER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `30s`).
- `GAMEMASTER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`).

**PostgreSQL:**

- `GAMEMASTER_POSTGRES_PRIMARY_DSN`
  (`postgres://gamemaster:<pwd>@<host>:5432/galaxy?search_path=gamemaster&sslmode=disable`).
- `GAMEMASTER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used
  in v1).
- `GAMEMASTER_POSTGRES_OPERATION_TIMEOUT` (default `2s`).
- `GAMEMASTER_POSTGRES_MAX_OPEN_CONNS` (default `10`).
- `GAMEMASTER_POSTGRES_MAX_IDLE_CONNS` (default `2`).
- `GAMEMASTER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`).

**Redis:**

- `GAMEMASTER_REDIS_MASTER_ADDR`.
- `GAMEMASTER_REDIS_REPLICA_ADDRS` (optional, comma-separated).
- `GAMEMASTER_REDIS_PASSWORD`.
- `GAMEMASTER_REDIS_DB` (default `0`).
- `GAMEMASTER_REDIS_OPERATION_TIMEOUT` (default `2s`).

**Streams:**

- `GAMEMASTER_REDIS_LOBBY_EVENTS_STREAM` (default `gm:lobby_events`).
- `GAMEMASTER_REDIS_HEALTH_EVENTS_STREAM` (default
  `runtime:health_events`).
- `GAMEMASTER_REDIS_NOTIFICATION_INTENTS_STREAM` (default
  `notification:intents`).
- `GAMEMASTER_STREAM_BLOCK_TIMEOUT` (default `5s`).

**Engine client:**

- `GAMEMASTER_ENGINE_CALL_TIMEOUT` (default `30s` — covers turn generation
  on large games).
- `GAMEMASTER_ENGINE_PROBE_TIMEOUT` (default `5s` — for inspect-style
  reads).

**Lobby internal client:**

- `GAMEMASTER_LOBBY_INTERNAL_BASE_URL`.
- `GAMEMASTER_LOBBY_INTERNAL_TIMEOUT` (default `2s`).

**Runtime Manager internal client:**

- `GAMEMASTER_RTM_INTERNAL_BASE_URL`.
- `GAMEMASTER_RTM_INTERNAL_TIMEOUT` (default `5s`).

**Scheduler:**

- `GAMEMASTER_SCHEDULER_TICK_INTERVAL` (default `1s`).
- `GAMEMASTER_TURN_GENERATION_TIMEOUT` (default `60s`).

**Membership cache:**

- `GAMEMASTER_MEMBERSHIP_CACHE_TTL` (default `30s`).
- `GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES` (default `4096`; LRU eviction).

**Logging:**

- `GAMEMASTER_LOG_LEVEL` (default `info`).

**Lifecycle:**

- `GAMEMASTER_SHUTDOWN_TIMEOUT` (default `30s`).

**Telemetry:** uses the standard OTLP env vars
(`OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_EXPORTER_OTLP_PROTOCOL`, etc.)
shared with other Galaxy services.

## Observability

### Metrics (OpenTelemetry, low cardinality)

- `gamemaster.register_runtime.outcomes` — counter; labels `outcome`,
  `error_code`.
- `gamemaster.turn_generation.outcomes` — counter; labels `outcome`,
  `error_code`, `trigger` (`scheduler | force`).
- `gamemaster.command_execute.outcomes` — counter; labels `outcome`,
  `error_code`.
- `gamemaster.order_put.outcomes` — counter; labels `outcome`,
  `error_code`.
- `gamemaster.report_get.outcomes` — counter; labels `outcome`,
  `error_code`.
- `gamemaster.banish.outcomes` — counter; labels `outcome`, `error_code`.
- `gamemaster.engine_call.latency` — histogram; label `op` (`init |
  status | turn | banish | command | order | report`).
- `gamemaster.runtime_records_by_status` — gauge; label `status`.
- `gamemaster.scheduler.due_games` — gauge.
- `gamemaster.health_events.consumed` — counter.
- `gamemaster.lobby_events.published` — counter; label `event_type`.
- `gamemaster.notification.publish_attempts` — counter; label
  `notification_type`, `result` (`ok | error`).
- `gamemaster.membership_cache.hits` — counter; labels `result` (`hit |
  miss | invalidate`).
- `gamemaster.engine_versions_total` — gauge.

Metrics avoid high-cardinality attributes such as `game_id` and `user_id`.

### Structured logs (slog JSON to stdout)

Common fields on every entry: `service=gamemaster`, `request_id`,
`trace_id`, `span_id`, `game_id` (when known), `user_id` (when known),
`op_kind`, `op_source`, `outcome`, `error_code`.

Worker-specific fields: `event_type` (lobby-events publisher),
`stream_entry_id` (health-events consumer), `turn` (turn-generation),
`engine_endpoint` (engine calls).

## Verification

Service-level (per [`./PLAN.md`](./PLAN.md)):

- Unit tests for every service-layer operation against mocked engine,
  Lobby, RTM, notification publisher, lobby-events publisher.
- Adapter tests using `testcontainers-go` for PostgreSQL and Redis.
- Contract tests for `internal-openapi.yaml` and
  `runtime-events-asyncapi.yaml`.

Service-local integration suite under `gamemaster/integration/`:

- Register-runtime + first turn happy path against the real
  `galaxy/game` test image.
- Force-next-turn skip behaviour.
- Engine version registry CRUD + resolve.
- Admin stop synchronous REST.
- Banish round-trip.
- Membership invalidation hook.
- `runtime:health_events` consumption.

Inter-service suite under `integration/lobbygm/` and
`integration/lobbygmrtm/`:

- `lobbygm`: real Lobby + real GM + real engine + stub RTM. Covers
  enrollment → register-runtime → first turn → finish + capability
  evaluation.
- `lobbygmrtm`: full Lobby + GM + RTM + engine. Covers happy path and the
  documented failure paths from `ARCHITECTURE.md` flow §4.

Manual smoke (development):

```sh
docker network create galaxy-net   # once
GAMEMASTER_INTERNAL_HTTP_ADDR=:8097 \
GAMEMASTER_POSTGRES_PRIMARY_DSN=postgres://gamemaster:secret@localhost:5432/galaxy?search_path=gamemaster&sslmode=disable \
GAMEMASTER_REDIS_MASTER_ADDR=localhost:6379 \
GAMEMASTER_REDIS_PASSWORD=secret \
GAMEMASTER_LOBBY_INTERNAL_BASE_URL=http://localhost:8095 \
GAMEMASTER_RTM_INTERNAL_BASE_URL=http://localhost:8096 \
... go run ./gamemaster/cmd/gamemaster
```

After start, `curl http://localhost:8097/readyz` returns `200`. Driving
Lobby through its public start flow brings up `galaxy-game-{game_id}`
containers, GM registers each runtime, generates turns on the configured
schedule, and propagates events to Lobby.