Files
galaxy-game/gamemaster
2026-05-03 07:59:03 +02:00
..
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00
2026-05-03 07:59:03 +02:00

Game Master

Game Master (GM) is the only Galaxy platform service permitted to talk to running game engine containers. It owns runtime and operational state of already-running games, the engine version registry, the platform mapping of (user_id ↔ race_name ↔ engine_player_uuid), the per-game turn scheduler, and the synchronous and asynchronous boundaries that other services use to interact with running games.

References

  • ../ARCHITECTURE.md — system architecture, §8 Game Master.
  • ../TESTING.md §8 — testing matrix for GM.
  • ./PLAN.md — staged implementation plan.
  • ./docs/README.md — service-local documentation entry point (created at PLAN stage 24).
  • ./docs/stage06-contract-files.md — decisions behind the OpenAPI and AsyncAPI specs frozen at PLAN stage 06.
  • ./docs/stage07-notification-catalog-audit.md — notification catalog audit and producer-side freeze test added at PLAN stage 07.
  • ./docs/stage08-module-skeleton.md — module skeleton wiring decisions (config groups, telemetry instruments, Makefile targets, deferred dependencies) recorded at PLAN stage 08.
  • ./docs/stage09-postgres-migration.md — PostgreSQL schema, embedded migration, jet generation pipeline, and runtime wiring landed at PLAN stage 09.
  • ./docs/stage10-domain-and-ports.md — domain types, port interfaces, and the six stage-10 decisions (operation domain package, membership DTO placement, engine-version options shape, schedule wrapper signature, recovery transition, deferred mock destination) landed at PLAN stage 10.
  • ./docs/stage11-persistence-adapters.md — PostgreSQL stores (runtimerecordstore, engineversionstore, playermappingstore, operationlog), the Redis offset store, and the eight stage-11 decisions (sqlx/pgtest local clones, CAS pattern, port-level Now extension, domain conflict sentinels, jsonb cast, idempotent Deprecate, multi-row BulkInsert, miniredis dependency) landed at PLAN stage 11.
  • ./docs/stage12-external-clients.md — outbound adapters (engine, Lobby, Runtime Manager, notification intent publisher, lobby-events publisher) and the seven stage-12 decisions (per-call engine base URL, dual engine timeout dispatch, engine population rounding, Lobby pagination cap, no extra RTM sentinels, AsyncAPI-aligned XADD encoding for gm:lobby_events, Makefile mocks-target guard) landed at PLAN stage 12.
  • ./docs/stage13-register-runtime.md — register-runtime service-layer orchestrator and the five stage-13 decisions (RuntimeRecordStore.Delete extension, engine 4xx/5xx classification split, engine response validated as engine_protocol_violation, initial snapshot carries player_turn_stats from /admin/init, two-flag rollback gating) landed at PLAN stage 13.
  • ./docs/stage14-engine-version-registry.md — engine version registry service-layer orchestrator (List, Get, Create, Update, Deprecate, Delete, ResolveImageRef) and the five stage-14 decisions (EngineVersionStore.Delete port extension, reference probe before hard delete, new engine_version_delete op_kind in schema and domain, operation_log.game_id overloaded as audit subject for registry entries, JSON-object validation for options) landed at PLAN stage 14.
  • ./docs/stage15-scheduler-and-turn-generation.md — scheduler ticker, turn-generation orchestrator, and snapshot publisher and the seven stage-15 decisions (LobbyClient.GetGameSummary extension with fail-soft game_name fallback, telemetry-only Trigger parameter, two-CAS pattern with external-mutation conflict, single-snapshot-per-outcome cadence, player_mappings as recipient source, stateless scheduler utility, in-flight set on the ticker) landed at PLAN stage 15.
  • ./docs/stage16-membership-cache-and-invalidation.md — hot-path services (commandexecute, orderput, reportget), membership cache, and the six stage-16 decisions (no runtime_not_running for reports, GM-side envelope rewrite commandscmd with injected actor, hot-path skips operation_log, hand-rolled per-game inflight tracker, raw status string return, missing-mapping surfaces as forbidden) landed at PLAN stage 16.
  • ./docs/stage17-admin-operations.md — admin service-layer operations (adminstop, adminforce, adminpatch, adminbanish, livenessreply) and the six stage-17 decisions (RuntimeRecordStore.UpdateImage extension, adminstop idempotent on terminal statuses and conflict on starting, adminforce always sets skip_next_tick, adminbanish without status check and missing race surfaces as forbidden, livenessreply 200 + empty status on runtime_not_found, RTM failures map to service_unavailable) landed at PLAN stage 17.
  • ./docs/stage18-health-events-consumer.mdruntime:health_events consumer worker and the seven stage-18 decisions (event-type taxonomy expanded to seven values with container_started and probe_recovered, CAS-conflict fallback to health-only update, new RuntimeRecordStore.UpdateEngineHealth port method, in-memory dedupe of last-emitted summaries, read-after-write snapshot construction, health_events stream offset label, worker wiring deferred to Stage 19) landed at PLAN stage 18.
  • ./api/internal-openapi.yaml — internal trusted REST contract.
  • ./api/runtime-events-asyncapi.yamlgm:lobby_events Redis Stream contract.
  • ../game/README.md — game engine container contract (env, ports, admin and player REST surfaces, /healthz).
  • ../lobby/README.md — Game Lobby integration with GM.
  • ../rtmanager/README.md — Runtime Manager contract used synchronously by GM admin operations.

Purpose

A running Galaxy game lives in exactly one Docker container managed by Runtime Manager. The platform must:

  • register a freshly started container with platform-level membership;
  • initialise the engine with the agreed race roster;
  • accept and forward player commands and orders to the engine;
  • route per-player report reads;
  • generate turns according to a schedule;
  • detect game finish and propagate it back to platform-level state;
  • expose runtime/operational controls (force-next-turn, stop, patch, banish);
  • own the catalogue of supported engine versions and resolve image_ref values for Game Lobby.

Game Master is the single component that performs these actions. It does not own platform metadata of games (that is Game Lobby), Docker control (that is Runtime Manager), or the full game state (that is the engine container). Engine state on disk is the engine's domain; GM never reads or writes the bind-mounted state directory.

Scope

Game Master is the source of truth for:

  • the runtime mapping game_id → engine_endpoint for every running game;
  • the runtime status (starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished);
  • the current turn number and the next-tick timestamp;
  • the per-game (user_id, race_name, engine_player_uuid) triple;
  • the engine version registry: (version, image_ref, options, status);
  • the durable history of every operation GM performed (operation_log);
  • the latest engine health summary per game.

Game Master is not the source of truth for:

  • platform game records (created, draft, enrollment, finished metadata) — owned by Game Lobby;
  • container lifecycle and Docker reality — owned by Runtime Manager;
  • in-game world state (planets, ships, science, reports) — owned by the engine container;
  • platform user identity and entitlements — owned by User Service;
  • in-game race_name reservations and the Race Name Directory — owned by Game Lobby.

Non-Goals

  • Multi-instance operation in v1. GM runs as a single process; the in-process scheduler is authoritative. Multi-instance with leader election is an explicit future iteration.
  • Direct Docker access. GM never imports the Docker SDK; every container operation goes through Runtime Manager over trusted internal REST.
  • Player removal/block at platform level. Game Lobby owns that decision; GM only performs the engine-side banish call when explicitly invoked.
  • Pause/resume of a running game on the platform side. Game Lobby.paused is a platform-only state; GM only answers a liveness probe used by Lobby's resume flow.
  • Automatic semver-patch upgrades. Patch is always an explicit admin operation against a target engine version present in the registry.
  • TLS or mTLS on the internal listener. GM trusts its network segment.
  • Direct delivery of player-visible push events. Notification Service owns user-targeted push delivery; GM publishes notification intents only.
  • A separate Admin Service. GM exposes its trusted internal REST surface; Admin Service will adopt it in a later iteration.
  • Engine state file management. Backup, archival, and cleanup of the bind-mounted state directories are operator concerns.

Position in the System

flowchart LR
    Gateway["Edge Gateway"]
    Lobby["Game Lobby"]
    Admin["Admin Service\n(future)"]
    GM["Game Master"]
    RTM["Runtime Manager"]
    Notify["Notification Service"]
    Engine["Game Engine container\n(galaxy/game)"]
    Postgres["PostgreSQL\nschema gamemaster"]
    Redis["Redis\nstreams + caches"]

    Gateway -- "verified player commands\n(REST/JSON)" --> GM
    Lobby -- "register-runtime,\nimage-ref resolve,\nmemberships invalidate" --> GM
    Admin -- "internal REST" --> GM
    GM -- "engine HTTP API" --> Engine
    GM -- "stop / restart / patch" --> RTM
    GM -- "notification:intents" --> Notify
    GM -- "gm:lobby_events" --> Redis
    Redis -- "runtime:health_events" --> GM
    GM --> Postgres

Edge Gateway routes verified player message types (game.command.execute, game.order.put, game.report.get) to GM as trusted REST/JSON after transcoding from FlatBuffers. Game Lobby calls GM synchronously to register runtimes after a successful container start, to resolve image_ref from the engine version registry, to invalidate membership cache on roster changes, and to verify GM liveness during platform resume. Game Master calls Runtime Manager synchronously over REST for stop, restart, and patch. Runtime Manager publishes runtime:health_events, which GM consumes asynchronously. GM publishes gm:lobby_events consumed by Game Lobby, and notification:intents consumed by Notification Service.

Responsibility Boundaries

Game Master is responsible for:

  • registering a freshly started container into platform-level runtime state;
  • initialising the engine with the race roster received from Lobby;
  • maintaining the platform mapping of user_id, race_name, and engine_player_uuid;
  • forwarding player commands, orders, and report reads to the engine after authorising the actor;
  • generating turns on schedule, including the force-next-turn skip rule;
  • evaluating engine finish on every turn boundary;
  • publishing runtime snapshot updates and the final game-finish event;
  • consuming runtime health events from Runtime Manager and updating its per-game health summary;
  • exposing the engine version registry CRUD;
  • driving admin-level runtime operations (stop, force-next-turn, patch, banish) by calling Runtime Manager and the engine on demand.

Game Master is not responsible for:

  • creating or stopping containers on Docker (that is Runtime Manager);
  • evaluating whether a game is allowed to start (that is Game Lobby);
  • deriving recipient user lists for non-game notifications (that is Notification Service);
  • verifying authenticated transport, signatures, freshness, and replay (that is Edge Gateway);
  • mapping user_id to platform-level membership (that is Game Lobby).

Engine Container Contract

The engine container is galaxy/game. GM uses two route classes:

Class Path Purpose
Admin (GM-only) POST /api/v1/admin/init Initialise the engine with a race roster.
Admin (GM-only) GET /api/v1/admin/status Read the full game state.
Admin (GM-only) PUT /api/v1/admin/turn Generate the next turn.
Admin (GM-only) POST /api/v1/admin/race/banish Deactivate a race after permanent platform removal. Body {race_name}.
Player PUT /api/v1/command Execute a batch of player commands.
Player PUT /api/v1/order Validate and store a batch of player orders.
Player GET /api/v1/report Fetch per-player turn report.
Probe GET /healthz Liveness probe used by Runtime Manager and operator tooling.

Admin paths are unauthenticated but routed only from inside the trusted network segment that connects GM to the engine container. The engine does not enforce caller identity — network-level segmentation is the boundary.

StateResponse carries an extra boolean finished field. When true on a turn-generation response, GM treats the game as finished and runs the finish flow described below. The conditional logic that flips finished to true lives in the engine's domain code and is not GM's concern.

The engine endpoint URL is the engine_endpoint value handed to GM by Game Lobby during register-runtime: http://galaxy-game-{game_id}:8080. The DNS name is stable across restart and patch.

Runtime Surface

Listeners

Listener Default address Purpose
Internal HTTP :8097 (GAMEMASTER_INTERNAL_HTTP_ADDR) Probes (/healthz, /readyz) and the trusted REST surface for Edge Gateway, Game Lobby, and Admin Service.

There is no public listener. The internal listener is unauthenticated and assumes a trusted network segment. Authentication of player commands has already happened at Edge Gateway; GM enforces authorisation only.

Background workers

Worker Driver Description
Scheduler ticker 1 s loop Scans runtime_records for due next_generation_at, runs the turn-generation service for each, recomputes next_generation_at from turn_schedule (skipping one tick when skip_next_tick=true is set).
runtime:health_events consumer Redis Stream XREADs from runtime:health_events (produced by RTM), updates runtime_records.engine_health summary, debounces runtime_snapshot_update publication.

Startup dependencies

In start order:

  1. PostgreSQL primary (GAMEMASTER_POSTGRES_PRIMARY_DSN). Embedded goose migrations apply synchronously before any listener opens.
  2. Redis master (GAMEMASTER_REDIS_MASTER_ADDR).
  3. Telemetry exporter (OTLP grpc/http or stdout).
  4. Internal HTTP listener.
  5. Health-events consumer worker.
  6. Scheduler ticker worker.

A failure in any step exits the process non-zero.

Probes

/healthz reports liveness — the process responds when the HTTP server is alive.

/readyz reports readiness — 200 only when the PostgreSQL pool can ping the primary and the Redis master client can ping. No deeper dependency is checked synchronously; the engine is reached only on demand.

Both probes are documented in ./api/internal-openapi.yaml.

Lifecycles

Register-runtime

Triggered by: Game Lobby after a successful container start, calling POST /api/v1/internal/games/{game_id}/register-runtime with body {engine_endpoint, members:[{user_id, race_name}], target_engine_version, turn_schedule}.

Flow on success:

  1. Validate request shape; reject with invalid_request if any required field is missing.
  2. Reject with conflict if runtime_records.{game_id} already exists.
  3. Resolve image_ref for target_engine_version from engine_versions; reject with engine_version_not_found when missing.
  4. Persist runtime_records with status=starting, engine_endpoint, current_image_ref, current_engine_version, turn_schedule, and created_at.
  5. Call engine POST /api/v1/admin/init with the race-name list derived from members.
  6. Read StateResponse and persist one player_mappings row per player: (game_id, user_id, race_name, engine_player_uuid).
  7. CAS runtime_records.status: starting → running. Persist current_turn=0 and next_generation_at computed from turn_schedule.
  8. Append operation_log entry (op_kind=register_runtime, outcome=success).
  9. Publish runtime_snapshot_update to gm:lobby_events.
  10. Return 200 with the persisted runtime_records row.

Failure paths:

Failure Side effect Outcome to caller
Invalid envelope None 400 invalid_request
runtime_records already exists None 409 conflict
Engine /admin/init returns 4xx Roll back runtime_records; append failure to operation_log 502 engine_validation_error
Engine /admin/init returns 5xx or fails at the transport layer Roll back; append failure 502 engine_unreachable
Engine response missing players or contains races not in roster Roll back; append failure 502 engine_protocol_violation
PostgreSQL transaction failure Roll back; append failure if possible 503 service_unavailable

A failed register-runtime leaves no runtime_records row and no player_mappings rows. Game Lobby then transitions the platform game record to paused (per the architecture's flow §4 forced-pause path).

Turn generation

Triggered by: the scheduler ticker when now >= next_generation_at for a game in status=running, or by an admin invocation of force-next-turn.

Flow on success:

  1. CAS runtime_records.status: running → generation_in_progress. If the CAS fails (status changed concurrently), the tick is skipped silently.
  2. Call engine PUT /api/v1/admin/turn. Engine returns StateResponse with the new turn and the updated player[] array.
  3. Persist runtime_records.current_turn and refresh runtime_records.engine_health summary.
  4. If StateResponse.finished == true:
    • CAS runtime_records.status: generation_in_progress → finished;
    • publish game_finished to gm:lobby_events with {game_id, final_turn_number, finished_at_ms, player_turn_stats[]};
    • publish game.finished notification intent to all active members.
  5. If StateResponse.finished == false:
    • CAS runtime_records.status: generation_in_progress → running;
    • recompute next_generation_at from turn_schedule. If skip_next_tick=true, advance by one extra cron step and clear the flag;
    • publish runtime_snapshot_update to gm:lobby_events with {game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats[]};
    • publish game.turn.ready notification intent to all active members.
  6. Append operation_log entry (op_kind=turn_generation, outcome=success).

Failure paths:

Failure Side effect Outcome
Engine timeout / 5xx CAS status: generation_in_progress → generation_failed; publish runtime_snapshot_update; publish game.generation_failed admin notification Logged; ticker leaves the game in generation_failed until manual recovery (admin issues force-next-turn or stop).
Persistence failure after engine success Append failure to operation_log; status stays generation_in_progress Health-summary update on next probe will resync.

player_turn_stats[] is built from StateResponse.player[] by mapping raceName → user_id through player_mappings and projecting {user_id, planets, population}. ships_built is intentionally absent (see ./docs/stage01-architecture-sync.md).

Force-next-turn

Triggered by: Admin Service or system-admin via POST /api/v1/internal/runtimes/{game_id}/force-next-turn.

Pre-conditions: runtime exists, status=running.

Flow:

  1. Run the turn-generation flow synchronously (the same code path the scheduler uses).
  2. After success, set runtime_records.skip_next_tick = true. The next regular tick computed from turn_schedule is then advanced by one extra step before being persisted as next_generation_at.
  3. Append operation_log entry (op_kind=force_next_turn).

The skip rule guarantees that the inter-turn spacing is never shorter than one schedule interval, regardless of when the force is issued.

Game finish

The finish flow is driven entirely by the engine signal finished:bool. GM never decides finish independently. After game_finished is published, Game Lobby transitions its platform record to finished, runs the capability evaluation, and finalises Race Name Directory state. The GM record stays in status=finished indefinitely; cleanup is operator-driven.

Banish (engine-side player removal)

Triggered by: Game Lobby synchronously calling POST /api/v1/internal/games/{game_id}/race/{race_name}/banish after a permanent membership removal at platform level.

Pre-conditions: runtime exists; race_name resolves to an existing player_mappings row.

Flow:

  1. Call engine POST /api/v1/admin/race/banish with {race_name}.
  2. On engine success, append operation_log entry (op_kind=banish, outcome=success).
  3. Return 204 to Lobby.

Failure path: engine error returns 502 engine_unreachable. Lobby treats this as a degraded state and may retry; the platform-level membership stays removed regardless.

Stop

Triggered by: system-admin via POST /api/v1/internal/runtimes/{game_id}/stop with body {reason}, where reason ∈ {admin_request, finished, timeout}.

Flow:

  1. Call Runtime Manager POST /api/v1/internal/runtimes/{game_id}/stop with the same reason.
  2. CAS runtime_records.status: * → stopped.
  3. Append operation_log entry.
  4. Publish runtime_snapshot_update reflecting the stopped status.

Patch

Triggered by: system-admin via POST /api/v1/internal/runtimes/{game_id}/patch with body {version}.

Pre-conditions:

  • engine_versions.{version} exists with status=active;
  • the new version is a semver-patch of the current version (same major and minor); otherwise reject with semver_patch_only.

Flow:

  1. Resolve image_ref from engine_versions.{version}.
  2. Call Runtime Manager POST /api/v1/internal/runtimes/{game_id}/patch with {image_ref}.
  3. On success, persist new current_image_ref and current_engine_version on runtime_records.
  4. Append operation_log entry.

The engine container is recreated by RTM with the same DNS name; the engine_endpoint is unchanged. GM does not call /admin/init again — the bind-mounted state directory is preserved and the engine resumes from the previous turn.

Liveness reply (Lobby resume)

Triggered by: Game Lobby resuming a paused game, calling GET /api/v1/internal/games/{game_id}/liveness.

Flow: if runtime_records.{game_id} exists and status=running, return 200 {ready: true}. Otherwise return 200 {ready: false, status: "<observed status>"}.

This endpoint never calls the engine; it reflects GM's own view only.

Hot Path

Player commands and orders

Both game.command.execute and game.order.put use the same FlatBuffers schema (pkg/schema/fbs/order.fbs Order{updated_at, commands:[…]}). The gateway transcodes the verified payload to JSON via pkg/transcoder/order.go before calling GM.

GM endpoints:

  • POST /api/v1/internal/games/{game_id}/commands — execute now; engine PUT /api/v1/command.
  • POST /api/v1/internal/games/{game_id}/orders — validate-and-store; engine PUT /api/v1/order.

Both endpoints accept body {commands:[{cmd_id, @type, …}, …]} and the X-User-ID header. The actor field on the engine call is always set by GM from the authenticated user identity; GM never trusts a payload field for actor identification.

Pre-conditions:

  • runtime_records.{game_id} exists with status=running;
  • the user is an active member of the game (cache lookup);
  • player_mappings.(game_id, user_id) exists.

Errors:

  • runtime_not_found — runtime missing.
  • runtime_not_runningruntime_status is anything other than running.
  • forbidden — caller is not an active member.
  • engine_unreachable — engine returned 5xx.
  • engine_validation_error — engine returned 4xx; the body carries the engine's per-command result (cmd_applied, cmd_error_code).

Reports

GM endpoint: GET /api/v1/internal/games/{game_id}/reports/{turn} with the X-User-ID header.

Flow:

  1. Authorise: caller must be an active member of the game.
  2. Resolve race_name from player_mappings.
  3. Call engine GET /api/v1/report?player={race_name}&turn={turn}.
  4. Return the engine response verbatim. Reports are full per-player payloads and are never cached at the platform layer; the engine remains the source of truth.

Membership cache and invalidation

GM holds an in-process per-game TTL cache (default 30 s) of memberships loaded from Lobby /api/v1/internal/games/{id}/memberships. The cache shape is map[user_id]MembershipStatus plus a load timestamp. TTL is the safety-net fallback.

The primary invalidation mechanism is an explicit hook from Lobby:

  • Endpoint: POST /api/v1/internal/games/{game_id}/memberships/invalidate.
  • Lobby invokes it post-commit on every operation that mutates roster: application approval, application rejection, invite redeem, member remove, member block, user-lifecycle cascade.
  • Failed invalidation does not roll back Lobby state; the TTL safety net catches stale data within the next 30 s.

This is a deliberate tight coupling. The trade-off is recorded in ./PLAN.md Stage 16.

Engine Version Registry

The registry is the source of truth for which engine versions are deployable. CRUD is exposed on the GM internal port; Game Lobby consumes it synchronously to resolve image_ref for target_engine_version just before publishing a runtime:start_jobs envelope.

Method Path Purpose
GET /api/v1/internal/engine-versions List versions; supports status filter.
POST /api/v1/internal/engine-versions Create a new version with version, image_ref, optional options. Validates semver shape and Docker reference.
GET /api/v1/internal/engine-versions/{version} Read one version.
PATCH /api/v1/internal/engine-versions/{version} Update image_ref, options, or status.
DELETE /api/v1/internal/engine-versions/{version} Soft-deprecate (status=deprecated). Hard delete is rejected if the version is referenced by any non-finished runtime_records row.
GET /api/v1/internal/engine-versions/{version}/image-ref Resolve image_ref only. Used by Lobby's start flow.

options is a free-form jsonb document stored verbatim. v1 does not enforce a schema; future engine-side options follow the engine's own contract.

status values: active (deployable), deprecated (rejected on new starts; existing runtimes unaffected). Hard removal of a deprecated version requires that no runtime references it.

Lobby resolves image_ref synchronously per game start. If the resolve call fails or the version is missing, Lobby fails the start with engine_version_not_found and never publishes runtime:start_jobs.

Trusted Surfaces

Internal REST

The internal REST surface is consumed by:

  • Edge Gateway — verified player commands and report reads;
  • Game Lobby — register-runtime, image-ref resolve, membership invalidate, banish, liveness reply;
  • Admin Service (future) — full administrative operations;
  • platform probes — /healthz, /readyz.

The listener is unauthenticated; downstream services rely on network segmentation. Caller identity for audit is recorded from the optional X-Galaxy-Caller header (gateway, lobby, admin) and reflected as op_source in operation_log (gateway_player, lobby_internal, admin_rest); when missing or unrecognised, GM defaults to op_source=admin_rest.

For player-command endpoints, the additional X-User-ID header is required and authoritative for the acting user identity.

Request and response shapes are defined in ./api/internal-openapi.yaml. Unknown JSON fields are rejected with invalid_request.

Async Stream Contracts

gm:lobby_events (out)

Producer: Game Master. Consumer: Game Lobby.

Two message types share the stream, discriminated by event_type:

event_type Body
runtime_snapshot_update {game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats:[{user_id, planets, population}], occurred_at_ms}
game_finished {game_id, final_turn_number, runtime_status:"finished", player_turn_stats:[…], finished_at_ms}

Publication cadence: events only. GM publishes a snapshot when:

  • a turn was generated (success or failure);
  • runtime_status transitioned (e.g., running ↔ generation_in_progress, running → engine_unreachable, * → finished);
  • engine_health_summary changed in response to a runtime:health_events observation (debounced — duplicates are suppressed when the summary did not change).

There is no periodic heartbeat. Game Lobby consumes these events to update its denormalised runtime snapshot and to feed the per-game player_turn_stats aggregate used at game finish.

The first runtime_snapshot_update published right after a successful register-runtime carries player_turn_stats projected from the engine /admin/init response — the per-player baseline (planets, population) at turn 0. Lobby treats this baseline as the reference point against which subsequent turn deltas are measured. For other status transitions that fire without a fresh engine state payload (e.g., a pure health-summary change), player_turn_stats is empty.

The full schema is enforced by ./api/runtime-events-asyncapi.yaml.

runtime:health_events (in)

Producer: Runtime Manager. Consumer: Game Master.

GM consumes the stream to update runtime_records.engine_health summary per game. The schema is owned by Runtime Manager and documented in ../rtmanager/api/runtime-health-asyncapi.yaml. GM never modifies runtime:health_events; it is read-only.

GM does not publish notifications in response to runtime health changes in v1; the operator surface is gm:lobby_events plus the GM REST inspect endpoints.

Notification Contracts

Game Master publishes notification intents to notification:intents using the shared pkg/notificationintent producer module:

Trigger notification_type Audience Channels
Successful turn generation game.turn.ready active members of the game push+email
Game finish game.finished active members of the game push+email
Turn generation failed game.generation_failed configured admin email list email

Recipient resolution: GM materialises recipient_user_ids from its own membership cache (loaded from Lobby) at publish time; admin recipients are resolved by Notification Service from configuration.

A failed publication is a notification degradation and must not roll back already committed runtime state. Failed publications are logged and counted via gamemaster.notification.publish_attempts.

Persistence Layout

PostgreSQL durable state (schema gamemaster)

Table Purpose Key
runtime_records One row per game; latest known runtime status and scheduling state. game_id
engine_versions Engine version registry. version
player_mappings (game_id, user_id) → race_name + engine_player_uuid. composite (game_id, user_id)
operation_log Append-only audit of every GM operation. id (auto)

runtime_records columns:

  • game_id — primary key, references Lobby's identifier.
  • statusstarting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished.
  • engine_endpointhttp://galaxy-game-{game_id}:8080.
  • current_image_ref — Docker reference of the running image.
  • current_engine_version — semver string registered in engine_versions.
  • turn_schedule — five-field cron expression copied from Lobby.
  • current_turn — last completed turn number; 0 until the first turn generates.
  • next_generation_at — UTC timestamp of the next due tick.
  • skip_next_tick — boolean; set by force-next-turn, cleared after the first cron step is skipped.
  • engine_health — short text summary derived from runtime:health_events.
  • created_at, updated_at, started_at, stopped_at, finished_at — lifecycle timestamps.

engine_versions columns:

  • version — primary key; semver string.
  • image_ref — non-empty Docker reference.
  • optionsjsonb, free-form, default '{}'.
  • statusactive | deprecated.
  • created_at, updated_at.

player_mappings columns:

  • composite primary key (game_id, user_id).
  • race_name — non-empty string; unique per game_id.
  • engine_player_uuid — UUID returned by the engine /admin/init.
  • created_at.

operation_log columns:

  • id, game_id, op_kind (register_runtime | turn_generation | force_next_turn | banish | stop | patch | engine_version_create | engine_version_update | engine_version_deprecate | engine_version_delete), op_source, source_ref (request id when known), outcome (success | failure), error_code, error_message, started_at, finished_at.

For engine-version registry entries (op_kind starting with engine_version_), the game_id column doubles as the audit subject and stores the canonical version string instead of a platform game identifier; the registry is global, not per-game. The convention is documented in ./docs/stage14-engine-version-registry.md.

Indexes:

  • runtime_records (status, next_generation_at) — drives the scheduler ticker scan.
  • operation_log (game_id, started_at DESC) — drives audit reads.
  • UNIQUE on player_mappings (game_id, race_name) — one-race-per-game invariant.

Per-game roster reads (WHERE game_id = $1) are served by the leftmost prefix of the composite primary key on player_mappings (game_id, user_id); no extra single-column index is added.

Migrations are embedded 00001_init.sql (single-init pre-launch policy from ARCHITECTURE.md §Persistence Backends).

Redis runtime-coordination state

Key shape Purpose
gamemaster:stream_offsets:{label} Last processed entry id per consumer (health_events). Same shape as Lobby and RTM.

GM does not persist the membership cache to Redis in v1; the cache is in-process. This trade-off is documented in ./PLAN.md Stage 16.

Error Model

Error envelope: { "error": { "code": "...", "message": "..." } }, identical to Lobby and RTM.

Stable error codes:

Code Meaning
invalid_request Malformed JSON, unknown fields, missing required parameter.
runtime_not_found runtime_records.{game_id} does not exist.
runtime_not_running Operation requires status=running.
conflict State transition not allowed.
forbidden Caller is not an active member or not authorised.
engine_version_not_found engine_versions.{version} does not exist.
engine_version_in_use Hard-delete attempt against a version referenced by a non-finished runtime.
semver_patch_only Patch attempt across major/minor boundary.
engine_unreachable Engine returned 5xx or connection error.
engine_protocol_violation Engine response missing required fields or carries unexpected payload.
engine_validation_error Engine returned 4xx with per-command results.
service_unavailable Dependency (PostgreSQL, Redis, Lobby, RTM) unavailable.
internal_error Unspecified failure.

Configuration

All variables use the GAMEMASTER_ prefix. Required variables fail-fast on startup.

Required

  • GAMEMASTER_INTERNAL_HTTP_ADDR
  • GAMEMASTER_POSTGRES_PRIMARY_DSN
  • GAMEMASTER_REDIS_MASTER_ADDR
  • GAMEMASTER_REDIS_PASSWORD
  • GAMEMASTER_LOBBY_INTERNAL_BASE_URL
  • GAMEMASTER_RTM_INTERNAL_BASE_URL

Configuration groups

Listener:

  • GAMEMASTER_INTERNAL_HTTP_ADDR (e.g., :8097).
  • GAMEMASTER_INTERNAL_HTTP_READ_TIMEOUT (default 5s).
  • GAMEMASTER_INTERNAL_HTTP_WRITE_TIMEOUT (default 30s).
  • GAMEMASTER_INTERNAL_HTTP_IDLE_TIMEOUT (default 60s).

PostgreSQL:

  • GAMEMASTER_POSTGRES_PRIMARY_DSN (postgres://gamemaster:<pwd>@<host>:5432/galaxy?search_path=gamemaster&sslmode=disable).
  • GAMEMASTER_POSTGRES_REPLICA_DSNS (optional, comma-separated; not used in v1).
  • GAMEMASTER_POSTGRES_OPERATION_TIMEOUT (default 2s).
  • GAMEMASTER_POSTGRES_MAX_OPEN_CONNS (default 10).
  • GAMEMASTER_POSTGRES_MAX_IDLE_CONNS (default 2).
  • GAMEMASTER_POSTGRES_CONN_MAX_LIFETIME (default 30m).

Redis:

  • GAMEMASTER_REDIS_MASTER_ADDR.
  • GAMEMASTER_REDIS_REPLICA_ADDRS (optional, comma-separated).
  • GAMEMASTER_REDIS_PASSWORD.
  • GAMEMASTER_REDIS_DB (default 0).
  • GAMEMASTER_REDIS_OPERATION_TIMEOUT (default 2s).

Streams:

  • GAMEMASTER_REDIS_LOBBY_EVENTS_STREAM (default gm:lobby_events).
  • GAMEMASTER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events).
  • GAMEMASTER_REDIS_NOTIFICATION_INTENTS_STREAM (default notification:intents).
  • GAMEMASTER_STREAM_BLOCK_TIMEOUT (default 5s).

Engine client:

  • GAMEMASTER_ENGINE_CALL_TIMEOUT (default 30s — covers turn generation on large games).
  • GAMEMASTER_ENGINE_PROBE_TIMEOUT (default 5s — for inspect-style reads).

Lobby internal client:

  • GAMEMASTER_LOBBY_INTERNAL_BASE_URL.
  • GAMEMASTER_LOBBY_INTERNAL_TIMEOUT (default 2s).

Runtime Manager internal client:

  • GAMEMASTER_RTM_INTERNAL_BASE_URL.
  • GAMEMASTER_RTM_INTERNAL_TIMEOUT (default 5s).

Scheduler:

  • GAMEMASTER_SCHEDULER_TICK_INTERVAL (default 1s).
  • GAMEMASTER_TURN_GENERATION_TIMEOUT (default 60s).

Membership cache:

  • GAMEMASTER_MEMBERSHIP_CACHE_TTL (default 30s).
  • GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES (default 4096; LRU eviction).

Logging:

  • GAMEMASTER_LOG_LEVEL (default info).

Lifecycle:

  • GAMEMASTER_SHUTDOWN_TIMEOUT (default 30s).

Telemetry: uses the standard OTLP env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, etc.) shared with other Galaxy services.

Observability

Metrics (OpenTelemetry, low cardinality)

  • gamemaster.register_runtime.outcomes — counter; labels outcome, error_code.
  • gamemaster.turn_generation.outcomes — counter; labels outcome, error_code, trigger (scheduler | force).
  • gamemaster.command_execute.outcomes — counter; labels outcome, error_code.
  • gamemaster.order_put.outcomes — counter; labels outcome, error_code.
  • gamemaster.report_get.outcomes — counter; labels outcome, error_code.
  • gamemaster.banish.outcomes — counter; labels outcome, error_code.
  • gamemaster.engine_call.latency — histogram; label op (init | status | turn | banish | command | order | report).
  • gamemaster.runtime_records_by_status — gauge; label status.
  • gamemaster.scheduler.due_games — gauge.
  • gamemaster.health_events.consumed — counter.
  • gamemaster.lobby_events.published — counter; label event_type.
  • gamemaster.notification.publish_attempts — counter; label notification_type, result (ok | error).
  • gamemaster.membership_cache.hits — counter; labels result (hit | miss | invalidate).
  • gamemaster.engine_versions_total — gauge.

Metrics avoid high-cardinality attributes such as game_id and user_id.

Structured logs (slog JSON to stdout)

Common fields on every entry: service=gamemaster, request_id, trace_id, span_id, game_id (when known), user_id (when known), op_kind, op_source, outcome, error_code.

Worker-specific fields: event_type (lobby-events publisher), stream_entry_id (health-events consumer), turn (turn-generation), engine_endpoint (engine calls).

Verification

Service-level (per ./PLAN.md):

  • Unit tests for every service-layer operation against mocked engine, Lobby, RTM, notification publisher, lobby-events publisher.
  • Adapter tests using testcontainers-go for PostgreSQL and Redis.
  • Contract tests for internal-openapi.yaml and runtime-events-asyncapi.yaml.

Service-local integration suite under gamemaster/integration/:

  • Register-runtime + first turn happy path against the real galaxy/game test image.
  • Force-next-turn skip behaviour.
  • Engine version registry CRUD + resolve.
  • Admin stop synchronous REST.
  • Banish round-trip.
  • Membership invalidation hook.
  • runtime:health_events consumption.

Inter-service suite under integration/lobbygm/ and integration/lobbygmrtm/:

  • lobbygm: real Lobby + real GM + real engine + stub RTM. Covers enrollment → register-runtime → first turn → finish + capability evaluation.
  • lobbygmrtm: full Lobby + GM + RTM + engine. Covers happy path and the documented failure paths from ARCHITECTURE.md flow §4.

Manual smoke (development):

docker network create galaxy-net   # once
GAMEMASTER_INTERNAL_HTTP_ADDR=:8097 \
GAMEMASTER_POSTGRES_PRIMARY_DSN=postgres://gamemaster:secret@localhost:5432/galaxy?search_path=gamemaster&sslmode=disable \
GAMEMASTER_REDIS_MASTER_ADDR=localhost:6379 \
GAMEMASTER_REDIS_PASSWORD=secret \
GAMEMASTER_LOBBY_INTERNAL_BASE_URL=http://localhost:8095 \
GAMEMASTER_RTM_INTERNAL_BASE_URL=http://localhost:8096 \
... go run ./gamemaster/cmd/gamemaster

After start, curl http://localhost:8097/readyz returns 200. Driving Lobby through its public start flow brings up galaxy-game-{game_id} containers, GM registers each runtime, generates turns on the configured schedule, and propagates events to Lobby.