Game Master
Game Master (GM) is the only Galaxy platform service permitted to talk to
running game engine containers. It owns runtime and operational state of
already-running games, the engine version registry, the platform mapping of
(user_id ↔ race_name ↔ engine_player_uuid), the per-game turn scheduler,
and the synchronous and asynchronous boundaries that other services use to
interact with running games.
References
../ARCHITECTURE.md— system architecture, §8 Game Master.../TESTING.md§8 — testing matrix for GM../PLAN.md— staged implementation plan../docs/README.md— service-local documentation entry point (created at PLAN stage 24)../docs/stage06-contract-files.md— decisions behind the OpenAPI and AsyncAPI specs frozen at PLAN stage 06../docs/stage07-notification-catalog-audit.md— notification catalog audit and producer-side freeze test added at PLAN stage 07../docs/stage08-module-skeleton.md— module skeleton wiring decisions (config groups, telemetry instruments, Makefile targets, deferred dependencies) recorded at PLAN stage 08../docs/stage09-postgres-migration.md— PostgreSQL schema, embedded migration, jet generation pipeline, and runtime wiring landed at PLAN stage 09../docs/stage10-domain-and-ports.md— domain types, port interfaces, and the six stage-10 decisions (operation domain package, membership DTO placement, engine-version options shape, schedule wrapper signature, recovery transition, deferred mock destination) landed at PLAN stage 10../docs/stage11-persistence-adapters.md— PostgreSQL stores (runtimerecordstore,engineversionstore,playermappingstore,operationlog), the Redis offset store, and the eight stage-11 decisions (sqlx/pgtest local clones, CAS pattern, port-level Now extension, domain conflict sentinels, jsonb cast, idempotent Deprecate, multi-row BulkInsert, miniredis dependency) landed at PLAN stage 11../docs/stage12-external-clients.md— outbound adapters (engine, Lobby, Runtime Manager, notification intent publisher, lobby-events publisher) and the seven stage-12 decisions (per-call engine base URL, dual engine timeout dispatch, engine population rounding, Lobby pagination cap, no extra RTM sentinels, AsyncAPI-aligned XADD encoding forgm:lobby_events, Makefile mocks-target guard) landed at PLAN stage 12../docs/stage13-register-runtime.md— register-runtime service-layer orchestrator and the five stage-13 decisions (RuntimeRecordStore.Deleteextension, engine 4xx/5xx classification split, engine response validated asengine_protocol_violation, initial snapshot carriesplayer_turn_statsfrom/admin/init, two-flag rollback gating) landed at PLAN stage 13../docs/stage14-engine-version-registry.md— engine version registry service-layer orchestrator (List, Get, Create, Update, Deprecate, Delete, ResolveImageRef) and the five stage-14 decisions (EngineVersionStore.Deleteport extension, reference probe before hard delete, newengine_version_deleteop_kind in schema and domain,operation_log.game_idoverloaded as audit subject for registry entries, JSON-object validation foroptions) landed at PLAN stage 14../docs/stage15-scheduler-and-turn-generation.md— scheduler ticker, turn-generation orchestrator, and snapshot publisher and the seven stage-15 decisions (LobbyClient.GetGameSummaryextension with fail-softgame_namefallback, telemetry-onlyTriggerparameter, two-CAS pattern with external-mutation conflict, single-snapshot-per-outcome cadence, player_mappings as recipient source, stateless scheduler utility, in-flight set on the ticker) landed at PLAN stage 15../docs/stage16-membership-cache-and-invalidation.md— hot-path services (commandexecute,orderput,reportget), membership cache, and the six stage-16 decisions (noruntime_not_runningfor reports, GM-side envelope rewritecommands→cmdwith injectedactor, hot-path skipsoperation_log, hand-rolled per-game inflight tracker, raw status string return, missing-mapping surfaces asforbidden) landed at PLAN stage 16../docs/stage17-admin-operations.md— admin service-layer operations (adminstop,adminforce,adminpatch,adminbanish,livenessreply) and the six stage-17 decisions (RuntimeRecordStore.UpdateImageextension,adminstopidempotent on terminal statuses andconflictonstarting,adminforcealways setsskip_next_tick,adminbanishwithout status check and missing race surfaces asforbidden,livenessreply200 + empty status onruntime_not_found, RTM failures map toservice_unavailable) landed at PLAN stage 17../docs/stage18-health-events-consumer.md—runtime:health_eventsconsumer worker and the seven stage-18 decisions (event-type taxonomy expanded to seven values withcontainer_startedandprobe_recovered, CAS-conflict fallback to health-only update, newRuntimeRecordStore.UpdateEngineHealthport method, in-memory dedupe of last-emitted summaries, read-after-write snapshot construction,health_eventsstream offset label, worker wiring deferred to Stage 19) landed at PLAN stage 18../api/internal-openapi.yaml— internal trusted REST contract../api/runtime-events-asyncapi.yaml—gm:lobby_eventsRedis Stream contract.../game/README.md— game engine container contract (env, ports, admin and player REST surfaces,/healthz).../lobby/README.md— Game Lobby integration with GM.../rtmanager/README.md— Runtime Manager contract used synchronously by GM admin operations.
Purpose
A running Galaxy game lives in exactly one Docker container managed by
Runtime Manager. The platform must:
- register a freshly started container with platform-level membership;
- initialise the engine with the agreed race roster;
- accept and forward player commands and orders to the engine;
- route per-player report reads;
- generate turns according to a schedule;
- detect game finish and propagate it back to platform-level state;
- expose runtime/operational controls (force-next-turn, stop, patch, banish);
- own the catalogue of supported engine versions and resolve
image_refvalues forGame Lobby.
Game Master is the single component that performs these actions. It does
not own platform metadata of games (that is Game Lobby), Docker control
(that is Runtime Manager), or the full game state (that is the engine
container). Engine state on disk is the engine's domain; GM never reads or
writes the bind-mounted state directory.
Scope
Game Master is the source of truth for:
- the runtime mapping
game_id → engine_endpointfor every running game; - the runtime status (
starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished); - the current turn number and the next-tick timestamp;
- the per-game
(user_id, race_name, engine_player_uuid)triple; - the engine version registry:
(version, image_ref, options, status); - the durable history of every operation GM performed (
operation_log); - the latest engine health summary per game.
Game Master is not the source of truth for:
- platform game records (created, draft, enrollment, finished metadata) —
owned by
Game Lobby; - container lifecycle and Docker reality — owned by
Runtime Manager; - in-game world state (planets, ships, science, reports) — owned by the engine container;
- platform user identity and entitlements — owned by
User Service; - in-game
race_namereservations and the Race Name Directory — owned byGame Lobby.
Non-Goals
- Multi-instance operation in v1. GM runs as a single process; the in-process scheduler is authoritative. Multi-instance with leader election is an explicit future iteration.
- Direct Docker access. GM never imports the Docker SDK; every container
operation goes through
Runtime Managerover trusted internal REST. - Player removal/block at platform level.
Game Lobbyowns that decision; GM only performs the engine-sidebanishcall when explicitly invoked. - Pause/resume of a running game on the platform side.
Game Lobby.pausedis a platform-only state; GM only answers a liveness probe used by Lobby's resume flow. - Automatic semver-patch upgrades. Patch is always an explicit admin operation against a target engine version present in the registry.
- TLS or mTLS on the internal listener. GM trusts its network segment.
- Direct delivery of player-visible push events.
Notification Serviceowns user-targeted push delivery; GM publishes notification intents only. - A separate Admin Service. GM exposes its trusted internal REST surface; Admin Service will adopt it in a later iteration.
- Engine state file management. Backup, archival, and cleanup of the bind-mounted state directories are operator concerns.
Position in the System
flowchart LR
Gateway["Edge Gateway"]
Lobby["Game Lobby"]
Admin["Admin Service\n(future)"]
GM["Game Master"]
RTM["Runtime Manager"]
Notify["Notification Service"]
Engine["Game Engine container\n(galaxy/game)"]
Postgres["PostgreSQL\nschema gamemaster"]
Redis["Redis\nstreams + caches"]
Gateway -- "verified player commands\n(REST/JSON)" --> GM
Lobby -- "register-runtime,\nimage-ref resolve,\nmemberships invalidate" --> GM
Admin -- "internal REST" --> GM
GM -- "engine HTTP API" --> Engine
GM -- "stop / restart / patch" --> RTM
GM -- "notification:intents" --> Notify
GM -- "gm:lobby_events" --> Redis
Redis -- "runtime:health_events" --> GM
GM --> Postgres
Edge Gateway routes verified player message types (game.command.execute,
game.order.put, game.report.get) to GM as trusted REST/JSON after
transcoding from FlatBuffers. Game Lobby calls GM synchronously to
register runtimes after a successful container start, to resolve image_ref
from the engine version registry, to invalidate membership cache on roster
changes, and to verify GM liveness during platform resume. Game Master
calls Runtime Manager synchronously over REST for stop, restart, and
patch. Runtime Manager publishes runtime:health_events, which GM
consumes asynchronously. GM publishes gm:lobby_events consumed by
Game Lobby, and notification:intents consumed by Notification Service.
Responsibility Boundaries
Game Master is responsible for:
- registering a freshly started container into platform-level runtime state;
- initialising the engine with the race roster received from Lobby;
- maintaining the platform mapping of
user_id,race_name, andengine_player_uuid; - forwarding player commands, orders, and report reads to the engine after authorising the actor;
- generating turns on schedule, including the force-next-turn skip rule;
- evaluating engine finish on every turn boundary;
- publishing runtime snapshot updates and the final game-finish event;
- consuming runtime health events from
Runtime Managerand updating its per-game health summary; - exposing the engine version registry CRUD;
- driving admin-level runtime operations (stop, force-next-turn, patch,
banish) by calling
Runtime Managerand the engine on demand.
Game Master is not responsible for:
- creating or stopping containers on Docker (that is
Runtime Manager); - evaluating whether a game is allowed to start (that is
Game Lobby); - deriving recipient user lists for non-game notifications (that is
Notification Service); - verifying authenticated transport, signatures, freshness, and replay
(that is
Edge Gateway); - mapping
user_idto platform-level membership (that isGame Lobby).
Engine Container Contract
The engine container is galaxy/game. GM uses two route classes:
| Class | Path | Purpose |
|---|---|---|
| Admin (GM-only) | POST /api/v1/admin/init |
Initialise the engine with a race roster. |
| Admin (GM-only) | GET /api/v1/admin/status |
Read the full game state. |
| Admin (GM-only) | PUT /api/v1/admin/turn |
Generate the next turn. |
| Admin (GM-only) | POST /api/v1/admin/race/banish |
Deactivate a race after permanent platform removal. Body {race_name}. |
| Player | PUT /api/v1/command |
Execute a batch of player commands. |
| Player | PUT /api/v1/order |
Validate and store a batch of player orders. |
| Player | GET /api/v1/report |
Fetch per-player turn report. |
| Probe | GET /healthz |
Liveness probe used by Runtime Manager and operator tooling. |
Admin paths are unauthenticated but routed only from inside the trusted network segment that connects GM to the engine container. The engine does not enforce caller identity — network-level segmentation is the boundary.
StateResponse carries an extra boolean finished field. When true on a
turn-generation response, GM treats the game as finished and runs the
finish flow described below. The conditional logic that flips finished
to true lives in the engine's domain code and is not GM's concern.
The engine endpoint URL is the engine_endpoint value handed to GM by
Game Lobby during register-runtime: http://galaxy-game-{game_id}:8080.
The DNS name is stable across restart and patch.
Runtime Surface
Listeners
| Listener | Default address | Purpose |
|---|---|---|
| Internal HTTP | :8097 (GAMEMASTER_INTERNAL_HTTP_ADDR) |
Probes (/healthz, /readyz) and the trusted REST surface for Edge Gateway, Game Lobby, and Admin Service. |
There is no public listener. The internal listener is unauthenticated and
assumes a trusted network segment. Authentication of player commands has
already happened at Edge Gateway; GM enforces authorisation only.
Background workers
| Worker | Driver | Description |
|---|---|---|
| Scheduler ticker | 1 s loop | Scans runtime_records for due next_generation_at, runs the turn-generation service for each, recomputes next_generation_at from turn_schedule (skipping one tick when skip_next_tick=true is set). |
runtime:health_events consumer |
Redis Stream | XREADs from runtime:health_events (produced by RTM), updates runtime_records.engine_health summary, debounces runtime_snapshot_update publication. |
Startup dependencies
In start order:
- PostgreSQL primary (
GAMEMASTER_POSTGRES_PRIMARY_DSN). Embedded goose migrations apply synchronously before any listener opens. - Redis master (
GAMEMASTER_REDIS_MASTER_ADDR). - Telemetry exporter (OTLP grpc/http or stdout).
- Internal HTTP listener.
- Health-events consumer worker.
- Scheduler ticker worker.
A failure in any step exits the process non-zero.
Probes
/healthz reports liveness — the process responds when the HTTP server is
alive.
/readyz reports readiness — 200 only when the PostgreSQL pool can ping
the primary and the Redis master client can ping. No deeper dependency is
checked synchronously; the engine is reached only on demand.
Both probes are documented in
./api/internal-openapi.yaml.
Lifecycles
Register-runtime
Triggered by: Game Lobby after a successful container start, calling
POST /api/v1/internal/games/{game_id}/register-runtime with body
{engine_endpoint, members:[{user_id, race_name}], target_engine_version, turn_schedule}.
Flow on success:
- Validate request shape; reject with
invalid_requestif any required field is missing. - Reject with
conflictifruntime_records.{game_id}already exists. - Resolve
image_reffortarget_engine_versionfromengine_versions; reject withengine_version_not_foundwhen missing. - Persist
runtime_recordswithstatus=starting,engine_endpoint,current_image_ref,current_engine_version,turn_schedule, andcreated_at. - Call engine
POST /api/v1/admin/initwith the race-name list derived frommembers. - Read
StateResponseand persist oneplayer_mappingsrow per player:(game_id, user_id, race_name, engine_player_uuid). - CAS
runtime_records.status: starting → running. Persistcurrent_turn=0andnext_generation_atcomputed fromturn_schedule. - Append
operation_logentry (op_kind=register_runtime,outcome=success). - Publish
runtime_snapshot_updatetogm:lobby_events. - Return
200with the persistedruntime_recordsrow.
Failure paths:
| Failure | Side effect | Outcome to caller |
|---|---|---|
| Invalid envelope | None | 400 invalid_request |
runtime_records already exists |
None | 409 conflict |
Engine /admin/init returns 4xx |
Roll back runtime_records; append failure to operation_log |
502 engine_validation_error |
Engine /admin/init returns 5xx or fails at the transport layer |
Roll back; append failure | 502 engine_unreachable |
| Engine response missing players or contains races not in roster | Roll back; append failure | 502 engine_protocol_violation |
| PostgreSQL transaction failure | Roll back; append failure if possible | 503 service_unavailable |
A failed register-runtime leaves no runtime_records row and no
player_mappings rows. Game Lobby then transitions the platform game
record to paused (per the architecture's flow §4 forced-pause path).
Turn generation
Triggered by: the scheduler ticker when now >= next_generation_at
for a game in status=running, or by an admin invocation of
force-next-turn.
Flow on success:
- CAS
runtime_records.status: running → generation_in_progress. If the CAS fails (status changed concurrently), the tick is skipped silently. - Call engine
PUT /api/v1/admin/turn. Engine returnsStateResponsewith the newturnand the updatedplayer[]array. - Persist
runtime_records.current_turnand refreshruntime_records.engine_healthsummary. - If
StateResponse.finished == true:- CAS
runtime_records.status: generation_in_progress → finished; - publish
game_finishedtogm:lobby_eventswith{game_id, final_turn_number, finished_at_ms, player_turn_stats[]}; - publish
game.finishednotification intent to allactivemembers.
- CAS
- If
StateResponse.finished == false:- CAS
runtime_records.status: generation_in_progress → running; - recompute
next_generation_atfromturn_schedule. Ifskip_next_tick=true, advance by one extra cron step and clear the flag; - publish
runtime_snapshot_updatetogm:lobby_eventswith{game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats[]}; - publish
game.turn.readynotification intent to allactivemembers.
- CAS
- Append
operation_logentry (op_kind=turn_generation,outcome=success).
Failure paths:
| Failure | Side effect | Outcome |
|---|---|---|
| Engine timeout / 5xx | CAS status: generation_in_progress → generation_failed; publish runtime_snapshot_update; publish game.generation_failed admin notification |
Logged; ticker leaves the game in generation_failed until manual recovery (admin issues force-next-turn or stop). |
| Persistence failure after engine success | Append failure to operation_log; status stays generation_in_progress |
Health-summary update on next probe will resync. |
player_turn_stats[] is built from StateResponse.player[] by mapping
raceName → user_id through player_mappings and projecting
{user_id, planets, population}. ships_built is intentionally absent
(see ./docs/stage01-architecture-sync.md).
Force-next-turn
Triggered by: Admin Service or system-admin via
POST /api/v1/internal/runtimes/{game_id}/force-next-turn.
Pre-conditions: runtime exists, status=running.
Flow:
- Run the turn-generation flow synchronously (the same code path the scheduler uses).
- After success, set
runtime_records.skip_next_tick = true. The next regular tick computed fromturn_scheduleis then advanced by one extra step before being persisted asnext_generation_at. - Append
operation_logentry (op_kind=force_next_turn).
The skip rule guarantees that the inter-turn spacing is never shorter than one schedule interval, regardless of when the force is issued.
Game finish
The finish flow is driven entirely by the engine signal finished:bool.
GM never decides finish independently. After game_finished is published,
Game Lobby transitions its platform record to finished, runs the
capability evaluation, and finalises Race Name Directory state. The GM
record stays in status=finished indefinitely; cleanup is operator-driven.
Banish (engine-side player removal)
Triggered by: Game Lobby synchronously calling
POST /api/v1/internal/games/{game_id}/race/{race_name}/banish after a
permanent membership removal at platform level.
Pre-conditions: runtime exists; race_name resolves to an existing
player_mappings row.
Flow:
- Call engine
POST /api/v1/admin/race/banishwith{race_name}. - On engine success, append
operation_logentry (op_kind=banish,outcome=success). - Return
204to Lobby.
Failure path: engine error returns 502 engine_unreachable. Lobby
treats this as a degraded state and may retry; the platform-level
membership stays removed regardless.
Stop
Triggered by: system-admin via
POST /api/v1/internal/runtimes/{game_id}/stop with body {reason},
where reason ∈ {admin_request, finished, timeout}.
Flow:
- Call
Runtime ManagerPOST /api/v1/internal/runtimes/{game_id}/stopwith the samereason. - CAS
runtime_records.status: * → stopped. - Append
operation_logentry. - Publish
runtime_snapshot_updatereflecting the stopped status.
Patch
Triggered by: system-admin via
POST /api/v1/internal/runtimes/{game_id}/patch with body {version}.
Pre-conditions:
engine_versions.{version}exists withstatus=active;- the new version is a semver-patch of the current version (same major and
minor); otherwise reject with
semver_patch_only.
Flow:
- Resolve
image_reffromengine_versions.{version}. - Call
Runtime ManagerPOST /api/v1/internal/runtimes/{game_id}/patchwith{image_ref}. - On success, persist new
current_image_refandcurrent_engine_versiononruntime_records. - Append
operation_logentry.
The engine container is recreated by RTM with the same DNS name; the
engine_endpoint is unchanged. GM does not call /admin/init again —
the bind-mounted state directory is preserved and the engine resumes from
the previous turn.
Liveness reply (Lobby resume)
Triggered by: Game Lobby resuming a paused game, calling
GET /api/v1/internal/games/{game_id}/liveness.
Flow: if runtime_records.{game_id} exists and status=running,
return 200 {ready: true}. Otherwise return 200 {ready: false, status: "<observed status>"}.
This endpoint never calls the engine; it reflects GM's own view only.
Hot Path
Player commands and orders
Both game.command.execute and game.order.put use the same FlatBuffers
schema (pkg/schema/fbs/order.fbs Order{updated_at, commands:[…]}). The
gateway transcodes the verified payload to JSON via
pkg/transcoder/order.go before calling GM.
GM endpoints:
POST /api/v1/internal/games/{game_id}/commands— execute now; enginePUT /api/v1/command.POST /api/v1/internal/games/{game_id}/orders— validate-and-store; enginePUT /api/v1/order.
Both endpoints accept body {commands:[{cmd_id, @type, …}, …]} and the
X-User-ID header. The actor field on the engine call is always set
by GM from the authenticated user identity; GM never trusts a payload
field for actor identification.
Pre-conditions:
runtime_records.{game_id}exists withstatus=running;- the user is an
activemember of the game (cache lookup); player_mappings.(game_id, user_id)exists.
Errors:
runtime_not_found— runtime missing.runtime_not_running—runtime_statusis anything other thanrunning.forbidden— caller is not an active member.engine_unreachable— engine returned 5xx.engine_validation_error— engine returned 4xx; the body carries the engine's per-command result (cmd_applied,cmd_error_code).
Reports
GM endpoint: GET /api/v1/internal/games/{game_id}/reports/{turn}
with the X-User-ID header.
Flow:
- Authorise: caller must be an active member of the game.
- Resolve
race_namefromplayer_mappings. - Call engine
GET /api/v1/report?player={race_name}&turn={turn}. - Return the engine response verbatim. Reports are full per-player payloads and are never cached at the platform layer; the engine remains the source of truth.
Membership cache and invalidation
GM holds an in-process per-game TTL cache (default 30 s) of memberships
loaded from Lobby /api/v1/internal/games/{id}/memberships. The cache
shape is map[user_id]MembershipStatus plus a load timestamp. TTL is
the safety-net fallback.
The primary invalidation mechanism is an explicit hook from Lobby:
- Endpoint:
POST /api/v1/internal/games/{game_id}/memberships/invalidate. - Lobby invokes it post-commit on every operation that mutates roster: application approval, application rejection, invite redeem, member remove, member block, user-lifecycle cascade.
- Failed invalidation does not roll back Lobby state; the TTL safety net catches stale data within the next 30 s.
This is a deliberate tight coupling. The trade-off is recorded in
./PLAN.md Stage 16.
Engine Version Registry
The registry is the source of truth for which engine versions are
deployable. CRUD is exposed on the GM internal port; Game Lobby
consumes it synchronously to resolve image_ref for target_engine_version
just before publishing a runtime:start_jobs envelope.
| Method | Path | Purpose |
|---|---|---|
GET |
/api/v1/internal/engine-versions |
List versions; supports status filter. |
POST |
/api/v1/internal/engine-versions |
Create a new version with version, image_ref, optional options. Validates semver shape and Docker reference. |
GET |
/api/v1/internal/engine-versions/{version} |
Read one version. |
PATCH |
/api/v1/internal/engine-versions/{version} |
Update image_ref, options, or status. |
DELETE |
/api/v1/internal/engine-versions/{version} |
Soft-deprecate (status=deprecated). Hard delete is rejected if the version is referenced by any non-finished runtime_records row. |
GET |
/api/v1/internal/engine-versions/{version}/image-ref |
Resolve image_ref only. Used by Lobby's start flow. |
options is a free-form jsonb document stored verbatim. v1 does not
enforce a schema; future engine-side options follow the engine's own
contract.
status values: active (deployable), deprecated (rejected on new
starts; existing runtimes unaffected). Hard removal of a deprecated
version requires that no runtime references it.
Lobby resolves image_ref synchronously per game start. If the resolve
call fails or the version is missing, Lobby fails the start with
engine_version_not_found and never publishes runtime:start_jobs.
Trusted Surfaces
Internal REST
The internal REST surface is consumed by:
Edge Gateway— verified player commands and report reads;Game Lobby— register-runtime, image-ref resolve, membership invalidate, banish, liveness reply;Admin Service(future) — full administrative operations;- platform probes —
/healthz,/readyz.
The listener is unauthenticated; downstream services rely on network
segmentation. Caller identity for audit is recorded from the optional
X-Galaxy-Caller header (gateway, lobby, admin) and reflected as
op_source in operation_log (gateway_player, lobby_internal,
admin_rest); when missing or unrecognised, GM defaults to
op_source=admin_rest.
For player-command endpoints, the additional X-User-ID header is
required and authoritative for the acting user identity.
Request and response shapes are defined in
./api/internal-openapi.yaml. Unknown JSON
fields are rejected with invalid_request.
Async Stream Contracts
gm:lobby_events (out)
Producer: Game Master. Consumer: Game Lobby.
Two message types share the stream, discriminated by event_type:
event_type |
Body |
|---|---|
runtime_snapshot_update |
{game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats:[{user_id, planets, population}], occurred_at_ms} |
game_finished |
{game_id, final_turn_number, runtime_status:"finished", player_turn_stats:[…], finished_at_ms} |
Publication cadence: events only. GM publishes a snapshot when:
- a turn was generated (success or failure);
runtime_statustransitioned (e.g.,running ↔ generation_in_progress,running → engine_unreachable,* → finished);engine_health_summarychanged in response to aruntime:health_eventsobservation (debounced — duplicates are suppressed when the summary did not change).
There is no periodic heartbeat. Game Lobby consumes these events to
update its denormalised runtime snapshot and to feed the per-game
player_turn_stats aggregate used at game finish.
The first runtime_snapshot_update published right after a successful
register-runtime carries player_turn_stats projected from the
engine /admin/init response — the per-player baseline (planets,
population) at turn 0. Lobby treats this baseline as the reference
point against which subsequent turn deltas are measured. For other
status transitions that fire without a fresh engine state payload
(e.g., a pure health-summary change), player_turn_stats is empty.
The full schema is enforced by
./api/runtime-events-asyncapi.yaml.
runtime:health_events (in)
Producer: Runtime Manager. Consumer: Game Master.
GM consumes the stream to update runtime_records.engine_health summary
per game. The schema is owned by Runtime Manager and documented in
../rtmanager/api/runtime-health-asyncapi.yaml.
GM never modifies runtime:health_events; it is read-only.
GM does not publish notifications in response to runtime health changes
in v1; the operator surface is gm:lobby_events plus the GM REST
inspect endpoints.
Notification Contracts
Game Master publishes notification intents to notification:intents
using the shared pkg/notificationintent producer module:
| Trigger | notification_type |
Audience | Channels |
|---|---|---|---|
| Successful turn generation | game.turn.ready |
active members of the game | push+email |
| Game finish | game.finished |
active members of the game | push+email |
| Turn generation failed | game.generation_failed |
configured admin email list | email |
Recipient resolution: GM materialises recipient_user_ids from its own
membership cache (loaded from Lobby) at publish time; admin recipients
are resolved by Notification Service from configuration.
A failed publication is a notification degradation and must not roll back
already committed runtime state. Failed publications are logged and
counted via gamemaster.notification.publish_attempts.
Persistence Layout
PostgreSQL durable state (schema gamemaster)
| Table | Purpose | Key |
|---|---|---|
runtime_records |
One row per game; latest known runtime status and scheduling state. | game_id |
engine_versions |
Engine version registry. | version |
player_mappings |
(game_id, user_id) → race_name + engine_player_uuid. |
composite (game_id, user_id) |
operation_log |
Append-only audit of every GM operation. | id (auto) |
runtime_records columns:
game_id— primary key, references Lobby's identifier.status—starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished.engine_endpoint—http://galaxy-game-{game_id}:8080.current_image_ref— Docker reference of the running image.current_engine_version— semver string registered inengine_versions.turn_schedule— five-field cron expression copied from Lobby.current_turn— last completed turn number;0until the first turn generates.next_generation_at— UTC timestamp of the next due tick.skip_next_tick— boolean; set byforce-next-turn, cleared after the first cron step is skipped.engine_health— short text summary derived fromruntime:health_events.created_at,updated_at,started_at,stopped_at,finished_at— lifecycle timestamps.
engine_versions columns:
version— primary key; semver string.image_ref— non-empty Docker reference.options—jsonb, free-form, default'{}'.status—active | deprecated.created_at,updated_at.
player_mappings columns:
- composite primary key
(game_id, user_id). race_name— non-empty string; unique pergame_id.engine_player_uuid— UUID returned by the engine/admin/init.created_at.
operation_log columns:
id,game_id,op_kind(register_runtime | turn_generation | force_next_turn | banish | stop | patch | engine_version_create | engine_version_update | engine_version_deprecate | engine_version_delete),op_source,source_ref(request id when known),outcome(success | failure),error_code,error_message,started_at,finished_at.
For engine-version registry entries (op_kind starting with
engine_version_), the game_id column doubles as the audit subject
and stores the canonical version string instead of a platform game
identifier; the registry is global, not per-game. The convention is
documented in
./docs/stage14-engine-version-registry.md.
Indexes:
runtime_records (status, next_generation_at)— drives the scheduler ticker scan.operation_log (game_id, started_at DESC)— drives audit reads.- UNIQUE on
player_mappings (game_id, race_name)— one-race-per-game invariant.
Per-game roster reads (WHERE game_id = $1) are served by the
leftmost prefix of the composite primary key on
player_mappings (game_id, user_id); no extra single-column index is
added.
Migrations are embedded 00001_init.sql (single-init pre-launch policy
from ARCHITECTURE.md §Persistence Backends).
Redis runtime-coordination state
| Key shape | Purpose |
|---|---|
gamemaster:stream_offsets:{label} |
Last processed entry id per consumer (health_events). Same shape as Lobby and RTM. |
GM does not persist the membership cache to Redis in v1; the cache is
in-process. This trade-off is documented in ./PLAN.md Stage 16.
Error Model
Error envelope: { "error": { "code": "...", "message": "..." } },
identical to Lobby and RTM.
Stable error codes:
| Code | Meaning |
|---|---|
invalid_request |
Malformed JSON, unknown fields, missing required parameter. |
runtime_not_found |
runtime_records.{game_id} does not exist. |
runtime_not_running |
Operation requires status=running. |
conflict |
State transition not allowed. |
forbidden |
Caller is not an active member or not authorised. |
engine_version_not_found |
engine_versions.{version} does not exist. |
engine_version_in_use |
Hard-delete attempt against a version referenced by a non-finished runtime. |
semver_patch_only |
Patch attempt across major/minor boundary. |
engine_unreachable |
Engine returned 5xx or connection error. |
engine_protocol_violation |
Engine response missing required fields or carries unexpected payload. |
engine_validation_error |
Engine returned 4xx with per-command results. |
service_unavailable |
Dependency (PostgreSQL, Redis, Lobby, RTM) unavailable. |
internal_error |
Unspecified failure. |
Configuration
All variables use the GAMEMASTER_ prefix. Required variables fail-fast
on startup.
Required
GAMEMASTER_INTERNAL_HTTP_ADDRGAMEMASTER_POSTGRES_PRIMARY_DSNGAMEMASTER_REDIS_MASTER_ADDRGAMEMASTER_REDIS_PASSWORDGAMEMASTER_LOBBY_INTERNAL_BASE_URLGAMEMASTER_RTM_INTERNAL_BASE_URL
Configuration groups
Listener:
GAMEMASTER_INTERNAL_HTTP_ADDR(e.g.,:8097).GAMEMASTER_INTERNAL_HTTP_READ_TIMEOUT(default5s).GAMEMASTER_INTERNAL_HTTP_WRITE_TIMEOUT(default30s).GAMEMASTER_INTERNAL_HTTP_IDLE_TIMEOUT(default60s).
PostgreSQL:
GAMEMASTER_POSTGRES_PRIMARY_DSN(postgres://gamemaster:<pwd>@<host>:5432/galaxy?search_path=gamemaster&sslmode=disable).GAMEMASTER_POSTGRES_REPLICA_DSNS(optional, comma-separated; not used in v1).GAMEMASTER_POSTGRES_OPERATION_TIMEOUT(default2s).GAMEMASTER_POSTGRES_MAX_OPEN_CONNS(default10).GAMEMASTER_POSTGRES_MAX_IDLE_CONNS(default2).GAMEMASTER_POSTGRES_CONN_MAX_LIFETIME(default30m).
Redis:
GAMEMASTER_REDIS_MASTER_ADDR.GAMEMASTER_REDIS_REPLICA_ADDRS(optional, comma-separated).GAMEMASTER_REDIS_PASSWORD.GAMEMASTER_REDIS_DB(default0).GAMEMASTER_REDIS_OPERATION_TIMEOUT(default2s).
Streams:
GAMEMASTER_REDIS_LOBBY_EVENTS_STREAM(defaultgm:lobby_events).GAMEMASTER_REDIS_HEALTH_EVENTS_STREAM(defaultruntime:health_events).GAMEMASTER_REDIS_NOTIFICATION_INTENTS_STREAM(defaultnotification:intents).GAMEMASTER_STREAM_BLOCK_TIMEOUT(default5s).
Engine client:
GAMEMASTER_ENGINE_CALL_TIMEOUT(default30s— covers turn generation on large games).GAMEMASTER_ENGINE_PROBE_TIMEOUT(default5s— for inspect-style reads).
Lobby internal client:
GAMEMASTER_LOBBY_INTERNAL_BASE_URL.GAMEMASTER_LOBBY_INTERNAL_TIMEOUT(default2s).
Runtime Manager internal client:
GAMEMASTER_RTM_INTERNAL_BASE_URL.GAMEMASTER_RTM_INTERNAL_TIMEOUT(default5s).
Scheduler:
GAMEMASTER_SCHEDULER_TICK_INTERVAL(default1s).GAMEMASTER_TURN_GENERATION_TIMEOUT(default60s).
Membership cache:
GAMEMASTER_MEMBERSHIP_CACHE_TTL(default30s).GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES(default4096; LRU eviction).
Logging:
GAMEMASTER_LOG_LEVEL(defaultinfo).
Lifecycle:
GAMEMASTER_SHUTDOWN_TIMEOUT(default30s).
Telemetry: uses the standard OTLP env vars
(OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, etc.)
shared with other Galaxy services.
Observability
Metrics (OpenTelemetry, low cardinality)
gamemaster.register_runtime.outcomes— counter; labelsoutcome,error_code.gamemaster.turn_generation.outcomes— counter; labelsoutcome,error_code,trigger(scheduler | force).gamemaster.command_execute.outcomes— counter; labelsoutcome,error_code.gamemaster.order_put.outcomes— counter; labelsoutcome,error_code.gamemaster.report_get.outcomes— counter; labelsoutcome,error_code.gamemaster.banish.outcomes— counter; labelsoutcome,error_code.gamemaster.engine_call.latency— histogram; labelop(init | status | turn | banish | command | order | report).gamemaster.runtime_records_by_status— gauge; labelstatus.gamemaster.scheduler.due_games— gauge.gamemaster.health_events.consumed— counter.gamemaster.lobby_events.published— counter; labelevent_type.gamemaster.notification.publish_attempts— counter; labelnotification_type,result(ok | error).gamemaster.membership_cache.hits— counter; labelsresult(hit | miss | invalidate).gamemaster.engine_versions_total— gauge.
Metrics avoid high-cardinality attributes such as game_id and user_id.
Structured logs (slog JSON to stdout)
Common fields on every entry: service=gamemaster, request_id,
trace_id, span_id, game_id (when known), user_id (when known),
op_kind, op_source, outcome, error_code.
Worker-specific fields: event_type (lobby-events publisher),
stream_entry_id (health-events consumer), turn (turn-generation),
engine_endpoint (engine calls).
Verification
Service-level (per ./PLAN.md):
- Unit tests for every service-layer operation against mocked engine, Lobby, RTM, notification publisher, lobby-events publisher.
- Adapter tests using
testcontainers-gofor PostgreSQL and Redis. - Contract tests for
internal-openapi.yamlandruntime-events-asyncapi.yaml.
Service-local integration suite under gamemaster/integration/:
- Register-runtime + first turn happy path against the real
galaxy/gametest image. - Force-next-turn skip behaviour.
- Engine version registry CRUD + resolve.
- Admin stop synchronous REST.
- Banish round-trip.
- Membership invalidation hook.
runtime:health_eventsconsumption.
Inter-service suite under integration/lobbygm/ and
integration/lobbygmrtm/:
lobbygm: real Lobby + real GM + real engine + stub RTM. Covers enrollment → register-runtime → first turn → finish + capability evaluation.lobbygmrtm: full Lobby + GM + RTM + engine. Covers happy path and the documented failure paths fromARCHITECTURE.mdflow §4.
Manual smoke (development):
docker network create galaxy-net # once
GAMEMASTER_INTERNAL_HTTP_ADDR=:8097 \
GAMEMASTER_POSTGRES_PRIMARY_DSN=postgres://gamemaster:secret@localhost:5432/galaxy?search_path=gamemaster&sslmode=disable \
GAMEMASTER_REDIS_MASTER_ADDR=localhost:6379 \
GAMEMASTER_REDIS_PASSWORD=secret \
GAMEMASTER_LOBBY_INTERNAL_BASE_URL=http://localhost:8095 \
GAMEMASTER_RTM_INTERNAL_BASE_URL=http://localhost:8096 \
... go run ./gamemaster/cmd/gamemaster
After start, curl http://localhost:8097/readyz returns 200. Driving
Lobby through its public start flow brings up galaxy-game-{game_id}
containers, GM registers each runtime, generates turns on the configured
schedule, and propagates events to Lobby.