# Game Master Implementation Plan This plan delivers `Game Master` (GM), the platform service that owns runtime and operational state of running Galaxy games, mediates every call to the engine container, runs the turn scheduler, and owns the engine version registry. The plan also delivers the upstream changes that GM depends on: the extracted `pkg/cronutil` module, the engine admin-path rename plus the `finished:bool` field and the new `/admin/race/banish` endpoint on `galaxy/game`, the Lobby refactor that drops `LOBBY_ENGINE_IMAGE_TEMPLATE` in favour of synchronous image-ref resolution against GM, and the membership invalidation hook from Lobby into GM. The architectural rules behind every decision are recorded in [`./README.md`](./README.md). This file describes the order in which the implementation lands. ## Global Rules - Documentation always lands before contracts; contracts before code. - Each stage leaves the repository in a buildable, test-green state. No stage relies on a later stage to fix a regression it introduced. - Existing-service refactors (Lobby image-ref resolver, Lobby membership invalidation hook, game engine path rename plus `finished` field plus banish endpoint, `pkg/cronutil` extraction) are full-fledged stages of this plan; they precede every GM stage that depends on them. - GM never opens the Docker SDK. Every container operation goes through `Runtime Manager` over trusted internal REST. - GM never trusts an `actor` field provided in a payload from `Edge Gateway`; it always derives `actor=race_name` from its own `(user_id → race_name)` mapping. - Every functional change ships its tests in the same stage. Contract tests freeze operation IDs and stream message names from Stage 06 onward. - All code, docs, and identifiers are written in English. - Engine domain logic (when `finished=true` is set, what `banish` mutates inside the game) is user-owned and explicitly out of scope; this plan ships only the contract, router plumbing, and stub handlers for those pieces. ## Suggested Module Structure ```text gamemaster/ ├── cmd/ │ ├── gamemaster/ │ │ └── main.go │ └── jetgen/ │ └── main.go │ ├── internal/ │ ├── app/ │ │ ├── app.go │ │ ├── runtime.go │ │ ├── wiring.go │ │ └── bootstrap.go │ │ │ ├── config/ │ │ ├── config.go │ │ ├── env.go │ │ └── validation.go │ │ │ ├── logging/ │ │ ├── logger.go │ │ └── context.go │ │ │ ├── telemetry/ │ │ └── runtime.go │ │ │ ├── domain/ │ │ ├── runtime/ │ │ │ ├── model.go │ │ │ └── transitions.go │ │ ├── engineversion/ │ │ │ ├── model.go │ │ │ └── semver.go │ │ ├── playermapping/ │ │ │ └── model.go │ │ └── schedule/ │ │ └── nexttick.go │ │ │ ├── ports/ │ │ ├── runtimerecordstore.go │ │ ├── engineversionstore.go │ │ ├── playermappingstore.go │ │ ├── operationlog.go │ │ ├── streamoffsetstore.go │ │ ├── engineclient.go │ │ ├── lobbyclient.go │ │ ├── rtmclient.go │ │ ├── notificationpublisher.go │ │ └── lobbyeventspublisher.go │ │ │ ├── adapters/ │ │ ├── postgres/ │ │ │ ├── migrations/ │ │ │ ├── jet/ │ │ │ ├── runtimerecordstore/ │ │ │ ├── engineversionstore/ │ │ │ ├── playermappingstore/ │ │ │ └── operationlog/ │ │ ├── redisstate/ │ │ │ └── streamoffsets/ │ │ ├── engineclient/ │ │ ├── lobbyclient/ │ │ ├── rtmclient/ │ │ ├── notificationpublisher/ │ │ ├── lobbyeventspublisher/ │ │ └── mocks/ │ │ │ ├── service/ │ │ ├── registerruntime/ │ │ ├── engineversion/ │ │ ├── scheduler/ │ │ ├── turngeneration/ │ │ ├── commandexecute/ │ │ ├── orderput/ │ │ ├── reportget/ │ │ ├── membership/ │ │ ├── adminstop/ │ │ ├── adminforce/ │ │ ├── adminpatch/ │ │ ├── adminbanish/ │ │ └── livenessreply/ │ │ │ ├── worker/ │ │ ├── schedulerticker/ │ │ └── healtheventsconsumer/ │ │ │ └── api/ │ └── internalhttp/ │ ├── server.go │ └── handlers/ │ ├── api/ │ ├── internal-openapi.yaml │ └── runtime-events-asyncapi.yaml │ ├── integration/ │ ├── harness/ │ ├── registerruntime_test.go │ ├── scheduler_test.go │ ├── hotpath_test.go │ ├── adminops_test.go │ ├── healthevents_test.go │ └── notification_test.go │ ├── docs/ │ ├── README.md │ ├── runtime.md │ ├── flows.md │ ├── runbook.md │ ├── examples.md │ └── postgres-migration.md │ ├── README.md ├── PLAN.md ├── Makefile └── go.mod ``` ## ~~Stage 01.~~ Update `ARCHITECTURE.md` Goal: - align the project-wide source of truth with every decision recorded in [`./README.md`](./README.md) before any code change touches it. Tasks: - Expand `ARCHITECTURE.md §8` (Game Master) with subsections: engine container contract (admin vs player paths, `finished:bool` semantics, `banish` endpoint), runtime status enum (`starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished`), turn cutoff rule (no shadow window; CAS-only), force-next-turn skip rule, snapshot publishing cadence (events only, no heartbeat), single-instance topology. - Update §«Versioning of Game Engines»: GM owns the engine version registry from v1; Lobby resolves `image_ref` synchronously through GM. `LOBBY_ENGINE_IMAGE_TEMPLATE` is removed. `engine_versions` table lives in the `gamemaster` schema. - Update §«Fixed synchronous interactions»: add `Game Lobby → Game Master` for `register-runtime`, image-ref resolve, membership invalidation hook, banish, and liveness reply. Add `Edge Gateway → Game Master` for player commands, orders, and reports. - Update §«Fixed asynchronous interactions»: add `Game Master → Game Lobby` runtime snapshot updates and game-finish events through the `gm:lobby_events` Redis Stream (already mentioned, expanded with cadence rules); add `Runtime Manager → Game Master` health events consumption (`runtime:health_events`) — already mentioned, confirmed. - Update §«Persistence Backends»: add `gamemaster` schema to the schema-per-service list and to PG-backed services. - Update §«Configuration»: add `GAMEMASTER` to the env-var prefix list with the same shape rules as other PG/Redis-backed services. - Update §«Recommended Order of Service Implementation» entry 8 with the scope finalised in [`./README.md`](./README.md). - Drop `ships_built` from every architectural mention of `player_turn_stats`. Update the capability rule wording to use `planets` and `population` only (no behavioural change; `ships_built` was unused). Files touched: - `ARCHITECTURE.md`. Exit criteria: - every later GM, Lobby, Notification, or Game stage can quote its rules from `ARCHITECTURE.md` without re-deciding them. - `go test ./...` is unaffected (this stage changes only Markdown). ## ~~Stage 02.~~ Freeze GM `README.md` Status: implemented as part of this planning task — see [`./README.md`](./README.md). Goal: - publish the complete service description so contracts and code can reference one source. Exit criteria: - a reviewer can answer any «what does GM do when X» question by reading the README alone. ## ~~Stage 03.~~ Sync existing-service docs (Lobby, Notification, Game, RTM) Goal: - bring the READMEs of every touched service into agreement with the GM contract before any code in those services changes. Tasks: - `lobby/README.md`: - replace the `LOBBY_ENGINE_IMAGE_TEMPLATE` configuration entry with a new `LOBBY_GM_BASE_URL`-backed image-ref resolve via `GET /api/v1/internal/engine-versions/{version}/image-ref`; - document the new outgoing `POST /api/v1/internal/games/{id}/memberships/invalidate` call from `removemember`, `blockmember`, `approveapplication`, `rejectapplication`, `redeeminvite`, and the user-lifecycle cascade worker (post-commit, fail-open); - drop `ships_built` from the `player_turn_stats` description and from the capability evaluation wording (rule already reduces to planets + population); - add a paragraph in §Game Start Flow noting that `image_ref` is resolved from GM synchronously and that GM unavailability turns `lobby.game.start` into `service_unavailable`. - `lobby/PLAN.md`: append a closing note stating that the image-ref template removal and the membership invalidation hook are landed by the Game Master plan; no new stages added in Lobby's own PLAN. - `notification/README.md`: confirm the catalog already lists `game.turn.ready`, `game.finished`, `game.generation_failed` and add a one-line note that GM is the producer. - `game/README.md`: - document the new path layout: admin endpoints under `/api/v1/admin/*` (`init`, `status`, `turn`, `race/banish`); player endpoints unchanged at `/api/v1/{command, order, report}`; - document the `finished:bool` extension on `StateResponse`; - document the `POST /api/v1/admin/race/banish` request/response shape (body `{race_name}`; response `204`). - `rtmanager/README.md`: add a closing note that `runtime:health_events` is now consumed by Game Master in production (was reserved as a future consumer). Files touched: - `lobby/README.md`, `lobby/PLAN.md`, `notification/README.md`, `game/README.md`, `rtmanager/README.md`. Exit criteria: - every doc in the repo agrees on the post-GM contract; no contradiction remains between any two READMEs. - `go test ./...` is unaffected. ## ~~Stage 04.~~ Extract `pkg/cronutil` + wire Lobby Goal: - own a single cron parser/calculator across the workspace, used today by Lobby and tomorrow by GM. Tasks: - Create new workspace module `pkg/cronutil/` with: - `cronutil.go`: thin wrapper over `github.com/robfig/cron/v3.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow)`; exports `Parse(expr string) (Schedule, error)` and `Schedule.Next(after time.Time) time.Time`; - `cronutil_test.go`: parser validation tests covering five-field cron expressions (e.g., `0 18 * * *`, `*/15 * * * *`), invalid expressions, DST/timezone behaviour (Schedule operates in UTC; UTC inputs yield UTC outputs); - `go.mod` declaring the module `galaxy/cronutil` with replace target. - Wire from Lobby: replace any inline `robfig/cron/v3` usage in `lobby/internal/domain/game/model.go:validateCronExpr` and the enrollment automation worker with calls into `pkg/cronutil`. The enrollment automation worker does not parse cron today (it uses `enrollment_ends_at` UTC seconds), so the only Lobby caller is the cron-validation path on game records. - Update `go.work` to include `./pkg/cronutil` and add the replace block. - Add Lobby unit tests confirming `validateCronExpr` accepts and rejects the same expressions as before. Files new: - `pkg/cronutil/{cronutil.go, cronutil_test.go, go.mod, go.sum}`. Files touched: - `go.work`, `go.work.sum`, `lobby/internal/domain/game/model.go`, `lobby/go.mod`, `lobby/go.sum`. Exit criteria: - `go build ./...` succeeds. - `go test ./pkg/cronutil/... ./lobby/...` passes. - `lobby/internal/domain/game/model_test.go` still asserts the same acceptance set on cron expressions. ## ~~Stage 05.~~ Game engine contract: admin paths + finished + banish Goal: - ship the contract changes to `galaxy/game` that GM depends on: admin routes under `/api/v1/admin/*`, the `StateResponse.finished` field, and the new `/admin/race/banish` endpoint. Tasks: - `game/openapi.yaml`: - rename `/api/v1/init` → `/api/v1/admin/init` (operation `initGame` → `adminInitGame`); - rename `/api/v1/status` → `/api/v1/admin/status` (operation `getGameStatus` → `adminGetGameStatus`); - rename `/api/v1/turn` → `/api/v1/admin/turn` (operation `generateTurn` → `adminGenerateTurn`); - add `POST /api/v1/admin/race/banish` (operation `adminBanishRace`) with body `{race_name}` and `204 No Content` on success; document the same `400` and `500` error envelopes as the existing endpoints; - extend `StateResponse` schema with `finished:bool` (required; default `false` from server perspective documented in description). - `game/internal/router/router.go` (or its router-helper file): rename the route constants and registrations to the new admin paths; add a new route for `/admin/race/banish` wired to a stub handler returning `204` with empty body. - `game/internal/router/handler/banish.go`: new file with a stub handler that decodes the body, validates `race_name` is non-empty, and returns `204`. Logging only; no game-state mutation. The user fills in domain logic in a separate change. - `game/internal/model/state.go`: add `Finished bool` field to the Go struct backing `StateResponse`. Default-zero (`false`) on serialisation; the user fills in conditional logic. - `game/internal/router/{init,status,turn}_test.go`: update path literals to the new admin form; tests stay green. - `game/openapi_contract_test.go`: assert presence of the new operation IDs (`adminInitGame`, `adminGetGameStatus`, `adminGenerateTurn`, `adminBanishRace`), the new path components, and the `finished` field on `StateResponse`. Files new: - `game/internal/router/handler/banish.go`, `game/internal/router/banish_test.go` (path-level test only). Files touched: - `game/openapi.yaml`, `game/openapi_contract_test.go`, `game/internal/router/router.go`, `game/internal/router/handler/*.go`, `game/internal/router/{init,status,turn}_test.go`, `game/internal/model/state.go`. Exit criteria: - `go test ./game/...` passes. - `docker build -t galaxy/game:test -f game/Dockerfile .` from the workspace root still succeeds. - `curl -X POST http://localhost:8080/api/v1/admin/race/banish -d '{"race_name":"Aelinari"}'` against a running container returns `204`. ## ~~Stage 06.~~ GM contract files and contract tests Goal: - ship machine-readable contracts before any GM handler is written, so the implementation has a target spec. Tasks: - `gamemaster/api/internal-openapi.yaml`: every internal REST endpoint with request and response schemas; error envelope `{ "error": { "code", "message" } }` identical to Lobby. Operation IDs: `internalRegisterRuntime`, `internalGetRuntime`, `internalListRuntimes`, `internalForceNextTurn`, `internalStopRuntime`, `internalPatchRuntime`, `internalBanishRace`, `internalInvalidateMemberships`, `internalGameLiveness`, `internalListEngineVersions`, `internalCreateEngineVersion`, `internalGetEngineVersion`, `internalUpdateEngineVersion`, `internalDeprecateEngineVersion`, `internalResolveEngineVersionImageRef`, `internalExecuteCommands`, `internalPutOrders`, `internalGetReport`, `internalHealthz`, `internalReadyz`. - `gamemaster/api/runtime-events-asyncapi.yaml`: AsyncAPI 3.1.0 spec for `gm:lobby_events`. Two `event_type` values: `runtime_snapshot_update` and `game_finished`. Frozen field set per message: `runtime_snapshot_update {game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats[], occurred_at_ms}`; `game_finished {game_id, final_turn_number, runtime_status, player_turn_stats[], finished_at_ms}`. - `gamemaster/contract_openapi_test.go`: load the OpenAPI spec via `kin-openapi`, assert every operation ID is present, every required field on every request/response schema is present, and that `additionalProperties: false` is set on every body schema. - `gamemaster/contract_asyncapi_test.go`: load the AsyncAPI spec via the shared YAML walker pattern from `notification/contract_asyncapi_test.go`; assert message names, channel addresses, action vocabulary (`send`/`receive`), and `event_type` discriminator values. Files new: - `gamemaster/api/internal-openapi.yaml`, `gamemaster/api/runtime-events-asyncapi.yaml`, `gamemaster/contract_openapi_test.go`, `gamemaster/contract_asyncapi_test.go`. Exit criteria: - both specs validate. - contract tests pass; tests fail loudly if any operation ID, message name, or required field disappears. ## ~~Stage 07.~~ Notification catalog audit (no-op or minor) Goal: - confirm the GM-owned notification types (`game.turn.ready`, `game.finished`, `game.generation_failed`) are already wired through `pkg/notificationintent`, the `notification` service's catalog data tables, and `notification/api/intents-asyncapi.yaml`. Add freeze assertions so a future drift breaks loudly. Tasks: - Run a freeze test inside `gamemaster/` that imports `galaxy/notificationintent` and asserts the existence of the three constructors plus payload struct shapes. - Inspect `notification/api/intents-asyncapi.yaml` for the three message schemas; if any are missing the per-payload required fields, add them here. - Inspect the notification service's routing data tables (the location is internal to `notification/internal/...`); confirm the three types are present with audience and channel decisions matching [`./README.md` §Notification Contracts](./README.md). Add entries if missing. - Extend `notification/contract_asyncapi_test.go` if any new payload schema entries were added. Files touched (only if drift is found): - `notification/api/intents-asyncapi.yaml`, `notification/internal/...` (catalog data), `notification/contract_asyncapi_test.go`. Files new: - `gamemaster/notificationintent_audit_test.go`. Exit criteria: - the freeze test passes. - `notification/contract_asyncapi_test.go` and `intent_acceptance_contract_test.go` continue to pass. ## ~~Stage 08.~~ GM module skeleton Goal: - create a buildable `gamemaster` binary that loads config, opens dependencies, and exits cleanly on SIGTERM. It does no business work yet. Tasks: - `gamemaster/cmd/gamemaster/main.go` mirroring `rtmanager/cmd/rtmanager/main.go`. - `gamemaster/internal/config/{config.go, env.go, validation.go}` with env prefix `GAMEMASTER` and groups Listener, Postgres, Redis, Streams, Engine client, Lobby internal client, RTM internal client, Scheduler, Membership cache, Logging, Lifecycle, Telemetry. Required variables fail-fast. - `gamemaster/internal/logging/{logger.go, context.go}` copied from lobby/notification. - `gamemaster/internal/telemetry/runtime.go` registering the metrics named in [`./README.md §Observability`](./README.md). - `gamemaster/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go}` — empty wiring with PostgreSQL open, Redis open, telemetry open, probe listener open. - `gamemaster/internal/api/internalhttp/server.go` — listener with `/healthz` and `/readyz` only. - `gamemaster/Makefile` with the `jet` target (real generation lands in Stage 09) and a `mocks` target. - `gamemaster/go.mod` and `go.sum` with dependencies: `github.com/redis/go-redis/v9`, `github.com/jackc/pgx/v5`, `github.com/go-jet/jet/v2`, `github.com/pressly/goose/v3`, `github.com/stretchr/testify`, `go.uber.org/mock`, the testcontainers modules for postgres/redis, the OpenTelemetry stack identical to lobby, `galaxy/cronutil`, `galaxy/notificationintent`, `galaxy/postgres`, `galaxy/redisconn`, `galaxy/error`, `galaxy/util`. - Update repo-level `go.work` — `./gamemaster` is already a workspace member; verify the module path and `go.work.sum`. Files new: - the entire skeleton tree under `gamemaster/`. Exit criteria: - `go build ./gamemaster/cmd/gamemaster` succeeds. - Running with valid env brings `/healthz` and `/readyz` up. - `SIGTERM` returns within `GAMEMASTER_SHUTDOWN_TIMEOUT`. ## ~~Stage 09.~~ PostgreSQL schema, migrations, jet Goal: - finalise the persistence schema and the code-generation pipeline. Tasks: - `gamemaster/internal/adapters/postgres/migrations/00001_init.sql` — `CREATE SCHEMA IF NOT EXISTS gamemaster;` plus the four tables and indexes from [`./README.md §Persistence Layout`](./README.md): `runtime_records`, `engine_versions`, `player_mappings`, `operation_log`. All time columns are `timestamptz`. - `gamemaster/internal/adapters/postgres/migrations/migrations.go` — `//go:embed *.sql` and `FS()` exporter, identical pattern to lobby and rtmanager. - `gamemaster/cmd/jetgen/main.go` — testcontainers PostgreSQL + goose up + jet generation against the resulting database. Mirrors `rtmanager/cmd/jetgen/main.go`. - Generated `gamemaster/internal/adapters/postgres/jet/...` committed to the repo. - Wire goose migrations into `gamemaster/internal/app/runtime.go` startup so they apply before any listener opens; non-zero exit on failure (matches `pkg/postgres` policy). Files new: - as above. Exit criteria: - `make -C gamemaster jet` regenerates the jet code with no diff after a clean run. - Service start applies migrations to a fresh database and exits zero if migrations are already applied. ## ~~Stage 10.~~ Domain layer and ports Goal: - lock the in-memory domain model and the port interfaces for adapters. Tasks: - `gamemaster/internal/domain/runtime/model.go` — `RuntimeRecord` struct; status enum (`StatusStarting`, `StatusRunning`, `StatusGenerationInProgress`, `StatusGenerationFailed`, `StatusStopped`, `StatusEngineUnreachable`, `StatusFinished`); error sentinels. - `gamemaster/internal/domain/runtime/transitions.go` — allowed transitions table and a CAS-friendly validator. - `gamemaster/internal/domain/engineversion/{model.go, semver.go}` — `EngineVersion` struct (`Version`, `ImageRef`, `Options`, `Status`); semver parse + patch-only comparison helpers. - `gamemaster/internal/domain/playermapping/model.go` — `PlayerMapping` struct (`GameID`, `UserID`, `RaceName`, `EnginePlayerUUID`). - `gamemaster/internal/domain/schedule/nexttick.go` — wraps `cronutil.Schedule`; carries `skip_next_tick` semantics on `Next(after, skip bool) (time.Time, skipConsumed bool)`. - `gamemaster/internal/ports/`: - `runtimerecordstore.go` — `Get`, `Insert`, `UpdateStatus` (CAS by expected status), `UpdateScheduling`, `ListDueRunning`, `ListByStatus`. - `engineversionstore.go` — `Get`, `List` (with `status` filter), `Insert`, `Update`, `Deprecate`, `IsReferencedByActiveRuntime`. - `playermappingstore.go` — `BulkInsert`, `Get(gameID, userID)`, `ListByGame(gameID)`, `DeleteByGame(gameID)`. - `operationlog.go` — `Append`, `ListByGame`. - `streamoffsetstore.go` — `Load`, `Save` (Redis offset persistence per consumer label). - `engineclient.go` — narrow surface GM uses: `Init`, `Status`, `Turn`, `BanishRace`, `ExecuteCommands`, `PutOrders`, `GetReport`. - `lobbyclient.go` — `GetMemberships(ctx, gameID) ([]Membership, error)`. - `rtmclient.go` — `Stop(ctx, gameID, reason) error`, `Patch(ctx, gameID, imageRef) error`, `Restart` (reserved; not in v1 feature scope). - `notificationpublisher.go` — `Publish(ctx, intent) error`. - `lobbyeventspublisher.go` — `PublishSnapshotUpdate`, `PublishGameFinished`. - `//go:generate mockgen` directive next to each interface declaration. Files new: - as above. Exit criteria: - the package compiles. - every interface has a `_ ports.X = (*Y)(nil)` assertion slot ready for the adapters that follow. - `go test ./gamemaster/internal/domain/...` passes. ## ~~Stage 11.~~ Persistence adapters Goal: - implement the four PostgreSQL stores and the Redis offset store. Tasks: - `gamemaster/internal/adapters/postgres/runtimerecordstore/store.go` using jet. CAS semantics on `UpdateStatus` (expected status comparison inside the SQL `UPDATE ... WHERE game_id = $1 AND status = $2` pattern). `UpdateScheduling` mutates `next_generation_at` and `skip_next_tick` together. - `gamemaster/internal/adapters/postgres/engineversionstore/store.go`. `IsReferencedByActiveRuntime` joins against `runtime_records WHERE status NOT IN ('finished','stopped')`. - `gamemaster/internal/adapters/postgres/playermappingstore/store.go`. `BulkInsert` is a single `INSERT ... ON CONFLICT DO NOTHING`. - `gamemaster/internal/adapters/postgres/operationlog/store.go`. - `gamemaster/internal/adapters/redisstate/streamoffsets/store.go` (mirror Lobby's and RTM's `redisstate/streamoffsets`). - For each adapter: store-level integration tests against testcontainers PostgreSQL or Redis. CAS semantics on `runtime_records.UpdateStatus` are verified by an explicit concurrent-update test (only one of two callers wins). The semver-patch comparison in `engineversion` is verified against a curated table of cases. Files new: - as above and per-package `_test.go`. Exit criteria: - store tests pass on a CI runner with Docker available. ## ~~Stage 12.~~ External clients (engine, lobby, RTM, notification, lobby-events) Goal: - ship the HTTP and Redis adapters that GM uses to talk to the engine, Lobby internal API, RTM internal API, the notification stream, and the lobby-events stream. Tasks: - `gamemaster/internal/adapters/engineclient/client.go` — REST client over an `otelhttp`-wrapped `http.Client`. Implements `ports.EngineClient` by calling the renamed admin endpoints (`/api/v1/admin/init`, `/admin/status`, `/admin/turn`, `/admin/race/banish`) and the player endpoints (`/api/v1/command`, `/api/v1/order`, `/api/v1/report`). Builds and consumes the existing JSON shapes from `game/openapi.yaml`. - `gamemaster/internal/adapters/lobbyclient/client.go` — REST client for `GET /api/v1/internal/games/{game_id}/memberships`. Returns a typed `Membership` slice. - `gamemaster/internal/adapters/rtmclient/client.go` — REST client for `POST /api/v1/internal/runtimes/{game_id}/stop` and `/patch`. - `gamemaster/internal/adapters/notificationpublisher/publisher.go` — thin XADD wrapper over `notification:intents` using `galaxy/notificationintent` constructors. - `gamemaster/internal/adapters/lobbyeventspublisher/publisher.go` — XADD wrapper for `gm:lobby_events`. Two methods: `PublishSnapshotUpdate(ctx, msg)` and `PublishGameFinished(ctx, msg)`. Schema enforced inline against `runtime-events-asyncapi.yaml`. - `gamemaster/internal/adapters/mocks/` — `mockgen`-generated mocks for every `ports.*` interface. Regenerated by `make -C gamemaster mocks`. - Per-adapter unit tests with mocks for the clients (httptest server for REST adapters; miniredis for the publishers). Files new: - as above. Exit criteria: - mocks regenerate cleanly via `go generate`. - unit tests pass. - `go test ./gamemaster/internal/adapters/...` passes. ## ~~Stage 13.~~ Service: register-runtime Goal: - end-to-end `register-runtime` operation: validate, persist initial record, call engine `/admin/init`, persist player mappings, mark running, schedule first turn. Tasks: - `gamemaster/internal/service/registerruntime/service.go` orchestrator, following the flow from [`./README.md §Lifecycles → Register-runtime`](./README.md): - validate envelope; - reject if `runtime_records.{game_id}` exists; - resolve `image_ref` for `target_engine_version` from `engine_versions`; - persist `runtime_records.status=starting`; - call engine `/admin/init`; - persist `player_mappings` rows from the engine response; - CAS `status: starting → running`, persist `current_turn=0` and initial `next_generation_at`; - append `operation_log`; - publish `runtime_snapshot_update`; - return persisted runtime record. - Failure paths: roll back `runtime_records` on engine failure; ensure no orphan `player_mappings` rows; record failure in `operation_log`. - Unit tests cover happy path, idempotent re-registration (returns `conflict`), engine 4xx (`engine_validation_error`), engine 5xx (`engine_unreachable`), missing engine version (`engine_version_not_found`), partial-rollback paths. Files new: - `gamemaster/internal/service/registerruntime/{service.go, service_test.go, errors.go}`. Exit criteria: - service-level tests pass. ## ~~Stage 14.~~ Service: engine version registry CRUD + image-ref resolve Goal: - the registry surface used by Lobby's start flow and by Admin Service. Tasks: - `gamemaster/internal/service/engineversion/service.go`: - `List(ctx, statusFilter)` — list versions optionally filtered by `status`; - `Get(ctx, version)` — read one; - `Create(ctx, version, imageRef, options)` — validate semver, validate Docker reference shape, persist; - `Update(ctx, version, patch)` — partial update (`image_ref`, `options`, `status`); - `Deprecate(ctx, version)` — set `status=deprecated`; - `Delete(ctx, version)` — hard delete; rejected with `engine_version_in_use` if `IsReferencedByActiveRuntime` returns true; - `ResolveImageRef(ctx, version)` — read `image_ref` only; this is the hot path used by Lobby. - Unit tests cover create-validate, delete-when-active rejection, and semver shape validation. Resolve is tested against a seeded table of versions. Files new: - `gamemaster/internal/service/engineversion/{service.go, service_test.go, errors.go}`. Exit criteria: - service-level tests pass. ## ~~Stage 15.~~ Service: scheduler + turn generation + snapshot publisher Goal: - the heart of GM: the periodic scheduler and the turn-generation flow, with snapshot publication and finish detection. Tasks: - `gamemaster/internal/service/turngeneration/service.go`: - input: `gameID`, `trigger ∈ {scheduler, force}`; - CAS `status: running → generation_in_progress`; - call engine `/admin/turn`; - on success: persist `current_turn`, evaluate `finished`, branch: - finished: CAS `status → finished`, persist `finished_at`, `PublishGameFinished`, publish `game.finished` notification, return; - not finished: CAS `status → running`, recompute `next_generation_at` (skip a tick if `skip_next_tick=true`, then clear), `PublishSnapshotUpdate`, publish `game.turn.ready` notification, return; - on failure: CAS `status → generation_failed`, publish `runtime_snapshot_update` reflecting the new status, publish `game.generation_failed` admin notification, return. - `gamemaster/internal/service/scheduler/service.go`: - thin wrapper that builds the next-tick value from `domain/schedule.NextTick` given `turn_schedule` and `skip_next_tick`; - reused by both the ticker worker (Stage 19 wires it) and by the `force-next-turn` admin op (Stage 17). - `gamemaster/internal/worker/schedulerticker/worker.go`: - 1-second loop; - calls `runtime_records.ListDueRunning(now)` and runs `turngeneration.Run(ctx, gameID, scheduler)` per game; - serialises per-`game_id` calls (one in-flight per game; concurrent games proceed in parallel). - Unit tests cover happy path, finish detection, force trigger with skip consumption, generation failure, CAS contention with a concurrent external status change (e.g., admin stop). - Player turn stats are derived from `StateResponse.player[]` and projected to `{user_id, planets, population}` via `playermappingstore.ListByGame`. Files new: - `gamemaster/internal/service/turngeneration/{service.go, service_test.go, errors.go}`, `gamemaster/internal/service/scheduler/{service.go, service_test.go}`, `gamemaster/internal/worker/schedulerticker/{worker.go, worker_test.go}`. Exit criteria: - service-level tests pass. ## ~~Stage 16.~~ Service: hot-path command + order + report + membership cache Goal: - the gateway-facing trio: command execution, order submission, report reading. Membership cache and the invalidation hook. Tasks: - `gamemaster/internal/service/membership/cache.go`: - in-process `map[gameID]entry{members map[userID]MembershipStatus, loadedAt}`; - `Resolve(ctx, gameID, userID) (status, error)` — checks cache, falls back to `lobbyclient.GetMemberships` on miss or TTL expiry; - `Invalidate(gameID)` — purges the cache entry; - LRU eviction governed by `GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES`. - `gamemaster/internal/service/commandexecute/service.go`: - input: `gameID`, `userID`, payload `{commands:[…]}`; - validate `runtime_records.{game_id}` exists with `status=running`; - resolve membership; reject if not active; - resolve `race_name` from `playermappingstore`; - call engine `/api/v1/command` with `CommandRequest{actor=race_name, cmd=…}`; - return engine response verbatim. - `gamemaster/internal/service/orderput/service.go`: identical structure, calls `/api/v1/order`. - `gamemaster/internal/service/reportget/service.go`: input `{gameID, userID, turn}`; resolves `race_name`; calls `/api/v1/report?player=…&turn=…`; returns body verbatim. - Unit tests: each service covers happy path, runtime-not-running, forbidden, engine 4xx, engine 5xx; membership cache tests cover hit, miss, TTL expiry, invalidate. Files new: - `gamemaster/internal/service/membership/{cache.go, cache_test.go}`, `gamemaster/internal/service/commandexecute/{service.go, service_test.go}`, `gamemaster/internal/service/orderput/{service.go, service_test.go}`, `gamemaster/internal/service/reportget/{service.go, service_test.go}`. Exit criteria: - service-level tests pass. ## ~~Stage 17.~~ Service: admin operations (stop, force-next-turn, patch, banish, liveness) Goal: - the remaining service-layer operations: admin/runtime control plus the Lobby-facing liveness reply. Tasks: - `gamemaster/internal/service/adminstop/service.go`: - input `{gameID, reason}`; - call `rtmclient.Stop(ctx, gameID, reason)`; - on success: CAS `runtime_records.status: * → stopped`; append `operation_log`; publish `runtime_snapshot_update`. - `gamemaster/internal/service/adminforce/service.go`: - run `turngeneration.Run(ctx, gameID, force)` synchronously; - on success, set `runtime_records.skip_next_tick = true` (the next scheduler-driven `Next` consumes it). - `gamemaster/internal/service/adminpatch/service.go`: - input `{gameID, version}`; - resolve new `image_ref` via `engineversion.ResolveImageRef`; - validate semver-patch against current `runtime_records.current_engine_version`; reject with `semver_patch_only` otherwise; - call `rtmclient.Patch(ctx, gameID, imageRef)`; - on success: persist new `current_image_ref` and `current_engine_version`; append `operation_log`. - `gamemaster/internal/service/adminbanish/service.go`: - input `{gameID, raceName}`; - validate `playermappingstore.GetByRace(gameID, raceName)` exists; - call engine `/admin/race/banish`; - append `operation_log`. - `gamemaster/internal/service/livenessreply/service.go`: - lookup `runtime_records.{game_id}`; - return `{ready: status==running, status: }`. - Unit tests for each service cover happy path and each documented error code. Files new: - `gamemaster/internal/service/adminstop/...`, `gamemaster/internal/service/adminforce/...`, `gamemaster/internal/service/adminpatch/...`, `gamemaster/internal/service/adminbanish/...`, `gamemaster/internal/service/livenessreply/...`. Exit criteria: - service-level tests pass. ## ~~Stage 18.~~ Async consumer: `runtime:health_events` Goal: - bring runtime health into GM's view per game and propagate to Lobby via the snapshot stream. Tasks: - `gamemaster/internal/worker/healtheventsconsumer/worker.go`: - XREADs `runtime:health_events` with a persisted offset (via `streamoffsetstore`); - decodes the AsyncAPI envelope from RTM; - updates `runtime_records.engine_health` per `game_id`; - emits a debounced `runtime_snapshot_update` only when the summary string changes. - The summary derivation rule: - `healthy` ⇒ summary `healthy`; - `probe_failed` after threshold ⇒ summary `probe_failed`; - `inspect_unhealthy` ⇒ summary `inspect_unhealthy`; - `container_exited` ⇒ summary `exited` and CAS `status → engine_unreachable`; - `container_oom` ⇒ summary `oom` and CAS `status → engine_unreachable`; - `container_disappeared` ⇒ summary `disappeared` and CAS `status → engine_unreachable`. - Unit tests use `miniredis` and the AsyncAPI fixture from `rtmanager/api/runtime-health-asyncapi.yaml`. Files new: - `gamemaster/internal/worker/healtheventsconsumer/{worker.go, worker_test.go}`. Exit criteria: - worker tests pass. ## ~~Stage 19.~~ Internal REST handlers Goal: - ship the gateway-, Lobby-, and Admin-facing REST surface backed by the service layer. Tasks: - `gamemaster/internal/api/internalhttp/handlers/{registerruntime, getruntime, listruntimes, forcenextturn, stopruntime, patchruntime, banishrace, invalidatememberships, gameliveness, listengineversions, createengineversion, getengineversion, updateengineversion, deprecateengineversion, resolveengineversionimageref, executecommands, putorders, getreport}.go` — one file per operation, each delegating to the corresponding service. JSON in / JSON out. Unknown JSON fields rejected with `invalid_request`. - Error envelope identical to lobby and rtmanager. - Wiring under the existing internal HTTP listener; route registration in `gamemaster/internal/app/wiring.go`. - Handler-level table-driven tests. - OpenAPI conformance test that loads `api/internal-openapi.yaml` and asserts every defined operation is reachable and matches its declared response shape. Files new: - handlers + tests + the conformance test `gamemaster/api/openapi_conformance_test.go`. Exit criteria: - OpenAPI conformance test passes for every endpoint. - Handlers reject unknown JSON fields. ## Stage 20. Lobby refactor Goal: - complete the Lobby side of the new image-resolve and membership invalidation contract. Tasks: - Replace `lobby/internal/domain/engineimage/resolver.go` with a thin GM-client wrapper. The package goes away; the call site in `lobby/internal/service/startgame/service.go` switches from `engineimage.Resolver{}.Resolve(version)` to `gmClient.ResolveImageRef(ctx, version)`. - Drop `LOBBY_ENGINE_IMAGE_TEMPLATE` from `lobby/internal/config/{config.go, env.go, validation.go}`. Remove the validation function and the related env-var test cases. - Add `InvalidateMemberships(ctx, gameID) error` to `lobby/internal/ports/gmclient.go`. Regenerate the `mockgen`-mock and update the inmem fake to record invocations. - Wire the new call from: - `lobby/internal/service/approveapplication/service.go` — post-commit; - `lobby/internal/service/rejectapplication/service.go` — post-commit (only if a reservation existed prior); - `lobby/internal/service/redeeminvite/service.go` — post-commit; - `lobby/internal/service/removemember/service.go` — post-commit (already in scope of removal); - `lobby/internal/service/blockmember/service.go` — post-commit; - `lobby/internal/worker/userlifecycle/consumer.go` — post-commit per game in the cascade. - Failed invalidation is logged at `warn` and incremented in the existing `lobby.notification.publish_attempts` style metric (or a new `lobby.gm_invalidation.publish_attempts`) but does not roll back the business commit. TTL on GM is the safety net. - Update Lobby unit tests, in particular the start-flow tests (replace `engineimage` mock with `gmclient.ResolveImageRef` mock) and the membership-mutation tests (assert `InvalidateMemberships` was called post-commit). - Update `lobby/api/internal-openapi.yaml` only if any new field surfaces (none expected; the call shape is on Lobby's outbound side, not on its REST surface). Files touched: - `lobby/internal/service/{startgame, approveapplication, rejectapplication, redeeminvite, removemember, blockmember}/`, `lobby/internal/worker/userlifecycle/`, `lobby/internal/config/{config.go, env.go, validation.go}`, `lobby/internal/ports/gmclient.go`, `lobby/internal/adapters/gmclient/client.go`, `lobby/internal/adapters/mocks/gmclient/...`, `lobby/internal/adapters/gmclientinmem/...` (if the inmem fake exists; otherwise the mockgen mock plus the migration described in RTM stage 22 is enough). Files removed: - `lobby/internal/domain/engineimage/` (entire package). Exit criteria: - `go test ./lobby/...` passes. - `LOBBY_ENGINE_IMAGE_TEMPLATE` no longer appears in any Lobby source or documentation. - Lobby's start-flow integration test still passes against a stub `gmclient` that returns `image_ref` synchronously. ## Stage 21. Service-local integration suite Goal: - end-to-end suite running against testcontainers PostgreSQL + Redis + the real `galaxy/game` engine container. Tasks: - `gamemaster/integration/harness/` — set up PostgreSQL with goose-applied migrations; Redis (testcontainers Redis for coordination suites that exercise streams); ensure the Docker bridge network exists; build `galaxy/game` test image once per package run with `sync.Once`; tear everything down via `t.Cleanup`. Reuse the RTM-built image where possible (skip rebuilding when present). - `gamemaster/integration/registerruntime_test.go` — register-runtime happy path: GM persists the runtime record, calls engine `/admin/init`, persists `player_mappings`, transitions to `running`, publishes a `runtime_snapshot_update`. Engine answers with a real `StateResponse`. - `gamemaster/integration/scheduler_test.go` — schedules a five-second turn cron, observes one tick, asserts engine `/admin/turn` was hit and `current_turn` advanced. Force-next-turn test asserts `skip_next_tick` consumes the next regular tick. - `gamemaster/integration/hotpath_test.go` — full command, order, and report round-trips against the real engine. Membership invalidation hook test asserts the cache flushes on demand. - `gamemaster/integration/adminops_test.go` — admin stop calls a stub RTM and asserts the runtime record transitions to `stopped`. Admin patch with a non-patch semver target fails with `semver_patch_only`. Admin banish hits the engine endpoint. - `gamemaster/integration/healthevents_test.go` — publishes a fake `runtime:health_events` entry, asserts the consumer updates `engine_health` and emits a debounced snapshot. - `gamemaster/integration/notification_test.go` — observe `notification:intents` after a successful turn (`game.turn.ready`), after a finish (`game.finished`), and after a forced engine failure (`game.generation_failed` admin email). Files new: - as above. Exit criteria: - `go test ./gamemaster/integration/...` passes locally with Docker available. - CI runs the suite under a profile that exposes the Docker socket. ## Stage 22. Inter-service test: Lobby ↔ GM Goal: - exercise the new image-ref resolve, register-runtime, and membership invalidation paths end-to-end without RTM in the loop. Tasks: - `integration/lobbygm/` (top-level integration directory, mirroring existing `integration/lobbyrtm`): runs real Lobby, real GM, real PostgreSQL, real Redis, a stub RTM that simply returns success on `runtime:start_jobs`, and the real `galaxy/game` test engine container. - Scenarios: - Lobby creates a game, resolves `image_ref` from GM, publishes a start_job, the stub RTM acks success, Lobby calls `register-runtime` on GM, GM `/admin/init`s the engine, GM transitions to `running`, GM publishes `runtime_snapshot_update`, Lobby updates its denormalised view. - One full turn generation cycle: scheduler ticks, GM calls engine `/admin/turn`, GM publishes `runtime_snapshot_update`, Lobby's per-game stats aggregate updates. - Membership change: an admin removes a member; Lobby's `removemember` post-commit calls GM `invalidate-memberships`; the next player command from that user fails with `forbidden`. - Game finish: engine returns `finished:true`; GM publishes `game_finished`; Lobby transitions the platform game record to `finished` and runs the capability evaluator. Files new: - as above. Exit criteria: - all scenarios pass in CI when the Docker socket is available. ## Stage 23. Inter-service test: Lobby ↔ GM ↔ RTM (full happy path) Goal: - the canonical end-to-end test covering the whole running-game pipeline. Tasks: - `integration/lobbygmrtm/`: runs real Lobby, real GM, real RTM, real PostgreSQL, real Redis, and the real `galaxy/game` test engine container. - Scenarios: - Happy path: enrollment → start → RTM container → GM register-runtime → engine `/admin/init` → first player command → first scheduled turn → engine `finished:true` → GM `game_finished` → Lobby transitions to `finished` → RTM cleanup TTL. - Failure path A: RTM reports `start_config_invalid` on `runtime:job_results`; Lobby transitions the game to `start_failed`; no GM register-runtime is attempted. - Failure path B: container starts but GM is unavailable when Lobby calls `register-runtime`; Lobby transitions the game to `paused` and publishes `lobby.runtime_paused_after_start`; once GM comes back, Lobby's resume flow calls GM `/liveness`, receives `ready=true`, re-issues `register-runtime`, and the game reaches `running`. Files new: - as above. Exit criteria: - all scenarios pass in CI when the Docker socket is available. ## Stage 24. Service-local docs Goal: - drop per-stage decisions captured during this plan into discoverable service-local documentation, mirroring `lobby/docs/` and `rtmanager/docs/`. Tasks: - `gamemaster/docs/README.md` — index pointing at the five content docs and the postgres-migration record. - `gamemaster/docs/runtime.md` — components, processes, in-memory state of each worker. - `gamemaster/docs/flows.md` — Mermaid diagrams for: register-runtime, turn generation, force-next-turn skip, hot-path command, admin patch, finish, health consumption, banish. - `gamemaster/docs/runbook.md` — operator scenarios: «engine became unreachable», «turn generation failed and stuck», «patch upgrade», «manual force-next-turn», «engine version registry rotation», «membership cache appears stale». - `gamemaster/docs/examples.md` — env-var examples per environment (dev / test / prod skeletons), example payloads for each stream and each REST endpoint. - `gamemaster/docs/postgres-migration.md` — decision record for the schema (mirrors `notification/docs/postgres-migration.md` style). - Add per-stage decision records under `gamemaster/docs/stage-*.md` for any stage that produced a noteworthy decision (mirroring the RTM pattern). At minimum: - `stage11-persistence-adapters.md`, - `stage12-external-clients.md`, - `stage15-scheduler-and-turn-generation.md`, - `stage16-membership-cache-and-invalidation.md`, - `stage17-admin-operations.md`, - `stage18-health-events-consumer.md`, - `stage20-lobby-refactor.md`. Files new: - all of the above. Exit criteria: - the README of GM links to `docs/README.md`. - a reviewer can find any operational how-to within two clicks. ## Final Acceptance Criteria - `go build ./...` from the repository root succeeds. - `go test ./...` from the repository root passes. - `go test -tags=integration ./gamemaster/integration/...` passes when Docker is available. - `go test ./integration/lobbygm/...` and `go test ./integration/lobbygmrtm/...` pass when Docker is available. - `make -C gamemaster jet` regenerates jet code with no diff after a clean run. - `make -C gamemaster mocks` regenerates mock code with no diff after a clean run. - Manual smoke: bring Lobby + GM + RTM + the rest of the stack up via the existing dev compose; create a game; observe a real `galaxy-game-{game_id}` container; play one turn round-trip; observe a `runtime_snapshot_update` on `gm:lobby_events`; force-next-turn; observe the next scheduled tick is skipped; stop the game; the container moves to `exited`. - Documentation across `ARCHITECTURE.md`, `gamemaster/`, `lobby/`, `notification/`, `game/`, and `rtmanager/` is internally consistent. ## Out of Scope - Multi-instance GM with leader election (`Game Master` runs as a single process in v1). - Engine state file management (backup, archival, host-side cleanup). - Direct gateway routing of admin `message_type` values (admin operations land via Admin Service in a later iteration; v1 exposes only the GM internal REST surface). - TLS / mTLS on the internal listener. - Engine-version automatic patch upgrades (manual admin operation only). - A pause/resume flow on GM's side beyond the liveness-check reply. ## Risks and Notes - The membership invalidation hook from Lobby into GM is a deliberate tight coupling. TTL stays as the safety net for any failed invalidation; the explicit hook only optimises for the staleness window. Failure to invalidate is logged but never rolls back Lobby state. This trade-off is recorded in [`./README.md` §Hot Path](./README.md). - Lobby refactor (Stage 20) gates on GM stages 14 (engine version registry resolve endpoint) and 19 (handlers wired). Once Lobby switches to GM for image-ref resolution, Lobby cannot start a game when GM is unavailable; this is documented as the new failure mode in `lobby/README.md` (Stage 03). - Engine path rename (Stage 05) is internal to `galaxy/game`. No other service today calls `/api/v1/init`, `/api/v1/status`, or `/api/v1/turn` (RTM probes only `/healthz`); the rename is therefore a contained change inside the engine module. The user owns the conditional logic that fills `StateResponse.finished` and the body-level mechanics of `banish`. - GM single-instance is a single point of failure for turn generation in v1. The trade-off is acceptable for the prototype and is documented in `gamemaster/README.md §Non-Goals`. - Pre-launch single-init policy applies to GM exactly as documented in `ARCHITECTURE.md §Persistence Backends`: schema evolves by editing `00001_init.sql` until first production deploy.