50 KiB
Game Master Implementation Plan
This plan delivers Game Master (GM), the platform service that owns
runtime and operational state of running Galaxy games, mediates every call
to the engine container, runs the turn scheduler, and owns the engine
version registry.
The plan also delivers the upstream changes that GM depends on: the
extracted pkg/cronutil module, the engine admin-path rename plus the
finished:bool field and the new /admin/race/banish endpoint on
galaxy/game, the Lobby refactor that drops LOBBY_ENGINE_IMAGE_TEMPLATE
in favour of synchronous image-ref resolution against GM, and the
membership invalidation hook from Lobby into GM.
The architectural rules behind every decision are recorded in
./README.md. This file describes the order in which the
implementation lands.
Global Rules
- Documentation always lands before contracts; contracts before code.
- Each stage leaves the repository in a buildable, test-green state. No stage relies on a later stage to fix a regression it introduced.
- Existing-service refactors (Lobby image-ref resolver, Lobby membership
invalidation hook, game engine path rename plus
finishedfield plus banish endpoint,pkg/cronutilextraction) are full-fledged stages of this plan; they precede every GM stage that depends on them. - GM never opens the Docker SDK. Every container operation goes through
Runtime Managerover trusted internal REST. - GM never trusts an
actorfield provided in a payload fromEdge Gateway; it always derivesactor=race_namefrom its own(user_id → race_name)mapping. - Every functional change ships its tests in the same stage. Contract tests freeze operation IDs and stream message names from Stage 06 onward.
- All code, docs, and identifiers are written in English.
- Engine domain logic (when
finished=trueis set, whatbanishmutates inside the game) is user-owned and explicitly out of scope; this plan ships only the contract, router plumbing, and stub handlers for those pieces.
Suggested Module Structure
gamemaster/
├── cmd/
│ ├── gamemaster/
│ │ └── main.go
│ └── jetgen/
│ └── main.go
│
├── internal/
│ ├── app/
│ │ ├── app.go
│ │ ├── runtime.go
│ │ ├── wiring.go
│ │ └── bootstrap.go
│ │
│ ├── config/
│ │ ├── config.go
│ │ ├── env.go
│ │ └── validation.go
│ │
│ ├── logging/
│ │ ├── logger.go
│ │ └── context.go
│ │
│ ├── telemetry/
│ │ └── runtime.go
│ │
│ ├── domain/
│ │ ├── runtime/
│ │ │ ├── model.go
│ │ │ └── transitions.go
│ │ ├── engineversion/
│ │ │ ├── model.go
│ │ │ └── semver.go
│ │ ├── playermapping/
│ │ │ └── model.go
│ │ └── schedule/
│ │ └── nexttick.go
│ │
│ ├── ports/
│ │ ├── runtimerecordstore.go
│ │ ├── engineversionstore.go
│ │ ├── playermappingstore.go
│ │ ├── operationlog.go
│ │ ├── streamoffsetstore.go
│ │ ├── engineclient.go
│ │ ├── lobbyclient.go
│ │ ├── rtmclient.go
│ │ ├── notificationpublisher.go
│ │ └── lobbyeventspublisher.go
│ │
│ ├── adapters/
│ │ ├── postgres/
│ │ │ ├── migrations/
│ │ │ ├── jet/
│ │ │ ├── runtimerecordstore/
│ │ │ ├── engineversionstore/
│ │ │ ├── playermappingstore/
│ │ │ └── operationlog/
│ │ ├── redisstate/
│ │ │ └── streamoffsets/
│ │ ├── engineclient/
│ │ ├── lobbyclient/
│ │ ├── rtmclient/
│ │ ├── notificationpublisher/
│ │ ├── lobbyeventspublisher/
│ │ └── mocks/
│ │
│ ├── service/
│ │ ├── registerruntime/
│ │ ├── engineversion/
│ │ ├── scheduler/
│ │ ├── turngeneration/
│ │ ├── commandexecute/
│ │ ├── orderput/
│ │ ├── reportget/
│ │ ├── membership/
│ │ ├── adminstop/
│ │ ├── adminforce/
│ │ ├── adminpatch/
│ │ ├── adminbanish/
│ │ └── livenessreply/
│ │
│ ├── worker/
│ │ ├── schedulerticker/
│ │ └── healtheventsconsumer/
│ │
│ └── api/
│ └── internalhttp/
│ ├── server.go
│ └── handlers/
│
├── api/
│ ├── internal-openapi.yaml
│ └── runtime-events-asyncapi.yaml
│
├── integration/
│ ├── harness/
│ ├── registerruntime_test.go
│ ├── scheduler_test.go
│ ├── hotpath_test.go
│ ├── adminops_test.go
│ ├── healthevents_test.go
│ └── notification_test.go
│
├── docs/
│ ├── README.md
│ ├── runtime.md
│ ├── flows.md
│ ├── runbook.md
│ ├── examples.md
│ └── postgres-migration.md
│
├── README.md
├── PLAN.md
├── Makefile
└── go.mod
Stage 01. Update ARCHITECTURE.md
Goal:
- align the project-wide source of truth with every decision recorded in
./README.mdbefore any code change touches it.
Tasks:
- Expand
ARCHITECTURE.md §8(Game Master) with subsections: engine container contract (admin vs player paths,finished:boolsemantics,banishendpoint), runtime status enum (starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished), turn cutoff rule (no shadow window; CAS-only), force-next-turn skip rule, snapshot publishing cadence (events only, no heartbeat), single-instance topology. - Update §«Versioning of Game Engines»: GM owns the engine version
registry from v1; Lobby resolves
image_refsynchronously through GM.LOBBY_ENGINE_IMAGE_TEMPLATEis removed.engine_versionstable lives in thegamemasterschema. - Update §«Fixed synchronous interactions»: add
Game Lobby → Game Masterforregister-runtime, image-ref resolve, membership invalidation hook, banish, and liveness reply. AddEdge Gateway → Game Masterfor player commands, orders, and reports. - Update §«Fixed asynchronous interactions»: add
Game Master → Game Lobbyruntime snapshot updates and game-finish events through thegm:lobby_eventsRedis Stream (already mentioned, expanded with cadence rules); addRuntime Manager → Game Masterhealth events consumption (runtime:health_events) — already mentioned, confirmed. - Update §«Persistence Backends»: add
gamemasterschema to the schema-per-service list and to PG-backed services. - Update §«Configuration»: add
GAMEMASTERto the env-var prefix list with the same shape rules as other PG/Redis-backed services. - Update §«Recommended Order of Service Implementation» entry 8 with the
scope finalised in
./README.md. - Drop
ships_builtfrom every architectural mention ofplayer_turn_stats. Update the capability rule wording to useplanetsandpopulationonly (no behavioural change;ships_builtwas unused).
Files touched:
ARCHITECTURE.md.
Exit criteria:
- every later GM, Lobby, Notification, or Game stage can quote its rules
from
ARCHITECTURE.mdwithout re-deciding them. go test ./...is unaffected (this stage changes only Markdown).
Stage 02. Freeze GM README.md
Status: implemented as part of this planning task — see
./README.md.
Goal:
- publish the complete service description so contracts and code can reference one source.
Exit criteria:
- a reviewer can answer any «what does GM do when X» question by reading the README alone.
Stage 03. Sync existing-service docs (Lobby, Notification, Game, RTM)
Goal:
- bring the READMEs of every touched service into agreement with the GM contract before any code in those services changes.
Tasks:
lobby/README.md:- replace the
LOBBY_ENGINE_IMAGE_TEMPLATEconfiguration entry with a newLOBBY_GM_BASE_URL-backed image-ref resolve viaGET /api/v1/internal/engine-versions/{version}/image-ref; - document the new outgoing
POST /api/v1/internal/games/{id}/memberships/invalidatecall fromremovemember,blockmember,approveapplication,rejectapplication,redeeminvite, and the user-lifecycle cascade worker (post-commit, fail-open); - drop
ships_builtfrom theplayer_turn_statsdescription and from the capability evaluation wording (rule already reduces to planets + population); - add a paragraph in §Game Start Flow noting that
image_refis resolved from GM synchronously and that GM unavailability turnslobby.game.startintoservice_unavailable.
- replace the
lobby/PLAN.md: append a closing note stating that the image-ref template removal and the membership invalidation hook are landed by the Game Master plan; no new stages added in Lobby's own PLAN.notification/README.md: confirm the catalog already listsgame.turn.ready,game.finished,game.generation_failedand add a one-line note that GM is the producer.game/README.md:- document the new path layout: admin endpoints under
/api/v1/admin/*(init,status,turn,race/banish); player endpoints unchanged at/api/v1/{command, order, report}; - document the
finished:boolextension onStateResponse; - document the
POST /api/v1/admin/race/banishrequest/response shape (body{race_name}; response204).
- document the new path layout: admin endpoints under
rtmanager/README.md: add a closing note thatruntime:health_eventsis now consumed by Game Master in production (was reserved as a future consumer).
Files touched:
lobby/README.md,lobby/PLAN.md,notification/README.md,game/README.md,rtmanager/README.md.
Exit criteria:
- every doc in the repo agrees on the post-GM contract; no contradiction remains between any two READMEs.
go test ./...is unaffected.
Stage 04. Extract pkg/cronutil + wire Lobby
Goal:
- own a single cron parser/calculator across the workspace, used today by Lobby and tomorrow by GM.
Tasks:
- Create new workspace module
pkg/cronutil/with:cronutil.go: thin wrapper overgithub.com/robfig/cron/v3.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow); exportsParse(expr string) (Schedule, error)andSchedule.Next(after time.Time) time.Time;cronutil_test.go: parser validation tests covering five-field cron expressions (e.g.,0 18 * * *,*/15 * * * *), invalid expressions, DST/timezone behaviour (Schedule operates in UTC; UTC inputs yield UTC outputs);go.moddeclaring the modulegalaxy/cronutilwith replace target.
- Wire from Lobby: replace any inline
robfig/cron/v3usage inlobby/internal/domain/game/model.go:validateCronExprand the enrollment automation worker with calls intopkg/cronutil. The enrollment automation worker does not parse cron today (it usesenrollment_ends_atUTC seconds), so the only Lobby caller is the cron-validation path on game records. - Update
go.workto include./pkg/cronutiland add the replace block. - Add Lobby unit tests confirming
validateCronExpraccepts and rejects the same expressions as before.
Files new:
pkg/cronutil/{cronutil.go, cronutil_test.go, go.mod, go.sum}.
Files touched:
go.work,go.work.sum,lobby/internal/domain/game/model.go,lobby/go.mod,lobby/go.sum.
Exit criteria:
go build ./...succeeds.go test ./pkg/cronutil/... ./lobby/...passes.lobby/internal/domain/game/model_test.gostill asserts the same acceptance set on cron expressions.
Stage 05. Game engine contract: admin paths + finished + banish
Goal:
- ship the contract changes to
galaxy/gamethat GM depends on: admin routes under/api/v1/admin/*, theStateResponse.finishedfield, and the new/admin/race/banishendpoint.
Tasks:
game/openapi.yaml:- rename
/api/v1/init→/api/v1/admin/init(operationinitGame→adminInitGame); - rename
/api/v1/status→/api/v1/admin/status(operationgetGameStatus→adminGetGameStatus); - rename
/api/v1/turn→/api/v1/admin/turn(operationgenerateTurn→adminGenerateTurn); - add
POST /api/v1/admin/race/banish(operationadminBanishRace) with body{race_name}and204 No Contenton success; document the same400and500error envelopes as the existing endpoints; - extend
StateResponseschema withfinished:bool(required; defaultfalsefrom server perspective documented in description).
- rename
game/internal/router/router.go(or its router-helper file): rename the route constants and registrations to the new admin paths; add a new route for/admin/race/banishwired to a stub handler returning204with empty body.game/internal/router/handler/banish.go: new file with a stub handler that decodes the body, validatesrace_nameis non-empty, and returns204. Logging only; no game-state mutation. The user fills in domain logic in a separate change.game/internal/model/state.go: addFinished boolfield to the Go struct backingStateResponse. Default-zero (false) on serialisation; the user fills in conditional logic.game/internal/router/{init,status,turn}_test.go: update path literals to the new admin form; tests stay green.game/openapi_contract_test.go: assert presence of the new operation IDs (adminInitGame,adminGetGameStatus,adminGenerateTurn,adminBanishRace), the new path components, and thefinishedfield onStateResponse.
Files new:
game/internal/router/handler/banish.go,game/internal/router/banish_test.go(path-level test only).
Files touched:
game/openapi.yaml,game/openapi_contract_test.go,game/internal/router/router.go,game/internal/router/handler/*.go,game/internal/router/{init,status,turn}_test.go,game/internal/model/state.go.
Exit criteria:
go test ./game/...passes.docker build -t galaxy/game:test -f game/Dockerfile .from the workspace root still succeeds.curl -X POST http://localhost:8080/api/v1/admin/race/banish -d '{"race_name":"Aelinari"}'against a running container returns204.
Stage 06. GM contract files and contract tests
Goal:
- ship machine-readable contracts before any GM handler is written, so the implementation has a target spec.
Tasks:
gamemaster/api/internal-openapi.yaml: every internal REST endpoint with request and response schemas; error envelope{ "error": { "code", "message" } }identical to Lobby. Operation IDs:internalRegisterRuntime,internalGetRuntime,internalListRuntimes,internalForceNextTurn,internalStopRuntime,internalPatchRuntime,internalBanishRace,internalInvalidateMemberships,internalGameLiveness,internalListEngineVersions,internalCreateEngineVersion,internalGetEngineVersion,internalUpdateEngineVersion,internalDeprecateEngineVersion,internalResolveEngineVersionImageRef,internalExecuteCommands,internalPutOrders,internalGetReport,internalHealthz,internalReadyz.gamemaster/api/runtime-events-asyncapi.yaml: AsyncAPI 3.1.0 spec forgm:lobby_events. Twoevent_typevalues:runtime_snapshot_updateandgame_finished. Frozen field set per message:runtime_snapshot_update {game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats[], occurred_at_ms};game_finished {game_id, final_turn_number, runtime_status, player_turn_stats[], finished_at_ms}.gamemaster/contract_openapi_test.go: load the OpenAPI spec viakin-openapi, assert every operation ID is present, every required field on every request/response schema is present, and thatadditionalProperties: falseis set on every body schema.gamemaster/contract_asyncapi_test.go: load the AsyncAPI spec via the shared YAML walker pattern fromnotification/contract_asyncapi_test.go; assert message names, channel addresses, action vocabulary (send/receive), andevent_typediscriminator values.
Files new:
gamemaster/api/internal-openapi.yaml,gamemaster/api/runtime-events-asyncapi.yaml,gamemaster/contract_openapi_test.go,gamemaster/contract_asyncapi_test.go.
Exit criteria:
- both specs validate.
- contract tests pass; tests fail loudly if any operation ID, message name, or required field disappears.
Stage 07. Notification catalog audit (no-op or minor)
Goal:
- confirm the GM-owned notification types (
game.turn.ready,game.finished,game.generation_failed) are already wired throughpkg/notificationintent, thenotificationservice's catalog data tables, andnotification/api/intents-asyncapi.yaml. Add freeze assertions so a future drift breaks loudly.
Tasks:
- Run a freeze test inside
gamemaster/that importsgalaxy/notificationintentand asserts the existence of the three constructors plus payload struct shapes. - Inspect
notification/api/intents-asyncapi.yamlfor the three message schemas; if any are missing the per-payload required fields, add them here. - Inspect the notification service's routing data tables (the location
is internal to
notification/internal/...); confirm the three types are present with audience and channel decisions matching./README.md§Notification Contracts. Add entries if missing. - Extend
notification/contract_asyncapi_test.goif any new payload schema entries were added.
Files touched (only if drift is found):
notification/api/intents-asyncapi.yaml,notification/internal/...(catalog data),notification/contract_asyncapi_test.go.
Files new:
gamemaster/notificationintent_audit_test.go.
Exit criteria:
- the freeze test passes.
notification/contract_asyncapi_test.goandintent_acceptance_contract_test.gocontinue to pass.
Stage 08. GM module skeleton
Goal:
- create a buildable
gamemasterbinary that loads config, opens dependencies, and exits cleanly on SIGTERM. It does no business work yet.
Tasks:
gamemaster/cmd/gamemaster/main.gomirroringrtmanager/cmd/rtmanager/main.go.gamemaster/internal/config/{config.go, env.go, validation.go}with env prefixGAMEMASTERand groups Listener, Postgres, Redis, Streams, Engine client, Lobby internal client, RTM internal client, Scheduler, Membership cache, Logging, Lifecycle, Telemetry. Required variables fail-fast.gamemaster/internal/logging/{logger.go, context.go}copied from lobby/notification.gamemaster/internal/telemetry/runtime.goregistering the metrics named in./README.md §Observability.gamemaster/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go}— empty wiring with PostgreSQL open, Redis open, telemetry open, probe listener open.gamemaster/internal/api/internalhttp/server.go— listener with/healthzand/readyzonly.gamemaster/Makefilewith thejettarget (real generation lands in Stage 09) and amockstarget.gamemaster/go.modandgo.sumwith dependencies:github.com/redis/go-redis/v9,github.com/jackc/pgx/v5,github.com/go-jet/jet/v2,github.com/pressly/goose/v3,github.com/stretchr/testify,go.uber.org/mock, the testcontainers modules for postgres/redis, the OpenTelemetry stack identical to lobby,galaxy/cronutil,galaxy/notificationintent,galaxy/postgres,galaxy/redisconn,galaxy/error,galaxy/util.- Update repo-level
go.work—./gamemasteris already a workspace member; verify the module path andgo.work.sum.
Files new:
- the entire skeleton tree under
gamemaster/.
Exit criteria:
go build ./gamemaster/cmd/gamemastersucceeds.- Running with valid env brings
/healthzand/readyzup. SIGTERMreturns withinGAMEMASTER_SHUTDOWN_TIMEOUT.
Stage 09. PostgreSQL schema, migrations, jet
Goal:
- finalise the persistence schema and the code-generation pipeline.
Tasks:
gamemaster/internal/adapters/postgres/migrations/00001_init.sql—CREATE SCHEMA IF NOT EXISTS gamemaster;plus the four tables and indexes from./README.md §Persistence Layout:runtime_records,engine_versions,player_mappings,operation_log. All time columns aretimestamptz.gamemaster/internal/adapters/postgres/migrations/migrations.go—//go:embed *.sqlandFS()exporter, identical pattern to lobby and rtmanager.gamemaster/cmd/jetgen/main.go— testcontainers PostgreSQL + goose up + jet generation against the resulting database. Mirrorsrtmanager/cmd/jetgen/main.go.- Generated
gamemaster/internal/adapters/postgres/jet/...committed to the repo. - Wire goose migrations into
gamemaster/internal/app/runtime.gostartup so they apply before any listener opens; non-zero exit on failure (matchespkg/postgrespolicy).
Files new:
- as above.
Exit criteria:
make -C gamemaster jetregenerates the jet code with no diff after a clean run.- Service start applies migrations to a fresh database and exits zero if migrations are already applied.
Stage 10. Domain layer and ports
Goal:
- lock the in-memory domain model and the port interfaces for adapters.
Tasks:
gamemaster/internal/domain/runtime/model.go—RuntimeRecordstruct; status enum (StatusStarting,StatusRunning,StatusGenerationInProgress,StatusGenerationFailed,StatusStopped,StatusEngineUnreachable,StatusFinished); error sentinels.gamemaster/internal/domain/runtime/transitions.go— allowed transitions table and a CAS-friendly validator.gamemaster/internal/domain/engineversion/{model.go, semver.go}—EngineVersionstruct (Version,ImageRef,Options,Status); semver parse + patch-only comparison helpers.gamemaster/internal/domain/playermapping/model.go—PlayerMappingstruct (GameID,UserID,RaceName,EnginePlayerUUID).gamemaster/internal/domain/schedule/nexttick.go— wrapscronutil.Schedule; carriesskip_next_ticksemantics onNext(after, skip bool) (time.Time, skipConsumed bool).gamemaster/internal/ports/:runtimerecordstore.go—Get,Insert,UpdateStatus(CAS by expected status),UpdateScheduling,ListDueRunning,ListByStatus.engineversionstore.go—Get,List(withstatusfilter),Insert,Update,Deprecate,IsReferencedByActiveRuntime.playermappingstore.go—BulkInsert,Get(gameID, userID),ListByGame(gameID),DeleteByGame(gameID).operationlog.go—Append,ListByGame.streamoffsetstore.go—Load,Save(Redis offset persistence per consumer label).engineclient.go— narrow surface GM uses:Init,Status,Turn,BanishRace,ExecuteCommands,PutOrders,GetReport.lobbyclient.go—GetMemberships(ctx, gameID) ([]Membership, error).rtmclient.go—Stop(ctx, gameID, reason) error,Patch(ctx, gameID, imageRef) error,Restart(reserved; not in v1 feature scope).notificationpublisher.go—Publish(ctx, intent) error.lobbyeventspublisher.go—PublishSnapshotUpdate,PublishGameFinished.
//go:generate mockgendirective next to each interface declaration.
Files new:
- as above.
Exit criteria:
- the package compiles.
- every interface has a
_ ports.X = (*Y)(nil)assertion slot ready for the adapters that follow. go test ./gamemaster/internal/domain/...passes.
Stage 11. Persistence adapters
Goal:
- implement the four PostgreSQL stores and the Redis offset store.
Tasks:
gamemaster/internal/adapters/postgres/runtimerecordstore/store.gousing jet. CAS semantics onUpdateStatus(expected status comparison inside the SQLUPDATE ... WHERE game_id = $1 AND status = $2pattern).UpdateSchedulingmutatesnext_generation_atandskip_next_ticktogether.gamemaster/internal/adapters/postgres/engineversionstore/store.go.IsReferencedByActiveRuntimejoins againstruntime_records WHERE status NOT IN ('finished','stopped').gamemaster/internal/adapters/postgres/playermappingstore/store.go.BulkInsertis a singleINSERT ... ON CONFLICT DO NOTHING.gamemaster/internal/adapters/postgres/operationlog/store.go.gamemaster/internal/adapters/redisstate/streamoffsets/store.go(mirror Lobby's and RTM'sredisstate/streamoffsets).- For each adapter: store-level integration tests against testcontainers
PostgreSQL or Redis. CAS semantics on
runtime_records.UpdateStatusare verified by an explicit concurrent-update test (only one of two callers wins). The semver-patch comparison inengineversionis verified against a curated table of cases.
Files new:
- as above and per-package
_test.go.
Exit criteria:
- store tests pass on a CI runner with Docker available.
Stage 12. External clients (engine, lobby, RTM, notification, lobby-events)
Goal:
- ship the HTTP and Redis adapters that GM uses to talk to the engine, Lobby internal API, RTM internal API, the notification stream, and the lobby-events stream.
Tasks:
gamemaster/internal/adapters/engineclient/client.go— REST client over anotelhttp-wrappedhttp.Client. Implementsports.EngineClientby calling the renamed admin endpoints (/api/v1/admin/init,/admin/status,/admin/turn,/admin/race/banish) and the player endpoints (/api/v1/command,/api/v1/order,/api/v1/report). Builds and consumes the existing JSON shapes fromgame/openapi.yaml.gamemaster/internal/adapters/lobbyclient/client.go— REST client forGET /api/v1/internal/games/{game_id}/memberships. Returns a typedMembershipslice.gamemaster/internal/adapters/rtmclient/client.go— REST client forPOST /api/v1/internal/runtimes/{game_id}/stopand/patch.gamemaster/internal/adapters/notificationpublisher/publisher.go— thin XADD wrapper overnotification:intentsusinggalaxy/notificationintentconstructors.gamemaster/internal/adapters/lobbyeventspublisher/publisher.go— XADD wrapper forgm:lobby_events. Two methods:PublishSnapshotUpdate(ctx, msg)andPublishGameFinished(ctx, msg). Schema enforced inline againstruntime-events-asyncapi.yaml.gamemaster/internal/adapters/mocks/—mockgen-generated mocks for everyports.*interface. Regenerated bymake -C gamemaster mocks.- Per-adapter unit tests with mocks for the clients (httptest server for REST adapters; miniredis for the publishers).
Files new:
- as above.
Exit criteria:
- mocks regenerate cleanly via
go generate. - unit tests pass.
go test ./gamemaster/internal/adapters/...passes.
Stage 13. Service: register-runtime
Goal:
- end-to-end
register-runtimeoperation: validate, persist initial record, call engine/admin/init, persist player mappings, mark running, schedule first turn.
Tasks:
gamemaster/internal/service/registerruntime/service.goorchestrator, following the flow from./README.md §Lifecycles → Register-runtime:- validate envelope;
- reject if
runtime_records.{game_id}exists; - resolve
image_reffortarget_engine_versionfromengine_versions; - persist
runtime_records.status=starting; - call engine
/admin/init; - persist
player_mappingsrows from the engine response; - CAS
status: starting → running, persistcurrent_turn=0and initialnext_generation_at; - append
operation_log; - publish
runtime_snapshot_update; - return persisted runtime record.
- Failure paths: roll back
runtime_recordson engine failure; ensure no orphanplayer_mappingsrows; record failure inoperation_log. - Unit tests cover happy path, idempotent re-registration (returns
conflict), engine 4xx (engine_validation_error), engine 5xx (engine_unreachable), missing engine version (engine_version_not_found), partial-rollback paths.
Files new:
gamemaster/internal/service/registerruntime/{service.go, service_test.go, errors.go}.
Exit criteria:
- service-level tests pass.
Stage 14. Service: engine version registry CRUD + image-ref resolve
Goal:
- the registry surface used by Lobby's start flow and by Admin Service.
Tasks:
gamemaster/internal/service/engineversion/service.go:List(ctx, statusFilter)— list versions optionally filtered bystatus;Get(ctx, version)— read one;Create(ctx, version, imageRef, options)— validate semver, validate Docker reference shape, persist;Update(ctx, version, patch)— partial update (image_ref,options,status);Deprecate(ctx, version)— setstatus=deprecated;Delete(ctx, version)— hard delete; rejected withengine_version_in_useifIsReferencedByActiveRuntimereturns true;ResolveImageRef(ctx, version)— readimage_refonly; this is the hot path used by Lobby.
- Unit tests cover create-validate, delete-when-active rejection, and semver shape validation. Resolve is tested against a seeded table of versions.
Files new:
gamemaster/internal/service/engineversion/{service.go, service_test.go, errors.go}.
Exit criteria:
- service-level tests pass.
Stage 15. Service: scheduler + turn generation + snapshot publisher
Goal:
- the heart of GM: the periodic scheduler and the turn-generation flow, with snapshot publication and finish detection.
Tasks:
gamemaster/internal/service/turngeneration/service.go:- input:
gameID,trigger ∈ {scheduler, force}; - CAS
status: running → generation_in_progress; - call engine
/admin/turn; - on success: persist
current_turn, evaluatefinished, branch:- finished: CAS
status → finished, persistfinished_at,PublishGameFinished, publishgame.finishednotification, return; - not finished: CAS
status → running, recomputenext_generation_at(skip a tick ifskip_next_tick=true, then clear),PublishSnapshotUpdate, publishgame.turn.readynotification, return;
- finished: CAS
- on failure: CAS
status → generation_failed, publishruntime_snapshot_updatereflecting the new status, publishgame.generation_failedadmin notification, return.
- input:
gamemaster/internal/service/scheduler/service.go:- thin wrapper that builds the next-tick value from
domain/schedule.NextTickgiventurn_scheduleandskip_next_tick; - reused by both the ticker worker (Stage 19 wires it) and by the
force-next-turnadmin op (Stage 17).
- thin wrapper that builds the next-tick value from
gamemaster/internal/worker/schedulerticker/worker.go:- 1-second loop;
- calls
runtime_records.ListDueRunning(now)and runsturngeneration.Run(ctx, gameID, scheduler)per game; - serialises per-
game_idcalls (one in-flight per game; concurrent games proceed in parallel).
- Unit tests cover happy path, finish detection, force trigger with skip consumption, generation failure, CAS contention with a concurrent external status change (e.g., admin stop).
- Player turn stats are derived from
StateResponse.player[]and projected to{user_id, planets, population}viaplayermappingstore.ListByGame.
Files new:
gamemaster/internal/service/turngeneration/{service.go, service_test.go, errors.go},gamemaster/internal/service/scheduler/{service.go, service_test.go},gamemaster/internal/worker/schedulerticker/{worker.go, worker_test.go}.
Exit criteria:
- service-level tests pass.
Stage 16. Service: hot-path command + order + report + membership cache
Goal:
- the gateway-facing trio: command execution, order submission, report reading. Membership cache and the invalidation hook.
Tasks:
gamemaster/internal/service/membership/cache.go:- in-process
map[gameID]entry{members map[userID]MembershipStatus, loadedAt}; Resolve(ctx, gameID, userID) (status, error)— checks cache, falls back tolobbyclient.GetMembershipson miss or TTL expiry;Invalidate(gameID)— purges the cache entry;- LRU eviction governed by
GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES.
- in-process
gamemaster/internal/service/commandexecute/service.go:- input:
gameID,userID, payload{commands:[…]}; - validate
runtime_records.{game_id}exists withstatus=running; - resolve membership; reject if not active;
- resolve
race_namefromplayermappingstore; - call engine
/api/v1/commandwithCommandRequest{actor=race_name, cmd=…}; - return engine response verbatim.
- input:
gamemaster/internal/service/orderput/service.go: identical structure, calls/api/v1/order.gamemaster/internal/service/reportget/service.go: input{gameID, userID, turn}; resolvesrace_name; calls/api/v1/report?player=…&turn=…; returns body verbatim.- Unit tests: each service covers happy path, runtime-not-running, forbidden, engine 4xx, engine 5xx; membership cache tests cover hit, miss, TTL expiry, invalidate.
Files new:
gamemaster/internal/service/membership/{cache.go, cache_test.go},gamemaster/internal/service/commandexecute/{service.go, service_test.go},gamemaster/internal/service/orderput/{service.go, service_test.go},gamemaster/internal/service/reportget/{service.go, service_test.go}.
Exit criteria:
- service-level tests pass.
Stage 17. Service: admin operations (stop, force-next-turn, patch, banish, liveness)
Goal:
- the remaining service-layer operations: admin/runtime control plus the Lobby-facing liveness reply.
Tasks:
gamemaster/internal/service/adminstop/service.go:- input
{gameID, reason}; - call
rtmclient.Stop(ctx, gameID, reason); - on success: CAS
runtime_records.status: * → stopped; appendoperation_log; publishruntime_snapshot_update.
- input
gamemaster/internal/service/adminforce/service.go:- run
turngeneration.Run(ctx, gameID, force)synchronously; - on success, set
runtime_records.skip_next_tick = true(the next scheduler-drivenNextconsumes it).
- run
gamemaster/internal/service/adminpatch/service.go:- input
{gameID, version}; - resolve new
image_refviaengineversion.ResolveImageRef; - validate semver-patch against current
runtime_records.current_engine_version; reject withsemver_patch_onlyotherwise; - call
rtmclient.Patch(ctx, gameID, imageRef); - on success: persist new
current_image_refandcurrent_engine_version; appendoperation_log.
- input
gamemaster/internal/service/adminbanish/service.go:- input
{gameID, raceName}; - validate
playermappingstore.GetByRace(gameID, raceName)exists; - call engine
/admin/race/banish; - append
operation_log.
- input
gamemaster/internal/service/livenessreply/service.go:- lookup
runtime_records.{game_id}; - return
{ready: status==running, status: <observed>}.
- lookup
- Unit tests for each service cover happy path and each documented error code.
Files new:
gamemaster/internal/service/adminstop/...,gamemaster/internal/service/adminforce/...,gamemaster/internal/service/adminpatch/...,gamemaster/internal/service/adminbanish/...,gamemaster/internal/service/livenessreply/....
Exit criteria:
- service-level tests pass.
Stage 18. Async consumer: runtime:health_events
Goal:
- bring runtime health into GM's view per game and propagate to Lobby via the snapshot stream.
Tasks:
gamemaster/internal/worker/healtheventsconsumer/worker.go:- XREADs
runtime:health_eventswith a persisted offset (viastreamoffsetstore); - decodes the AsyncAPI envelope from RTM;
- updates
runtime_records.engine_healthpergame_id; - emits a debounced
runtime_snapshot_updateonly when the summary string changes.
- XREADs
- The summary derivation rule:
healthy⇒ summaryhealthy;probe_failedafter threshold ⇒ summaryprobe_failed;inspect_unhealthy⇒ summaryinspect_unhealthy;container_exited⇒ summaryexitedand CASstatus → engine_unreachable;container_oom⇒ summaryoomand CASstatus → engine_unreachable;container_disappeared⇒ summarydisappearedand CASstatus → engine_unreachable.
- Unit tests use
miniredisand the AsyncAPI fixture fromrtmanager/api/runtime-health-asyncapi.yaml.
Files new:
gamemaster/internal/worker/healtheventsconsumer/{worker.go, worker_test.go}.
Exit criteria:
- worker tests pass.
Stage 19. Internal REST handlers
Goal:
- ship the gateway-, Lobby-, and Admin-facing REST surface backed by the service layer.
Tasks:
gamemaster/internal/api/internalhttp/handlers/{registerruntime, getruntime, listruntimes, forcenextturn, stopruntime, patchruntime, banishrace, invalidatememberships, gameliveness, listengineversions, createengineversion, getengineversion, updateengineversion, deprecateengineversion, resolveengineversionimageref, executecommands, putorders, getreport}.go— one file per operation, each delegating to the corresponding service. JSON in / JSON out. Unknown JSON fields rejected withinvalid_request.- Error envelope identical to lobby and rtmanager.
- Wiring under the existing internal HTTP listener; route registration
in
gamemaster/internal/app/wiring.go. - Handler-level table-driven tests.
- OpenAPI conformance test that loads
api/internal-openapi.yamland asserts every defined operation is reachable and matches its declared response shape.
Files new:
- handlers + tests + the conformance test
gamemaster/api/openapi_conformance_test.go.
Exit criteria:
- OpenAPI conformance test passes for every endpoint.
- Handlers reject unknown JSON fields.
Stage 20. Lobby refactor
Goal:
- complete the Lobby side of the new image-resolve and membership invalidation contract.
Tasks:
- Replace
lobby/internal/domain/engineimage/resolver.gowith a thin GM-client wrapper. The package goes away; the call site inlobby/internal/service/startgame/service.goswitches fromengineimage.Resolver{}.Resolve(version)togmClient.ResolveImageRef(ctx, version). - Drop
LOBBY_ENGINE_IMAGE_TEMPLATEfromlobby/internal/config/{config.go, env.go, validation.go}. Remove the validation function and the related env-var test cases. - Add
InvalidateMemberships(ctx, gameID) errortolobby/internal/ports/gmclient.go. Regenerate themockgen-mock and update the inmem fake to record invocations. - Wire the new call from:
lobby/internal/service/approveapplication/service.go— post-commit;lobby/internal/service/rejectapplication/service.go— post-commit (only if a reservation existed prior);lobby/internal/service/redeeminvite/service.go— post-commit;lobby/internal/service/removemember/service.go— post-commit (already in scope of removal);lobby/internal/service/blockmember/service.go— post-commit;lobby/internal/worker/userlifecycle/consumer.go— post-commit per game in the cascade.
- Failed invalidation is logged at
warnand incremented in the existinglobby.notification.publish_attemptsstyle metric (or a newlobby.gm_invalidation.publish_attempts) but does not roll back the business commit. TTL on GM is the safety net. - Update Lobby unit tests, in particular the start-flow tests (replace
engineimagemock withgmclient.ResolveImageRefmock) and the membership-mutation tests (assertInvalidateMembershipswas called post-commit). - Update
lobby/api/internal-openapi.yamlonly if any new field surfaces (none expected; the call shape is on Lobby's outbound side, not on its REST surface).
Files touched:
lobby/internal/service/{startgame, approveapplication, rejectapplication, redeeminvite, removemember, blockmember}/,lobby/internal/worker/userlifecycle/,lobby/internal/config/{config.go, env.go, validation.go},lobby/internal/ports/gmclient.go,lobby/internal/adapters/gmclient/client.go,lobby/internal/adapters/mocks/gmclient/...,lobby/internal/adapters/gmclientinmem/...(if the inmem fake exists; otherwise the mockgen mock plus the migration described in RTM stage 22 is enough).
Files removed:
lobby/internal/domain/engineimage/(entire package).
Exit criteria:
go test ./lobby/...passes.LOBBY_ENGINE_IMAGE_TEMPLATEno longer appears in any Lobby source or documentation.- Lobby's start-flow integration test still passes against a stub
gmclientthat returnsimage_refsynchronously.
Stage 21. Service-local integration suite
Goal:
- end-to-end suite running against testcontainers PostgreSQL + Redis +
the real
galaxy/gameengine container.
Tasks:
gamemaster/integration/harness/— set up PostgreSQL with goose-applied migrations; Redis (testcontainers Redis for coordination suites that exercise streams); ensure the Docker bridge network exists; buildgalaxy/gametest image once per package run withsync.Once; tear everything down viat.Cleanup. Reuse the RTM-built image where possible (skip rebuilding when present).gamemaster/integration/registerruntime_test.go— register-runtime happy path: GM persists the runtime record, calls engine/admin/init, persistsplayer_mappings, transitions torunning, publishes aruntime_snapshot_update. Engine answers with a realStateResponse.gamemaster/integration/scheduler_test.go— schedules a five-second turn cron, observes one tick, asserts engine/admin/turnwas hit andcurrent_turnadvanced. Force-next-turn test assertsskip_next_tickconsumes the next regular tick.gamemaster/integration/hotpath_test.go— full command, order, and report round-trips against the real engine. Membership invalidation hook test asserts the cache flushes on demand.gamemaster/integration/adminops_test.go— admin stop calls a stub RTM and asserts the runtime record transitions tostopped. Admin patch with a non-patch semver target fails withsemver_patch_only. Admin banish hits the engine endpoint.gamemaster/integration/healthevents_test.go— publishes a fakeruntime:health_eventsentry, asserts the consumer updatesengine_healthand emits a debounced snapshot.gamemaster/integration/notification_test.go— observenotification:intentsafter a successful turn (game.turn.ready), after a finish (game.finished), and after a forced engine failure (game.generation_failedadmin email).
Files new:
- as above.
Exit criteria:
go test ./gamemaster/integration/...passes locally with Docker available.- CI runs the suite under a profile that exposes the Docker socket.
Stage 22. Inter-service test: Lobby ↔ GM
Goal:
- exercise the new image-ref resolve, register-runtime, and membership invalidation paths end-to-end without RTM in the loop.
Tasks:
integration/lobbygm/(top-level integration directory, mirroring existingintegration/lobbyrtm): runs real Lobby, real GM, real PostgreSQL, real Redis, a stub RTM that simply returns success onruntime:start_jobs, and the realgalaxy/gametest engine container.- Scenarios:
- Lobby creates a game, resolves
image_reffrom GM, publishes a start_job, the stub RTM acks success, Lobby callsregister-runtimeon GM, GM/admin/inits the engine, GM transitions torunning, GM publishesruntime_snapshot_update, Lobby updates its denormalised view. - One full turn generation cycle: scheduler ticks, GM calls engine
/admin/turn, GM publishesruntime_snapshot_update, Lobby's per-game stats aggregate updates. - Membership change: an admin removes a member; Lobby's
removememberpost-commit calls GMinvalidate-memberships; the next player command from that user fails withforbidden. - Game finish: engine returns
finished:true; GM publishesgame_finished; Lobby transitions the platform game record tofinishedand runs the capability evaluator.
- Lobby creates a game, resolves
Files new:
- as above.
Exit criteria:
- all scenarios pass in CI when the Docker socket is available.
Stage 23. Inter-service test: Lobby ↔ GM ↔ RTM (full happy path)
Goal:
- the canonical end-to-end test covering the whole running-game pipeline.
Tasks:
integration/lobbygmrtm/: runs real Lobby, real GM, real RTM, real PostgreSQL, real Redis, and the realgalaxy/gametest engine container.- Scenarios:
- Happy path: enrollment → start → RTM container → GM register-runtime
→ engine
/admin/init→ first player command → first scheduled turn → enginefinished:true→ GMgame_finished→ Lobby transitions tofinished→ RTM cleanup TTL. - Failure path A: RTM reports
start_config_invalidonruntime:job_results; Lobby transitions the game tostart_failed; no GM register-runtime is attempted. - Failure path B: container starts but GM is unavailable when Lobby
calls
register-runtime; Lobby transitions the game topausedand publisheslobby.runtime_paused_after_start; once GM comes back, Lobby's resume flow calls GM/liveness, receivesready=true, re-issuesregister-runtime, and the game reachesrunning.
- Happy path: enrollment → start → RTM container → GM register-runtime
→ engine
Files new:
- as above.
Exit criteria:
- all scenarios pass in CI when the Docker socket is available.
Stage 24. Service-local docs
Goal:
- drop per-stage decisions captured during this plan into discoverable
service-local documentation, mirroring
lobby/docs/andrtmanager/docs/.
Tasks:
gamemaster/docs/README.md— index pointing at the five content docs and the postgres-migration record.gamemaster/docs/runtime.md— components, processes, in-memory state of each worker.gamemaster/docs/flows.md— Mermaid diagrams for: register-runtime, turn generation, force-next-turn skip, hot-path command, admin patch, finish, health consumption, banish.gamemaster/docs/runbook.md— operator scenarios: «engine became unreachable», «turn generation failed and stuck», «patch upgrade», «manual force-next-turn», «engine version registry rotation», «membership cache appears stale».gamemaster/docs/examples.md— env-var examples per environment (dev / test / prod skeletons), example payloads for each stream and each REST endpoint.gamemaster/docs/postgres-migration.md— decision record for the schema (mirrorsnotification/docs/postgres-migration.mdstyle).- Add per-stage decision records under
gamemaster/docs/stage<NN>-*.mdfor any stage that produced a noteworthy decision (mirroring the RTM pattern). At minimum:stage11-persistence-adapters.md,stage12-external-clients.md,stage15-scheduler-and-turn-generation.md,stage16-membership-cache-and-invalidation.md,stage17-admin-operations.md,stage18-health-events-consumer.md,stage20-lobby-refactor.md.
Files new:
- all of the above.
Exit criteria:
- the README of GM links to
docs/README.md. - a reviewer can find any operational how-to within two clicks.
Final Acceptance Criteria
go build ./...from the repository root succeeds.go test ./...from the repository root passes.go test -tags=integration ./gamemaster/integration/...passes when Docker is available.go test ./integration/lobbygm/...andgo test ./integration/lobbygmrtm/...pass when Docker is available.make -C gamemaster jetregenerates jet code with no diff after a clean run.make -C gamemaster mocksregenerates mock code with no diff after a clean run.- Manual smoke: bring Lobby + GM + RTM + the rest of the stack up via
the existing dev compose; create a game; observe a real
galaxy-game-{game_id}container; play one turn round-trip; observe aruntime_snapshot_updateongm:lobby_events; force-next-turn; observe the next scheduled tick is skipped; stop the game; the container moves toexited. - Documentation across
ARCHITECTURE.md,gamemaster/,lobby/,notification/,game/, andrtmanager/is internally consistent.
Out of Scope
- Multi-instance GM with leader election (
Game Masterruns as a single process in v1). - Engine state file management (backup, archival, host-side cleanup).
- Direct gateway routing of admin
message_typevalues (admin operations land via Admin Service in a later iteration; v1 exposes only the GM internal REST surface). - TLS / mTLS on the internal listener.
- Engine-version automatic patch upgrades (manual admin operation only).
- A pause/resume flow on GM's side beyond the liveness-check reply.
Risks and Notes
- The membership invalidation hook from Lobby into GM is a deliberate
tight coupling. TTL stays as the safety net for any failed invalidation;
the explicit hook only optimises for the staleness window. Failure to
invalidate is logged but never rolls back Lobby state. This trade-off
is recorded in
./README.md§Hot Path. - Lobby refactor (Stage 20) gates on GM stages 14 (engine version registry
resolve endpoint) and 19 (handlers wired). Once Lobby switches to GM
for image-ref resolution, Lobby cannot start a game when GM is
unavailable; this is documented as the new failure mode in
lobby/README.md(Stage 03). - Engine path rename (Stage 05) is internal to
galaxy/game. No other service today calls/api/v1/init,/api/v1/status, or/api/v1/turn(RTM probes only/healthz); the rename is therefore a contained change inside the engine module. The user owns the conditional logic that fillsStateResponse.finishedand the body-level mechanics ofbanish. - GM single-instance is a single point of failure for turn generation in
v1. The trade-off is acceptable for the prototype and is documented in
gamemaster/README.md §Non-Goals. - Pre-launch single-init policy applies to GM exactly as documented in
ARCHITECTURE.md §Persistence Backends: schema evolves by editing00001_init.sqluntil first production deploy.