Files
galaxy-game/gamemaster/PLAN.md
T
2026-05-03 07:59:03 +02:00

50 KiB

Game Master Implementation Plan

This plan delivers Game Master (GM), the platform service that owns runtime and operational state of running Galaxy games, mediates every call to the engine container, runs the turn scheduler, and owns the engine version registry.

The plan also delivers the upstream changes that GM depends on: the extracted pkg/cronutil module, the engine admin-path rename plus the finished:bool field and the new /admin/race/banish endpoint on galaxy/game, the Lobby refactor that drops LOBBY_ENGINE_IMAGE_TEMPLATE in favour of synchronous image-ref resolution against GM, and the membership invalidation hook from Lobby into GM.

The architectural rules behind every decision are recorded in ./README.md. This file describes the order in which the implementation lands.

Global Rules

  • Documentation always lands before contracts; contracts before code.
  • Each stage leaves the repository in a buildable, test-green state. No stage relies on a later stage to fix a regression it introduced.
  • Existing-service refactors (Lobby image-ref resolver, Lobby membership invalidation hook, game engine path rename plus finished field plus banish endpoint, pkg/cronutil extraction) are full-fledged stages of this plan; they precede every GM stage that depends on them.
  • GM never opens the Docker SDK. Every container operation goes through Runtime Manager over trusted internal REST.
  • GM never trusts an actor field provided in a payload from Edge Gateway; it always derives actor=race_name from its own (user_id → race_name) mapping.
  • Every functional change ships its tests in the same stage. Contract tests freeze operation IDs and stream message names from Stage 06 onward.
  • All code, docs, and identifiers are written in English.
  • Engine domain logic (when finished=true is set, what banish mutates inside the game) is user-owned and explicitly out of scope; this plan ships only the contract, router plumbing, and stub handlers for those pieces.

Suggested Module Structure

gamemaster/
├── cmd/
│   ├── gamemaster/
│   │   └── main.go
│   └── jetgen/
│       └── main.go
│
├── internal/
│   ├── app/
│   │   ├── app.go
│   │   ├── runtime.go
│   │   ├── wiring.go
│   │   └── bootstrap.go
│   │
│   ├── config/
│   │   ├── config.go
│   │   ├── env.go
│   │   └── validation.go
│   │
│   ├── logging/
│   │   ├── logger.go
│   │   └── context.go
│   │
│   ├── telemetry/
│   │   └── runtime.go
│   │
│   ├── domain/
│   │   ├── runtime/
│   │   │   ├── model.go
│   │   │   └── transitions.go
│   │   ├── engineversion/
│   │   │   ├── model.go
│   │   │   └── semver.go
│   │   ├── playermapping/
│   │   │   └── model.go
│   │   └── schedule/
│   │       └── nexttick.go
│   │
│   ├── ports/
│   │   ├── runtimerecordstore.go
│   │   ├── engineversionstore.go
│   │   ├── playermappingstore.go
│   │   ├── operationlog.go
│   │   ├── streamoffsetstore.go
│   │   ├── engineclient.go
│   │   ├── lobbyclient.go
│   │   ├── rtmclient.go
│   │   ├── notificationpublisher.go
│   │   └── lobbyeventspublisher.go
│   │
│   ├── adapters/
│   │   ├── postgres/
│   │   │   ├── migrations/
│   │   │   ├── jet/
│   │   │   ├── runtimerecordstore/
│   │   │   ├── engineversionstore/
│   │   │   ├── playermappingstore/
│   │   │   └── operationlog/
│   │   ├── redisstate/
│   │   │   └── streamoffsets/
│   │   ├── engineclient/
│   │   ├── lobbyclient/
│   │   ├── rtmclient/
│   │   ├── notificationpublisher/
│   │   ├── lobbyeventspublisher/
│   │   └── mocks/
│   │
│   ├── service/
│   │   ├── registerruntime/
│   │   ├── engineversion/
│   │   ├── scheduler/
│   │   ├── turngeneration/
│   │   ├── commandexecute/
│   │   ├── orderput/
│   │   ├── reportget/
│   │   ├── membership/
│   │   ├── adminstop/
│   │   ├── adminforce/
│   │   ├── adminpatch/
│   │   ├── adminbanish/
│   │   └── livenessreply/
│   │
│   ├── worker/
│   │   ├── schedulerticker/
│   │   └── healtheventsconsumer/
│   │
│   └── api/
│       └── internalhttp/
│           ├── server.go
│           └── handlers/
│
├── api/
│   ├── internal-openapi.yaml
│   └── runtime-events-asyncapi.yaml
│
├── integration/
│   ├── harness/
│   ├── registerruntime_test.go
│   ├── scheduler_test.go
│   ├── hotpath_test.go
│   ├── adminops_test.go
│   ├── healthevents_test.go
│   └── notification_test.go
│
├── docs/
│   ├── README.md
│   ├── runtime.md
│   ├── flows.md
│   ├── runbook.md
│   ├── examples.md
│   └── postgres-migration.md
│
├── README.md
├── PLAN.md
├── Makefile
└── go.mod

Stage 01. Update ARCHITECTURE.md

Goal:

  • align the project-wide source of truth with every decision recorded in ./README.md before any code change touches it.

Tasks:

  • Expand ARCHITECTURE.md §8 (Game Master) with subsections: engine container contract (admin vs player paths, finished:bool semantics, banish endpoint), runtime status enum (starting | running | generation_in_progress | generation_failed | stopped | engine_unreachable | finished), turn cutoff rule (no shadow window; CAS-only), force-next-turn skip rule, snapshot publishing cadence (events only, no heartbeat), single-instance topology.
  • Update §«Versioning of Game Engines»: GM owns the engine version registry from v1; Lobby resolves image_ref synchronously through GM. LOBBY_ENGINE_IMAGE_TEMPLATE is removed. engine_versions table lives in the gamemaster schema.
  • Update §«Fixed synchronous interactions»: add Game Lobby → Game Master for register-runtime, image-ref resolve, membership invalidation hook, banish, and liveness reply. Add Edge Gateway → Game Master for player commands, orders, and reports.
  • Update §«Fixed asynchronous interactions»: add Game Master → Game Lobby runtime snapshot updates and game-finish events through the gm:lobby_events Redis Stream (already mentioned, expanded with cadence rules); add Runtime Manager → Game Master health events consumption (runtime:health_events) — already mentioned, confirmed.
  • Update §«Persistence Backends»: add gamemaster schema to the schema-per-service list and to PG-backed services.
  • Update §«Configuration»: add GAMEMASTER to the env-var prefix list with the same shape rules as other PG/Redis-backed services.
  • Update §«Recommended Order of Service Implementation» entry 8 with the scope finalised in ./README.md.
  • Drop ships_built from every architectural mention of player_turn_stats. Update the capability rule wording to use planets and population only (no behavioural change; ships_built was unused).

Files touched:

  • ARCHITECTURE.md.

Exit criteria:

  • every later GM, Lobby, Notification, or Game stage can quote its rules from ARCHITECTURE.md without re-deciding them.
  • go test ./... is unaffected (this stage changes only Markdown).

Stage 02. Freeze GM README.md

Status: implemented as part of this planning task — see ./README.md.

Goal:

  • publish the complete service description so contracts and code can reference one source.

Exit criteria:

  • a reviewer can answer any «what does GM do when X» question by reading the README alone.

Stage 03. Sync existing-service docs (Lobby, Notification, Game, RTM)

Goal:

  • bring the READMEs of every touched service into agreement with the GM contract before any code in those services changes.

Tasks:

  • lobby/README.md:
    • replace the LOBBY_ENGINE_IMAGE_TEMPLATE configuration entry with a new LOBBY_GM_BASE_URL-backed image-ref resolve via GET /api/v1/internal/engine-versions/{version}/image-ref;
    • document the new outgoing POST /api/v1/internal/games/{id}/memberships/invalidate call from removemember, blockmember, approveapplication, rejectapplication, redeeminvite, and the user-lifecycle cascade worker (post-commit, fail-open);
    • drop ships_built from the player_turn_stats description and from the capability evaluation wording (rule already reduces to planets + population);
    • add a paragraph in §Game Start Flow noting that image_ref is resolved from GM synchronously and that GM unavailability turns lobby.game.start into service_unavailable.
  • lobby/PLAN.md: append a closing note stating that the image-ref template removal and the membership invalidation hook are landed by the Game Master plan; no new stages added in Lobby's own PLAN.
  • notification/README.md: confirm the catalog already lists game.turn.ready, game.finished, game.generation_failed and add a one-line note that GM is the producer.
  • game/README.md:
    • document the new path layout: admin endpoints under /api/v1/admin/* (init, status, turn, race/banish); player endpoints unchanged at /api/v1/{command, order, report};
    • document the finished:bool extension on StateResponse;
    • document the POST /api/v1/admin/race/banish request/response shape (body {race_name}; response 204).
  • rtmanager/README.md: add a closing note that runtime:health_events is now consumed by Game Master in production (was reserved as a future consumer).

Files touched:

  • lobby/README.md, lobby/PLAN.md, notification/README.md, game/README.md, rtmanager/README.md.

Exit criteria:

  • every doc in the repo agrees on the post-GM contract; no contradiction remains between any two READMEs.
  • go test ./... is unaffected.

Stage 04. Extract pkg/cronutil + wire Lobby

Goal:

  • own a single cron parser/calculator across the workspace, used today by Lobby and tomorrow by GM.

Tasks:

  • Create new workspace module pkg/cronutil/ with:
    • cronutil.go: thin wrapper over github.com/robfig/cron/v3.NewParser(cron.Minute | cron.Hour | cron.Dom | cron.Month | cron.Dow); exports Parse(expr string) (Schedule, error) and Schedule.Next(after time.Time) time.Time;
    • cronutil_test.go: parser validation tests covering five-field cron expressions (e.g., 0 18 * * *, */15 * * * *), invalid expressions, DST/timezone behaviour (Schedule operates in UTC; UTC inputs yield UTC outputs);
    • go.mod declaring the module galaxy/cronutil with replace target.
  • Wire from Lobby: replace any inline robfig/cron/v3 usage in lobby/internal/domain/game/model.go:validateCronExpr and the enrollment automation worker with calls into pkg/cronutil. The enrollment automation worker does not parse cron today (it uses enrollment_ends_at UTC seconds), so the only Lobby caller is the cron-validation path on game records.
  • Update go.work to include ./pkg/cronutil and add the replace block.
  • Add Lobby unit tests confirming validateCronExpr accepts and rejects the same expressions as before.

Files new:

  • pkg/cronutil/{cronutil.go, cronutil_test.go, go.mod, go.sum}.

Files touched:

  • go.work, go.work.sum, lobby/internal/domain/game/model.go, lobby/go.mod, lobby/go.sum.

Exit criteria:

  • go build ./... succeeds.
  • go test ./pkg/cronutil/... ./lobby/... passes.
  • lobby/internal/domain/game/model_test.go still asserts the same acceptance set on cron expressions.

Stage 05. Game engine contract: admin paths + finished + banish

Goal:

  • ship the contract changes to galaxy/game that GM depends on: admin routes under /api/v1/admin/*, the StateResponse.finished field, and the new /admin/race/banish endpoint.

Tasks:

  • game/openapi.yaml:
    • rename /api/v1/init/api/v1/admin/init (operation initGameadminInitGame);
    • rename /api/v1/status/api/v1/admin/status (operation getGameStatusadminGetGameStatus);
    • rename /api/v1/turn/api/v1/admin/turn (operation generateTurnadminGenerateTurn);
    • add POST /api/v1/admin/race/banish (operation adminBanishRace) with body {race_name} and 204 No Content on success; document the same 400 and 500 error envelopes as the existing endpoints;
    • extend StateResponse schema with finished:bool (required; default false from server perspective documented in description).
  • game/internal/router/router.go (or its router-helper file): rename the route constants and registrations to the new admin paths; add a new route for /admin/race/banish wired to a stub handler returning 204 with empty body.
  • game/internal/router/handler/banish.go: new file with a stub handler that decodes the body, validates race_name is non-empty, and returns 204. Logging only; no game-state mutation. The user fills in domain logic in a separate change.
  • game/internal/model/state.go: add Finished bool field to the Go struct backing StateResponse. Default-zero (false) on serialisation; the user fills in conditional logic.
  • game/internal/router/{init,status,turn}_test.go: update path literals to the new admin form; tests stay green.
  • game/openapi_contract_test.go: assert presence of the new operation IDs (adminInitGame, adminGetGameStatus, adminGenerateTurn, adminBanishRace), the new path components, and the finished field on StateResponse.

Files new:

  • game/internal/router/handler/banish.go, game/internal/router/banish_test.go (path-level test only).

Files touched:

  • game/openapi.yaml, game/openapi_contract_test.go, game/internal/router/router.go, game/internal/router/handler/*.go, game/internal/router/{init,status,turn}_test.go, game/internal/model/state.go.

Exit criteria:

  • go test ./game/... passes.
  • docker build -t galaxy/game:test -f game/Dockerfile . from the workspace root still succeeds.
  • curl -X POST http://localhost:8080/api/v1/admin/race/banish -d '{"race_name":"Aelinari"}' against a running container returns 204.

Stage 06. GM contract files and contract tests

Goal:

  • ship machine-readable contracts before any GM handler is written, so the implementation has a target spec.

Tasks:

  • gamemaster/api/internal-openapi.yaml: every internal REST endpoint with request and response schemas; error envelope { "error": { "code", "message" } } identical to Lobby. Operation IDs: internalRegisterRuntime, internalGetRuntime, internalListRuntimes, internalForceNextTurn, internalStopRuntime, internalPatchRuntime, internalBanishRace, internalInvalidateMemberships, internalGameLiveness, internalListEngineVersions, internalCreateEngineVersion, internalGetEngineVersion, internalUpdateEngineVersion, internalDeprecateEngineVersion, internalResolveEngineVersionImageRef, internalExecuteCommands, internalPutOrders, internalGetReport, internalHealthz, internalReadyz.
  • gamemaster/api/runtime-events-asyncapi.yaml: AsyncAPI 3.1.0 spec for gm:lobby_events. Two event_type values: runtime_snapshot_update and game_finished. Frozen field set per message: runtime_snapshot_update {game_id, current_turn, runtime_status, engine_health_summary, player_turn_stats[], occurred_at_ms}; game_finished {game_id, final_turn_number, runtime_status, player_turn_stats[], finished_at_ms}.
  • gamemaster/contract_openapi_test.go: load the OpenAPI spec via kin-openapi, assert every operation ID is present, every required field on every request/response schema is present, and that additionalProperties: false is set on every body schema.
  • gamemaster/contract_asyncapi_test.go: load the AsyncAPI spec via the shared YAML walker pattern from notification/contract_asyncapi_test.go; assert message names, channel addresses, action vocabulary (send/receive), and event_type discriminator values.

Files new:

  • gamemaster/api/internal-openapi.yaml, gamemaster/api/runtime-events-asyncapi.yaml, gamemaster/contract_openapi_test.go, gamemaster/contract_asyncapi_test.go.

Exit criteria:

  • both specs validate.
  • contract tests pass; tests fail loudly if any operation ID, message name, or required field disappears.

Stage 07. Notification catalog audit (no-op or minor)

Goal:

  • confirm the GM-owned notification types (game.turn.ready, game.finished, game.generation_failed) are already wired through pkg/notificationintent, the notification service's catalog data tables, and notification/api/intents-asyncapi.yaml. Add freeze assertions so a future drift breaks loudly.

Tasks:

  • Run a freeze test inside gamemaster/ that imports galaxy/notificationintent and asserts the existence of the three constructors plus payload struct shapes.
  • Inspect notification/api/intents-asyncapi.yaml for the three message schemas; if any are missing the per-payload required fields, add them here.
  • Inspect the notification service's routing data tables (the location is internal to notification/internal/...); confirm the three types are present with audience and channel decisions matching ./README.md §Notification Contracts. Add entries if missing.
  • Extend notification/contract_asyncapi_test.go if any new payload schema entries were added.

Files touched (only if drift is found):

  • notification/api/intents-asyncapi.yaml, notification/internal/... (catalog data), notification/contract_asyncapi_test.go.

Files new:

  • gamemaster/notificationintent_audit_test.go.

Exit criteria:

  • the freeze test passes.
  • notification/contract_asyncapi_test.go and intent_acceptance_contract_test.go continue to pass.

Stage 08. GM module skeleton

Goal:

  • create a buildable gamemaster binary that loads config, opens dependencies, and exits cleanly on SIGTERM. It does no business work yet.

Tasks:

  • gamemaster/cmd/gamemaster/main.go mirroring rtmanager/cmd/rtmanager/main.go.
  • gamemaster/internal/config/{config.go, env.go, validation.go} with env prefix GAMEMASTER and groups Listener, Postgres, Redis, Streams, Engine client, Lobby internal client, RTM internal client, Scheduler, Membership cache, Logging, Lifecycle, Telemetry. Required variables fail-fast.
  • gamemaster/internal/logging/{logger.go, context.go} copied from lobby/notification.
  • gamemaster/internal/telemetry/runtime.go registering the metrics named in ./README.md §Observability.
  • gamemaster/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go} — empty wiring with PostgreSQL open, Redis open, telemetry open, probe listener open.
  • gamemaster/internal/api/internalhttp/server.go — listener with /healthz and /readyz only.
  • gamemaster/Makefile with the jet target (real generation lands in Stage 09) and a mocks target.
  • gamemaster/go.mod and go.sum with dependencies: github.com/redis/go-redis/v9, github.com/jackc/pgx/v5, github.com/go-jet/jet/v2, github.com/pressly/goose/v3, github.com/stretchr/testify, go.uber.org/mock, the testcontainers modules for postgres/redis, the OpenTelemetry stack identical to lobby, galaxy/cronutil, galaxy/notificationintent, galaxy/postgres, galaxy/redisconn, galaxy/error, galaxy/util.
  • Update repo-level go.work./gamemaster is already a workspace member; verify the module path and go.work.sum.

Files new:

  • the entire skeleton tree under gamemaster/.

Exit criteria:

  • go build ./gamemaster/cmd/gamemaster succeeds.
  • Running with valid env brings /healthz and /readyz up.
  • SIGTERM returns within GAMEMASTER_SHUTDOWN_TIMEOUT.

Stage 09. PostgreSQL schema, migrations, jet

Goal:

  • finalise the persistence schema and the code-generation pipeline.

Tasks:

  • gamemaster/internal/adapters/postgres/migrations/00001_init.sqlCREATE SCHEMA IF NOT EXISTS gamemaster; plus the four tables and indexes from ./README.md §Persistence Layout: runtime_records, engine_versions, player_mappings, operation_log. All time columns are timestamptz.
  • gamemaster/internal/adapters/postgres/migrations/migrations.go//go:embed *.sql and FS() exporter, identical pattern to lobby and rtmanager.
  • gamemaster/cmd/jetgen/main.go — testcontainers PostgreSQL + goose up + jet generation against the resulting database. Mirrors rtmanager/cmd/jetgen/main.go.
  • Generated gamemaster/internal/adapters/postgres/jet/... committed to the repo.
  • Wire goose migrations into gamemaster/internal/app/runtime.go startup so they apply before any listener opens; non-zero exit on failure (matches pkg/postgres policy).

Files new:

  • as above.

Exit criteria:

  • make -C gamemaster jet regenerates the jet code with no diff after a clean run.
  • Service start applies migrations to a fresh database and exits zero if migrations are already applied.

Stage 10. Domain layer and ports

Goal:

  • lock the in-memory domain model and the port interfaces for adapters.

Tasks:

  • gamemaster/internal/domain/runtime/model.goRuntimeRecord struct; status enum (StatusStarting, StatusRunning, StatusGenerationInProgress, StatusGenerationFailed, StatusStopped, StatusEngineUnreachable, StatusFinished); error sentinels.
  • gamemaster/internal/domain/runtime/transitions.go — allowed transitions table and a CAS-friendly validator.
  • gamemaster/internal/domain/engineversion/{model.go, semver.go}EngineVersion struct (Version, ImageRef, Options, Status); semver parse + patch-only comparison helpers.
  • gamemaster/internal/domain/playermapping/model.goPlayerMapping struct (GameID, UserID, RaceName, EnginePlayerUUID).
  • gamemaster/internal/domain/schedule/nexttick.go — wraps cronutil.Schedule; carries skip_next_tick semantics on Next(after, skip bool) (time.Time, skipConsumed bool).
  • gamemaster/internal/ports/:
    • runtimerecordstore.goGet, Insert, UpdateStatus (CAS by expected status), UpdateScheduling, ListDueRunning, ListByStatus.
    • engineversionstore.goGet, List (with status filter), Insert, Update, Deprecate, IsReferencedByActiveRuntime.
    • playermappingstore.goBulkInsert, Get(gameID, userID), ListByGame(gameID), DeleteByGame(gameID).
    • operationlog.goAppend, ListByGame.
    • streamoffsetstore.goLoad, Save (Redis offset persistence per consumer label).
    • engineclient.go — narrow surface GM uses: Init, Status, Turn, BanishRace, ExecuteCommands, PutOrders, GetReport.
    • lobbyclient.goGetMemberships(ctx, gameID) ([]Membership, error).
    • rtmclient.goStop(ctx, gameID, reason) error, Patch(ctx, gameID, imageRef) error, Restart (reserved; not in v1 feature scope).
    • notificationpublisher.goPublish(ctx, intent) error.
    • lobbyeventspublisher.goPublishSnapshotUpdate, PublishGameFinished.
  • //go:generate mockgen directive next to each interface declaration.

Files new:

  • as above.

Exit criteria:

  • the package compiles.
  • every interface has a _ ports.X = (*Y)(nil) assertion slot ready for the adapters that follow.
  • go test ./gamemaster/internal/domain/... passes.

Stage 11. Persistence adapters

Goal:

  • implement the four PostgreSQL stores and the Redis offset store.

Tasks:

  • gamemaster/internal/adapters/postgres/runtimerecordstore/store.go using jet. CAS semantics on UpdateStatus (expected status comparison inside the SQL UPDATE ... WHERE game_id = $1 AND status = $2 pattern). UpdateScheduling mutates next_generation_at and skip_next_tick together.
  • gamemaster/internal/adapters/postgres/engineversionstore/store.go. IsReferencedByActiveRuntime joins against runtime_records WHERE status NOT IN ('finished','stopped').
  • gamemaster/internal/adapters/postgres/playermappingstore/store.go. BulkInsert is a single INSERT ... ON CONFLICT DO NOTHING.
  • gamemaster/internal/adapters/postgres/operationlog/store.go.
  • gamemaster/internal/adapters/redisstate/streamoffsets/store.go (mirror Lobby's and RTM's redisstate/streamoffsets).
  • For each adapter: store-level integration tests against testcontainers PostgreSQL or Redis. CAS semantics on runtime_records.UpdateStatus are verified by an explicit concurrent-update test (only one of two callers wins). The semver-patch comparison in engineversion is verified against a curated table of cases.

Files new:

  • as above and per-package _test.go.

Exit criteria:

  • store tests pass on a CI runner with Docker available.

Stage 12. External clients (engine, lobby, RTM, notification, lobby-events)

Goal:

  • ship the HTTP and Redis adapters that GM uses to talk to the engine, Lobby internal API, RTM internal API, the notification stream, and the lobby-events stream.

Tasks:

  • gamemaster/internal/adapters/engineclient/client.go — REST client over an otelhttp-wrapped http.Client. Implements ports.EngineClient by calling the renamed admin endpoints (/api/v1/admin/init, /admin/status, /admin/turn, /admin/race/banish) and the player endpoints (/api/v1/command, /api/v1/order, /api/v1/report). Builds and consumes the existing JSON shapes from game/openapi.yaml.
  • gamemaster/internal/adapters/lobbyclient/client.go — REST client for GET /api/v1/internal/games/{game_id}/memberships. Returns a typed Membership slice.
  • gamemaster/internal/adapters/rtmclient/client.go — REST client for POST /api/v1/internal/runtimes/{game_id}/stop and /patch.
  • gamemaster/internal/adapters/notificationpublisher/publisher.go — thin XADD wrapper over notification:intents using galaxy/notificationintent constructors.
  • gamemaster/internal/adapters/lobbyeventspublisher/publisher.go — XADD wrapper for gm:lobby_events. Two methods: PublishSnapshotUpdate(ctx, msg) and PublishGameFinished(ctx, msg). Schema enforced inline against runtime-events-asyncapi.yaml.
  • gamemaster/internal/adapters/mocks/mockgen-generated mocks for every ports.* interface. Regenerated by make -C gamemaster mocks.
  • Per-adapter unit tests with mocks for the clients (httptest server for REST adapters; miniredis for the publishers).

Files new:

  • as above.

Exit criteria:

  • mocks regenerate cleanly via go generate.
  • unit tests pass.
  • go test ./gamemaster/internal/adapters/... passes.

Stage 13. Service: register-runtime

Goal:

  • end-to-end register-runtime operation: validate, persist initial record, call engine /admin/init, persist player mappings, mark running, schedule first turn.

Tasks:

  • gamemaster/internal/service/registerruntime/service.go orchestrator, following the flow from ./README.md §Lifecycles → Register-runtime:
    • validate envelope;
    • reject if runtime_records.{game_id} exists;
    • resolve image_ref for target_engine_version from engine_versions;
    • persist runtime_records.status=starting;
    • call engine /admin/init;
    • persist player_mappings rows from the engine response;
    • CAS status: starting → running, persist current_turn=0 and initial next_generation_at;
    • append operation_log;
    • publish runtime_snapshot_update;
    • return persisted runtime record.
  • Failure paths: roll back runtime_records on engine failure; ensure no orphan player_mappings rows; record failure in operation_log.
  • Unit tests cover happy path, idempotent re-registration (returns conflict), engine 4xx (engine_validation_error), engine 5xx (engine_unreachable), missing engine version (engine_version_not_found), partial-rollback paths.

Files new:

  • gamemaster/internal/service/registerruntime/{service.go, service_test.go, errors.go}.

Exit criteria:

  • service-level tests pass.

Stage 14. Service: engine version registry CRUD + image-ref resolve

Goal:

  • the registry surface used by Lobby's start flow and by Admin Service.

Tasks:

  • gamemaster/internal/service/engineversion/service.go:
    • List(ctx, statusFilter) — list versions optionally filtered by status;
    • Get(ctx, version) — read one;
    • Create(ctx, version, imageRef, options) — validate semver, validate Docker reference shape, persist;
    • Update(ctx, version, patch) — partial update (image_ref, options, status);
    • Deprecate(ctx, version) — set status=deprecated;
    • Delete(ctx, version) — hard delete; rejected with engine_version_in_use if IsReferencedByActiveRuntime returns true;
    • ResolveImageRef(ctx, version) — read image_ref only; this is the hot path used by Lobby.
  • Unit tests cover create-validate, delete-when-active rejection, and semver shape validation. Resolve is tested against a seeded table of versions.

Files new:

  • gamemaster/internal/service/engineversion/{service.go, service_test.go, errors.go}.

Exit criteria:

  • service-level tests pass.

Stage 15. Service: scheduler + turn generation + snapshot publisher

Goal:

  • the heart of GM: the periodic scheduler and the turn-generation flow, with snapshot publication and finish detection.

Tasks:

  • gamemaster/internal/service/turngeneration/service.go:
    • input: gameID, trigger ∈ {scheduler, force};
    • CAS status: running → generation_in_progress;
    • call engine /admin/turn;
    • on success: persist current_turn, evaluate finished, branch:
      • finished: CAS status → finished, persist finished_at, PublishGameFinished, publish game.finished notification, return;
      • not finished: CAS status → running, recompute next_generation_at (skip a tick if skip_next_tick=true, then clear), PublishSnapshotUpdate, publish game.turn.ready notification, return;
    • on failure: CAS status → generation_failed, publish runtime_snapshot_update reflecting the new status, publish game.generation_failed admin notification, return.
  • gamemaster/internal/service/scheduler/service.go:
    • thin wrapper that builds the next-tick value from domain/schedule.NextTick given turn_schedule and skip_next_tick;
    • reused by both the ticker worker (Stage 19 wires it) and by the force-next-turn admin op (Stage 17).
  • gamemaster/internal/worker/schedulerticker/worker.go:
    • 1-second loop;
    • calls runtime_records.ListDueRunning(now) and runs turngeneration.Run(ctx, gameID, scheduler) per game;
    • serialises per-game_id calls (one in-flight per game; concurrent games proceed in parallel).
  • Unit tests cover happy path, finish detection, force trigger with skip consumption, generation failure, CAS contention with a concurrent external status change (e.g., admin stop).
  • Player turn stats are derived from StateResponse.player[] and projected to {user_id, planets, population} via playermappingstore.ListByGame.

Files new:

  • gamemaster/internal/service/turngeneration/{service.go, service_test.go, errors.go}, gamemaster/internal/service/scheduler/{service.go, service_test.go}, gamemaster/internal/worker/schedulerticker/{worker.go, worker_test.go}.

Exit criteria:

  • service-level tests pass.

Stage 16. Service: hot-path command + order + report + membership cache

Goal:

  • the gateway-facing trio: command execution, order submission, report reading. Membership cache and the invalidation hook.

Tasks:

  • gamemaster/internal/service/membership/cache.go:
    • in-process map[gameID]entry{members map[userID]MembershipStatus, loadedAt};
    • Resolve(ctx, gameID, userID) (status, error) — checks cache, falls back to lobbyclient.GetMemberships on miss or TTL expiry;
    • Invalidate(gameID) — purges the cache entry;
    • LRU eviction governed by GAMEMASTER_MEMBERSHIP_CACHE_MAX_GAMES.
  • gamemaster/internal/service/commandexecute/service.go:
    • input: gameID, userID, payload {commands:[…]};
    • validate runtime_records.{game_id} exists with status=running;
    • resolve membership; reject if not active;
    • resolve race_name from playermappingstore;
    • call engine /api/v1/command with CommandRequest{actor=race_name, cmd=…};
    • return engine response verbatim.
  • gamemaster/internal/service/orderput/service.go: identical structure, calls /api/v1/order.
  • gamemaster/internal/service/reportget/service.go: input {gameID, userID, turn}; resolves race_name; calls /api/v1/report?player=…&turn=…; returns body verbatim.
  • Unit tests: each service covers happy path, runtime-not-running, forbidden, engine 4xx, engine 5xx; membership cache tests cover hit, miss, TTL expiry, invalidate.

Files new:

  • gamemaster/internal/service/membership/{cache.go, cache_test.go}, gamemaster/internal/service/commandexecute/{service.go, service_test.go}, gamemaster/internal/service/orderput/{service.go, service_test.go}, gamemaster/internal/service/reportget/{service.go, service_test.go}.

Exit criteria:

  • service-level tests pass.

Stage 17. Service: admin operations (stop, force-next-turn, patch, banish, liveness)

Goal:

  • the remaining service-layer operations: admin/runtime control plus the Lobby-facing liveness reply.

Tasks:

  • gamemaster/internal/service/adminstop/service.go:
    • input {gameID, reason};
    • call rtmclient.Stop(ctx, gameID, reason);
    • on success: CAS runtime_records.status: * → stopped; append operation_log; publish runtime_snapshot_update.
  • gamemaster/internal/service/adminforce/service.go:
    • run turngeneration.Run(ctx, gameID, force) synchronously;
    • on success, set runtime_records.skip_next_tick = true (the next scheduler-driven Next consumes it).
  • gamemaster/internal/service/adminpatch/service.go:
    • input {gameID, version};
    • resolve new image_ref via engineversion.ResolveImageRef;
    • validate semver-patch against current runtime_records.current_engine_version; reject with semver_patch_only otherwise;
    • call rtmclient.Patch(ctx, gameID, imageRef);
    • on success: persist new current_image_ref and current_engine_version; append operation_log.
  • gamemaster/internal/service/adminbanish/service.go:
    • input {gameID, raceName};
    • validate playermappingstore.GetByRace(gameID, raceName) exists;
    • call engine /admin/race/banish;
    • append operation_log.
  • gamemaster/internal/service/livenessreply/service.go:
    • lookup runtime_records.{game_id};
    • return {ready: status==running, status: <observed>}.
  • Unit tests for each service cover happy path and each documented error code.

Files new:

  • gamemaster/internal/service/adminstop/..., gamemaster/internal/service/adminforce/..., gamemaster/internal/service/adminpatch/..., gamemaster/internal/service/adminbanish/..., gamemaster/internal/service/livenessreply/....

Exit criteria:

  • service-level tests pass.

Stage 18. Async consumer: runtime:health_events

Goal:

  • bring runtime health into GM's view per game and propagate to Lobby via the snapshot stream.

Tasks:

  • gamemaster/internal/worker/healtheventsconsumer/worker.go:
    • XREADs runtime:health_events with a persisted offset (via streamoffsetstore);
    • decodes the AsyncAPI envelope from RTM;
    • updates runtime_records.engine_health per game_id;
    • emits a debounced runtime_snapshot_update only when the summary string changes.
  • The summary derivation rule:
    • healthy ⇒ summary healthy;
    • probe_failed after threshold ⇒ summary probe_failed;
    • inspect_unhealthy ⇒ summary inspect_unhealthy;
    • container_exited ⇒ summary exited and CAS status → engine_unreachable;
    • container_oom ⇒ summary oom and CAS status → engine_unreachable;
    • container_disappeared ⇒ summary disappeared and CAS status → engine_unreachable.
  • Unit tests use miniredis and the AsyncAPI fixture from rtmanager/api/runtime-health-asyncapi.yaml.

Files new:

  • gamemaster/internal/worker/healtheventsconsumer/{worker.go, worker_test.go}.

Exit criteria:

  • worker tests pass.

Stage 19. Internal REST handlers

Goal:

  • ship the gateway-, Lobby-, and Admin-facing REST surface backed by the service layer.

Tasks:

  • gamemaster/internal/api/internalhttp/handlers/{registerruntime, getruntime, listruntimes, forcenextturn, stopruntime, patchruntime, banishrace, invalidatememberships, gameliveness, listengineversions, createengineversion, getengineversion, updateengineversion, deprecateengineversion, resolveengineversionimageref, executecommands, putorders, getreport}.go — one file per operation, each delegating to the corresponding service. JSON in / JSON out. Unknown JSON fields rejected with invalid_request.
  • Error envelope identical to lobby and rtmanager.
  • Wiring under the existing internal HTTP listener; route registration in gamemaster/internal/app/wiring.go.
  • Handler-level table-driven tests.
  • OpenAPI conformance test that loads api/internal-openapi.yaml and asserts every defined operation is reachable and matches its declared response shape.

Files new:

  • handlers + tests + the conformance test gamemaster/api/openapi_conformance_test.go.

Exit criteria:

  • OpenAPI conformance test passes for every endpoint.
  • Handlers reject unknown JSON fields.

Stage 20. Lobby refactor

Goal:

  • complete the Lobby side of the new image-resolve and membership invalidation contract.

Tasks:

  • Replace lobby/internal/domain/engineimage/resolver.go with a thin GM-client wrapper. The package goes away; the call site in lobby/internal/service/startgame/service.go switches from engineimage.Resolver{}.Resolve(version) to gmClient.ResolveImageRef(ctx, version).
  • Drop LOBBY_ENGINE_IMAGE_TEMPLATE from lobby/internal/config/{config.go, env.go, validation.go}. Remove the validation function and the related env-var test cases.
  • Add InvalidateMemberships(ctx, gameID) error to lobby/internal/ports/gmclient.go. Regenerate the mockgen-mock and update the inmem fake to record invocations.
  • Wire the new call from:
    • lobby/internal/service/approveapplication/service.go — post-commit;
    • lobby/internal/service/rejectapplication/service.go — post-commit (only if a reservation existed prior);
    • lobby/internal/service/redeeminvite/service.go — post-commit;
    • lobby/internal/service/removemember/service.go — post-commit (already in scope of removal);
    • lobby/internal/service/blockmember/service.go — post-commit;
    • lobby/internal/worker/userlifecycle/consumer.go — post-commit per game in the cascade.
  • Failed invalidation is logged at warn and incremented in the existing lobby.notification.publish_attempts style metric (or a new lobby.gm_invalidation.publish_attempts) but does not roll back the business commit. TTL on GM is the safety net.
  • Update Lobby unit tests, in particular the start-flow tests (replace engineimage mock with gmclient.ResolveImageRef mock) and the membership-mutation tests (assert InvalidateMemberships was called post-commit).
  • Update lobby/api/internal-openapi.yaml only if any new field surfaces (none expected; the call shape is on Lobby's outbound side, not on its REST surface).

Files touched:

  • lobby/internal/service/{startgame, approveapplication, rejectapplication, redeeminvite, removemember, blockmember}/, lobby/internal/worker/userlifecycle/, lobby/internal/config/{config.go, env.go, validation.go}, lobby/internal/ports/gmclient.go, lobby/internal/adapters/gmclient/client.go, lobby/internal/adapters/mocks/gmclient/..., lobby/internal/adapters/gmclientinmem/... (if the inmem fake exists; otherwise the mockgen mock plus the migration described in RTM stage 22 is enough).

Files removed:

  • lobby/internal/domain/engineimage/ (entire package).

Exit criteria:

  • go test ./lobby/... passes.
  • LOBBY_ENGINE_IMAGE_TEMPLATE no longer appears in any Lobby source or documentation.
  • Lobby's start-flow integration test still passes against a stub gmclient that returns image_ref synchronously.

Stage 21. Service-local integration suite

Goal:

  • end-to-end suite running against testcontainers PostgreSQL + Redis + the real galaxy/game engine container.

Tasks:

  • gamemaster/integration/harness/ — set up PostgreSQL with goose-applied migrations; Redis (testcontainers Redis for coordination suites that exercise streams); ensure the Docker bridge network exists; build galaxy/game test image once per package run with sync.Once; tear everything down via t.Cleanup. Reuse the RTM-built image where possible (skip rebuilding when present).
  • gamemaster/integration/registerruntime_test.go — register-runtime happy path: GM persists the runtime record, calls engine /admin/init, persists player_mappings, transitions to running, publishes a runtime_snapshot_update. Engine answers with a real StateResponse.
  • gamemaster/integration/scheduler_test.go — schedules a five-second turn cron, observes one tick, asserts engine /admin/turn was hit and current_turn advanced. Force-next-turn test asserts skip_next_tick consumes the next regular tick.
  • gamemaster/integration/hotpath_test.go — full command, order, and report round-trips against the real engine. Membership invalidation hook test asserts the cache flushes on demand.
  • gamemaster/integration/adminops_test.go — admin stop calls a stub RTM and asserts the runtime record transitions to stopped. Admin patch with a non-patch semver target fails with semver_patch_only. Admin banish hits the engine endpoint.
  • gamemaster/integration/healthevents_test.go — publishes a fake runtime:health_events entry, asserts the consumer updates engine_health and emits a debounced snapshot.
  • gamemaster/integration/notification_test.go — observe notification:intents after a successful turn (game.turn.ready), after a finish (game.finished), and after a forced engine failure (game.generation_failed admin email).

Files new:

  • as above.

Exit criteria:

  • go test ./gamemaster/integration/... passes locally with Docker available.
  • CI runs the suite under a profile that exposes the Docker socket.

Stage 22. Inter-service test: Lobby ↔ GM

Goal:

  • exercise the new image-ref resolve, register-runtime, and membership invalidation paths end-to-end without RTM in the loop.

Tasks:

  • integration/lobbygm/ (top-level integration directory, mirroring existing integration/lobbyrtm): runs real Lobby, real GM, real PostgreSQL, real Redis, a stub RTM that simply returns success on runtime:start_jobs, and the real galaxy/game test engine container.
  • Scenarios:
    • Lobby creates a game, resolves image_ref from GM, publishes a start_job, the stub RTM acks success, Lobby calls register-runtime on GM, GM /admin/inits the engine, GM transitions to running, GM publishes runtime_snapshot_update, Lobby updates its denormalised view.
    • One full turn generation cycle: scheduler ticks, GM calls engine /admin/turn, GM publishes runtime_snapshot_update, Lobby's per-game stats aggregate updates.
    • Membership change: an admin removes a member; Lobby's removemember post-commit calls GM invalidate-memberships; the next player command from that user fails with forbidden.
    • Game finish: engine returns finished:true; GM publishes game_finished; Lobby transitions the platform game record to finished and runs the capability evaluator.

Files new:

  • as above.

Exit criteria:

  • all scenarios pass in CI when the Docker socket is available.

Stage 23. Inter-service test: Lobby ↔ GM ↔ RTM (full happy path)

Goal:

  • the canonical end-to-end test covering the whole running-game pipeline.

Tasks:

  • integration/lobbygmrtm/: runs real Lobby, real GM, real RTM, real PostgreSQL, real Redis, and the real galaxy/game test engine container.
  • Scenarios:
    • Happy path: enrollment → start → RTM container → GM register-runtime → engine /admin/init → first player command → first scheduled turn → engine finished:true → GM game_finished → Lobby transitions to finished → RTM cleanup TTL.
    • Failure path A: RTM reports start_config_invalid on runtime:job_results; Lobby transitions the game to start_failed; no GM register-runtime is attempted.
    • Failure path B: container starts but GM is unavailable when Lobby calls register-runtime; Lobby transitions the game to paused and publishes lobby.runtime_paused_after_start; once GM comes back, Lobby's resume flow calls GM /liveness, receives ready=true, re-issues register-runtime, and the game reaches running.

Files new:

  • as above.

Exit criteria:

  • all scenarios pass in CI when the Docker socket is available.

Stage 24. Service-local docs

Goal:

  • drop per-stage decisions captured during this plan into discoverable service-local documentation, mirroring lobby/docs/ and rtmanager/docs/.

Tasks:

  • gamemaster/docs/README.md — index pointing at the five content docs and the postgres-migration record.
  • gamemaster/docs/runtime.md — components, processes, in-memory state of each worker.
  • gamemaster/docs/flows.md — Mermaid diagrams for: register-runtime, turn generation, force-next-turn skip, hot-path command, admin patch, finish, health consumption, banish.
  • gamemaster/docs/runbook.md — operator scenarios: «engine became unreachable», «turn generation failed and stuck», «patch upgrade», «manual force-next-turn», «engine version registry rotation», «membership cache appears stale».
  • gamemaster/docs/examples.md — env-var examples per environment (dev / test / prod skeletons), example payloads for each stream and each REST endpoint.
  • gamemaster/docs/postgres-migration.md — decision record for the schema (mirrors notification/docs/postgres-migration.md style).
  • Add per-stage decision records under gamemaster/docs/stage<NN>-*.md for any stage that produced a noteworthy decision (mirroring the RTM pattern). At minimum:
    • stage11-persistence-adapters.md,
    • stage12-external-clients.md,
    • stage15-scheduler-and-turn-generation.md,
    • stage16-membership-cache-and-invalidation.md,
    • stage17-admin-operations.md,
    • stage18-health-events-consumer.md,
    • stage20-lobby-refactor.md.

Files new:

  • all of the above.

Exit criteria:

  • the README of GM links to docs/README.md.
  • a reviewer can find any operational how-to within two clicks.

Final Acceptance Criteria

  • go build ./... from the repository root succeeds.
  • go test ./... from the repository root passes.
  • go test -tags=integration ./gamemaster/integration/... passes when Docker is available.
  • go test ./integration/lobbygm/... and go test ./integration/lobbygmrtm/... pass when Docker is available.
  • make -C gamemaster jet regenerates jet code with no diff after a clean run.
  • make -C gamemaster mocks regenerates mock code with no diff after a clean run.
  • Manual smoke: bring Lobby + GM + RTM + the rest of the stack up via the existing dev compose; create a game; observe a real galaxy-game-{game_id} container; play one turn round-trip; observe a runtime_snapshot_update on gm:lobby_events; force-next-turn; observe the next scheduled tick is skipped; stop the game; the container moves to exited.
  • Documentation across ARCHITECTURE.md, gamemaster/, lobby/, notification/, game/, and rtmanager/ is internally consistent.

Out of Scope

  • Multi-instance GM with leader election (Game Master runs as a single process in v1).
  • Engine state file management (backup, archival, host-side cleanup).
  • Direct gateway routing of admin message_type values (admin operations land via Admin Service in a later iteration; v1 exposes only the GM internal REST surface).
  • TLS / mTLS on the internal listener.
  • Engine-version automatic patch upgrades (manual admin operation only).
  • A pause/resume flow on GM's side beyond the liveness-check reply.

Risks and Notes

  • The membership invalidation hook from Lobby into GM is a deliberate tight coupling. TTL stays as the safety net for any failed invalidation; the explicit hook only optimises for the staleness window. Failure to invalidate is logged but never rolls back Lobby state. This trade-off is recorded in ./README.md §Hot Path.
  • Lobby refactor (Stage 20) gates on GM stages 14 (engine version registry resolve endpoint) and 19 (handlers wired). Once Lobby switches to GM for image-ref resolution, Lobby cannot start a game when GM is unavailable; this is documented as the new failure mode in lobby/README.md (Stage 03).
  • Engine path rename (Stage 05) is internal to galaxy/game. No other service today calls /api/v1/init, /api/v1/status, or /api/v1/turn (RTM probes only /healthz); the rename is therefore a contained change inside the engine module. The user owns the conditional logic that fills StateResponse.finished and the body-level mechanics of banish.
  • GM single-instance is a single point of failure for turn generation in v1. The trade-off is acceptable for the prototype and is documented in gamemaster/README.md §Non-Goals.
  • Pre-launch single-init policy applies to GM exactly as documented in ARCHITECTURE.md §Persistence Backends: schema evolves by editing 00001_init.sql until first production deploy.