feat: gamemaster

This commit is contained in:
Ilia Denisov
2026-05-03 07:59:03 +02:00
committed by GitHub
parent a7cee15115
commit 3e2622757e
229 changed files with 41521 additions and 1098 deletions
@@ -0,0 +1,62 @@
# Stage 01 — Architecture sync
This decision record captures the non-obvious choice from
[`../PLAN.md` Stage 01](../PLAN.md#stage-01-update-architecturemd):
the drop of `ships_built` from every architectural mention of
`player_turn_stats`.
## Context
Before Stage 01, `ARCHITECTURE.md` and `lobby/README.md` described
`player_turn_stats` as carrying `{user_id, planets, population,
ships_built}`, and the Race Name Directory capability rule was wired in
prose as if `ships_built` could affect the outcome. In practice, the
formal capability rule was already
`max_planets > initial_planets AND max_population > initial_population`
`ships_built` was named in the stats payload but never referenced by
the rule.
## Decision
`player_turn_stats` carries `{user_id, planets, population}` only.
`ships_built` is removed from:
- `ARCHITECTURE.md §8 Game Master``runtime_snapshot_update` payload
description.
- `ARCHITECTURE.md §7 Game Lobby` — per-member aggregate description
(`current and running-max of planets and population`).
- `gamemaster/README.md` — already aligned at the stage-02 README
freeze.
The capability rule wording is unchanged because it was already
`planets`/`population`-only; only the surrounding prose mentioning the
unused field was inaccurate.
This is a documentation-only change. No runtime behaviour, wire format,
schema, or test fixture is affected.
## Why
`ships_built` was unused. Naming it in the contract obliged every
producer (GM) and consumer (Lobby aggregator) to populate and forward a
field with no consumer. Dropping it now — before any GM code lands —
keeps the contract minimal and avoids future drift between "what the
spec lists" and "what the code uses". `lobby/README.md` and the lobby
aggregate code are aligned in Stage 03 of the same plan.
## Alternatives considered
- **Keep `ships_built` in the contract for future use.** Rejected: no
concrete plan exists for a `ships_built`-driven capability or stat
surface; speculative fields rot.
- **Add `ships_built` only as an opaque stat without changing the
capability rule.** Rejected: the runtime cost of carrying it is
negligible, but the documentation burden of explaining why an unused
field is in the payload is not.
## References
- [`../PLAN.md` Stage 01](../PLAN.md)
- [`../../ARCHITECTURE.md` §7 Game Lobby](../../ARCHITECTURE.md)
- [`../../ARCHITECTURE.md` §8 Game Master](../../ARCHITECTURE.md)
- [`../README.md`](../README.md) — `player_turn_stats[]` description.
@@ -0,0 +1,124 @@
---
stage: 03
title: Existing-service docs sync (Lobby, Notification, Game, RTM)
---
# Stage 03 — Existing-service docs sync
This decision record captures the non-obvious choices made while
synchronising every touched-service README with the post-Game-Master
contract before any code change lands. The mechanical edits
(strikethrough renames, drop of `ships_built`, replacement of the
`engineimage.Resolver` block) are not enumerated here — they are direct
consequences of the rules already recorded in
[`../README.md`](../README.md) and
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md).
## Context
Stage 03 had to reach a state where every README in the repository
agreed on three new contractual rules before any service-level code
landed:
- `image_ref` is resolved synchronously from `Game Master`'s engine
version registry, not from a Go-template held by `Game Lobby`.
- A new outgoing `POST /api/v1/internal/games/{game_id}/memberships/invalidate`
hook from `Game Lobby` into `Game Master` fires post-commit on every
roster mutation.
- The engine container splits its REST surface into `/api/v1/admin/*`
(GM-only) and `/api/v1/{command,order,report}` (player), and
`StateResponse` carries a new boolean `finished` field that GM uses
as the sole finish signal.
Three decisions were not derivable from the GM README and required a
deliberate choice while editing `lobby/README.md`, `game/README.md`,
and `rtmanager/README.md`.
## Decision 1 — `lobby.game.start` failure modes for GM-driven image resolve
`Game Lobby` now calls
`GET /api/v1/internal/engine-versions/{version}/image-ref` synchronously
before publishing `runtime:start_jobs`. The contract defines two new
failure modes for the `lobby.game.start` command:
- GM unreachable (network error, timeout, `5xx`) ⇒
`lobby.game.start` returns `service_unavailable`; the game stays in
`ready_to_start`. No container is created, no envelope is published.
- GM reports the version is missing or deprecated (`404` or
`engine_version_not_found` payload) ⇒ `lobby.game.start` returns
`engine_version_not_found`; the game stays in `ready_to_start`.
Both error codes were added to the stable error code list in
`lobby/README.md`. They are deliberately distinct from the existing
GM-unavailable-after-container-start path, which transitions the game to
`paused` (the container is alive; only platform tracking is missing).
Conflating the two would force operators to inspect the `paused` set
for misconfigurations that never produced a container.
Alternatives considered and rejected:
- treat GM-unavailable at resolve time as `paused` for symmetry with the
later path — rejected because no container exists, so the
`lobby.runtime_paused_after_start` admin notification (which announces
a stranded container) would be a lie;
- silently fall back to a Go-template default when GM is unreachable —
rejected because it brings back the very coupling the stage is
retiring and lets a misconfigured registry slip through unnoticed.
## Decision 2 — Membership invalidate hook is fail-open
The new outgoing
`POST /api/v1/internal/games/{game_id}/memberships/invalidate` call from
`approveapplication`, `rejectapplication`, `redeeminvite`,
`removemember`, `blockmember`, and the user-lifecycle cascade worker is
documented as **fail-open**: a non-2xx response is logged and metered
but never rolls back the Lobby commit. GM's TTL safety net catches
stale data within the next cache TTL window.
This matches the architectural rule that a failed cross-service hook
must not invalidate an already committed business state. The TTL on
GM's in-process membership cache (default `30s`) bounds the staleness
window; the explicit hook only optimises for the time between commit
and TTL expiry.
Alternatives considered and rejected:
- two-phase commit across Lobby and GM — rejected: GM is allowed to be
unavailable without rolling back Lobby's roster mutation;
- queue the invalidation on a Redis Stream and let GM consume it
asynchronously — rejected for v1 because it introduces a new stream
contract for a rare event, and the synchronous post-commit call is
cheap enough that the staleness reduction beats the operational cost.
## Decision 3 — Keep `runtime:start_jobs` envelope shape unchanged
The `runtime:start_jobs` envelope continues to carry `image_ref` as a
top-level string field. Only the source of that string changes (from a
Lobby-side template substitution to a Lobby-side synchronous call into
GM). `Runtime Manager` does not need a contract change in this stage
and does not learn about engine versions — it still receives a
ready-to-pull Docker reference.
Alternatives considered and rejected:
- replace `image_ref` with `engine_version` and have RTM resolve the
image — rejected: it would force RTM to call GM, which violates the
rule that RTM has no upstream service dependencies for runtime
operations;
- attach the resolved version metadata to the envelope alongside
`image_ref` — rejected: RTM has no consumer for the metadata and
carrying it would invite divergence between Lobby and RTM views of
the engine version registry.
## References
- [`../PLAN.md` Stage 03](../PLAN.md)
- [`../README.md`](../README.md) — Game Master service description.
- [`../../lobby/README.md`](../../lobby/README.md) — updated Game Start
Flow, internal trusted REST, configuration, and error codes.
- [`../../game/README.md`](../../game/README.md) — admin path layout,
`StateResponse.finished`, `/admin/race/banish` shape.
- [`../../rtmanager/README.md`](../../rtmanager/README.md) —
`runtime:health_events` consumer note.
- [`../../notification/README.md`](../../notification/README.md) — GM as
the producer of the three `game.*` notification types.
+177
View File
@@ -0,0 +1,177 @@
---
stage: 06
title: Contract files and contract tests
---
# Stage 06 — Contract files and contract tests
This decision record captures the non-obvious choices made while
producing the machine-readable contracts for `Game Master`:
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml),
[`../api/runtime-events-asyncapi.yaml`](../api/runtime-events-asyncapi.yaml),
and the matching contract tests in the `gamemaster` package.
## Context
[`../PLAN.md` Stage 06](../PLAN.md) freezes the GM REST and event
contracts before any handler is written, so later stages have a target
spec. The plan enumerates the 20 internal REST `operationId` values and
the two `gm:lobby_events` message types and asks contract tests to
fail loudly if anything drifts.
Three decisions were not derivable from `../README.md` or
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) and required a
deliberate choice while writing the YAML.
## Decision 1 — Two messages and two send operations on one channel
`gm:lobby_events` carries two distinct message types — a recurring
`runtime_snapshot_update` and a terminal `game_finished`. The AsyncAPI
3.1.0 surface encodes them as **two separate messages on one channel
with one `send` operation per message**:
```yaml
channels:
lobbyEvents:
address: gm:lobby_events
messages:
runtimeSnapshotUpdate: { $ref: '#/components/messages/RuntimeSnapshotUpdate' }
gameFinished: { $ref: '#/components/messages/GameFinished' }
operations:
publishRuntimeSnapshotUpdate: { action: send, ... }
publishGameFinished: { action: send, ... }
```
The `notification:intents` contract uses a single message with
`allOf`-conditional discriminator branches; the `runtime:health_events`
contract uses a single message with a `oneOf` `details` field. Both
patterns work when most fields are shared and only one variant slot
differs.
For `gm:lobby_events` the two payloads share only `event_type`,
`game_id`, `runtime_status`, and `player_turn_stats[]`. The remaining
fields (`current_turn`, `engine_health_summary`, `occurred_at_ms` on
the snapshot vs `final_turn_number`, `finished_at_ms` on the finish
event) have no overlap, and their semantics differ — the snapshot is
recurring, the finish event is terminal. Two messages reflect this
asymmetry directly and keep each payload schema closed without
needing per-variant `if/then` rules.
Alternatives considered:
- **One message with `allOf` discriminator** — rejected: would force
every shared field to be optional at the envelope level and
re-required inside each `if/then` branch, doubling the schema size
and complicating the contract test. The notification spec accepts
this cost because it has 18 message types and the payload-shape
asymmetry is the whole point; here it's two types with no field
overlap.
- **Two channels** — rejected: would require Game Lobby to subscribe
to two streams, breaking the cadence guarantees in `../README.md`
§Async Stream Contracts ("snapshot transitions and finish are
ordered relative to each other on the same stream").
## Decision 2 — `event_type` is a required schema-level `const`
[`../PLAN.md` Stage 06](../PLAN.md) lists the "frozen field set per
message" without naming `event_type`. The implementation pins
`event_type` as a required schema property with a `const` value:
```yaml
RuntimeSnapshotUpdatePayload:
required: [event_type, ...]
properties:
event_type: { type: string, const: runtime_snapshot_update }
```
Reasons:
1. The wire payload must carry a discriminator; consumers (Game Lobby)
dispatch on `event_type` after `XREAD`. Omitting it from the schema
would require Game Master to inject the value at publish time
without spec backing.
2. `const` at the schema level lets the contract test assert the
discriminator value, which is the only meaningful check Stage 06
asks for ("`event_type` discriminator values"). Asserting only the
message component name without the on-wire `event_type` would not
protect consumers from a misconfigured publisher.
3. `rtmanager/api/runtime-health-asyncapi.yaml` already uses
`event_type` as a schema-level enum-typed discriminator; treating
`gm:lobby_events` the same way keeps the patterns consistent for a
reader cross-walking the two specs.
Alternatives considered:
- **Leave `event_type` out of the spec and produce it only at the
publish-side adapter** — rejected: hides the discriminator from the
contract test, which then cannot fail when the publisher renames or
drops it.
- **Encode discrimination through AsyncAPI message names alone**
(relying on `header.X-Message-Type` or similar) — rejected: Redis
Streams have no message-headers concept; everything travels in the
payload field set.
## Decision 3 — `additionalProperties: true` on engine pass-through schemas
Three internal REST operations forward engine-owned payloads without
modification:
- `internalExecuteCommands``POST /api/v1/command` on the engine
- `internalPutOrders``PUT /api/v1/order` on the engine
- `internalGetReport``GET /api/v1/report` on the engine
Their request and response bodies use `additionalProperties: true`:
```yaml
ExecuteCommandsRequest:
type: object
additionalProperties: true
required: [commands]
properties:
commands:
type: array
items: { type: object, additionalProperties: true }
```
Game Master does not own the shape of these payloads — `galaxy/game/openapi.yaml`
is the source of truth — and freezing them in the GM contract would
turn every engine-side schema bump into a coordinated GM release. The
same reasoning applies to `EngineVersion.options`, which is a
free-form `jsonb` document Game Master stores verbatim.
To prevent the open-by-default flag from spreading by accident, the
contract test
[`../contract_openapi_test.go`](../contract_openapi_test.go) maintains
two explicit allowlists:
- `gmOwnedClosedSchemas` — every schema for which Game Master owns
the wire shape; the test asserts each one closes with
`additionalProperties: false`.
- `engineOwnedPassthroughSchemas` — the five pass-through schemas
(request and response bodies of the three hot-path operations); the
test asserts each one keeps `additionalProperties: true`.
Adding a new GM schema requires registering it in
`gmOwnedClosedSchemas`; the test fails loudly if it isn't.
Alternatives considered:
- **Close the pass-through schemas with `additionalProperties: false`
and hand-mirror every engine field** — rejected: `galaxy/game` and
`galaxy/gamemaster` would have to release in lockstep; even cosmetic
field renames in the engine would break Edge Gateway routing.
- **Rely on a `// pass-through` comment in the YAML alone** — rejected:
comments do not survive automated reformatters and provide no
test-time signal.
## References
- [`../PLAN.md` Stage 06](../PLAN.md)
- [`../README.md` §Hot Path](../README.md), [`../README.md` §Async Stream Contracts](../README.md)
- [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml)
- [`../api/runtime-events-asyncapi.yaml`](../api/runtime-events-asyncapi.yaml)
- [`../contract_openapi_test.go`](../contract_openapi_test.go)
- [`../contract_asyncapi_test.go`](../contract_asyncapi_test.go)
- [`../../lobby/contract_openapi_test.go`](../../lobby/contract_openapi_test.go) — OpenAPI test pattern reused here.
- [`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go) — YAML walker pattern reused here.
- [`../../rtmanager/api/runtime-health-asyncapi.yaml`](../../rtmanager/api/runtime-health-asyncapi.yaml) — `event_type` const precedent.
@@ -0,0 +1,125 @@
---
stage: 07
title: Notification catalog audit
---
# Stage 07 — Notification catalog audit
This decision record captures the audit outcome and the freeze-test
choice made for the GM-owned notification types
(`game.turn.ready`, `game.finished`, `game.generation_failed`).
## Context
[`../PLAN.md` Stage 07](../PLAN.md) asks for confirmation that the three
notification types `Game Master` will produce in Stage 15 are already
wired through the shared producer module
[`../../pkg/notificationintent/`](../../pkg/notificationintent/), the
`notification` service AsyncAPI contract
[`../../notification/api/intents-asyncapi.yaml`](../../notification/api/intents-asyncapi.yaml),
and the catalog freeze in
[`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go).
The stage is described as «no-op or minor»: edits land elsewhere only if
the audit finds drift.
The producer-side surface is consumed in Stage 15 by
`gamemaster/internal/adapters/notificationpublisher/`; this stage locks
the contract before the publisher is implemented.
## Audit outcome — no drift
Each artefact already matches the `Game Master` notification table at
[`../README.md` §Notification Contracts](../README.md):
- [`../../pkg/notificationintent/intent.go`](../../pkg/notificationintent/intent.go)
declares `NotificationTypeGameTurnReady`, `NotificationTypeGameFinished`,
`NotificationTypeGameGenerationFailed`; `ExpectedProducer` maps the
three to `ProducerGameMaster`; `SupportsAudience` and `SupportsChannel`
encode `user + (push|email)` for the first two and `admin_email + email`
for the failure type.
- [`../../pkg/notificationintent/payloads.go`](../../pkg/notificationintent/payloads.go)
defines `GameTurnReadyPayload`, `GameFinishedPayload`,
`GameGenerationFailedPayload` with the exact field set required by the
README table, and exposes `NewGameTurnReadyIntent`,
`NewGameFinishedIntent`, `NewGameGenerationFailedIntent`. The
user-targeted constructors take `recipientUserIDs`; the admin-email
constructor does not.
- [`../../notification/api/intents-asyncapi.yaml`](../../notification/api/intents-asyncapi.yaml)
carries the three values in the `notification_type` enum, declares
one `if/then` branch each on the envelope, and defines the
`GameTurnReadyPayload`, `GameFinishedPayload`,
`GameGenerationFailedPayload` schemas with the per-type required
fields.
- [`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go)
freezes the three types inside `expectedNotificationCatalog` and
exercises them through `TestIntentAsyncAPISpecFreezesNotificationCatalogBranches`
and `TestNotificationCatalogDocsStayInSync`.
There is no separate «catalog data table» inside `notification/internal/`:
the routing decisions live in `pkg/notificationintent/intent.go` and are
shared by every producer and by the notification service itself.
Consequently no edits to
`notification/api/intents-asyncapi.yaml`,
`notification/internal/...`, or
`notification/contract_asyncapi_test.go` are required by this stage.
## Decision — producer-side compile-time freeze in addition to the YAML freeze
[`../notificationintent_audit_test.go`](../notificationintent_audit_test.go)
imports `galaxy/notificationintent` from inside the `gamemaster`
package. Because the test names every constant, constructor, and
payload struct field directly, any rename or removal in
`pkg/notificationintent` breaks `go build ./gamemaster/...` before the
test even runs. At runtime the test additionally asserts:
- the wire value of every `NotificationType` constant
(`game.turn.ready`, `game.finished`, `game.generation_failed`);
- the `Producer`, `AudienceKind`, recipient handling, and `Validate()`
outcome of the constructed intent;
- the on-wire field names through `Contains` checks against
`Intent.PayloadJSON` (catches a JSON tag rename even when the Go
struct field name stays);
- the audience/channel matrix via `SupportsAudience` and
`SupportsChannel`.
Reasons for adding this in addition to the YAML freeze in
`notification/contract_asyncapi_test.go`:
1. The YAML freeze runs in the `notification` module. A drift in
`pkg/notificationintent` that is *consistent* with a drift in
`notification/api/intents-asyncapi.yaml` would still be caught, but
the failure surface is on the consumer side, not the producer side.
The GM-side test fails first and points the engineer at the producer
they own.
2. The test binds the contract at compile time. A field rename in
`pkg/notificationintent/payloads.go` cannot land without breaking
`gamemaster/notificationintent_audit_test.go` build, even before
`go test` runs.
3. Stage 15 will introduce a publisher adapter that calls the same
constructors. Locking the constructor signatures here removes one
class of churn from that stage — the test serves as a contract
reference that the adapter has to satisfy.
Alternatives considered:
- **YAML re-parse in `gamemaster/`** — rejected: would duplicate the
walker logic already present in
`notification/contract_asyncapi_test.go` and bind the GM module to
the YAML file path through a relative `../notification/` reference.
The Go-import test catches the relevant drift class with no
cross-module file lookups.
- **No GM-side test, rely on the YAML freeze alone** — rejected:
Stage 07's exit criterion is «the freeze test passes», which the
PLAN explicitly anchors to a new file under `gamemaster/`. The YAML
freeze alone would also miss a Go-side rename that the test author
forgot to mirror in the YAML in the same change.
## References
- [`../PLAN.md` Stage 07](../PLAN.md)
- [`../README.md` §Notification Contracts](../README.md)
- [`../notificationintent_audit_test.go`](../notificationintent_audit_test.go)
- [`../../pkg/notificationintent/intent.go`](../../pkg/notificationintent/intent.go)
- [`../../pkg/notificationintent/payloads.go`](../../pkg/notificationintent/payloads.go)
- [`../../notification/api/intents-asyncapi.yaml`](../../notification/api/intents-asyncapi.yaml)
- [`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go) — YAML-level catalog freeze.
+145
View File
@@ -0,0 +1,145 @@
---
stage: 08
title: Module skeleton
---
# Stage 08 — GM module skeleton
This decision record captures the wiring choices made when bootstrapping
the runnable `gamemaster` binary on top of the contracts and freeze
tests landed by Stages 0107.
## Context
[`../PLAN.md` Stage 08](../PLAN.md) calls for a buildable `gamemaster`
process that loads its environment-driven configuration, opens
PostgreSQL and Redis pools, installs the OpenTelemetry runtime, exposes
`/healthz` and `/readyz` on the trusted internal HTTP listener, and
exits cleanly on `SIGTERM` within `GAMEMASTER_SHUTDOWN_TIMEOUT`. No
business endpoints, no workers, and no persistence stores yet.
The reference implementation is `rtmanager`, the most recently landed
Galaxy service that follows the platform-wide skeleton conventions
(layered `cmd / internal/{app, api, config, logging, telemetry}`,
`app.Component` lifecycle, OpenTelemetry runtime with deferred
observable gauges, fail-fast environment loader). Stage 08 mirrors that
skeleton with two deliberate divergences described below.
## Decisions
### 1. `go.mod` scope is minimal at Stage 08
Only modules actually imported by Stage 08 code land in
[`../go.mod`](../go.mod):
- `galaxy/postgres`, `galaxy/redisconn`, `galaxy/notificationintent`
(the last one was already present from Stage 07 freeze test);
- the OpenTelemetry stack (`otel`, `metric`, `trace`, `sdk`,
`sdk/metric`, OTLP exporters for traces and metrics over gRPC and
HTTP, stdout exporters);
- `go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp`;
- `github.com/redis/go-redis/v9` (promoted from indirect to direct);
- `github.com/jackc/pgx/v5` (transitive via `pkg/postgres`).
PLAN-listed modules that arrive with later consumers (`go-jet/jet/v2`,
`pressly/goose/v3`, the testcontainers modules, `go.uber.org/mock`,
`galaxy/cronutil`, `galaxy/error`, `galaxy/util`) are deliberately left
out of Stage 08's `go.mod`. They join the module together with their
first consumers in Stages 09 / 10 / 11 / 12.
Reasoning: keeping `go mod tidy` honest at every stage is cheaper than
pre-declaring blank-import stubs. The PLAN's full list is the eventual
shape of the module across the series, not a Stage 08 contract.
### 2. `ShutdownTimeout` lives at the top level of `Config`
The README §Configuration groups one variable —
`GAMEMASTER_SHUTDOWN_TIMEOUT` — under a documentation group called
"Lifecycle". The Go struct does not split that single field into a
substruct: `Config.ShutdownTimeout` mirrors the
`rtmanager.Config.ShutdownTimeout` shape so the two services stay
isomorphic. The "Lifecycle" group remains a documentation grouping in
[`../README.md`](../README.md) only.
### 3. Telemetry — counters and histograms now, observable gauges later
`internal/telemetry/runtime.go` registers every counter and histogram
listed under [`../README.md` §Observability](../README.md) at process
start (`buildRuntime`). The three observable gauges
(`gamemaster.runtime_records_by_status`,
`gamemaster.scheduler.due_games`, `gamemaster.engine_versions_total`)
are declared up front but their callbacks are installed via a deferred
`Runtime.RegisterGauges(deps)` call. The wiring layer at Stages 11 / 14
/ 15 supplies the probes (per-status row count, due-now scheduler
count, registered engine versions) once the persistence stores and the
scheduler exist.
This matches the `rtmanager` pattern where
`runtime_records_by_status` is registered through an analogous
`RegisterGauges` plumbing.
### 4. PostgreSQL migrations are deferred to Stage 09
The README §Startup dependencies states "Embedded goose migrations
apply synchronously before any listener opens." Stage 08 opens,
instruments, and pings the PostgreSQL pool but **does not** call
`postgres.RunMigrations`. The migrations package
(`internal/adapters/postgres/migrations/`) is shipped by Stage 09; the
runtime adds the one-line `RunMigrations` call at that stage.
Until then, the runtime is buildable, listener-ready, and serves
`/healthz` + `/readyz` against a fresh PostgreSQL pool with no schema
applied. This is acceptable because Stage 08 ships no business handlers
and no workers; nothing reads or writes `gamemaster.*` tables yet.
### 5. Makefile mirrors `rtmanager`
[`../Makefile`](../Makefile) declares `jet`, `mocks`, `integration`
targets identical in shape to `rtmanager/Makefile`. The `jet` target
runs `go run ./cmd/jetgen`; the binary lands in Stage 09. The `mocks`
target runs `go generate ./internal/ports/...
./internal/api/internalhttp/handlers/...`; the `//go:generate`
directives land in Stages 10 / 12 / 19. Both targets fail until their
prerequisites land — accepted because Stage 08 does not require either
to succeed; only `go build` and `go test ./gamemaster/...` matter.
### 6. No Docker dependency
`Game Master` is forbidden from importing the Docker SDK
([`../README.md` §Non-Goals](../README.md)). The skeleton therefore
drops the `newDockerClient` / `pingDocker` helpers from
`internal/app/bootstrap.go` and the Docker-related fields from
`internal/app/wiring.go`. The readiness probe pings PostgreSQL and
Redis only.
## Files landed
- `cmd/gamemaster/main.go` — process entrypoint.
- `internal/config/{config.go, env.go, validation.go, config_test.go}`
GAMEMASTER-prefixed env loader plus required-vars fail-fast.
- `internal/logging/{logger.go, context.go}` — slog JSON-stdout logger
with request id and span id helpers.
- `internal/telemetry/{runtime.go, runtime_test.go}` — OpenTelemetry
runtime, instruments listed in §Observability, deferred gauge
plumbing.
- `internal/api/internalhttp/{server.go, server_test.go}``/healthz`
and `/readyz` listener with observability middleware.
- `internal/app/{app.go, app_test.go, bootstrap.go, runtime.go,
wiring.go}` — process lifecycle (component supervisor + reverse-order
cleanup), Redis bootstrap helpers, minimal placeholder wiring.
- `Makefile` — `jet`, `mocks`, `integration` target stubs.
- Updated `go.mod` / `go.sum` with the dependencies and replace
directives for `galaxy/postgres` and `galaxy/redisconn`.
## Verification
- `go build ./gamemaster/...` succeeds.
- `go test ./gamemaster/...` passes (existing contract / freeze tests
plus the four new test files).
- Manual smoke against a local Postgres + Redis confirms:
`/healthz` returns `200 ok`, `/readyz` returns `200 ready` while both
dependencies respond, and `503 service_unavailable` once one of them
is brought down.
- `SIGTERM` ends the process within `GAMEMASTER_SHUTDOWN_TIMEOUT`,
releasing PostgreSQL pool, Redis client, and telemetry providers in
reverse construction order.
@@ -0,0 +1,257 @@
---
stage: 09
title: PostgreSQL schema, migrations, jet
---
# Stage 09 — PostgreSQL schema, migrations, jet
This decision record captures the schema and code-generation pipeline
landed for Game Master at PLAN Stage 09. It is a service-local mirror
of [`../../rtmanager/docs/postgres-migration.md`](../../rtmanager/docs/postgres-migration.md)
but only documents the decisions specific to Stage 09; the stage-24
[`postgres-migration.md`](postgres-migration.md) reorganisation will
later subsume and supersede this record.
## Context
[`../PLAN.md` Stage 09](../PLAN.md) finalises the persistence schema
and the code-generation pipeline. Stage 08 already opens, instruments,
and pings the PostgreSQL pool but does not apply any migrations. The
durable surface for runtime state, engine version registry, player
mappings, and the audit log is described in
[`../README.md` §Persistence Layout](../README.md). Stage 09 ships:
- `internal/adapters/postgres/migrations/00001_init.sql` plus the
matching embed package;
- `cmd/jetgen` — a testcontainers-driven regeneration pipeline for
the go-jet/v2 query builder code;
- the generated jet code under
`internal/adapters/postgres/jet/gamemaster/{model,table}/`,
committed verbatim;
- the `postgres.RunMigrations` call in `internal/app/runtime.go`,
applied after the PostgreSQL pool ping and before any listener is
built.
The reference precedent is `rtmanager`, the most recently landed
PG-backed service in the workspace.
## Decisions
### 1. Schema and role provisioning are excluded from `00001_init.sql`
**Decision.** The `gamemaster` schema and the matching
`gamemasterservice` role are created outside the migration sequence
(in tests by [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go)
`provisionRoleAndSchema`; in production by an ops init script not in
scope for this stage). The embedded migration `00001_init.sql` only
contains DDL for the four service-owned tables and indexes and assumes
it runs as the schema owner with `search_path=gamemaster`.
**Why.** [`../../ARCHITECTURE.md` §Database topology](../../ARCHITECTURE.md)
mandates that each service connects with its own role whose grants are
restricted to its own schema. Mixing role creation, schema creation,
and table DDL into one script forces the migration to run as a
superuser on every replica boot and effectively relaxes the per-service
role boundary. The `rtmanager` precedent settled on the split first;
GM follows it for the same architectural reason. This is a deliberate
deviation from PLAN Stage 09's literal `CREATE SCHEMA IF NOT EXISTS
gamemaster;` instruction, called out in the comment header at the top
of `00001_init.sql`.
### 2. Natural primary keys mirror the platform identifiers
**Decision.** Every PK is a natural identifier already owned by another
component:
- `runtime_records.game_id` — Lobby's platform identifier;
- `engine_versions.version` — semver string from the registry;
- `player_mappings (game_id, user_id)` — composite, both columns owned
by Lobby/User Service.
- `operation_log.id``bigserial`, the only synthetic PK because the
audit table has no natural identity per row.
**Why.** The same reasoning as in
[`../../rtmanager/docs/postgres-migration.md` §2](../../rtmanager/docs/postgres-migration.md)
applies: surrogate keys would force every cross-service join through a
lookup table, while the natural keys keep the persistence layer
pin-compatible with the contracts (every `register-runtime` envelope
already names `game_id`, every Lobby resolve names `version`, every
player command names `user_id`).
### 3. Defense-in-depth CHECK constraints on every status enum
**Decision.** Five CHECK constraints reproduce the Go-level enums in
the schema:
- `runtime_records_status_chk` — seven runtime statuses
(`starting`, `running`, `generation_in_progress`, `generation_failed`,
`stopped`, `engine_unreachable`, `finished`);
- `engine_versions_status_chk``active | deprecated`;
- `operation_log_op_kind_chk` — nine operation kinds
(`register_runtime`, `turn_generation`, `force_next_turn`, `banish`,
`stop`, `patch`, `engine_version_create`, `engine_version_update`,
`engine_version_deprecate`);
- `operation_log_op_source_chk` — three op sources
(`gateway_player`, `lobby_internal`, `admin_rest`);
- `operation_log_outcome_chk``success | failure`.
The Go-level enums in the domain layer (added in Stage 10) remain the
source of truth for application code.
**Why.** The same defense-in-depth argument as for `rtmanager`: the
storage boundary catches an adapter regression that would otherwise
persist an unexpected string. Operator-side queries (`SELECT … WHERE
op_kind = 'patch'`) benefit from the enum being verifiable directly in
psql without consulting the Go source. PostgreSQL's `CREATE TYPE … AS
ENUM` was rejected because adding values to a PG enum type requires
`ALTER TYPE` outside a transaction and complicates the single-init
pre-launch policy (decision §6).
### 4. Indexes derive from concrete query shapes
**Decision.** Three secondary indexes ship with `00001_init.sql`:
- `runtime_records (status, next_generation_at)` — drives the
scheduler ticker scan
(`WHERE status='running' AND next_generation_at <= now()` once per
second);
- `player_mappings (game_id, race_name)` UNIQUE — enforces the
one-race-per-game invariant at the storage boundary;
- `operation_log (game_id, started_at DESC)` — drives audit reads
ordered by recency.
The README §Persistence Layout list also mentions `player_mappings
(game_id)`, which is intentionally **not** added: the composite
primary key on `(game_id, user_id)` already serves as a leftmost-prefix
index for `WHERE game_id = $1`, and a one-column duplicate would only
double the write cost for no plan-stability gain. The README's
indexes list is corrected in the same patch to drop the redundant
entry.
**Why.** Each remaining index has a single concrete read shape behind
it. The composite ordering on `(status, next_generation_at)` lets the
planner satisfy the scheduler scan with one index sweep. The descending
ordering on `(game_id, started_at DESC)` matches the
`ListByGame ORDER BY started_at DESC` shape already established by
`rtmanager.operationlogstore.ListByGame`.
### 5. `next_generation_at` is nullable
**Decision.** `runtime_records.next_generation_at timestamptz` admits
NULL; `runtime_records.skip_next_tick boolean NOT NULL DEFAULT false`
does not.
**Why.** A row enters the table at register-runtime with
`status='starting'` and no scheduled tick yet — the tick is only
computed once the engine `/admin/init` succeeds and the CAS flips the
status to `running`. NULL captures «no tick scheduled» without forcing
a sentinel value into the column. The scheduler index
`(status, next_generation_at)` still works correctly: the predicate
`next_generation_at <= now()` is undefined for NULL inputs, and PG
excludes those rows from the result set, which is the desired
behaviour. `skip_next_tick` is a boolean knob set or cleared by the
force-next-turn flow; NULL would be a third state with no semantic, so
the column is NOT NULL with a `false` default.
### 6. Single-init pre-launch policy applies as documented
**Decision.** `00001_init.sql` evolves in place until first production
deploy. Adding a column, an index, or a new table during the
pre-launch development window edits this file directly rather than
producing `00002_*.sql`. The runtime applies the migration on every
boot; if the schema is already at head, `pkg/postgres`'s goose
adapter exits zero.
**Why.** The schema-per-service architectural rule
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
endorses a single-init policy for pre-launch services. The pre-launch
window allows non-additive changes (column rename, type narrowing,
CHECK tightening) that a multi-step migration sequence would force into
awkward two-step rewrites. Once the service ships to production, the
next schema change becomes `00002_*.sql` and the policy lifts.
### 7. `cmd/jetgen` is a one-to-one mirror of `rtmanager/cmd/jetgen`
**Decision.** [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) follows
the same shape as
[`../../rtmanager/cmd/jetgen/main.go`](../../rtmanager/cmd/jetgen/main.go):
spin a `postgres:16-alpine` testcontainer, open it as superuser,
provision the role and schema, open a second pool with
`search_path=gamemaster`, apply the embedded goose migrations, then
invoke `github.com/go-jet/jet/v2/generator/postgres.GenerateDB` with
schema=gamemaster. Constants differ (`gamemasterservice`,
`gamemaster`, `galaxy_gamemaster`) but the algorithm and helper shape
are intentionally identical.
**Why.** Two PG-backed services should not diverge on a dev-only code
generator that nothing else in the workspace relies on. Mirroring
`rtmanager` keeps `make -C <service> jet` interchangeable for
operators and minimises the cognitive overhead of moving between
services.
### 8. Generated jet code is committed
**Decision.** The output of `make -C gamemaster jet` lands under
[`../internal/adapters/postgres/jet/gamemaster/{model,table}/`](../internal/adapters/postgres/jet/gamemaster)
and is committed verbatim.
**Why.** `go build ./...` from the repository root must work without
Docker; CI runners and contributor machines without a local Docker
daemon must still pass `go test ./gamemaster/...` for the non-PG-store
parts of the module. The generation pipeline itself remains available
behind `make jet` for everyone who wants to regenerate.
### 9. Migrations apply synchronously before any listener opens
**Decision.** [`../internal/app/runtime.go`](../internal/app/runtime.go)
calls `postgres.RunMigrations(ctx, pgPool, migrations.FS(), ".")`
immediately after the `postgres.Ping` succeeds and before
`newWiring`/`internalhttp.NewServer` are constructed. A non-zero exit
on migration failure follows the `pkg/postgres` policy.
**Why.** [`../README.md` §Startup dependencies](../README.md)
specifies that «embedded goose migrations apply synchronously before
any listener opens». Repeated process boots against a head schema
return goose's «no work to do» success — this is how the policy stays
operationally cheap, since a freshly-spawned replica re-applies the
same `00001_init.sql` with no work and proceeds straight to opening
its listeners.
## Files landed
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
— full schema for the four service tables plus indexes and CHECK
constraints.
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
`//go:embed *.sql` and `FS()` exporter.
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) — testcontainers +
goose + jet pipeline.
- [`../internal/adapters/postgres/jet/gamemaster/`](../internal/adapters/postgres/jet/gamemaster)
— generated model and table packages.
- [`../internal/app/runtime.go`](../internal/app/runtime.go) — wired
`postgres.RunMigrations` call after the pool ping.
- [`../Makefile`](../Makefile) — refreshed `jet` target comment now
that the pipeline is real.
- [`../go.mod`](../go.mod), [`../go.sum`](../go.sum) — promoted
`github.com/go-jet/jet/v2`, `github.com/testcontainers/testcontainers-go`,
and `github.com/testcontainers/testcontainers-go/modules/postgres`
to direct dependencies.
- [`../README.md`](../README.md) — corrected §Persistence Layout
indexes list (dropped redundant `player_mappings (game_id)` entry)
and added a §References pointer to this record.
## Verification
- `cd gamemaster && go mod tidy` — no missing dependency, no
superfluous indirect.
- `make -C gamemaster jet` — bring up `postgres:16-alpine`, apply
`00001_init.sql`, regenerate `internal/adapters/postgres/jet/...`;
`git status` is clean after a second run.
- `go build ./gamemaster/...` succeeds (including the generated jet
code).
- `go test ./gamemaster/...` passes — existing contract, freeze, and
config/telemetry/HTTP tests are unaffected.
- Manual smoke against a local PostgreSQL with an empty `gamemaster`
schema and a `gamemasterservice` role: the process applies the
migration, `/readyz` returns `200`, and a second boot exits zero on
the «no work to do» path.
+184
View File
@@ -0,0 +1,184 @@
---
stage: 10
title: Domain layer and ports
---
# Stage 10 — Domain layer and ports
This decision record captures the non-obvious choices made while
introducing the in-memory domain model and port interfaces of Game
Master at PLAN Stage 10.
## Context
[`../PLAN.md` Stage 10](../PLAN.md) freezes the domain types and the
port surfaces that adapters (Stage 11/12), services (Stages 1317), and
workers (Stage 18) will adopt. No adapter or service code lands here;
the stage exists so every consumer of these types in later stages can
import a stable contract.
The reference precedent is `rtmanager`, the most recently landed
PG-backed service. Its
[`internal/domain/`](../../rtmanager/internal/domain) and
[`internal/ports/`](../../rtmanager/internal/ports) directories define
the shape every Stage 10 file follows: `Status string` enums with
`IsKnown` / `AllStatuses`; `*InvalidTransitionError` wrapping
`ErrInvalidTransition`; transition tables keyed by `(from, to)` pairs;
input structs with `Validate()` methods on every store mutation.
Six decisions deviate from a direct copy of `rtmanager` or extend the
literal task list of PLAN Stage 10. Each is recorded below.
## Decisions
### 1. `internal/domain/operation/` is added beyond the literal task list
**Decision.** Stage 10 ships
[`internal/domain/operation/log.go`](../internal/domain/operation/log.go)
with `OperationEntry`, `OpKind`, `OpSource`, and `Outcome` types even
though PLAN Stage 10's bullet list does not enumerate them.
**Why.** The Stage 09
[`00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
schema already declares CHECK constraints on `op_kind`, `op_source`,
and `outcome`. The
[`ports/operationlog.go`](../internal/ports/operationlog.go) interface
returns and accepts an `OperationEntry` parameter, which must therefore
live in the domain layer or be redefined inside `ports`. The
`rtmanager` precedent
([`rtmanager/internal/domain/operation/log.go`](../../rtmanager/internal/domain/operation/log.go))
treats it as a domain package; mirroring that keeps Game Master's layout
recognisable and lets later service code import a single canonical
type. The alternative (defining the type on the port file) would
duplicate the SQL CHECK enums in two places once Stage 11's adapter
ships and would force every service-layer caller to import the port
package for what is structurally a value type.
### 2. `Membership` lives on `ports/lobbyclient.go`, not in the domain
**Decision.** The DTO consumed by `LobbyClient.GetMemberships` is
declared inside
[`ports/lobbyclient.go`](../internal/ports/lobbyclient.go) rather than a
new `internal/domain/membership/` package.
**Why.** Game Master does not own membership state — Game Lobby does
([`../../ARCHITECTURE.md` §Membership rules](../../ARCHITECTURE.md)).
Anything GM holds about membership is a remote projection used solely
for hot-path authorisation. Treating it as a port-level DTO matches
`rtmanager`'s precedent for cross-service projections
([`rtmanager/internal/ports/lobbyinternal.go:LobbyGameRecord`](../../rtmanager/internal/ports/lobbyinternal.go))
and keeps the domain layer free of types that GM does not author.
Promoting it to a domain package later costs nothing if a real
GM-owned invariant ever attaches to it, but the v1 surface has none.
### 3. `EngineVersion.Options` is `[]byte`, not `map[string]any`
**Decision.**
[`engineversion.EngineVersion.Options`](../internal/domain/engineversion/model.go)
is declared as `[]byte` carrying the raw `jsonb` document.
**Why.** The OpenAPI contract
([`../api/internal-openapi.yaml`](../api/internal-openapi.yaml)) marks
`EngineVersion.options` as `additionalProperties: true` — the engine
owns the schema, GM is a pass-through registry. A `map[string]any` Go
field would encourage callers to introspect or mutate keys, breaking
that pass-through guarantee. `[]byte` matches how `rtmanager` keeps
`Details json.RawMessage` on health snapshots
([`rtmanager/internal/domain/health/snapshot.go`](../../rtmanager/internal/domain/health/snapshot.go))
for the same reason. Schema-aware handling can introduce a typed shape
in a future iteration without disturbing existing rows.
### 4. `Schedule.Next(after, skip)` returns `skipConsumed`, not mutated state
**Decision.** The wrapper at
[`internal/domain/schedule/nexttick.go`](../internal/domain/schedule/nexttick.go)
exposes `Next(after time.Time, skip bool) (time.Time, bool)`. The
boolean return reports whether the skip flag was consumed; the wrapper
itself stores no state.
**Why.** Persisting `skip_next_tick=false` is a column update on the
`runtime_records` row and belongs to the service layer (Stage 15),
together with the `next_generation_at` write. Encapsulating that
mutation inside the schedule wrapper would couple a pure value type to
the store; the boolean return keeps the wrapper trivially testable and
lets the caller (service layer) issue the column update via an
existing `UpdateScheduling` port call.
### 5. The transition table includes `engine_unreachable → running`
**Decision.** The runtime transitions map
([`internal/domain/runtime/transitions.go`](../internal/domain/runtime/transitions.go))
permits `engine_unreachable → running` even though Stage 10's task
list does not introduce a producer for that edge.
**Why.** The Stage 18
([`../PLAN.md` Stage 18](../PLAN.md)) health-events consumer must be
able to recover an engine that previously appeared unreachable when a
subsequent health observation reports `healthy`. Declaring the edge in
Stage 10 means Stage 18 needs no transitions.go edit — the consumer
calls `UpdateStatus` with the existing CAS guard. The alternative
(wait until Stage 18 to add the edge) would couple two unrelated
stages and force a domain-level edit during a worker stage.
### 6. mockgen directives target `internal/adapters/mocks/` (deferred)
**Decision.** Every port file carries a
`//go:generate go run go.uber.org/mock/mockgen
-destination=../adapters/mocks/mock_<file>.go -package=mocks
galaxy/gamemaster/internal/ports <Interface>` directive even though
the destination directory does not exist yet.
**Why.** Stage 12 ships the
[`internal/adapters/mocks/`](../internal/adapters/mocks) directory and
the first regeneration of `make mocks`. Putting the directives in
place during Stage 10 means Stage 12 only adds the directory and the
generated files; no port file has to be edited then. The directives
are inert until the destination directory exists; running
`go generate ./internal/ports/...` before Stage 12 is expected to
fail. The
[`Makefile`](../Makefile)'s `mocks` target already references the
directives, matching the lobby and rtmanager pattern
([`../../lobby/internal/ports/gmclient.go`](../../lobby/internal/ports/gmclient.go),
[`../../rtmanager/internal/ports/dockerclient.go`](../../rtmanager/internal/ports/dockerclient.go)).
## Files landed
- [`../internal/domain/runtime/{model,errors,transitions}.go`](../internal/domain/runtime)
with seven-status enum, `RuntimeRecord` struct, and the transition
table from PLAN Stage 10 plus decision §5.
- [`../internal/domain/engineversion/{model,semver}.go`](../internal/domain/engineversion)
with the registry status enum, `EngineVersion` struct, and the
`ParseSemver` / `IsPatchUpgrade` helpers.
- [`../internal/domain/playermapping/model.go`](../internal/domain/playermapping/model.go)
carrying the (game_id, user_id) → race_name + engine_player_uuid
projection.
- [`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
per decision §1.
- [`../internal/domain/schedule/nexttick.go`](../internal/domain/schedule/nexttick.go)
per decision §4.
- Ten port files under
[`../internal/ports/`](../internal/ports) covering the runtime
record, engine version, player mapping, operation log, stream
offset, engine, lobby, runtime manager, notification publisher, and
lobby events surfaces.
- Unit tests next to every source file; the suite covers status
enums, transition matrix, validators, semver normalisation, and
schedule skip semantics.
- [`../go.mod`](../go.mod) gains direct dependencies on
`galaxy/cronutil` and `golang.org/x/mod` for the schedule wrapper
and the semver helpers.
## Verification
- `cd gamemaster && go build ./...` — clean.
- `cd gamemaster && go test ./internal/domain/... ./internal/ports/...`
— green; transition matrix exhaustively asserts every allowed and
forbidden pair, semver parser rejects shortened forms, schedule
wrapper honours both `skip` modes.
- `cd gamemaster && go vet ./internal/...` — clean.
- `gofmt -l gamemaster/internal` — empty.
- Stage 09 contract tests
([`../contract_openapi_test.go`](../contract_openapi_test.go),
[`../contract_asyncapi_test.go`](../contract_asyncapi_test.go),
[`../notificationintent_audit_test.go`](../notificationintent_audit_test.go))
remain green; Stage 10 introduces no contract changes.
@@ -0,0 +1,242 @@
---
stage: 11
title: Persistence adapters
---
# Stage 11 — Persistence adapters
This decision record captures the non-obvious choices made while
implementing the four PostgreSQL stores and the Redis offset store of
Game Master at PLAN Stage 11.
## Context
[`../PLAN.md` Stage 11](../PLAN.md) ships the persistence layer that
the service-layer stages (13-17) and the worker stage (18) consume.
Stage 09 already shipped the schema, embedded migration, and the
generated jet code; Stage 10 fixed the domain types and the port
interfaces. Stage 11 plugs concrete adapters into those ports.
The reference precedent is `rtmanager`, the most recently landed
PG-backed service. Its
[`internal/adapters/postgres/`](../../rtmanager/internal/adapters/postgres)
and
[`internal/adapters/redisstate/`](../../rtmanager/internal/adapters/redisstate)
trees define the shape every Stage 11 file follows: per-store package
under `postgres/<store>/store.go`, helper packages under
`internal/sqlx` and `internal/pgtest`, `Config`/`Store`/`New` triple,
ColumnList-driven canonical SELECTs, `sqlx.WithTimeout`/`sqlx.IsNoRows`/
`sqlx.IsUniqueViolation` shared boundary helpers.
Eight decisions either deviate from a literal copy of `rtmanager` or
extend the literal task list of PLAN Stage 11. Each is recorded below.
## Decisions
### 1. `internal/sqlx` and `internal/pgtest` are local clones, not a shared module
**Decision.**
[`internal/adapters/postgres/internal/sqlx/sqlx.go`](../internal/adapters/postgres/internal/sqlx/sqlx.go)
and
[`internal/adapters/postgres/internal/pgtest/pgtest.go`](../internal/adapters/postgres/internal/pgtest/pgtest.go)
are full copies of `rtmanager`'s sibling files, with the few constants
that name the schema and role (`gamemaster`, `gamemasterservice`,
`galaxy_gamemaster`) replaced verbatim.
**Why.** Each PG-backed service owns its own role, schema, and
migration FS. Promoting these helpers into `pkg/postgres` would force
that package to either know about every schema or take them as
configuration; either path adds surface area for a runtime helper that
already covers exactly one boundary. The `rtmanager` precedent settled
on the per-service clone first and Game Master mirrors it for the
same architectural reason. The duplication cost is small (≈250 lines
total, mechanical) and the alternative would couple services through a
testing concern that has no business in production code.
### 2. CAS via `(game_id, status)` predicate, not `SELECT … FOR UPDATE`
**Decision.**
[`runtimerecordstore.UpdateStatus`](../internal/adapters/postgres/runtimerecordstore/store.go)
encodes the compare-and-swap as a `WHERE game_id = $1 AND status = $2`
predicate on a single `UPDATE`, then probes the row's existence on
`RowsAffected == 0` to distinguish `runtime.ErrConflict` (status
changed concurrently) from `runtime.ErrNotFound` (row absent).
**Why.** Same reasoning as
[`rtmanager/docs/postgres-migration.md` §CAS](../../rtmanager/docs/postgres-migration.md):
holding a `SELECT … FOR UPDATE` lock would block every other tick on
the same game while the Go code computed the next status, lengthening
the locked region for no correctness gain. The CAS-only path is
verified by `TestUpdateStatusConcurrentCAS` (8 goroutines, exactly one
winner).
### 3. Port-level deviation: `UpdateEngineVersionInput.Now` and `Deprecate(ctx, version, now)`
**Decision.**
[`ports/engineversionstore.go`](../internal/ports/engineversionstore.go)
gains a `Now time.Time` field on `UpdateEngineVersionInput` (validated
by `Validate` to be non-zero) and a `now time.Time` argument on
`Deprecate`. The corresponding port-level test fixtures in
`engineversionstore_test.go` are updated to carry the new value.
**Why.** Stage 10's literal port did not include a wall-clock for the
engine-version mutators, while
[`UpdateStatusInput`](../internal/ports/runtimerecordstore.go) and
[`UpdateSchedulingInput`](../internal/ports/runtimerecordstore.go) do.
Without Now in the input, the adapter would have to either call
`time.Now()` directly (loses test determinism) or accept a `Clock`
dependency in `Config` (adds adapter infrastructure for a single use
case). Aligning the inputs is a small, targeted contract change
allowed by the pre-launch single-init policy and consistent with the
clock-from-input convention adopted everywhere else in the service.
### 4. Domain-level conflict sentinels `engineversion.ErrConflict` and `playermapping.ErrConflict`
**Decision.** The domain packages
[`engineversion`](../internal/domain/engineversion/model.go) and
[`playermapping`](../internal/domain/playermapping/model.go) gain
`ErrConflict` sentinels. Adapters surface PostgreSQL unique violations
as `fmt.Errorf("...: %w", <pkg>.ErrConflict)` so service callers can
branch with `errors.Is`.
**Why.** `runtime.ErrConflict` already exists in the runtime package
and the rest of the codebase (lobby, rtmanager, notification) uses
domain-level conflict sentinels (e.g.
`membership.ErrConflict`,
`runtime.ErrConflict`). Returning a generic wrapped error for
engine-version and player-mapping conflicts would break the
established pattern and force the service layer to carry adapter
implementation knowledge (`sqlx.IsUniqueViolation`). Adding two
sentinels is a small, idiomatic deviation from PLAN Stage 11's bullet
list, called out here so future contract diffs do not re-litigate it.
### 5. `Options` jsonb requires explicit `CAST(... AS jsonb)` in dynamic UPDATE
**Decision.** In
[`engineversionstore.Update`](../internal/adapters/postgres/engineversionstore/store.go)
the dynamic assignment for `options` wraps the value in
`pg.StringExp(pg.CAST(pg.String(...)).AS("jsonb"))`. The plain
`pg.String(...)` literal makes PostgreSQL infer the right-hand side as
`text` and the assignment to a `jsonb` column then fails with
SQLSTATE `42804` (`column is of type jsonb but expression is of type
text`).
**Why.** `INSERT ... VALUES(...)` paths bind the `[]byte` through pgx,
which knows how to coerce text into jsonb at the protocol level.
Dynamic `UPDATE … SET options = '...'` does not go through that bind
because the SQL contains a string literal directly; PostgreSQL applies
its own type inference and fails. Using
[`jet`'s `CAST`](https://pkg.go.dev/github.com/go-jet/jet/v2/postgres#CAST)
is the cleanest way to force the right-hand-side type without dropping
to raw SQL. Storing `'{}'::jsonb` as the empty default mirrors the SQL
column default.
### 6. `Deprecate` is idempotent through a pre-check `Get`
**Decision.**
[`engineversionstore.Deprecate`](../internal/adapters/postgres/engineversionstore/store.go)
runs `Get(version)` first to distinguish three cases: row absent
(return `engineversion.ErrNotFound`), row already deprecated (return
`nil` with no further mutation), row active (run the
`UPDATE ... SET status='deprecated'`). Without the pre-check the
adapter would have to interpret `RowsAffected == 0` against an
ambiguous SQL guard (`WHERE version = ? AND status != 'deprecated'`).
**Why.** Deprecation is a relatively rare admin operation; the extra
read costs ≈one millisecond and removes the ambiguity. The
alternative is the same `classifyMissingUpdate` probe pattern used by
`UpdateStatus`, which would still need a Get to tell "missing" from
"already deprecated". The pre-check is the simplest path.
### 7. `BulkInsert` ships every row in one multi-row `INSERT`, not a transaction
**Decision.**
[`playermappingstore.BulkInsert`](../internal/adapters/postgres/playermappingstore/store.go)
emits a single `INSERT ... VALUES (a), (b), …` with as many tuples as
the input slice. Any unique-violation rolls back every row in the same
statement.
**Why.** The atomicity guarantee Game Master needs (no partial
roster) is already provided by PostgreSQL's per-statement implicit
transaction; wrapping the same rows in `BEGIN; INSERT; INSERT; COMMIT`
buys nothing and adds round-trips. The multi-row form is also the
only path that lets jet's
[`InsertStatement.VALUES(...)`](https://pkg.go.dev/github.com/go-jet/jet/v2/postgres#InsertStatement)
chain without escape hatches. Atomicity is verified end-to-end by
[`TestBulkInsertAtomicConflictRaceName`](../internal/adapters/postgres/playermappingstore/store_test.go)
(3 valid rows + 1 conflicting → 0 rows persisted).
### 8. `miniredis/v2` is a direct gamemaster dependency
**Decision.**
[`go.mod`](../go.mod) gains `github.com/alicebob/miniredis/v2` as a
direct dependency. The
[`streamoffsets` test suite](../internal/adapters/redisstate/streamoffsets/store_test.go)
uses `miniredis.RunT(t)` per test for full isolation.
**Why.** Same reasoning as `rtmanager`: an in-memory Redis is faster
than testcontainers Redis, fully isolated per test, and fits the
shape of the offset-store API. Adding it as a direct dep matches the
pattern in the repo (`rtmanager`, `notification`, `lobby` all do this
for similar adapter test suites).
## Files landed
- [`../internal/domain/engineversion/model.go`](../internal/domain/engineversion/model.go)
`ErrConflict` sentinel.
- [`../internal/domain/playermapping/model.go`](../internal/domain/playermapping/model.go)
`ErrConflict` sentinel.
- [`../internal/ports/engineversionstore.go`](../internal/ports/engineversionstore.go)
`Now` field, `Deprecate(ctx, version, now)` signature.
- [`../internal/ports/engineversionstore_test.go`](../internal/ports/engineversionstore_test.go)
— port-level fixtures plus the new `now must not be zero` reject
case.
- [`../internal/adapters/postgres/internal/sqlx/sqlx.go`](../internal/adapters/postgres/internal/sqlx/sqlx.go)
`WithTimeout`, `IsNoRows`, `IsUniqueViolation`, `Nullable*`
helpers (mirror of `rtmanager`).
- [`../internal/adapters/postgres/internal/pgtest/pgtest.go`](../internal/adapters/postgres/internal/pgtest/pgtest.go)
— testcontainers harness scoped to the `gamemaster` schema and
service role.
- [`../internal/adapters/postgres/runtimerecordstore/store.go`](../internal/adapters/postgres/runtimerecordstore/store.go)
with full `_test.go`.
- [`../internal/adapters/postgres/engineversionstore/store.go`](../internal/adapters/postgres/engineversionstore/store.go)
with full `_test.go`.
- [`../internal/adapters/postgres/playermappingstore/store.go`](../internal/adapters/postgres/playermappingstore/store.go)
with full `_test.go`.
- [`../internal/adapters/postgres/operationlog/store.go`](../internal/adapters/postgres/operationlog/store.go)
with full `_test.go`.
- [`../internal/adapters/redisstate/keyspace.go`](../internal/adapters/redisstate/keyspace.go).
- [`../internal/adapters/redisstate/streamoffsets/store.go`](../internal/adapters/redisstate/streamoffsets/store.go)
with full `_test.go`.
- [`../go.mod`](../go.mod), [`../go.sum`](../go.sum) — `miniredis/v2`
promoted to a direct dependency.
- [`../README.md`](../README.md) — §References pointer to this
record.
## Verification
```sh
cd gamemaster
# Domain + port unit tests still pass after the Stage-11 contract
# touch-ups.
go test ./internal/domain/... ./internal/ports/...
# All adapter test suites (require Docker for testcontainers; without
# Docker, the pgtest helpers call t.Skip).
go test ./internal/adapters/postgres/...
go test ./internal/adapters/redisstate/...
# CAS race coverage with -race; the test must observe exactly one
# winner per run.
go test -count=3 -race -run TestUpdateStatusConcurrentCAS \
./internal/adapters/postgres/runtimerecordstore
# Stage 06/07 contract freeze tests stay green:
go test ./... -run Contract
go test ./... -run NotificationIntent
```
The full repo-level `go build ./...` from the workspace root also
succeeds; service-layer stages (13+) and the mocks regeneration
(stage 12) are unaffected by Stage 11's adapter additions.
+211
View File
@@ -0,0 +1,211 @@
---
stage: 12
title: External clients
---
# Stage 12 — External clients
This decision record captures the non-obvious choices made while
implementing the five outbound adapters Game Master uses to talk to
the engine, Game Lobby, Runtime Manager, the notification stream, and
the lobby-events stream at PLAN Stage 12.
## Context
[`../PLAN.md` Stage 12](../PLAN.md) ships the adapter layer the
service-layer stages 1318 depend on. Ports were frozen by Stage 10
([`stage10-domain-and-ports.md`](./stage10-domain-and-ports.md)) and
the AsyncAPI/OpenAPI contracts were frozen by Stage 06
([`stage06-contract-files.md`](./stage06-contract-files.md)). The
reference precedent is `rtmanager`'s adapter tree
([`rtmanager/internal/adapters/lobbyclient`](../../rtmanager/internal/adapters/lobbyclient),
[`rtmanager/internal/adapters/notificationpublisher`](../../rtmanager/internal/adapters/notificationpublisher),
[`rtmanager/internal/adapters/healtheventspublisher`](../../rtmanager/internal/adapters/healtheventspublisher)),
which Stage 11 already locked in as the canonical shape for Game
Master persistence adapters. Stage 12 extends that precedent to the
HTTP clients and stream publishers.
Six decisions deviate from a literal copy of the `rtmanager` precedent
or extend the literal task list of PLAN Stage 12. Each is recorded
below.
## Decisions
### 1. Engine client carries no `BaseURL` in `Config`
**Decision.**
[`engineclient.Config`](../internal/adapters/engineclient/client.go)
exposes only `CallTimeout` and `ProbeTimeout`. The engine endpoint
URL is supplied per call from `runtime_records.engine_endpoint`.
**Why.** Game Master operates on N concurrent games at runtime; each
game lives behind its own DNS hostname (`http://galaxy-game-{game_id}:8080`).
Binding a base URL at construction would force a per-game client
instance and complicate the caller. The port already reflects the
right shape (`baseURL` is a method parameter on every method), so the
adapter follows it. The `*http.Client` is shared, so the HTTP
connection pool stays single-instance.
### 2. Two timeouts on the engine client, dispatched per method
**Decision.** The engine client routes turn-generation-class methods
(`Init`, `Turn`, `BanishRace`, `ExecuteCommands`, `PutOrders`)
through `CallTimeout` and inspect-style methods (`Status`,
`GetReport`) through `ProbeTimeout`. Both are required and must be
positive at construction.
**Why.** README §Configuration already declares the two
(`GAMEMASTER_ENGINE_CALL_TIMEOUT=30s`,
`GAMEMASTER_ENGINE_PROBE_TIMEOUT=5s`) for exactly this dispatch:
turn generation on a large game can run for tens of seconds, while
status/report reads are bounded and benefit from a tight ceiling.
A single shared timeout would either starve the long calls or relax
the short ones; the dispatch keeps the contract consistent with the
documented intent.
### 3. Engine `population` (number) decoded into `int` via `math.Round`
**Decision.**
[`engineclient`](../internal/adapters/engineclient/client.go) decodes
each `PlayerState.population` (typed as `number` in `game/openapi.yaml`)
into a private `float64` field, then converts to the port-level `int`
through `int(math.Round(value))`. NaN, infinite, and negative values
are rejected as `ports.ErrEngineProtocolViolation`.
**Why.** The port (Stage 10) and the AsyncAPI for `gm:lobby_events`
both treat population as a non-negative integer; the engine spec is
the only place it is typed as `number`. The engine in practice
returns whole values, but a defensive `math.Round` removes any
floating-point noise that would otherwise propagate to Lobby.
Rejecting NaN/Inf/negative payloads keeps the protocol invariant
explicit at the trust boundary.
### 4. Lobby client walks pagination with a hard page cap
**Decision.**
[`lobbyclient.GetMemberships`](../internal/adapters/lobbyclient/client.go)
walks the `next_page_token` chain transparently with `page_size=200`,
stopping when the upstream response carries an empty
`next_page_token`. A hard cap of 64 pages (`maxPages`) surfaces as
`fmt.Errorf("%w: pagination overflow ...", ports.ErrLobbyUnavailable)`
when crossed.
**Why.** The port contract is "every membership of gameID, in any
status"; the only way to satisfy it across Lobby's paged contract is
to follow the chain. The 64-page cap is a defensive guard against a
broken upstream that keeps issuing tokens; 64 × 200 = 12 800
memberships per game, two orders of magnitude beyond any realistic
Galaxy roster, so legitimate traffic never trips it. Surfacing the
overflow as `ErrLobbyUnavailable` lets the membership cache treat it
the same as any other transport fault.
### 5. RTM client does not introduce `ErrSemverPatchOnly`
**Decision.** RTM's `409 conflict` with `error_code=semver_patch_only`
is wrapped as `fmt.Errorf("%w: rtm patch: ... (error_code=semver_patch_only)", ports.ErrRTMUnavailable)`
without a dedicated typed sentinel.
**Why.** The Stage 10 port [`RTMClient.Patch`](../internal/ports/rtmclient.go)
declares only `ErrRTMUnavailable`. Adding `ErrSemverPatchOnly` here
would extend the port contract beyond Stage 10's frozen surface, and
the v1 service-layer caller (Stage 17, `adminpatch`) already
validates semver-patch eligibility against `engineversionstore`
before issuing the call. The 409 path is therefore a defence-in-depth
signal, not a primary branch; a single wrapped error keeps the port
narrow and lets the caller match on the message substring if it
ever needs to (today it does not).
### 6. Lobby-events publisher reuses the `rtmanager/healtheventspublisher`
shape, with two methods sharing one stream
**Decision.**
[`lobbyeventspublisher.Publisher`](../internal/adapters/lobbyeventspublisher/publisher.go)
exposes `PublishSnapshotUpdate` and `PublishGameFinished`, both
hitting the same Redis Stream key (`cfg.Streams.LobbyEvents`,
default `gm:lobby_events`). Each XADD encodes the same field
vocabulary as `rtmanager/healtheventspublisher`: integer fields are
serialised through `strconv.FormatInt` / `strconv.Itoa`, the
per-player projection is JSON-encoded into one stream field
(`player_turn_stats`), and the discriminator field (`event_type`) is
a string literal pinned to one of the two AsyncAPI const values.
No MAXLEN cap is set on XADD; an empty `PlayerTurnStats` slice is
serialised as `"[]"` (literal). All `time.Time` fields are coerced
to UTC before `UnixMilli()` so the published timestamps match the
contract regardless of caller-supplied timezone.
**Why.** The two messages share one channel per the AsyncAPI spec
([`runtime-events-asyncapi.yaml`](../api/runtime-events-asyncapi.yaml));
the discriminator is the documented dispatch key for Lobby's
consumer. Using the existing field-encoding pattern from
`rtmanager/healtheventspublisher` keeps the wire format consistent
across services and lets Lobby reuse the same XADD-decoding helpers
it already runs against `runtime:health_events`. Setting MAXLEN was
considered and rejected: Game Master never processes the stream
itself, and the Lobby consumer owns its consumer-group offset, so
trimming would risk dropping unconsumed entries. The empty `"[]"`
default keeps the stream entry valid JSON for the field even before
the first turn generates (when no per-player stats exist yet).
### 7. Defensive Makefile guard for `make mocks` between Stage 12 and Stage 19
**Decision.** The `mocks` Makefile target now skips the
`internal/api/internalhttp/handlers/...` line when that directory
does not yet exist:
```makefile
mocks:
go generate ./internal/ports/...
@if [ -d ./internal/api/internalhttp/handlers ]; then \
go generate ./internal/api/internalhttp/handlers/...; \
fi
```
**Why.** Stage 8 wired the Makefile to regenerate both port-level
and handler-level mocks, but the handlers directory only appears at
Stage 19. Without the guard, `make mocks` fails with `lstat: no such
file or directory` between Stage 12 and Stage 19 — exactly when GM
is being grown stage by stage. The guard makes the target idempotent
across stages and adds zero cost when the directory is finally
created.
## Files landed
- [`../internal/adapters/engineclient/client.go`](../internal/adapters/engineclient/client.go),
[`../internal/adapters/engineclient/client_test.go`](../internal/adapters/engineclient/client_test.go)
- [`../internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go),
[`../internal/adapters/lobbyclient/client_test.go`](../internal/adapters/lobbyclient/client_test.go)
- [`../internal/adapters/rtmclient/client.go`](../internal/adapters/rtmclient/client.go),
[`../internal/adapters/rtmclient/client_test.go`](../internal/adapters/rtmclient/client_test.go)
- [`../internal/adapters/notificationpublisher/publisher.go`](../internal/adapters/notificationpublisher/publisher.go),
[`../internal/adapters/notificationpublisher/publisher_test.go`](../internal/adapters/notificationpublisher/publisher_test.go)
- [`../internal/adapters/lobbyeventspublisher/publisher.go`](../internal/adapters/lobbyeventspublisher/publisher.go),
[`../internal/adapters/lobbyeventspublisher/publisher_test.go`](../internal/adapters/lobbyeventspublisher/publisher_test.go)
- [`../internal/adapters/mocks/`](../internal/adapters/mocks) — ten
generated `mockgen` files covering every Stage 10 port (engine,
lobby, rtm, notification publisher, lobby-events publisher, plus
the five store/log ports landed by Stage 11).
- [`../Makefile`](../Makefile) — defensive guard on the `mocks`
target.
- [`../README.md`](../README.md) — §References pointer to this
record.
## Verification
```sh
cd gamemaster
# Mocks regenerate cleanly with no diff after a second run.
make mocks
git diff --exit-code internal/adapters/mocks
# Adapter-level unit tests against httptest / miniredis.
go test ./internal/adapters/engineclient/...
go test ./internal/adapters/lobbyclient/...
go test ./internal/adapters/rtmclient/...
go test ./internal/adapters/notificationpublisher/...
go test ./internal/adapters/lobbyeventspublisher/...
# Full repo build remains green; Stage 06/07/0911 contract and
# adapter tests are unaffected.
go test ./...
```
+230
View File
@@ -0,0 +1,230 @@
---
stage: 13
title: Register-runtime service
---
# Stage 13 — Register-runtime service
This decision record captures the non-obvious choices made while
implementing the `register-runtime` service-layer orchestrator at PLAN
Stage 13. The service is the single entry point Game Lobby uses (after
Runtime Manager has reported a successful container start) to install a
freshly-started game in Game Master.
## Context
[`../PLAN.md` Stage 13](../PLAN.md) ships the first service-layer stage
of Game Master. It lays the orchestrator pattern that Stages 1417 will
reuse (engine version registry CRUD, scheduler, hot path, admin
operations). The lifecycle the service drives is frozen by
[`../README.md` §Lifecycles → Register-runtime](../README.md):
1. validate request shape;
2. reject if `runtime_records.{game_id}` already exists;
3. resolve `image_ref` for `target_engine_version`;
4. persist `runtime_records` with `status=starting`;
5. call engine `POST /api/v1/admin/init`;
6. persist `player_mappings` from the engine response;
7. CAS `status: starting → running` and persist initial scheduling;
8. append `operation_log`;
9. publish `runtime_snapshot_update`;
10. return the persisted record.
The reference precedent is
[`rtmanager/internal/service/startruntime`](../../rtmanager/internal/service/startruntime),
which established the `Input` / `Result` / `Dependencies` / `NewService`
/ `Handle` shape, the `recordFailure` helper, and the
`bestEffortAppend` audit-log convention.
Five decisions deviate from a literal reading of either PLAN Stage 13
or the rtmanager precedent. Each is recorded below.
## Decisions
### 1. `RuntimeRecordStore.Delete` extension
**Decision.** [`ports.RuntimeRecordStore`](../internal/ports/runtimerecordstore.go)
gains an idempotent `Delete(ctx, gameID) error` method. The
PostgreSQL-backed adapter
[`runtimerecordstore.Store.Delete`](../internal/adapters/postgres/runtimerecordstore/store.go)
issues a single `DELETE FROM runtime_records WHERE game_id = $1` and
returns `nil` even when no row matches. The mock at
[`internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go)
is regenerated by `make -C gamemaster mocks`. A lone integration
test `TestDeleteIdempotent` mirrors `TestDeleteByGameIdempotent` in
`playermappingstore`.
**Why.** The README's failure paths for `register-runtime` mandate
"roll back `runtime_records`" on every post-Insert failure. The Stage 10
port surface had no Delete primitive, so the orchestrator could not
satisfy the README without one. Three alternatives were considered
and rejected:
- **Reorder the flow** (call engine init first, only then persist
`runtime_records`): contradicts the README, which lists the Insert
step before the engine call so that the in-flight `starting` row is
observable to inspect surfaces and acts as a coordination point for
concurrent register-runtime requests on the same game id.
- **Introduce a `removed` status enum**: changes the runtime status
machine for one transient bookkeeping case; complicates indexes,
filters, and the inspect surface; is not described anywhere in
README §Game Master status model.
- **Single SQL transaction across both stores**: requires the adapter
layer to expose a transactional sub-interface, breaking the per-port
abstraction Stage 10 set up. The cost of one extra method on a
single port is far smaller.
This is the same pattern Stage 11 used for `UpdateEngineVersionInput.Now`
and `Deprecate(ctx, version, now)`: a small, targeted contract delta
admitted by the pre-launch single-init policy.
### 2. Engine 4xx → `engine_validation_error`, engine 5xx →
`engine_unreachable`
**Decision.** When the engine `/admin/init` call returns 4xx, the
service produces `Result{ErrorCode: engine_validation_error}`. When it
returns 5xx (or fails at the transport layer), the service produces
`Result{ErrorCode: engine_unreachable}`. The classification lives in
[`classifyEngineError`](../internal/service/registerruntime/service.go)
and dispatches on the engine port sentinels
(`ports.ErrEngineValidation`, `ports.ErrEngineUnreachable`,
`ports.ErrEngineProtocolViolation`).
**Why.** [`../PLAN.md` Stage 13](../PLAN.md) lists the two as separate
test cases ("engine 4xx (engine_validation_error), engine 5xx
(engine_unreachable)"), but [`../README.md` §Lifecycles →
Register-runtime](../README.md)'s failure-path table at the time of
Stage 13 lumped them as `engine_unreachable`. PLAN's classification is
more useful operationally:
- 4xx from the engine signals a contract violation (the engine
rejected the request shape, which is a Game Master bug or a stale
contract). Treating this as `engine_unreachable` would push
operators down the "is the engine alive?" branch when the right
branch is "did the GM build send the right shape?".
- 5xx (and transport failures) signal that the engine is unreachable
or unhealthy. `engine_unreachable` is the right code.
The README §Lifecycles failure-path table is updated in the same
patch to reflect the split, so the two documents agree.
### 3. Engine response validated as `engine_protocol_violation`
**Decision.** After a successful engine `/admin/init` HTTP response,
the service performs two extra checks before persisting any
player_mappings:
- the number of returned players must equal the input roster size;
- the set of `RaceName` values returned must be a subset of the
roster (no extra races, no missing races).
A failure on either check rolls back the runtime record and returns
`Result{ErrorCode: engine_protocol_violation}`.
**Why.** The README's failure-path table includes
`engine_protocol_violation` for "engine response missing players or
contains races not in roster". The engine adapter ([Stage 12,
`engineclient.decodeStateResponse`](../internal/adapters/engineclient/client.go))
validates the wire shape (presence of required fields, well-formed
numeric values), but it cannot validate against the roster Game Master
sent — only the service layer knows the roster. Splitting the two
checks keeps the adapter narrow and lets the service-layer error code
carry the semantic meaning.
### 4. Initial `runtime_snapshot_update` carries non-empty
`player_turn_stats`
**Decision.** The first `runtime_snapshot_update` published by
register-runtime carries one
`PlayerTurnStats{UserID, Planets, Population}` row per active member,
projected from the `engine.Init` response by joining on `RaceName`
against the input roster. The projection is sorted by `UserID` for a
deterministic wire order.
**Why.** The README §Async Stream Contracts cadence note used to read
"empty when the snapshot is published for a status transition with no
new turn payload". For register-runtime there *is* a new payload — the
engine returns the initial player state in its `/admin/init` response,
including `Planets` and `Population`. That state is the turn-0
baseline against which Lobby's per-game stats aggregator measures
later deltas: without it, the first per-player delta after turn 1
would silently equal "everything" instead of "the change since
turn 0". The README cadence wording is updated in the same patch to
say the register-runtime snapshot carries the engine's turn-0 stats.
### 5. Best-effort rollback with two-flag gating
**Decision.** The service exposes a single `rollback(ctx, gameID,
playerMappingsInstalled)` helper that always tries `runtime_records.Delete`
and conditionally tries `playermappings.DeleteByGame`. The two booleans
on `recordFailure` (`runtimeInserted`, `playerMappingsInstalled`)
gate the rollback so:
- a pre-Insert failure (`invalid_request`, `conflict` from `Get`,
`engine_version_not_found`, `Insert`'s own `ErrConflict`) skips
rollback entirely;
- a post-Insert / pre-BulkInsert failure deletes only the runtime
row;
- a post-BulkInsert failure deletes both. Note that BulkInsert errors
themselves never install rows (per stage 11 D7's per-statement
atomicity), so on `BulkInsert` returning ErrConflict the rollback
flag for player_mappings is `false`.
The rollback uses a fresh `context.Background()` with a 5-second
timeout so a cancelled request context does not strand the
`starting` row.
**Why.** A common pitfall in rollback paths is to call `Delete` on
state owned by another caller. The Insert-conflict branch is the
canonical example: when our `Insert` returns `ErrConflict`, another
request inserted the row first and owns it. Blindly deleting it
would corrupt that other caller's state. The two-flag gating makes
the ownership transfer explicit. The fresh background context
mirrors the same pattern in `rtmanager.startruntime.releaseLease`.
## Files landed
- [`../internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
— added `Delete` to the interface and the comment block.
- [`../internal/adapters/postgres/runtimerecordstore/store.go`](../internal/adapters/postgres/runtimerecordstore/store.go)
— implemented `Delete`.
- [`../internal/adapters/postgres/runtimerecordstore/store_test.go`](../internal/adapters/postgres/runtimerecordstore/store_test.go)
— added `TestDeleteIdempotent` and `TestDeleteRejectsEmptyGameID`.
- [`../internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go)
— regenerated.
- [`../internal/service/registerruntime/service.go`](../internal/service/registerruntime/service.go)
with [`errors.go`](../internal/service/registerruntime/errors.go)
and [`service_test.go`](../internal/service/registerruntime/service_test.go)
— new orchestrator package and tests.
- [`../README.md`](../README.md) — §References pointer to this record
plus one-line clarifications in §Lifecycles → Register-runtime
(failure-path table now splits 4xx/5xx per **D2**) and §Async Stream
Contracts (cadence note now says the register-runtime snapshot
carries `player_turn_stats` from the engine-init response per **D4**).
- [`../PLAN.md`](../PLAN.md) — Stage 13 marked done.
## Verification
```sh
cd gamemaster
# Mocks regenerate cleanly with no diff after the port extension.
make mocks
git diff --exit-code internal/adapters/mocks
# Domain + port tests still pass.
go test ./internal/domain/... ./internal/ports/...
# Adapter test for the new Delete method.
go test ./internal/adapters/postgres/runtimerecordstore/...
# Service-level tests for the new orchestrator.
go test ./internal/service/registerruntime/...
# Stage 06/07/0912 contract / adapter / freeze tests stay green.
go test ./...
```
The full repo-level `go build ./...` from the workspace root succeeds;
later stages (14+) build on the orchestrator shape Stage 13
establishes.
@@ -0,0 +1,220 @@
---
stage: 14
title: Engine version registry service
---
# Stage 14 — Engine version registry service
This decision record captures the non-obvious choices made while
implementing the `engine_version` registry service-layer at PLAN
Stage 14. The service backs the
`/api/v1/internal/engine-versions/*` REST surface (Stage 19) and the
hot-path `image_ref` resolve called synchronously by Game Lobby's
start flow.
## Context
[`../PLAN.md` Stage 14](../PLAN.md) lists seven service methods:
`List`, `Get`, `Create`, `Update`, `Deprecate`, `Delete`,
`ResolveImageRef`. The lifecycle the service drives is frozen by
[`../README.md` §Engine Version Registry](../README.md). The reference
precedent for shape and audit semantics is
[`../internal/service/registerruntime`](../internal/service/registerruntime/service.go)
landed at Stage 13.
Five decisions deviate from a literal reading of either Stage 14 or
the existing port and migration shapes. Each is recorded below.
## Decisions
### 1. `EngineVersionStore.Delete` extension
**Decision.** [`ports.EngineVersionStore`](../internal/ports/engineversionstore.go)
gains a `Delete(ctx, version) error` method that returns
`engineversion.ErrNotFound` when no row matches. The PostgreSQL-backed
adapter [`engineversionstore.Store.Delete`](../internal/adapters/postgres/engineversionstore/store.go)
issues a single `DELETE FROM engine_versions WHERE version = $1` and
distinguishes "missing" from "removed" via `RowsAffected`. The mock at
[`internal/adapters/mocks/mock_engineversionstore.go`](../internal/adapters/mocks/mock_engineversionstore.go)
is regenerated by `make -C gamemaster mocks`. Three adapter tests
(`TestDeleteHappy`, `TestDeleteNotFound`, `TestDeleteRejectsEmptyVersion`)
mirror the pattern from the existing Deprecate tests.
**Why.** Stage 14 explicitly requires the service to expose a hard
`Delete` distinct from `Deprecate`. The Stage 11 port surface only
carried `Deprecate` (idempotent soft-mark) and
`IsReferencedByActiveRuntime` (read probe). Three alternatives were
considered and rejected:
- **Skip hard delete**: omits a Stage 14 deliverable and forces a port
delta later. The OpenAPI 409 `engine_version_in_use` example would
also become a dangling spec entry.
- **Reuse `Deprecate` for both soft and hard semantics**: contradicts
README §Engine Version Registry ("`status` values: ... `deprecated`
(rejected on new starts; existing runtimes unaffected)"). A
referenced version must remain deprecable so the operator can phase
in a successor while existing runtimes finish out — folding the
reference check into Deprecate would break that flow.
- **Inline the SQL inside the service**: contradicts the per-port
abstraction Stage 10 set up; the service must not import the jet
table package.
This is the same pattern Stage 13 D1 used for
`RuntimeRecordStore.Delete`: a small, targeted contract delta admitted
by the pre-launch single-init policy.
### 2. Hard-delete reference probe runs before adapter `Delete`
**Decision.** [`Service.Delete`](../internal/service/engineversion/service.go)
calls `versions.IsReferencedByActiveRuntime` first; on a positive
result it surfaces `ErrInUse` without ever calling the adapter
`Delete`. Only when the probe reports zero references does the service
issue the SQL DELETE.
**Why.** Two alternatives were rejected:
- **Single transaction with `SELECT ... FOR UPDATE` plus DELETE**:
requires the adapter to expose a transactional sub-interface and
forces the service into store-internal locking semantics. The plan
is single-instance (README §Non-Goals), so the small race window
between probe and delete is acceptable and self-correcting (a
late-arriving register-runtime against a deprecated version would
fail at `runtime_records` insert anyway because the version row is
gone — the eventual outcome is the same).
- **Probe-after-delete**: leaks the DELETE on transient probe
failures and surfaces a misleading "deleted" outcome to the caller.
Surfacing `engine_version_in_use` before any mutation matches the
README §Error Model wording and the OpenAPI `EngineVersionInUseError`
example.
### 3. `engine_version_delete` op kind added to schema and domain
**Decision.** A new audit value `engine_version_delete` is added to:
- [`domain/operation.OpKind`](../internal/domain/operation/log.go)
(constant, `IsKnown`, `AllOpKinds`);
- [`migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
(the `operation_log_op_kind_chk` CHECK constraint);
- README §Persistence Layout (the `op_kind` enum listing in the
`operation_log` description).
The pre-launch single-init policy from
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
allows editing `00001_init.sql` until first production deploy.
**Why.** Two alternatives were rejected:
- **Reuse `engine_version_deprecate`** for hard delete: semantically
weak; audit consumers would have to inspect outcome plus an
out-of-band column to tell soft from hard, defeating the audit's
signal value.
- **Skip audit for hard delete**: inconsistent with every other
service-layer mutation (every Stage 13/14 mutation writes
operation_log). Forensics on a destructive admin action are exactly
where audit matters most.
### 4. `operation_log.game_id` column doubles as audit subject
**Decision.** Engine-version CRUD audit entries store the canonical
`version` string in the `OperationEntry.GameID` field (and therefore
in the `operation_log.game_id` column). For `OpKindEngineVersionCreate`
the canonical post-`ParseSemver` form is used (`v1.2.3`); for
`OpKindEngineVersionUpdate` / `Deprecate` / `Delete` the user-supplied
version is used so failed lookups still record the attempt verbatim.
**Why.** Three alternatives were considered and rejected:
- **Make `game_id` nullable and add a `subject_id` column**: requires
a migration delta + jet regeneration + a domain field rename. Out
of scope for stage 14 and inconsistent with the minimal-diff
principle.
- **Use a sentinel `engine_version:<v>` prefix**: harder to query
alongside per-game audit reads; the index
`operation_log (game_id, started_at DESC)` already covers
subject-scoped reads, and a sentinel prefix would force callers to
strip it.
- **Skip audit for engine-version CRUD**: README §Persistence Layout
explicitly lists `engine_version_create | engine_version_update |
engine_version_deprecate` as op_kind values; the audit table is
the canonical surface.
The decision is recorded both here and in the README §Persistence
Layout note so future readers can find the overload rationale.
### 5. JSON-object validation for `Options`
**Decision.** [`Service.Create`](../internal/service/engineversion/service.go)
and `Service.Update` validate the `Options` byte slice as a JSON
object before persisting (raw bytes are decoded into
`map[string]any`; non-objects, including arrays and scalars, are
rejected with `invalid_request`). Empty/whitespace-only input passes
through as nil; the adapter (Stage 11 D5) already substitutes the
schema default `'{}'::jsonb`.
**Why.** The `engine_versions.options` column is `jsonb`. Persisting
an array, scalar, or malformed JSON would either be rejected by the
PostgreSQL parser at INSERT time (surfacing as a generic 500) or
accepted and break engine-side consumers that expect an object. The
service-layer validation surfaces a clear `invalid_request` early and
keeps the contract honest. README §Engine Version Registry already
describes `options` as a "free-form `jsonb` document" (object
implied); the validation makes that wording load-bearing.
## Files landed
- [`../internal/ports/engineversionstore.go`](../internal/ports/engineversionstore.go)
— added `Delete` to the interface and the comment block.
- [`../internal/adapters/postgres/engineversionstore/store.go`](../internal/adapters/postgres/engineversionstore/store.go)
— implemented `Delete`.
- [`../internal/adapters/postgres/engineversionstore/store_test.go`](../internal/adapters/postgres/engineversionstore/store_test.go)
— added `TestDeleteHappy`, `TestDeleteNotFound`,
`TestDeleteRejectsEmptyVersion`.
- [`../internal/adapters/mocks/mock_engineversionstore.go`](../internal/adapters/mocks/mock_engineversionstore.go)
— regenerated.
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
— added `engine_version_delete` to `operation_log_op_kind_chk`.
- [`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
with [`log_test.go`](../internal/domain/operation/log_test.go)
— added `OpKindEngineVersionDelete` plus `IsKnown`/`AllOpKinds`
membership.
- [`../internal/service/engineversion/service.go`](../internal/service/engineversion/service.go)
with [`errors.go`](../internal/service/engineversion/errors.go)
and [`service_test.go`](../internal/service/engineversion/service_test.go)
— new orchestrator package and tests.
- [`../internal/service/registerruntime/service_test.go`](../internal/service/registerruntime/service_test.go)
`fakeEngineVersions` gains a stub `Delete` to satisfy the
extended port.
- [`../README.md`](../README.md) — §References pointer to this
record; §Persistence Layout note that engine-version CRUD audit
entries store `version` in the `game_id` column and that
`engine_version_delete` joins the op_kind enum.
- [`../PLAN.md`](../PLAN.md) — Stage 14 marked done.
## Verification
```sh
cd gamemaster
# Mocks regenerate cleanly with no diff after the port extension is
# committed alongside this stage.
make mocks
git diff --exit-code internal/adapters/mocks
# Domain + port tests still pass (operation log enum membership).
go test ./internal/domain/... ./internal/ports/...
# Adapter test for the new Delete method and the migration's CHECK
# constraint.
go test ./internal/adapters/postgres/engineversionstore/...
go test ./internal/adapters/postgres/operationlog/...
# Service-level tests for the new orchestrator.
go test ./internal/service/engineversion/...
# Stage 13 service tests still pass (the fake gains a stub Delete).
go test ./internal/service/registerruntime/...
# Repo build succeeds at the workspace root.
go build ./...
```
@@ -0,0 +1,297 @@
---
stage: 15
title: Scheduler, turn generation, and snapshot publisher
---
# Stage 15 — Scheduler, turn generation, and snapshot publisher
This decision record captures the non-obvious choices made while
implementing the scheduler ticker, the turn-generation orchestrator,
and the publication of `gm:lobby_events` plus `notification:intents`
at PLAN Stage 15. It is the heart of Game Master: every running game
flows through this code path on every scheduled or admin-forced turn.
## Context
[`../PLAN.md` Stage 15](../PLAN.md) ships three components that
together drive a turn:
1. `service/turngeneration` — the orchestrator that CAS's `running →
generation_in_progress`, calls the engine `/admin/turn`, branches
on `finished`, and publishes a `runtime_snapshot_update` /
`game_finished` event plus the corresponding `game.turn.ready` /
`game.finished` / `game.generation_failed` notification.
2. `service/scheduler` — a thin, stateless wrapper around
`domain/schedule.Schedule.Next` reused by the turn-generation
recompute step and (in Stage 17) by `service/adminforce`.
3. `worker/schedulerticker` — the 1-second loop that scans
`runtime_records.ListDueRunning(now)` and dispatches one
`turngeneration.Handle` per due game.
The lifecycle the orchestrator drives is frozen by
[`../README.md` §Lifecycles → Turn generation](../README.md), and the
publication cadence by [§Async Stream Contracts](../README.md) and
[§Notification Contracts](../README.md). The reference precedent for
the orchestrator shape (Input / Result / Dependencies / NewService /
Handle) is Stage 13's `service/registerruntime`.
Seven decisions deviate from a literal reading of either PLAN Stage 15,
the README, or the Stage 13 precedent. Each is recorded below.
## Decisions
### D1. Resolve `game_name` synchronously from Lobby per notification
**Decision.** [`ports.LobbyClient`](../internal/ports/lobbyclient.go)
gains a `GetGameSummary(ctx, gameID) (GameSummary, error)` method plus
a narrow `GameSummary{GameID, GameName, Status}` type. The
HTTP-backed adapter at
[`internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go)
issues a `GET /api/v1/internal/games/{game_id}` against the Lobby
internal listener, decodes the `GameRecord` shape (Lobby's frozen
contract), and wraps every non-success outcome with
`ports.ErrLobbyUnavailable`. The `turngeneration` service calls it
before publishing each `notification:intents` entry; on any error the
orchestrator falls back to using `game_id` as `game_name` and logs a
`warn` event with `error_code=lobby_unavailable`.
**Why.** `notificationintent.GameTurnReadyPayload`,
`GameFinishedPayload`, and `GameGenerationFailedPayload` all require a
`game_name` string, but Game Master does not own the platform name and
the `register-runtime` envelope does not carry it. Three alternatives
were considered and rejected:
- **Extend the `register-runtime` contract with `game_name` and
persist it on `runtime_records`.** Cleanest architecturally, but
requires editing the Stage 06 frozen OpenAPI spec, the contract
test, the Stage 09 migration, the Stage 10 domain type, the
Stage 11 store and tests, the Stage 13 register-runtime service and
tests, and the regenerated jet code. Substantial cross-stage churn
for a single denormalised string.
- **Use `game_id` as the `game_name` placeholder unconditionally.**
Zero change cost, but every push notification a user receives
carries the opaque platform identifier — a user-visible regression.
- **Defer notification publication to Stage 16.** Contradicts the
PLAN Stage 15 task list, which explicitly enumerates
`game.turn.ready`, `game.finished`, and `game.generation_failed`
publication.
The chosen design adds one method and one return type to a port
already established in Stage 12, with fail-soft fallback semantics
that keep notification publication best-effort.
### D2. `Trigger` parameter classifies telemetry, never logic
**Decision.** The plan's input shape `{gameID, trigger ∈ {scheduler,
force}}` is preserved as `turngeneration.Input.Trigger`. The value
flows into the
`gamemaster.turn_generation.outcomes` counter as a
`trigger` label and into structured logs; it does **not** branch the
orchestrator's persistence path. The skip-tick mechanic is driven
exclusively by the runtime record's `skip_next_tick` column.
**Why.** [`../README.md §Force-next-turn`](../README.md) describes
adminforce as: "Run the turn-generation flow synchronously (the same
code path the scheduler uses). After success, set
`runtime_records.skip_next_tick = true`." Adminforce flips the flag
*after* the forced turn completes; the *next* scheduler-driven
generation consumes it. Forking the orchestrator on `Trigger` would
duplicate the recompute logic in two places and reopen the question
"what if a force fires while skip_next_tick is already true?".
Single-path makes the answer fall out of the existing rule (read the
flag at start, clear at recompute) without special cases.
### D3. Two CAS pattern with cleanup on engine failure
**Decision.** Persistence steps mirror Stage 13's CAS-then-rollback
pattern with two CAS transitions per generation:
1. `running → generation_in_progress` at the start. On
`runtime.ErrConflict` (concurrent stop / external mutation) the
orchestrator returns `Result{ErrorCode: conflict}` without
publishing events; the external mutation is responsible for its
own snapshot.
2. After the engine call:
- success + `finished=true` → `generation_in_progress → finished`;
- success + `finished=false` → `generation_in_progress → running`;
- engine error → `generation_in_progress → generation_failed`.
The post-engine CAS surfaces `runtime.ErrConflict` only when an
external mutation (typical cause: admin issued a stop while the engine
was generating) overtook the orchestrator. The engine call has
already mutated state, but the runtime row is owned by the new actor;
the orchestrator records the audit failure with `conflict` and exits.
**Why.** This keeps Stage 13's pattern intact: every CAS knows what
state the row should be in before the call, and a mismatch always
yields `conflict`. Mixing the two CAS guards with a single combined
status update (e.g., a transactional "running and not stopped") would
require the adapter to expose multi-status CAS predicates, breaking
the per-row CAS abstraction Stage 11 settled on.
### D4. Snapshot cadence: one publication per outcome
**Decision.** The orchestrator publishes exactly one
`runtime_snapshot_update` *or* `game_finished` per turn-generation
call:
- success + not finished → `PublishSnapshotUpdate` with full
`player_turn_stats`;
- success + finished → `PublishGameFinished` with full
`player_turn_stats`;
- engine failure → `PublishSnapshotUpdate` with
`RuntimeStatus=generation_failed` and empty `player_turn_stats`
(no fresh engine payload).
The intermediate `running → generation_in_progress` transition is
**not** broadcast.
**Why.** The README cadence enumerates "transitioned" cases as
examples (`running ↔ generation_in_progress`), but PLAN Stage 15
explicitly anchors publication on the outcome side. Publishing twice
would double Lobby's processing cost without delivering new
information, because `generation_in_progress` carries no fresh engine
state and Lobby cannot act on the in-progress moment.
### D5. Notification recipients = `playermappingstore.ListByGame`
**Decision.** `game.turn.ready` and `game.finished` use
`AudienceKindUser` and need a sorted unique non-empty
`recipient_user_ids` list. The orchestrator derives it from
`playermappingstore.ListByGame(gameID)` projected to `UserID` values,
deduplicated and sorted ascending. Empty rosters cause the
notification to be skipped silently with a `warn` log; the runtime
mutation persists.
**Why.** This is the only roster data Game Master owns until Stage 16
delivers the membership cache. After Stage 17 wires `banish`, the
player_mappings rows still represent the engine-known roster and
remain a correct conservative recipient set (banished members will be
filtered separately by Notification Service's user resolution if
absent in `User Service`). Adding a synchronous Lobby
`GetMemberships` call here would duplicate the work Stage 16 is
already on the hook to provide.
### D6. Scheduler service is a stateless utility
**Decision.**
[`service/scheduler.Service`](../internal/service/scheduler/service.go)
exposes a single `ComputeNext(turnSchedule, after, skipNextTick)
(time.Time, bool, error)` method that wraps `schedule.Parse(...).Next(after,
skipNextTick)`. The service holds no dependencies and no clock; the
caller passes `after`. `turngeneration` injects a
`*scheduler.Service` and uses it during the post-success recompute;
Stage 17 will reuse the same instance from `adminforce`.
**Why.** Centralising the parse-then-next sequence in one place keeps
the skip rule in one place and makes the future Stage 17 caller
trivial. Holding no state means tests are pure value tests against the
`domain/schedule` wrapper; no clock injection or dependency wiring is
required.
### D7. Per-game in-flight set on the scheduler ticker
**Decision.**
[`worker/schedulerticker.Worker`](../internal/worker/schedulerticker/worker.go)
holds a `sync.Map[gameID]struct{}` of currently-dispatched games. At
each tick the worker scans `RuntimeRecords.ListDueRunning(now)` and
launches one goroutine per due game; if `LoadOrStore` reports the game
is already in-flight, the worker logs at `debug` and skips. The
goroutine releases the slot via `defer w.inflight.Delete(gameID)`.
**Why.** A 1-second tick is shorter than typical engine call latency
plus PostgreSQL round-trips, so two ticks can observe the same due row
before the first completes. The CAS in `turngeneration` is the
authoritative protection (only one goroutine can flip `running →
generation_in_progress`), but two goroutines doing the engine call and
discarding the loser as `conflict` would waste an engine call and
inflate `engine_validation_error` / `engine_unreachable` counters with
spurious entries. The in-flight set is a 4-line optimisation that
removes the spurious work.
`Worker.Wait` exposes the in-flight `sync.WaitGroup` so tests (and
Stage 19's wiring) can drive `Tick` deterministically and observe
completion. `Run` itself waits on the same group before returning so
context cancellation gracefully drains in-flight work.
## Files landed
**Modified:**
- [`../internal/ports/lobbyclient.go`](../internal/ports/lobbyclient.go)
— added `GetGameSummary` to the interface plus the `GameSummary`
type.
- [`../internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go)
— implemented `GetGameSummary` with the same `ErrLobbyUnavailable`
wrapping precedent as `GetMemberships`.
- [`../internal/adapters/lobbyclient/client_test.go`](../internal/adapters/lobbyclient/client_test.go)
— table-driven tests for happy path, 404, 5xx, malformed JSON,
missing required fields, timeout, and bad input.
- [`../internal/adapters/mocks/mock_lobbyclient.go`](../internal/adapters/mocks/mock_lobbyclient.go)
— regenerated.
**Created:**
- [`../internal/service/scheduler/service.go`](../internal/service/scheduler/service.go),
[`../internal/service/scheduler/service_test.go`](../internal/service/scheduler/service_test.go)
— stateless scheduler utility.
- [`../internal/service/turngeneration/service.go`](../internal/service/turngeneration/service.go),
[`../internal/service/turngeneration/errors.go`](../internal/service/turngeneration/errors.go),
[`../internal/service/turngeneration/service_test.go`](../internal/service/turngeneration/service_test.go)
— turn-generation orchestrator and tests.
- [`../internal/worker/schedulerticker/worker.go`](../internal/worker/schedulerticker/worker.go),
[`../internal/worker/schedulerticker/worker_test.go`](../internal/worker/schedulerticker/worker_test.go)
— scheduler ticker worker and tests.
- This decision record.
**Reused (not modified):**
- `internal/domain/runtime/{model.go, transitions.go}` —
`running → generation_in_progress`, `generation_in_progress →
running`, `generation_in_progress → generation_failed`,
`generation_in_progress → finished` were all permitted by the
Stage 10 transitions table.
- `internal/domain/schedule/nexttick.go` — the cron + skip wrapper.
- `internal/domain/operation/log.go` — the `OpKindTurnGeneration`
enum value already in place.
- `internal/ports/{runtimerecordstore.go, engineclient.go,
playermappingstore.go, operationlog.go,
notificationpublisher.go, lobbyeventspublisher.go}` — every store
and publisher used by the orchestrator was already present.
- `internal/telemetry/runtime.go` — `RecordTurnGenerationOutcome`,
`RecordLobbyEventPublished`, `RecordNotificationPublishAttempt`.
- `pkg/notificationintent.NewGameTurnReadyIntent`,
`NewGameFinishedIntent`, `NewGameGenerationFailedIntent`.
## Verification
```sh
cd gamemaster
# Mock regeneration must produce the GetGameSummary additions and
# nothing else.
make mocks
git diff --stat internal/adapters/mocks
# Domain + ports tests still pass.
go test ./internal/domain/... ./internal/ports/...
# Scheduler utility.
go test ./internal/service/scheduler/...
# Turn-generation orchestrator.
go test ./internal/service/turngeneration/...
# Scheduler ticker worker.
go test ./internal/worker/schedulerticker/...
# Updated lobby client adapter.
go test ./internal/adapters/lobbyclient/...
# Module-wide build remains green.
go test ./...
```
Out-of-scope for this stage: app wiring (Stage 19), service-local
integration suite (Stage 21), cross-service Lobby ↔ GM tests
(Stage 22).
@@ -0,0 +1,256 @@
---
stage: 16
title: Hot-path services and membership cache
---
# Stage 16 — Hot-path services and membership cache
This decision record captures the non-obvious choices made while
implementing the gateway-facing trio of player services
(`commandexecute`, `orderput`, `reportget`) and the in-process membership
cache that authorises every hot-path call. It is the last service-layer
stage before Stage 17 (admin operations) and Stage 19 (REST handlers and
wiring).
## Context
[`../PLAN.md` Stage 16](../PLAN.md) ships four components that together
make the player surface usable:
1. `service/membership` — concurrent in-process LRU cache holding the
per-game `user_id → status` projection from
`Lobby /api/v1/internal/games/{game_id}/memberships`. TTL is the
safety net; the explicit invalidation hook from Lobby is the
primary staleness control.
2. `service/commandexecute` — orchestrator behind
`POST /api/v1/internal/games/{game_id}/commands`. Authorises the
caller, resolves `actor=race_name`, reshapes the JSON envelope, and
forwards `PUT /api/v1/command` to the engine.
3. `service/orderput` — same shape as `commandexecute`, targeting the
engine `PUT /api/v1/order`.
4. `service/reportget` — orchestrator behind
`GET /api/v1/internal/games/{game_id}/reports/{turn}`. Authorises
the caller, resolves `race_name`, and forwards
`GET /api/v1/report?player=<race>&turn=<turn>` to the engine.
The reference precedent for the orchestrator shape (Input / Result /
Dependencies / NewService / Handle, plus a private `classifyEngineError`
helper) is Stage 15's `service/turngeneration`. Six decisions deviate
from a literal reading of the README, the OpenAPI surface, or the
turngeneration precedent. Each is recorded below.
## Decisions
### D1. `reportget` does not require `runtime_records.status = running`
**Decision.**
[`service/reportget`](../internal/service/reportget/service.go) accepts
any non-deleted runtime row and forwards the read to the engine.
`runtime_not_running` is **not** part of `reportget`'s error vocabulary
([`errors.go`](../internal/service/reportget/errors.go)).
`commandexecute` and `orderput`, by contrast, reject anything other than
`StatusRunning` with `runtime_not_running`.
**Why.** Three signals point at the same conclusion:
- The OpenAPI surface for `internalGetReport`
(`api/internal-openapi.yaml` lines 546575) lists only
`403 / 404 / 502 / 500` responses; there is no 409 / `runtime_not_running`
on the report path. The matching error response on commands and
orders (lines 502, 540) does include 409.
- The README §Reports flow (`../README.md` lines 508520) lists only
authorisation, race-name resolution, and engine forwarding. The
preceding §Player commands and orders block (lines 492506) lists the
`status=running` precondition explicitly. The two sections are
separately worded by design.
- A finished or stopped runtime is a normal target for a post-mortem
read of older turns. Refusing the read forces operators to use ad-hoc
database access for the same data the engine already exposes.
The `engine_unreachable` outcome remains the natural failure mode when
the engine container is genuinely gone (e.g., on `engine_unreachable`
status); no extra branch is required.
This decision was confirmed with the user during plan-mode review.
### D2. GM rewrites the engine envelope (`commands` → `cmd`, inject `actor`)
**Decision.**
[`commandexecute.rewriteCommandPayload`](../internal/service/commandexecute/service.go)
and the parallel
[`orderput.rewriteOrderPayload`](../internal/service/orderput/service.go)
unmarshal the GM `ExecuteCommandsRequest` / `PutOrdersRequest` body as
`map[string]json.RawMessage`, take the `commands` field, and emit a
fresh JSON object containing only `actor` (set to the resolved race
name) and `cmd` (carrying the original array). Every other top-level
key is dropped. The OpenAPI descriptions for `ExecuteCommandsRequest`
and `PutOrdersRequest` were updated in the same patch to document the
rewrite.
**Why.** The literal "forwarded verbatim" wording in the original
Stage 06 OpenAPI description conflicted with two upstream constraints:
- The engine `CommandRequest` schema in `game/openapi.yaml` lines
345364 declares `actor` and `cmd` as required, with no top-level
`commands`.
- The README §Hot Path rule "GM never trusts a payload field for actor
identification" (`../README.md` lines 487490) requires GM to set
`actor` from the authenticated user identity.
Two alternatives were rejected:
- **Move the rewrite into `engineclient`.** The adapter's role is thin
transport; injecting actor (an authorisation concern) into transport
would muddle the boundary and make the adapter test harness
authorisation-aware. The service is the right home.
- **Inject `actor` only and keep the `commands` key.** The engine schema
requires `cmd`; this would require an engine contract change outside
the Stage 16 scope and break Stage 05's frozen path.
The transform is duplicated across the two services rather than
extracted to a shared package. Each implementation is twelve lines and
each service is otherwise independent; a shared package would add
import-edge surface for marginal savings, and the project convention is
to prefer the minimal diff (`CLAUDE.md §Priorities`). The duplication is
explicitly documented in both file-level comments.
This decision was confirmed with the user during plan-mode review.
### D3. Hot-path services do not append to `operation_log`
**Decision.** None of the three services emit an `operation_log` entry.
The `Input` shape carries no `OpSource`/`SourceRef` fields. Telemetry
counters
(`gamemaster.command_execute.outcomes`,
`gamemaster.order_put.outcomes`, `gamemaster.report_get.outcomes`) are
the only audit surface.
**Why.** The `operation.OpKind` enum
(`internal/domain/operation/log.go`) intentionally has no value for
command, order, or report — it stops at admin and lifecycle operations.
Every hot-path call would multiply audit volume by the order rate
without adding investigative value: the telemetry counter already
exposes outcome distribution, and the engine itself is the source of
truth for per-command results. Adding three new `OpKind` values would
also bloat the SQL CHECK on `operation_log` with no operational
consumer.
### D4. Membership cache uses a hand-rolled per-game inflight tracker
**Decision.**
[`Cache.fetch`](../internal/service/membership/cache.go) coordinates
concurrent misses on the same `game_id` through a tiny
`map[gameID]*flight` plus a per-flight `done` channel. Joiners block on
`select { case <-existing.done: case <-ctx.Done(): }`. The leader
populates `members` (or `err`) on the flight before closing the channel.
**Why.** `golang.org/x/sync/singleflight` would be a sharper tool, but
adding it as a *direct* dependency (it is currently only an indirect
transitive of other modules in the workspace) requires the
"justification for direct deps" bar set by `CLAUDE.md §Dependencies`.
The cache is the only consumer in `gamemaster`, the implementation is
~30 lines, and a context-cancellable wait is one extra `select` line we
would otherwise have to wrap around `singleflight.Do` anyway. The
cache-internal helper is the cheaper choice.
### D5. Cache returns the raw status string
**Decision.**
[`Cache.Resolve`](../internal/service/membership/cache.go) returns
`(status string, err error)` where the status is the verbatim Lobby
vocabulary (`"active"`, `"removed"`, `"blocked"`) plus the empty string
when the user is not in the roster. Callers compare against
`membershipStatusActive = "active"` directly. There is no typed
wrapper.
**Why.** `ports.Membership.Status` is already `string`
(`internal/ports/lobbyclient.go` line 56); introducing a `MembershipStatus`
domain type purely to be passed through would add boilerplate without
enforcing any invariant Go's type system can check. The hot-path
services need only a single equality check, so a typed enum buys
nothing; it would also need a fallback for "unknown vocabulary"
defensive against future Lobby additions, which is more decision
surface than the cache should own.
### D6. Empty roster slot surfaces as `forbidden`
**Decision.** Two distinct underlying conditions both surface as
`ErrorCodeForbidden` from the three services:
- The membership cache returns the empty string for the requested
`(gameID, userID)`: the user is not present in the Lobby roster.
- The membership cache returns `"active"` but
`playermappingstore.Get(gameID, userID)` returns
`playermapping.ErrNotFound`: the user is an active platform member
but has no engine roster slot.
The second condition is an internal inconsistency (register-runtime
should have installed the row), but the user-visible semantics — "you
are not authorised to act on this game" — are identical to the first.
The structured log captures the underlying cause.
**Why.** Surfacing the second condition as `internal_error` would
expose 500 to a perfectly-routine "user not part of the engine roster"
case and obscure the actual outcome from the gateway and the user. The
inconsistency, if it ever materialises, is an operator concern visible
in the warn-level log and the `forbidden` metric attribution; treating
it as a 5xx would not help operators (who would then ignore the false
alarm) nor users (who only care that they cannot act).
## Files landed
**Created:**
- [`../internal/service/membership/{errors.go, cache.go, cache_test.go}`](../internal/service/membership/)
— concurrent LRU cache plus `ErrLobbyUnavailable` sentinel.
- [`../internal/service/commandexecute/{errors.go, service.go, service_test.go}`](../internal/service/commandexecute/)
— command-execute orchestrator and tests.
- [`../internal/service/orderput/{errors.go, service.go, service_test.go}`](../internal/service/orderput/)
— order-put orchestrator and tests.
- [`../internal/service/reportget/{errors.go, service.go, service_test.go}`](../internal/service/reportget/)
— report-get orchestrator and tests.
- This decision record.
**Modified:**
- [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml) —
rewrote the description fields of `ExecuteCommandsRequest` and
`PutOrdersRequest` to document the GM-side envelope rewrite.
**Reused (not modified):**
- `internal/ports/{engineclient.go, lobbyclient.go,
playermappingstore.go, runtimerecordstore.go}` — every interface and
sentinel was already present.
- `internal/domain/runtime/model.go` — `StatusRunning` constant + the
whole status vocabulary.
- `internal/domain/playermapping/model.go` — `PlayerMapping` and
`ErrNotFound`.
- `internal/domain/operation/log.go` — `Outcome` enum.
- `internal/config/config.go` — `MembershipCacheConfig.{TTL, MaxGames}`
with defaults `30s` / `4096`.
- `internal/telemetry/runtime.go` —
`RecordCommandExecuteOutcome`, `RecordOrderPutOutcome`,
`RecordReportGetOutcome`, `RecordMembershipCacheResult`,
`RecordEngineCall` (already wired in Stage 08).
## Verification
```sh
cd gamemaster
# Membership cache (race-clean concurrency).
go test -race ./internal/service/membership/...
# Each new player service.
go test ./internal/service/commandexecute/...
go test ./internal/service/orderput/...
go test ./internal/service/reportget/...
# Module-wide build + suite.
go build ./...
go test ./...
```
Out-of-scope for this stage: app wiring (Stage 19), service-local
integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).
+264
View File
@@ -0,0 +1,264 @@
---
stage: 17
title: Admin operations and Lobby-facing liveness
---
# Stage 17 — Admin operations and Lobby-facing liveness
This decision record captures the non-obvious choices made while
implementing the five Game Master admin/inspect service-layer
operations and the Lobby-facing liveness reply
(`adminstop`, `adminforce`, `adminpatch`, `adminbanish`,
`livenessreply`). Stage 17 is the last service-layer stage before
Stage 18 (health-events consumer) and Stage 19 (REST handlers and
wiring).
## Context
[`../PLAN.md` Stage 17](../PLAN.md) ships five services that close
the GM service surface:
1. `service/adminstop` — orchestrator behind
`POST /api/v1/internal/runtimes/{game_id}/stop`. Calls Runtime
Manager and CASes `runtime_records.status → stopped`.
2. `service/adminforce` — orchestrator behind
`POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. Runs
the inner `service/turngeneration` flow synchronously, then sets
`runtime_records.skip_next_tick = true`.
3. `service/adminpatch` — orchestrator behind
`POST /api/v1/internal/runtimes/{game_id}/patch`. Calls Runtime
Manager and rotates `runtime_records.current_image_ref` plus
`current_engine_version`.
4. `service/adminbanish` — orchestrator behind
`POST /api/v1/internal/games/{game_id}/race/{race_name}/banish`.
Resolves the race and calls the engine `/admin/race/banish`.
5. `service/livenessreply` — orchestrator behind
`GET /api/v1/internal/games/{game_id}/liveness`. Reflects GM's own
view of the runtime without ever calling the engine.
The reference precedent for the orchestrator shape (`Input` /
`Result` / `Dependencies` / `NewService` / `Handle`) is Stage 13's
`service/registerruntime` and Stage 15's `service/turngeneration`.
Six decisions deviate from a literal reading of the README, the
OpenAPI surface, or the turngeneration precedent. Each is recorded
below.
## Decisions
### D1. `RuntimeRecordStore` grows a dedicated `UpdateImage` method
**Decision.**
[`ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
adds a new `UpdateImage(ctx, UpdateImageInput) error` method with its
own `UpdateImageInput` struct and `Validate`. The Postgres adapter
gains a matching SQL UPDATE under a CAS guard on `(game_id, status)`.
The existing `UpdateStatus` is **not** repurposed for patch updates.
**Why.** `UpdateStatusInput.Validate()` (Stage 11) calls
`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
`ExpectedFrom == To`. Patch deliberately keeps the runtime in
`running`, so any attempt to feed `UpdateStatus` with
`ExpectedFrom == To == running` is rejected before the SQL even
runs. Three alternatives were on the table:
- Drop the `runtime.Transition` invariant from `UpdateStatusInput`
to allow self-transitions. That would weaken the CAS validator
for every existing caller — register-runtime, turngeneration,
health-events consumer — and reintroduce the «accidental no-op
status update» class of bugs the validator was added to catch.
- Introduce a synthetic `runtime.StatusRunning → runtime.StatusRunning`
edge in `domain/runtime/transitions.go`. Same blast radius as
above, only with stronger semantic baggage in the transition table.
- Add a dedicated `UpdateImage` method that only writes the two
image columns plus `updated_at`. Bounded blast radius (one new
method, one new input struct, one new SQL UPDATE), preserves the
CAS invariant, and matches how Stage 11 already separated
`UpdateScheduling` from `UpdateStatus` for the same reason.
The third option is what shipped. Existing fakes (`registerruntime`,
`turngeneration`, hot-path tests, schedulerticker) carry a no-op
`UpdateImage` stub that returns `errors.New(...)` so a test that
accidentally exercises the new path fails loudly.
### D2. `adminstop` is idempotent on `stopped` and `finished`, rejects `starting`
**Decision.**
[`service/adminstop`](../internal/service/adminstop/service.go) reads
the runtime row first; if `Status ∈ {stopped, finished}`, the service
returns `OutcomeSuccess` without calling Runtime Manager and without
publishing a `runtime_snapshot_update`. If `Status == starting`, the
service returns `conflict` with `OutcomeFailure`. Every other
non-terminal status (`running`, `generation_in_progress`,
`generation_failed`, `engine_unreachable`) takes the regular path:
RTM call → CAS → snapshot publication.
**Why.** The README §Stop says «CAS `runtime_records.status: * →
stopped`» but in practice three edge cases pull the service away
from a literal CAS-only implementation:
- `stopped` and `finished` are common operator races: an admin clicks
«stop» on a UI list while another admin already pressed it (or the
game finished naturally). Returning `conflict` would force the UI
to retry the read and confuse the operator. Idempotent success is
the smallest-surprise behaviour and matches how Lobby's other
admin-cancel flows handle terminal states.
- `starting` is the active engine-init window. RTM has just been
asked to start the container; an admin stop here would race the
init flow and almost certainly leave the system in a partially
cleaned state. The transition table in Stage 10 deliberately
excludes `starting → stopped` for the same reason. Returning
`conflict` lets the admin tooling surface «runtime is mid-init,
retry in a moment» instead of pretending the stop succeeded.
- The «obvious» fourth path — letting the CAS validator reject
`starting → stopped` and surface that as the natural conflict —
was rejected because it depends on validator implementation
detail leaking through; the explicit pre-CAS check makes the
intent obvious in the audit log and the structured logs.
The audit log records every pre-CAS rejection with
`outcome=failure / error_code=conflict`, and every idempotent no-op
with `outcome=success`, so operators can distinguish the cases in
post-hoc analysis.
### D3. `adminforce` always sets `skip_next_tick=true`, even on a finishing turn
**Decision.**
[`service/adminforce`](../internal/service/adminforce/service.go)
issues `UpdateScheduling{SkipNextTick=true,
NextGenerationAt=turnResult.Record.NextGenerationAt,
CurrentTurn=turnResult.Record.CurrentTurn}` after every successful
inner turn-generation, regardless of whether `Result.Finished` is
`true`.
**Why.** The cleaner branch — «skip the scheduling write when the
turn just finished the game» — was considered and rejected:
- `turngeneration` already cleared `next_generation_at` and updated
`current_turn` on the finishing branch (Stage 15
`completeFinished`). A redundant write that re-affirms those
values plus sets `skip_next_tick=true` does no harm: the row is
already in `status=finished` and no scheduler tick will ever
consume the flag.
- The branchless code is shorter and the test contract is simpler
(«adminforce always writes the skip flag on success»). One extra
conditional saves zero SQL on the production path but doubles the
set of cases the test matrix has to assert.
- The README §Force-next-turn wording «After success, set
`runtime_records.skip_next_tick = true`» is unconditional. Adding
a runtime-side branch would silently weaken that contract.
The driver `op_kind=force_next_turn` audit row records the eventual
outcome (success / failure with the same error code that
turngeneration surfaced) so audit consumers can tell apart a forced
turn that finished the game from a forced turn that prepared the
next regular tick.
### D4. `adminbanish` does not check runtime status; missing race surfaces as `forbidden`
**Decision.**
[`service/adminbanish`](../internal/service/adminbanish/service.go)
reads the runtime row only to retrieve the `engine_endpoint`, then
calls `playermappingstore.GetByRace`. A missing row maps to
`error_code=forbidden`. The runtime status itself is **not**
inspected; banish is dispatched even when the runtime is in
`stopped`, `finished`, or `engine_unreachable`.
**Why.** Two threads informed the choice:
- README §Banish lists only two preconditions: «runtime exists»
and «`race_name` resolves to an existing player_mappings row».
Adding a status guard would silently extend the contract beyond
what Lobby is allowed to depend on, and would make the banish
flow fail differently from the documented set.
- A banish on a stopped/finished runtime is a no-op at the engine
side (the container is exited or absent). The engine call will
fail with `engine_unreachable`, which is the right error for the
caller to see — it means «the runtime was stopped before banish
could land». Pre-rejecting with a different code would hide the
real state from the operator.
The `forbidden` mapping for missing race mirrors Stage 16 D6 («empty
roster surfaces as `forbidden`»). The frozen error vocabulary does
not contain a `race_not_found` code, and `forbidden` is the
semantically closest match: «the platform user this race belonged
to is no longer authorised to act on the runtime».
### D5. `livenessreply` returns 200 / `status=""` on `runtime_not_found`
**Decision.**
[`service/livenessreply`](../internal/service/livenessreply/service.go)
absorbs `runtime.ErrNotFound` into a successful Result with
`Ready=false` and `Status=runtime.Status("")`. The Go-level error
return is reserved for non-business failures only (nil context, nil
receiver, store-read errors, invalid input). A handler that wraps
this service answers 200 with body `{"ready": false, "status": ""}`
when GM has no record for the requested game.
**Why.** README §Liveness reply specifies the endpoint «never calls
the engine; it reflects GM's own view only» and explicitly says it
returns 200 even when the runtime is not running. Three response
shapes were considered:
- 200 with `status="runtime_not_found"`. Mixes runtime-status
values with error codes in the same field, breaking the
caller's enum-match dispatch.
- 404 `runtime_not_found`. Contradicts the README §Liveness reply
«return `200`» wording and forces Lobby's resume flow to add a
404 handler that means «no observation» — semantically the same
as `Ready=false`.
- 200 with `status=""`. The empty status reads naturally as «GM
has no observation»; Lobby's resume flow already needs to handle
the `Ready=false` branch and the empty status is exactly what
«no observation» looks like in practice. Chosen for the smallest
caller-side complexity.
### D6. RTM client errors surface as `service_unavailable`, not a dedicated code
**Decision.** Both `service/adminstop` and `service/adminpatch` map
every error from `RTMClient.Stop` / `RTMClient.Patch` to
`error_code=service_unavailable`, regardless of whether the
underlying failure is `ErrRTMUnavailable`, a wrapped HTTP 5xx, or a
dialler-level transport error.
**Why.** The frozen error vocabulary in
[`gamemaster/api/internal-openapi.yaml`](../api/internal-openapi.yaml)
does not contain a `runtime_manager_unavailable` code. Three options
were on the table:
- Add a new code. Rejected: the OpenAPI surface is contract-frozen
from Stage 06 and adding a new error code is a wire-format change
that pulls every consumer into a re-validation. Stage 17 deals
with service-layer code only; no contract change is in scope.
- Map RTM failures to `engine_unreachable`. Rejected: the RTM call
is a sibling-service hop, not an engine call; mixing the two in
a single label confuses operators reading metric / log labels.
- Map RTM failures to `service_unavailable`. Accepted: the
vocabulary already documents `service_unavailable` as «a
steady-state dependency was unreachable for this call», which is
exactly what an RTM outage looks like from GM's perspective.
The Stage 12 D5 decision record in
[`stage12-external-clients.md`](./stage12-external-clients.md)
already records that the RTM adapter wraps every non-success
outcome in `ports.ErrRTMUnavailable` without distinguishing
sub-cases; Stage 17 simply consumes the unified sentinel.
## Cross-stage consequences
- The new port surface `RuntimeRecordStore.UpdateImage` is
available to every later consumer; Stage 18 and Stage 19 do not
use it. Existing hand-rolled fakes carry a no-op stub.
- `OpKindStop`, `OpKindForceNextTurn`, `OpKindPatch`, `OpKindBanish`
were introduced in Stage 09 / Stage 10 already; Stage 17 is their
first writer.
- The telemetry counter `gamemaster.banish.outcomes` (declared in
Stage 08) gets its first call site in `service/adminbanish`. No
new counters are introduced for `adminstop` / `adminforce` /
`adminpatch` / `livenessreply`; the README §Observability list
does not mention them and Stage 17 deliberately stays inside the
declared instrument set.
- The Stage 19 REST handlers consume the five services without
service-layer changes: each handler decodes the JSON envelope,
fills `Input.OpSource` / `Input.SourceRef` from the
`X-Galaxy-Caller` header convention, and translates `Result.ErrorCode`
into the standard error envelope.
@@ -0,0 +1,171 @@
---
stage: 18
title: runtime:health_events consumer
---
# Stage 18 — `runtime:health_events` consumer
This decision record captures the non-obvious choices made while
implementing the asynchronous consumer of the `runtime:health_events`
Redis Stream produced by Runtime Manager. The consumer translates RTM
observations into three effects on Game Master state:
1. Updates `runtime_records.engine_health` per game with a short
summary string.
2. For terminal container events applies a CAS
`running → engine_unreachable`; for `probe_recovered` applies the
symmetric recovery CAS `engine_unreachable → running`.
3. Publishes a debounced `runtime_snapshot_update` on `gm:lobby_events`
only when the engine-health summary or the runtime status actually
changed.
The reference precedent for the worker shape (`Dependencies` /
`NewWorker` / `Run` / `Shutdown` / exported `HandleMessage`) is the
Lobby `gmevents` consumer at `lobby/internal/worker/gmevents`. Seven
decisions deviate from a literal reading of [`../PLAN.md`](../PLAN.md)
or are sharp enough to surface here.
## Decisions
### D1. Event-type taxonomy expanded to seven values
**Decision.** The consumer maps all seven values published by RTM
([`rtmanager/internal/domain/health/snapshot.go`](../../rtmanager/internal/domain/health/snapshot.go)),
not the six listed in PLAN Stage 18. The added values are
`container_started` and `probe_recovered`. Both are mapped to the
summary string `healthy`. `probe_recovered` additionally attempts the
recovery CAS `engine_unreachable → running`. `container_started` does
not transition status — Game Master owns runtime startup through the
register-runtime flow, so RTM's container_started observation is
informational at the consumer level.
**Why.** The transition table in
[`internal/domain/runtime/transitions.go`](../internal/domain/runtime/transitions.go)
already declares `engine_unreachable → running` with the comment
`reserved for the Stage 18 consumer; declared here so Stage 18 needs
no transitions edit`. The reserved transition is only useful when an
event in the input stream actually triggers it; the only such event in
RTM's vocabulary is `probe_recovered`. Leaving the two extra event
types unmapped would either drop information (if ignored entirely) or
keep the recovery transition forever unreachable. Mapping them now is
the minimum diff that closes the loop.
### D2. CAS conflict on a status mutation falls back to a health-only update
**Decision.** When the worker plans a status transition (e.g.,
`running → engine_unreachable` for `container_oom`) and
`RuntimeRecordStore.UpdateStatus` returns `runtime.ErrConflict` or
`runtime.ErrInvalidTransition`, the worker logs the conflict at debug
and falls back to `RuntimeRecordStore.UpdateEngineHealth`. The summary
column is refreshed; the status column stays under whatever the
concurrent flow holds.
**Why.** Two flows can hold the runtime row when an RTM event arrives:
turn generation (`generation_in_progress`) and admin operations
(`stopped`, `finished`). Forcing the consumer to win over those flows
would either reintroduce stale-status writes or require expanding the
allowed-transitions table to include every non-terminal source — the
latter weakens the guard that turn generation relies on. The failure
semantics turn-generation already implements (engine call timeout →
`generation_failed`) cover the case where an `oom` arrives while a
turn is in flight: the engine call from turngeneration will fail
naturally a moment later. The consumer's job in that window is to keep
the summary current so operators see «last known: oom» on
`gm:lobby_events`.
### D3. New port method `UpdateEngineHealth`
**Decision.** [`internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
gains a new method `UpdateEngineHealth(ctx, UpdateEngineHealthInput) error`
with its own input struct and `Validate`. The Postgres adapter gains a
matching `UPDATE runtime_records SET engine_health = $1, updated_at =
$2 WHERE game_id = $3`. The existing `UpdateStatus` is **not**
repurposed for health-only updates.
**Why.** `UpdateStatusInput.Validate` calls
`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
`ExpectedFrom == To` (Stage 17 D1). A health-only update keeps the
runtime in its current status, so any attempt to feed `UpdateStatus`
with `ExpectedFrom == To` is rejected before the SQL even runs. The
same precedent led Stage 17 to add `UpdateImage` rather than relax the
self-transition guard. Stage 18 follows that precedent.
In addition, the health update is not gated on a CAS at all: late-
arriving events should still bookkeep the summary regardless of the
current status (including `stopped` and `finished`). A guarded
`UpdateStatus`-shaped variant would have to enumerate every source
status the consumer might observe; an unguarded `UpdateEngineHealth`
sidesteps the question.
### D4. In-memory dedupe of last-emitted summaries per game
**Decision.** The worker keeps a `map[string]string` (`gameID →
lastEmittedSummary`) under a `sync.RWMutex`. A snapshot is published
when either the status transitioned in this iteration or when the new
summary differs from the cached one for the same game. The cache is
process-local; on restart it is empty.
**Why.** [`./README.md` §`gm:lobby_events`](../README.md) freezes the
publication rule: snapshots are emitted on transitions and on health-
summary changes («debounced — duplicates are suppressed when the
summary did not change»). Stage 18 chooses an in-process map over a
Redis-backed dedupe for two reasons:
1. Game Master is single-instance in v1
([`./README.md §Non-Goals`](../README.md)); a per-process map is
sufficient for v1 correctness.
2. Losing the cache on restart causes at most one extra snapshot per
game right after restart — Lobby's `gmevents` consumer is
idempotent (CAS-protected status transitions, deterministic
snapshot blob), so the extra emission is benign.
A Redis-backed dedupe is cheap to introduce later if multi-instance
Game Master ever lands; until then the simpler choice ships less code.
### D5. Snapshot construction reads the runtime row again after the mutation
**Decision.** Whenever the worker decides to publish, it re-reads the
runtime record (`RuntimeRecordStore.Get`) and builds the
`RuntimeSnapshotUpdate` from that fresh row. The `EngineHealthSummary`,
`RuntimeStatus`, and `CurrentTurn` fields therefore reflect whatever
the database holds after the mutation, rather than what the worker
just intended to write.
**Why.** Two paths can produce the same publish decision: the CAS
succeeded (status changed, summary changed), or the CAS conflicted and
the fallback `UpdateEngineHealth` took over (status unchanged from the
worker's point of view, but possibly mutated by a concurrent flow
between the conflict and the read). A single read-after-write reduces
both paths to the same envelope-building code and keeps the snapshot
honest about what is actually in the database. `PlayerTurnStats` is
intentionally left as `nil`: the consumer does not have a fresh engine
state payload, so per-player stats stay empty until the next turn
(this matches [`./README.md §`gm:lobby_events`] for status-only
transitions).
### D6. Stream-offset label is `health_events`
**Decision.** The consumer uses the short label `health_events` for
`StreamOffsetStore.Load` / `Save`. The corresponding Redis key is
`gamemaster:stream_offsets:health_events`.
**Why.** The label convention is documented in
[`./README.md §Persistence Layout / Redis runtime-coordination state`](../README.md):
short logical identifier of the consumer, stable across renames of the
underlying stream key. The Lobby `gmevents` consumer follows the same
shape (`gm_lobby_events`).
### D7. Worker wiring deferred to Stage 19
**Decision.** Stage 18 ships the worker package and unit/loop tests but
does not register the worker as an `app.Component` in
`internal/app/runtime.go`. Wiring is deferred to Stage 19.
**Why.** The same pattern is already in place for the scheduler ticker
introduced at Stage 15: the worker exists in the source tree but is
not wired into `runtime.app = New(cfg, internalServer)`. Stage 19
explicitly bundles handler wiring with worker wiring (see PLAN
Stage 19), so deferring is consistent with the precedent. The
configuration values the wiring will need (stream name, block timeout,
offset-store DSN) are already loaded by `internal/config` and were
introduced in Stage 08.
@@ -0,0 +1,230 @@
---
stage: 19
title: Internal REST handlers
---
# Stage 19 — Internal REST handlers
This decision record captures the non-obvious choices made while
bringing the trusted internal REST listener of Game Master to full
contract coverage. The handlers wire the existing service layer
(stages 1317) and the membership cache (stage 16) to the eighteen
operations frozen by
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml). The
listener lifecycle, OpenTelemetry middleware, and the `/healthz` /
`/readyz` probes were established in stage 08; this stage adds the
per-operation handler subpackage, widens the listener `Dependencies`
struct to thread every service port, and grows
[`../internal/app/wiring.go`](../internal/app/wiring.go) to construct
the entire dependency graph (stores, adapters, services, workers).
The reference precedent for the handler shape is the rtmanager
`internal/api/internalhttp/handlers` tree; the conformance test
mirrors `rtmanager/internal/api/internalhttp/conformance_test.go`.
Eight decisions deviate from a literal reading of
[`../PLAN.md`](../PLAN.md) or are sharp enough to surface here.
## Decisions
### D1. Conformance test lives inside the listener package
**Decision.** The OpenAPI conformance test ships at
[`../internal/api/internalhttp/conformance_test.go`](../internal/api/internalhttp/conformance_test.go),
in the `internalhttp` package, not at
`gamemaster/api/openapi_conformance_test.go` as the literal text of
PLAN.md Stage 19 suggests.
**Why.** The test instantiates the live `Server.handler` through
`NewServer(...)` with stub services and replays each documented
operation against it. That requires reading the unexported
`handler` field and wiring stub implementations of the
handler-package interfaces; both are package-internal concerns that a
sibling test under `gamemaster/api/` would not have access to without
exporting hooks that exist solely for the test. The rtmanager
service ships the analogous test inside its own `internalhttp`
package; we follow the same idiom.
**How to apply.** Future surface-shape audits go in this file.
PLAN.md text is treated as a drift; the constraint that the spec is
covered by a kin-openapi-driven validation is honoured exactly.
### D2. `DELETE /engine-versions/{version}` calls `Service.Deprecate`
**Decision.** The handler bound to the OpenAPI operation
`internalDeprecateEngineVersion` calls
[`engineversion.Service.Deprecate`](../internal/service/engineversion/service.go)
and never `Service.Delete`. The 409 response declared by the
spec for `engine_version_in_use` is therefore unreachable on this
endpoint.
**Why.** The operation id and the first sentence of the description
explicitly say «Sets the engine version status to `deprecated`». The
sentence about hard removal and `engine_version_in_use` is a
leftover of an earlier intent — `Service.Deprecate` does not consult
`IsReferencedByActiveRuntime`, so the in-use rejection cannot fire
through this code path. Hard delete is a future Admin Service
operation; v1 does not expose it through REST.
**How to apply.** Calls that need to release the registry row
permanently must use `Service.Delete` directly (not yet wired through
REST). The spec's leftover 409 example is recorded here so a future
contract reviewer does not chase a phantom failure mode.
### D3. Workers wired and started alongside the listener
**Decision.** This stage constructs the scheduler ticker (stage 15)
and the runtime:health_events consumer (stage 18) inside
`wiring.buildWorkers` and registers them as `App.Component`-s next
to the internal HTTP server.
**Why.** Stage 19's narrow text says «ship the gateway-, Lobby- and
Admin-facing REST surface backed by the service layer». But the
service layer collaborators referenced from the listener (turn
generation, membership cache, runtime record store, etc.) only make
sense inside a process that is also producing turns and consuming
health events. Keeping the workers idle would leave the wiring graph
half-built and the dev experience surprising. Constructing and
starting them here makes a freshly-deployed process production-ready
the moment the listener accepts traffic.
**How to apply.** The two workers are owned by `App.Run` exactly
like the listener: both `Run` (long-lived) and `Shutdown` are part
of `App.Component`. See D4 for the trivial `Shutdown` added on the
scheduler ticker.
### D4. `schedulerticker.Worker.Shutdown` is a no-op
**Decision.** The scheduler ticker adds a one-line
`Shutdown(_ context.Context) error { return nil }` so the type
satisfies `app.Component`.
**Why.** The worker's `Run` already returns when the supplied
context is cancelled, and `wg.Wait` drains the in-flight per-game
goroutines before `Run` returns. There is nothing additional to
release. The `healtheventsconsumer.Worker` already had a `Shutdown`
from stage 18; this just brings the two workers to the same shape.
**How to apply.** When future workers grow real shutdown logic
(buffered output to flush, persistent connections to drain), they
should embed it inside `Shutdown` rather than relying on context
cancellation alone.
### D5. New `RuntimeRecordStore.List(ctx)` method
**Decision.** The port grows a fifth read method:
`List(ctx) ([]runtime.RuntimeRecord, error)`. The PostgreSQL
adapter implements it as one SELECT ordered by
`(created_at DESC, game_id ASC)`.
**Why.** The OpenAPI operation `internalListRuntimes` accepts an
optional `status` query parameter. With the parameter set, the
existing `ListByStatus` answers; without it, no method on the port
returned every record. Composing the unfiltered list as a
loop-over-statuses would dilute the ordering guarantee and double
the round-trip cost. The new method is additive — every other
caller keeps using its narrow read.
**How to apply.** Test fakes (`fakeRuntimeRecords` in service tests,
`fakeRuntimeRecordsBackend` in scheduler-ticker tests) gained the
method as well. The handler-side `RuntimeRecordsReader` interface
exposes only the three read methods (`Get`, `List`, `ListByStatus`)
so the listener cannot accidentally mutate runtime state.
### D6. `next_generation_at` encodes as `0` when unscheduled
**Decision.** The wire `RuntimeRecord.next_generation_at` field is
declared `required: true` and `format: int64`. The domain holds
`*time.Time` and may carry `nil` — typically while a runtime is in
status `starting` and the first scheduling write has not yet
landed. The encoder writes `0` in that case and writes the UTC
millisecond value otherwise.
**Why.** Encoding `nil` as `0` keeps the wire shape JSON-Schema-valid
without forcing every record reader to handle a missing field.
Optional pointer-typed timestamps (`started_at`, `stopped_at`,
`finished_at`) are still omitted from the JSON form via `omitempty`,
matching the `required` list in the spec.
**How to apply.** Readers must treat `next_generation_at == 0` as
«not yet scheduled» when the status warrants it; the field will
turn into a real Unix-millisecond value once the scheduler's first
write lands. The conformance test seeds a non-nil
`NextGenerationAt`, so the strict response validator never sees
this edge case at the wire boundary.
### D7. Hot-path bodies are pass-through, not strict-decoded
**Decision.** Handlers `internalExecuteCommands`, `internalPutOrders`
read the request body as raw bytes. The body is rejected only when
empty or not valid JSON; unknown fields pass through.
**Why.** The OpenAPI request schemas for these three operations carry
`additionalProperties: true` because the envelopes are engine-owned
(`galaxy/game/openapi.yaml`). Strict decoding here would reject
legitimate engine extensions and force every contract bump to land
in two services in lockstep.
**How to apply.** Engine `engine_validation_error` responses still
surface as the canonical Game Master error envelope at HTTP 502 —
the engine response body is recorded in `result.RawResponse` for
audit but the OpenAPI spec mandates the error envelope on this code
path. If a future contract version requires forwarding the engine's
4xx body to the gateway, a separate response shape needs to land in
the spec first.
### D8. `X-Galaxy-Caller` mapping with admin default
**Decision.** The `resolveOpSource` helper maps the
`X-Galaxy-Caller` header values to
[`operation.OpSource`](../internal/domain/operation/log.go) as
follows: `gateway → OpSourceGatewayPlayer`,
`lobby → OpSourceLobbyInternal`, `admin → OpSourceAdminRest`.
Missing or unrecognised values fall back to `OpSourceAdminRest`,
matching the contract documented in
[`../README.md` §«Internal REST API»](../README.md).
**Why.** The default is conservative: an Admin Service request
without the header still records as admin instead of being dropped.
The other two values are reserved for the documented callers and
trim/lowercase tolerantly so a casing slip in development does not
produce a confusing audit row.
**How to apply.** New REST callers should set the header
explicitly. Adding a fourth caller type requires an `OpSource`
constant alongside the mapping change.
## What ships
- Eighteen operation handlers under
[`../internal/api/internalhttp/handlers`](../internal/api/internalhttp/handlers).
- The probe-only `internal/api/internalhttp/server.go` now widens
`Dependencies` and forwards the per-operation services to
`handlers.Register`.
- Full dependency graph in
[`../internal/app/wiring.go`](../internal/app/wiring.go): five
stores, five external adapters, eleven services, two workers.
- `RuntimeRecordStore.List(ctx)` plus its PostgreSQL adapter
implementation and regression tests
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore)).
- `schedulerticker.Worker.Shutdown` so the worker is an
`App.Component`.
- Mockgen-generated handler-port mocks under
[`../internal/api/internalhttp/handlers/mocks`](../internal/api/internalhttp/handlers/mocks).
- A kin-openapi-driven conformance test
([`../internal/api/internalhttp/conformance_test.go`](../internal/api/internalhttp/conformance_test.go))
that validates request and response shapes for every documented
operation against
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml).
- Per-handler unit tests covering happy paths, error-code mapping,
unknown-field rejection, and header validation.
## What remains for later stages
- Lobby refactor (stage 20) flips Lobby's start flow to call
`GET /api/v1/internal/engine-versions/{version}/image-ref`
synchronously and adds the `InvalidateMemberships` outbound call
on every roster mutation.
- Service-local integration suite (stage 21) drives the listener
end-to-end against a real engine container.
- Cross-service integration tests (stages 2223) cover Lobby + GM,
Lobby + GM + RTM happy and failure paths.