feat: gamemaster
This commit is contained in:
@@ -0,0 +1,62 @@
|
||||
# Stage 01 — Architecture sync
|
||||
|
||||
This decision record captures the non-obvious choice from
|
||||
[`../PLAN.md` Stage 01](../PLAN.md#stage-01-update-architecturemd):
|
||||
the drop of `ships_built` from every architectural mention of
|
||||
`player_turn_stats`.
|
||||
|
||||
## Context
|
||||
|
||||
Before Stage 01, `ARCHITECTURE.md` and `lobby/README.md` described
|
||||
`player_turn_stats` as carrying `{user_id, planets, population,
|
||||
ships_built}`, and the Race Name Directory capability rule was wired in
|
||||
prose as if `ships_built` could affect the outcome. In practice, the
|
||||
formal capability rule was already
|
||||
`max_planets > initial_planets AND max_population > initial_population`
|
||||
— `ships_built` was named in the stats payload but never referenced by
|
||||
the rule.
|
||||
|
||||
## Decision
|
||||
|
||||
`player_turn_stats` carries `{user_id, planets, population}` only.
|
||||
`ships_built` is removed from:
|
||||
|
||||
- `ARCHITECTURE.md §8 Game Master` — `runtime_snapshot_update` payload
|
||||
description.
|
||||
- `ARCHITECTURE.md §7 Game Lobby` — per-member aggregate description
|
||||
(`current and running-max of planets and population`).
|
||||
- `gamemaster/README.md` — already aligned at the stage-02 README
|
||||
freeze.
|
||||
|
||||
The capability rule wording is unchanged because it was already
|
||||
`planets`/`population`-only; only the surrounding prose mentioning the
|
||||
unused field was inaccurate.
|
||||
|
||||
This is a documentation-only change. No runtime behaviour, wire format,
|
||||
schema, or test fixture is affected.
|
||||
|
||||
## Why
|
||||
|
||||
`ships_built` was unused. Naming it in the contract obliged every
|
||||
producer (GM) and consumer (Lobby aggregator) to populate and forward a
|
||||
field with no consumer. Dropping it now — before any GM code lands —
|
||||
keeps the contract minimal and avoids future drift between "what the
|
||||
spec lists" and "what the code uses". `lobby/README.md` and the lobby
|
||||
aggregate code are aligned in Stage 03 of the same plan.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- **Keep `ships_built` in the contract for future use.** Rejected: no
|
||||
concrete plan exists for a `ships_built`-driven capability or stat
|
||||
surface; speculative fields rot.
|
||||
- **Add `ships_built` only as an opaque stat without changing the
|
||||
capability rule.** Rejected: the runtime cost of carrying it is
|
||||
negligible, but the documentation burden of explaining why an unused
|
||||
field is in the payload is not.
|
||||
|
||||
## References
|
||||
|
||||
- [`../PLAN.md` Stage 01](../PLAN.md)
|
||||
- [`../../ARCHITECTURE.md` §7 Game Lobby](../../ARCHITECTURE.md)
|
||||
- [`../../ARCHITECTURE.md` §8 Game Master](../../ARCHITECTURE.md)
|
||||
- [`../README.md`](../README.md) — `player_turn_stats[]` description.
|
||||
@@ -0,0 +1,124 @@
|
||||
---
|
||||
stage: 03
|
||||
title: Existing-service docs sync (Lobby, Notification, Game, RTM)
|
||||
---
|
||||
|
||||
# Stage 03 — Existing-service docs sync
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
synchronising every touched-service README with the post-Game-Master
|
||||
contract before any code change lands. The mechanical edits
|
||||
(strikethrough renames, drop of `ships_built`, replacement of the
|
||||
`engineimage.Resolver` block) are not enumerated here — they are direct
|
||||
consequences of the rules already recorded in
|
||||
[`../README.md`](../README.md) and
|
||||
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md).
|
||||
|
||||
## Context
|
||||
|
||||
Stage 03 had to reach a state where every README in the repository
|
||||
agreed on three new contractual rules before any service-level code
|
||||
landed:
|
||||
|
||||
- `image_ref` is resolved synchronously from `Game Master`'s engine
|
||||
version registry, not from a Go-template held by `Game Lobby`.
|
||||
- A new outgoing `POST /api/v1/internal/games/{game_id}/memberships/invalidate`
|
||||
hook from `Game Lobby` into `Game Master` fires post-commit on every
|
||||
roster mutation.
|
||||
- The engine container splits its REST surface into `/api/v1/admin/*`
|
||||
(GM-only) and `/api/v1/{command,order,report}` (player), and
|
||||
`StateResponse` carries a new boolean `finished` field that GM uses
|
||||
as the sole finish signal.
|
||||
|
||||
Three decisions were not derivable from the GM README and required a
|
||||
deliberate choice while editing `lobby/README.md`, `game/README.md`,
|
||||
and `rtmanager/README.md`.
|
||||
|
||||
## Decision 1 — `lobby.game.start` failure modes for GM-driven image resolve
|
||||
|
||||
`Game Lobby` now calls
|
||||
`GET /api/v1/internal/engine-versions/{version}/image-ref` synchronously
|
||||
before publishing `runtime:start_jobs`. The contract defines two new
|
||||
failure modes for the `lobby.game.start` command:
|
||||
|
||||
- GM unreachable (network error, timeout, `5xx`) ⇒
|
||||
`lobby.game.start` returns `service_unavailable`; the game stays in
|
||||
`ready_to_start`. No container is created, no envelope is published.
|
||||
- GM reports the version is missing or deprecated (`404` or
|
||||
`engine_version_not_found` payload) ⇒ `lobby.game.start` returns
|
||||
`engine_version_not_found`; the game stays in `ready_to_start`.
|
||||
|
||||
Both error codes were added to the stable error code list in
|
||||
`lobby/README.md`. They are deliberately distinct from the existing
|
||||
GM-unavailable-after-container-start path, which transitions the game to
|
||||
`paused` (the container is alive; only platform tracking is missing).
|
||||
Conflating the two would force operators to inspect the `paused` set
|
||||
for misconfigurations that never produced a container.
|
||||
|
||||
Alternatives considered and rejected:
|
||||
|
||||
- treat GM-unavailable at resolve time as `paused` for symmetry with the
|
||||
later path — rejected because no container exists, so the
|
||||
`lobby.runtime_paused_after_start` admin notification (which announces
|
||||
a stranded container) would be a lie;
|
||||
- silently fall back to a Go-template default when GM is unreachable —
|
||||
rejected because it brings back the very coupling the stage is
|
||||
retiring and lets a misconfigured registry slip through unnoticed.
|
||||
|
||||
## Decision 2 — Membership invalidate hook is fail-open
|
||||
|
||||
The new outgoing
|
||||
`POST /api/v1/internal/games/{game_id}/memberships/invalidate` call from
|
||||
`approveapplication`, `rejectapplication`, `redeeminvite`,
|
||||
`removemember`, `blockmember`, and the user-lifecycle cascade worker is
|
||||
documented as **fail-open**: a non-2xx response is logged and metered
|
||||
but never rolls back the Lobby commit. GM's TTL safety net catches
|
||||
stale data within the next cache TTL window.
|
||||
|
||||
This matches the architectural rule that a failed cross-service hook
|
||||
must not invalidate an already committed business state. The TTL on
|
||||
GM's in-process membership cache (default `30s`) bounds the staleness
|
||||
window; the explicit hook only optimises for the time between commit
|
||||
and TTL expiry.
|
||||
|
||||
Alternatives considered and rejected:
|
||||
|
||||
- two-phase commit across Lobby and GM — rejected: GM is allowed to be
|
||||
unavailable without rolling back Lobby's roster mutation;
|
||||
- queue the invalidation on a Redis Stream and let GM consume it
|
||||
asynchronously — rejected for v1 because it introduces a new stream
|
||||
contract for a rare event, and the synchronous post-commit call is
|
||||
cheap enough that the staleness reduction beats the operational cost.
|
||||
|
||||
## Decision 3 — Keep `runtime:start_jobs` envelope shape unchanged
|
||||
|
||||
The `runtime:start_jobs` envelope continues to carry `image_ref` as a
|
||||
top-level string field. Only the source of that string changes (from a
|
||||
Lobby-side template substitution to a Lobby-side synchronous call into
|
||||
GM). `Runtime Manager` does not need a contract change in this stage
|
||||
and does not learn about engine versions — it still receives a
|
||||
ready-to-pull Docker reference.
|
||||
|
||||
Alternatives considered and rejected:
|
||||
|
||||
- replace `image_ref` with `engine_version` and have RTM resolve the
|
||||
image — rejected: it would force RTM to call GM, which violates the
|
||||
rule that RTM has no upstream service dependencies for runtime
|
||||
operations;
|
||||
- attach the resolved version metadata to the envelope alongside
|
||||
`image_ref` — rejected: RTM has no consumer for the metadata and
|
||||
carrying it would invite divergence between Lobby and RTM views of
|
||||
the engine version registry.
|
||||
|
||||
## References
|
||||
|
||||
- [`../PLAN.md` Stage 03](../PLAN.md)
|
||||
- [`../README.md`](../README.md) — Game Master service description.
|
||||
- [`../../lobby/README.md`](../../lobby/README.md) — updated Game Start
|
||||
Flow, internal trusted REST, configuration, and error codes.
|
||||
- [`../../game/README.md`](../../game/README.md) — admin path layout,
|
||||
`StateResponse.finished`, `/admin/race/banish` shape.
|
||||
- [`../../rtmanager/README.md`](../../rtmanager/README.md) —
|
||||
`runtime:health_events` consumer note.
|
||||
- [`../../notification/README.md`](../../notification/README.md) — GM as
|
||||
the producer of the three `game.*` notification types.
|
||||
@@ -0,0 +1,177 @@
|
||||
---
|
||||
stage: 06
|
||||
title: Contract files and contract tests
|
||||
---
|
||||
|
||||
# Stage 06 — Contract files and contract tests
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
producing the machine-readable contracts for `Game Master`:
|
||||
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml),
|
||||
[`../api/runtime-events-asyncapi.yaml`](../api/runtime-events-asyncapi.yaml),
|
||||
and the matching contract tests in the `gamemaster` package.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 06](../PLAN.md) freezes the GM REST and event
|
||||
contracts before any handler is written, so later stages have a target
|
||||
spec. The plan enumerates the 20 internal REST `operationId` values and
|
||||
the two `gm:lobby_events` message types and asks contract tests to
|
||||
fail loudly if anything drifts.
|
||||
|
||||
Three decisions were not derivable from `../README.md` or
|
||||
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) and required a
|
||||
deliberate choice while writing the YAML.
|
||||
|
||||
## Decision 1 — Two messages and two send operations on one channel
|
||||
|
||||
`gm:lobby_events` carries two distinct message types — a recurring
|
||||
`runtime_snapshot_update` and a terminal `game_finished`. The AsyncAPI
|
||||
3.1.0 surface encodes them as **two separate messages on one channel
|
||||
with one `send` operation per message**:
|
||||
|
||||
```yaml
|
||||
channels:
|
||||
lobbyEvents:
|
||||
address: gm:lobby_events
|
||||
messages:
|
||||
runtimeSnapshotUpdate: { $ref: '#/components/messages/RuntimeSnapshotUpdate' }
|
||||
gameFinished: { $ref: '#/components/messages/GameFinished' }
|
||||
operations:
|
||||
publishRuntimeSnapshotUpdate: { action: send, ... }
|
||||
publishGameFinished: { action: send, ... }
|
||||
```
|
||||
|
||||
The `notification:intents` contract uses a single message with
|
||||
`allOf`-conditional discriminator branches; the `runtime:health_events`
|
||||
contract uses a single message with a `oneOf` `details` field. Both
|
||||
patterns work when most fields are shared and only one variant slot
|
||||
differs.
|
||||
|
||||
For `gm:lobby_events` the two payloads share only `event_type`,
|
||||
`game_id`, `runtime_status`, and `player_turn_stats[]`. The remaining
|
||||
fields (`current_turn`, `engine_health_summary`, `occurred_at_ms` on
|
||||
the snapshot vs `final_turn_number`, `finished_at_ms` on the finish
|
||||
event) have no overlap, and their semantics differ — the snapshot is
|
||||
recurring, the finish event is terminal. Two messages reflect this
|
||||
asymmetry directly and keep each payload schema closed without
|
||||
needing per-variant `if/then` rules.
|
||||
|
||||
Alternatives considered:
|
||||
|
||||
- **One message with `allOf` discriminator** — rejected: would force
|
||||
every shared field to be optional at the envelope level and
|
||||
re-required inside each `if/then` branch, doubling the schema size
|
||||
and complicating the contract test. The notification spec accepts
|
||||
this cost because it has 18 message types and the payload-shape
|
||||
asymmetry is the whole point; here it's two types with no field
|
||||
overlap.
|
||||
- **Two channels** — rejected: would require Game Lobby to subscribe
|
||||
to two streams, breaking the cadence guarantees in `../README.md`
|
||||
§Async Stream Contracts ("snapshot transitions and finish are
|
||||
ordered relative to each other on the same stream").
|
||||
|
||||
## Decision 2 — `event_type` is a required schema-level `const`
|
||||
|
||||
[`../PLAN.md` Stage 06](../PLAN.md) lists the "frozen field set per
|
||||
message" without naming `event_type`. The implementation pins
|
||||
`event_type` as a required schema property with a `const` value:
|
||||
|
||||
```yaml
|
||||
RuntimeSnapshotUpdatePayload:
|
||||
required: [event_type, ...]
|
||||
properties:
|
||||
event_type: { type: string, const: runtime_snapshot_update }
|
||||
```
|
||||
|
||||
Reasons:
|
||||
|
||||
1. The wire payload must carry a discriminator; consumers (Game Lobby)
|
||||
dispatch on `event_type` after `XREAD`. Omitting it from the schema
|
||||
would require Game Master to inject the value at publish time
|
||||
without spec backing.
|
||||
2. `const` at the schema level lets the contract test assert the
|
||||
discriminator value, which is the only meaningful check Stage 06
|
||||
asks for ("`event_type` discriminator values"). Asserting only the
|
||||
message component name without the on-wire `event_type` would not
|
||||
protect consumers from a misconfigured publisher.
|
||||
3. `rtmanager/api/runtime-health-asyncapi.yaml` already uses
|
||||
`event_type` as a schema-level enum-typed discriminator; treating
|
||||
`gm:lobby_events` the same way keeps the patterns consistent for a
|
||||
reader cross-walking the two specs.
|
||||
|
||||
Alternatives considered:
|
||||
|
||||
- **Leave `event_type` out of the spec and produce it only at the
|
||||
publish-side adapter** — rejected: hides the discriminator from the
|
||||
contract test, which then cannot fail when the publisher renames or
|
||||
drops it.
|
||||
- **Encode discrimination through AsyncAPI message names alone**
|
||||
(relying on `header.X-Message-Type` or similar) — rejected: Redis
|
||||
Streams have no message-headers concept; everything travels in the
|
||||
payload field set.
|
||||
|
||||
## Decision 3 — `additionalProperties: true` on engine pass-through schemas
|
||||
|
||||
Three internal REST operations forward engine-owned payloads without
|
||||
modification:
|
||||
|
||||
- `internalExecuteCommands` — `POST /api/v1/command` on the engine
|
||||
- `internalPutOrders` — `PUT /api/v1/order` on the engine
|
||||
- `internalGetReport` — `GET /api/v1/report` on the engine
|
||||
|
||||
Their request and response bodies use `additionalProperties: true`:
|
||||
|
||||
```yaml
|
||||
ExecuteCommandsRequest:
|
||||
type: object
|
||||
additionalProperties: true
|
||||
required: [commands]
|
||||
properties:
|
||||
commands:
|
||||
type: array
|
||||
items: { type: object, additionalProperties: true }
|
||||
```
|
||||
|
||||
Game Master does not own the shape of these payloads — `galaxy/game/openapi.yaml`
|
||||
is the source of truth — and freezing them in the GM contract would
|
||||
turn every engine-side schema bump into a coordinated GM release. The
|
||||
same reasoning applies to `EngineVersion.options`, which is a
|
||||
free-form `jsonb` document Game Master stores verbatim.
|
||||
|
||||
To prevent the open-by-default flag from spreading by accident, the
|
||||
contract test
|
||||
[`../contract_openapi_test.go`](../contract_openapi_test.go) maintains
|
||||
two explicit allowlists:
|
||||
|
||||
- `gmOwnedClosedSchemas` — every schema for which Game Master owns
|
||||
the wire shape; the test asserts each one closes with
|
||||
`additionalProperties: false`.
|
||||
- `engineOwnedPassthroughSchemas` — the five pass-through schemas
|
||||
(request and response bodies of the three hot-path operations); the
|
||||
test asserts each one keeps `additionalProperties: true`.
|
||||
|
||||
Adding a new GM schema requires registering it in
|
||||
`gmOwnedClosedSchemas`; the test fails loudly if it isn't.
|
||||
|
||||
Alternatives considered:
|
||||
|
||||
- **Close the pass-through schemas with `additionalProperties: false`
|
||||
and hand-mirror every engine field** — rejected: `galaxy/game` and
|
||||
`galaxy/gamemaster` would have to release in lockstep; even cosmetic
|
||||
field renames in the engine would break Edge Gateway routing.
|
||||
- **Rely on a `// pass-through` comment in the YAML alone** — rejected:
|
||||
comments do not survive automated reformatters and provide no
|
||||
test-time signal.
|
||||
|
||||
## References
|
||||
|
||||
- [`../PLAN.md` Stage 06](../PLAN.md)
|
||||
- [`../README.md` §Hot Path](../README.md), [`../README.md` §Async Stream Contracts](../README.md)
|
||||
- [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml)
|
||||
- [`../api/runtime-events-asyncapi.yaml`](../api/runtime-events-asyncapi.yaml)
|
||||
- [`../contract_openapi_test.go`](../contract_openapi_test.go)
|
||||
- [`../contract_asyncapi_test.go`](../contract_asyncapi_test.go)
|
||||
- [`../../lobby/contract_openapi_test.go`](../../lobby/contract_openapi_test.go) — OpenAPI test pattern reused here.
|
||||
- [`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go) — YAML walker pattern reused here.
|
||||
- [`../../rtmanager/api/runtime-health-asyncapi.yaml`](../../rtmanager/api/runtime-health-asyncapi.yaml) — `event_type` const precedent.
|
||||
@@ -0,0 +1,125 @@
|
||||
---
|
||||
stage: 07
|
||||
title: Notification catalog audit
|
||||
---
|
||||
|
||||
# Stage 07 — Notification catalog audit
|
||||
|
||||
This decision record captures the audit outcome and the freeze-test
|
||||
choice made for the GM-owned notification types
|
||||
(`game.turn.ready`, `game.finished`, `game.generation_failed`).
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 07](../PLAN.md) asks for confirmation that the three
|
||||
notification types `Game Master` will produce in Stage 15 are already
|
||||
wired through the shared producer module
|
||||
[`../../pkg/notificationintent/`](../../pkg/notificationintent/), the
|
||||
`notification` service AsyncAPI contract
|
||||
[`../../notification/api/intents-asyncapi.yaml`](../../notification/api/intents-asyncapi.yaml),
|
||||
and the catalog freeze in
|
||||
[`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go).
|
||||
The stage is described as «no-op or minor»: edits land elsewhere only if
|
||||
the audit finds drift.
|
||||
|
||||
The producer-side surface is consumed in Stage 15 by
|
||||
`gamemaster/internal/adapters/notificationpublisher/`; this stage locks
|
||||
the contract before the publisher is implemented.
|
||||
|
||||
## Audit outcome — no drift
|
||||
|
||||
Each artefact already matches the `Game Master` notification table at
|
||||
[`../README.md` §Notification Contracts](../README.md):
|
||||
|
||||
- [`../../pkg/notificationintent/intent.go`](../../pkg/notificationintent/intent.go)
|
||||
declares `NotificationTypeGameTurnReady`, `NotificationTypeGameFinished`,
|
||||
`NotificationTypeGameGenerationFailed`; `ExpectedProducer` maps the
|
||||
three to `ProducerGameMaster`; `SupportsAudience` and `SupportsChannel`
|
||||
encode `user + (push|email)` for the first two and `admin_email + email`
|
||||
for the failure type.
|
||||
- [`../../pkg/notificationintent/payloads.go`](../../pkg/notificationintent/payloads.go)
|
||||
defines `GameTurnReadyPayload`, `GameFinishedPayload`,
|
||||
`GameGenerationFailedPayload` with the exact field set required by the
|
||||
README table, and exposes `NewGameTurnReadyIntent`,
|
||||
`NewGameFinishedIntent`, `NewGameGenerationFailedIntent`. The
|
||||
user-targeted constructors take `recipientUserIDs`; the admin-email
|
||||
constructor does not.
|
||||
- [`../../notification/api/intents-asyncapi.yaml`](../../notification/api/intents-asyncapi.yaml)
|
||||
carries the three values in the `notification_type` enum, declares
|
||||
one `if/then` branch each on the envelope, and defines the
|
||||
`GameTurnReadyPayload`, `GameFinishedPayload`,
|
||||
`GameGenerationFailedPayload` schemas with the per-type required
|
||||
fields.
|
||||
- [`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go)
|
||||
freezes the three types inside `expectedNotificationCatalog` and
|
||||
exercises them through `TestIntentAsyncAPISpecFreezesNotificationCatalogBranches`
|
||||
and `TestNotificationCatalogDocsStayInSync`.
|
||||
|
||||
There is no separate «catalog data table» inside `notification/internal/`:
|
||||
the routing decisions live in `pkg/notificationintent/intent.go` and are
|
||||
shared by every producer and by the notification service itself.
|
||||
Consequently no edits to
|
||||
`notification/api/intents-asyncapi.yaml`,
|
||||
`notification/internal/...`, or
|
||||
`notification/contract_asyncapi_test.go` are required by this stage.
|
||||
|
||||
## Decision — producer-side compile-time freeze in addition to the YAML freeze
|
||||
|
||||
[`../notificationintent_audit_test.go`](../notificationintent_audit_test.go)
|
||||
imports `galaxy/notificationintent` from inside the `gamemaster`
|
||||
package. Because the test names every constant, constructor, and
|
||||
payload struct field directly, any rename or removal in
|
||||
`pkg/notificationintent` breaks `go build ./gamemaster/...` before the
|
||||
test even runs. At runtime the test additionally asserts:
|
||||
|
||||
- the wire value of every `NotificationType` constant
|
||||
(`game.turn.ready`, `game.finished`, `game.generation_failed`);
|
||||
- the `Producer`, `AudienceKind`, recipient handling, and `Validate()`
|
||||
outcome of the constructed intent;
|
||||
- the on-wire field names through `Contains` checks against
|
||||
`Intent.PayloadJSON` (catches a JSON tag rename even when the Go
|
||||
struct field name stays);
|
||||
- the audience/channel matrix via `SupportsAudience` and
|
||||
`SupportsChannel`.
|
||||
|
||||
Reasons for adding this in addition to the YAML freeze in
|
||||
`notification/contract_asyncapi_test.go`:
|
||||
|
||||
1. The YAML freeze runs in the `notification` module. A drift in
|
||||
`pkg/notificationintent` that is *consistent* with a drift in
|
||||
`notification/api/intents-asyncapi.yaml` would still be caught, but
|
||||
the failure surface is on the consumer side, not the producer side.
|
||||
The GM-side test fails first and points the engineer at the producer
|
||||
they own.
|
||||
2. The test binds the contract at compile time. A field rename in
|
||||
`pkg/notificationintent/payloads.go` cannot land without breaking
|
||||
`gamemaster/notificationintent_audit_test.go` build, even before
|
||||
`go test` runs.
|
||||
3. Stage 15 will introduce a publisher adapter that calls the same
|
||||
constructors. Locking the constructor signatures here removes one
|
||||
class of churn from that stage — the test serves as a contract
|
||||
reference that the adapter has to satisfy.
|
||||
|
||||
Alternatives considered:
|
||||
|
||||
- **YAML re-parse in `gamemaster/`** — rejected: would duplicate the
|
||||
walker logic already present in
|
||||
`notification/contract_asyncapi_test.go` and bind the GM module to
|
||||
the YAML file path through a relative `../notification/` reference.
|
||||
The Go-import test catches the relevant drift class with no
|
||||
cross-module file lookups.
|
||||
- **No GM-side test, rely on the YAML freeze alone** — rejected:
|
||||
Stage 07's exit criterion is «the freeze test passes», which the
|
||||
PLAN explicitly anchors to a new file under `gamemaster/`. The YAML
|
||||
freeze alone would also miss a Go-side rename that the test author
|
||||
forgot to mirror in the YAML in the same change.
|
||||
|
||||
## References
|
||||
|
||||
- [`../PLAN.md` Stage 07](../PLAN.md)
|
||||
- [`../README.md` §Notification Contracts](../README.md)
|
||||
- [`../notificationintent_audit_test.go`](../notificationintent_audit_test.go)
|
||||
- [`../../pkg/notificationintent/intent.go`](../../pkg/notificationintent/intent.go)
|
||||
- [`../../pkg/notificationintent/payloads.go`](../../pkg/notificationintent/payloads.go)
|
||||
- [`../../notification/api/intents-asyncapi.yaml`](../../notification/api/intents-asyncapi.yaml)
|
||||
- [`../../notification/contract_asyncapi_test.go`](../../notification/contract_asyncapi_test.go) — YAML-level catalog freeze.
|
||||
@@ -0,0 +1,145 @@
|
||||
---
|
||||
stage: 08
|
||||
title: Module skeleton
|
||||
---
|
||||
|
||||
# Stage 08 — GM module skeleton
|
||||
|
||||
This decision record captures the wiring choices made when bootstrapping
|
||||
the runnable `gamemaster` binary on top of the contracts and freeze
|
||||
tests landed by Stages 01–07.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 08](../PLAN.md) calls for a buildable `gamemaster`
|
||||
process that loads its environment-driven configuration, opens
|
||||
PostgreSQL and Redis pools, installs the OpenTelemetry runtime, exposes
|
||||
`/healthz` and `/readyz` on the trusted internal HTTP listener, and
|
||||
exits cleanly on `SIGTERM` within `GAMEMASTER_SHUTDOWN_TIMEOUT`. No
|
||||
business endpoints, no workers, and no persistence stores yet.
|
||||
|
||||
The reference implementation is `rtmanager`, the most recently landed
|
||||
Galaxy service that follows the platform-wide skeleton conventions
|
||||
(layered `cmd / internal/{app, api, config, logging, telemetry}`,
|
||||
`app.Component` lifecycle, OpenTelemetry runtime with deferred
|
||||
observable gauges, fail-fast environment loader). Stage 08 mirrors that
|
||||
skeleton with two deliberate divergences described below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. `go.mod` scope is minimal at Stage 08
|
||||
|
||||
Only modules actually imported by Stage 08 code land in
|
||||
[`../go.mod`](../go.mod):
|
||||
|
||||
- `galaxy/postgres`, `galaxy/redisconn`, `galaxy/notificationintent`
|
||||
(the last one was already present from Stage 07 freeze test);
|
||||
- the OpenTelemetry stack (`otel`, `metric`, `trace`, `sdk`,
|
||||
`sdk/metric`, OTLP exporters for traces and metrics over gRPC and
|
||||
HTTP, stdout exporters);
|
||||
- `go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp`;
|
||||
- `github.com/redis/go-redis/v9` (promoted from indirect to direct);
|
||||
- `github.com/jackc/pgx/v5` (transitive via `pkg/postgres`).
|
||||
|
||||
PLAN-listed modules that arrive with later consumers (`go-jet/jet/v2`,
|
||||
`pressly/goose/v3`, the testcontainers modules, `go.uber.org/mock`,
|
||||
`galaxy/cronutil`, `galaxy/error`, `galaxy/util`) are deliberately left
|
||||
out of Stage 08's `go.mod`. They join the module together with their
|
||||
first consumers in Stages 09 / 10 / 11 / 12.
|
||||
|
||||
Reasoning: keeping `go mod tidy` honest at every stage is cheaper than
|
||||
pre-declaring blank-import stubs. The PLAN's full list is the eventual
|
||||
shape of the module across the series, not a Stage 08 contract.
|
||||
|
||||
### 2. `ShutdownTimeout` lives at the top level of `Config`
|
||||
|
||||
The README §Configuration groups one variable —
|
||||
`GAMEMASTER_SHUTDOWN_TIMEOUT` — under a documentation group called
|
||||
"Lifecycle". The Go struct does not split that single field into a
|
||||
substruct: `Config.ShutdownTimeout` mirrors the
|
||||
`rtmanager.Config.ShutdownTimeout` shape so the two services stay
|
||||
isomorphic. The "Lifecycle" group remains a documentation grouping in
|
||||
[`../README.md`](../README.md) only.
|
||||
|
||||
### 3. Telemetry — counters and histograms now, observable gauges later
|
||||
|
||||
`internal/telemetry/runtime.go` registers every counter and histogram
|
||||
listed under [`../README.md` §Observability](../README.md) at process
|
||||
start (`buildRuntime`). The three observable gauges
|
||||
(`gamemaster.runtime_records_by_status`,
|
||||
`gamemaster.scheduler.due_games`, `gamemaster.engine_versions_total`)
|
||||
are declared up front but their callbacks are installed via a deferred
|
||||
`Runtime.RegisterGauges(deps)` call. The wiring layer at Stages 11 / 14
|
||||
/ 15 supplies the probes (per-status row count, due-now scheduler
|
||||
count, registered engine versions) once the persistence stores and the
|
||||
scheduler exist.
|
||||
|
||||
This matches the `rtmanager` pattern where
|
||||
`runtime_records_by_status` is registered through an analogous
|
||||
`RegisterGauges` plumbing.
|
||||
|
||||
### 4. PostgreSQL migrations are deferred to Stage 09
|
||||
|
||||
The README §Startup dependencies states "Embedded goose migrations
|
||||
apply synchronously before any listener opens." Stage 08 opens,
|
||||
instruments, and pings the PostgreSQL pool but **does not** call
|
||||
`postgres.RunMigrations`. The migrations package
|
||||
(`internal/adapters/postgres/migrations/`) is shipped by Stage 09; the
|
||||
runtime adds the one-line `RunMigrations` call at that stage.
|
||||
|
||||
Until then, the runtime is buildable, listener-ready, and serves
|
||||
`/healthz` + `/readyz` against a fresh PostgreSQL pool with no schema
|
||||
applied. This is acceptable because Stage 08 ships no business handlers
|
||||
and no workers; nothing reads or writes `gamemaster.*` tables yet.
|
||||
|
||||
### 5. Makefile mirrors `rtmanager`
|
||||
|
||||
[`../Makefile`](../Makefile) declares `jet`, `mocks`, `integration`
|
||||
targets identical in shape to `rtmanager/Makefile`. The `jet` target
|
||||
runs `go run ./cmd/jetgen`; the binary lands in Stage 09. The `mocks`
|
||||
target runs `go generate ./internal/ports/...
|
||||
./internal/api/internalhttp/handlers/...`; the `//go:generate`
|
||||
directives land in Stages 10 / 12 / 19. Both targets fail until their
|
||||
prerequisites land — accepted because Stage 08 does not require either
|
||||
to succeed; only `go build` and `go test ./gamemaster/...` matter.
|
||||
|
||||
### 6. No Docker dependency
|
||||
|
||||
`Game Master` is forbidden from importing the Docker SDK
|
||||
([`../README.md` §Non-Goals](../README.md)). The skeleton therefore
|
||||
drops the `newDockerClient` / `pingDocker` helpers from
|
||||
`internal/app/bootstrap.go` and the Docker-related fields from
|
||||
`internal/app/wiring.go`. The readiness probe pings PostgreSQL and
|
||||
Redis only.
|
||||
|
||||
## Files landed
|
||||
|
||||
- `cmd/gamemaster/main.go` — process entrypoint.
|
||||
- `internal/config/{config.go, env.go, validation.go, config_test.go}` —
|
||||
GAMEMASTER-prefixed env loader plus required-vars fail-fast.
|
||||
- `internal/logging/{logger.go, context.go}` — slog JSON-stdout logger
|
||||
with request id and span id helpers.
|
||||
- `internal/telemetry/{runtime.go, runtime_test.go}` — OpenTelemetry
|
||||
runtime, instruments listed in §Observability, deferred gauge
|
||||
plumbing.
|
||||
- `internal/api/internalhttp/{server.go, server_test.go}` — `/healthz`
|
||||
and `/readyz` listener with observability middleware.
|
||||
- `internal/app/{app.go, app_test.go, bootstrap.go, runtime.go,
|
||||
wiring.go}` — process lifecycle (component supervisor + reverse-order
|
||||
cleanup), Redis bootstrap helpers, minimal placeholder wiring.
|
||||
- `Makefile` — `jet`, `mocks`, `integration` target stubs.
|
||||
- Updated `go.mod` / `go.sum` with the dependencies and replace
|
||||
directives for `galaxy/postgres` and `galaxy/redisconn`.
|
||||
|
||||
## Verification
|
||||
|
||||
- `go build ./gamemaster/...` succeeds.
|
||||
- `go test ./gamemaster/...` passes (existing contract / freeze tests
|
||||
plus the four new test files).
|
||||
- Manual smoke against a local Postgres + Redis confirms:
|
||||
`/healthz` returns `200 ok`, `/readyz` returns `200 ready` while both
|
||||
dependencies respond, and `503 service_unavailable` once one of them
|
||||
is brought down.
|
||||
- `SIGTERM` ends the process within `GAMEMASTER_SHUTDOWN_TIMEOUT`,
|
||||
releasing PostgreSQL pool, Redis client, and telemetry providers in
|
||||
reverse construction order.
|
||||
@@ -0,0 +1,257 @@
|
||||
---
|
||||
stage: 09
|
||||
title: PostgreSQL schema, migrations, jet
|
||||
---
|
||||
|
||||
# Stage 09 — PostgreSQL schema, migrations, jet
|
||||
|
||||
This decision record captures the schema and code-generation pipeline
|
||||
landed for Game Master at PLAN Stage 09. It is a service-local mirror
|
||||
of [`../../rtmanager/docs/postgres-migration.md`](../../rtmanager/docs/postgres-migration.md)
|
||||
but only documents the decisions specific to Stage 09; the stage-24
|
||||
[`postgres-migration.md`](postgres-migration.md) reorganisation will
|
||||
later subsume and supersede this record.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 09](../PLAN.md) finalises the persistence schema
|
||||
and the code-generation pipeline. Stage 08 already opens, instruments,
|
||||
and pings the PostgreSQL pool but does not apply any migrations. The
|
||||
durable surface for runtime state, engine version registry, player
|
||||
mappings, and the audit log is described in
|
||||
[`../README.md` §Persistence Layout](../README.md). Stage 09 ships:
|
||||
|
||||
- `internal/adapters/postgres/migrations/00001_init.sql` plus the
|
||||
matching embed package;
|
||||
- `cmd/jetgen` — a testcontainers-driven regeneration pipeline for
|
||||
the go-jet/v2 query builder code;
|
||||
- the generated jet code under
|
||||
`internal/adapters/postgres/jet/gamemaster/{model,table}/`,
|
||||
committed verbatim;
|
||||
- the `postgres.RunMigrations` call in `internal/app/runtime.go`,
|
||||
applied after the PostgreSQL pool ping and before any listener is
|
||||
built.
|
||||
|
||||
The reference precedent is `rtmanager`, the most recently landed
|
||||
PG-backed service in the workspace.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Schema and role provisioning are excluded from `00001_init.sql`
|
||||
|
||||
**Decision.** The `gamemaster` schema and the matching
|
||||
`gamemasterservice` role are created outside the migration sequence
|
||||
(in tests by [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go)
|
||||
`provisionRoleAndSchema`; in production by an ops init script not in
|
||||
scope for this stage). The embedded migration `00001_init.sql` only
|
||||
contains DDL for the four service-owned tables and indexes and assumes
|
||||
it runs as the schema owner with `search_path=gamemaster`.
|
||||
|
||||
**Why.** [`../../ARCHITECTURE.md` §Database topology](../../ARCHITECTURE.md)
|
||||
mandates that each service connects with its own role whose grants are
|
||||
restricted to its own schema. Mixing role creation, schema creation,
|
||||
and table DDL into one script forces the migration to run as a
|
||||
superuser on every replica boot and effectively relaxes the per-service
|
||||
role boundary. The `rtmanager` precedent settled on the split first;
|
||||
GM follows it for the same architectural reason. This is a deliberate
|
||||
deviation from PLAN Stage 09's literal `CREATE SCHEMA IF NOT EXISTS
|
||||
gamemaster;` instruction, called out in the comment header at the top
|
||||
of `00001_init.sql`.
|
||||
|
||||
### 2. Natural primary keys mirror the platform identifiers
|
||||
|
||||
**Decision.** Every PK is a natural identifier already owned by another
|
||||
component:
|
||||
|
||||
- `runtime_records.game_id` — Lobby's platform identifier;
|
||||
- `engine_versions.version` — semver string from the registry;
|
||||
- `player_mappings (game_id, user_id)` — composite, both columns owned
|
||||
by Lobby/User Service.
|
||||
- `operation_log.id` — `bigserial`, the only synthetic PK because the
|
||||
audit table has no natural identity per row.
|
||||
|
||||
**Why.** The same reasoning as in
|
||||
[`../../rtmanager/docs/postgres-migration.md` §2](../../rtmanager/docs/postgres-migration.md)
|
||||
applies: surrogate keys would force every cross-service join through a
|
||||
lookup table, while the natural keys keep the persistence layer
|
||||
pin-compatible with the contracts (every `register-runtime` envelope
|
||||
already names `game_id`, every Lobby resolve names `version`, every
|
||||
player command names `user_id`).
|
||||
|
||||
### 3. Defense-in-depth CHECK constraints on every status enum
|
||||
|
||||
**Decision.** Five CHECK constraints reproduce the Go-level enums in
|
||||
the schema:
|
||||
|
||||
- `runtime_records_status_chk` — seven runtime statuses
|
||||
(`starting`, `running`, `generation_in_progress`, `generation_failed`,
|
||||
`stopped`, `engine_unreachable`, `finished`);
|
||||
- `engine_versions_status_chk` — `active | deprecated`;
|
||||
- `operation_log_op_kind_chk` — nine operation kinds
|
||||
(`register_runtime`, `turn_generation`, `force_next_turn`, `banish`,
|
||||
`stop`, `patch`, `engine_version_create`, `engine_version_update`,
|
||||
`engine_version_deprecate`);
|
||||
- `operation_log_op_source_chk` — three op sources
|
||||
(`gateway_player`, `lobby_internal`, `admin_rest`);
|
||||
- `operation_log_outcome_chk` — `success | failure`.
|
||||
|
||||
The Go-level enums in the domain layer (added in Stage 10) remain the
|
||||
source of truth for application code.
|
||||
|
||||
**Why.** The same defense-in-depth argument as for `rtmanager`: the
|
||||
storage boundary catches an adapter regression that would otherwise
|
||||
persist an unexpected string. Operator-side queries (`SELECT … WHERE
|
||||
op_kind = 'patch'`) benefit from the enum being verifiable directly in
|
||||
psql without consulting the Go source. PostgreSQL's `CREATE TYPE … AS
|
||||
ENUM` was rejected because adding values to a PG enum type requires
|
||||
`ALTER TYPE` outside a transaction and complicates the single-init
|
||||
pre-launch policy (decision §6).
|
||||
|
||||
### 4. Indexes derive from concrete query shapes
|
||||
|
||||
**Decision.** Three secondary indexes ship with `00001_init.sql`:
|
||||
|
||||
- `runtime_records (status, next_generation_at)` — drives the
|
||||
scheduler ticker scan
|
||||
(`WHERE status='running' AND next_generation_at <= now()` once per
|
||||
second);
|
||||
- `player_mappings (game_id, race_name)` UNIQUE — enforces the
|
||||
one-race-per-game invariant at the storage boundary;
|
||||
- `operation_log (game_id, started_at DESC)` — drives audit reads
|
||||
ordered by recency.
|
||||
|
||||
The README §Persistence Layout list also mentions `player_mappings
|
||||
(game_id)`, which is intentionally **not** added: the composite
|
||||
primary key on `(game_id, user_id)` already serves as a leftmost-prefix
|
||||
index for `WHERE game_id = $1`, and a one-column duplicate would only
|
||||
double the write cost for no plan-stability gain. The README's
|
||||
indexes list is corrected in the same patch to drop the redundant
|
||||
entry.
|
||||
|
||||
**Why.** Each remaining index has a single concrete read shape behind
|
||||
it. The composite ordering on `(status, next_generation_at)` lets the
|
||||
planner satisfy the scheduler scan with one index sweep. The descending
|
||||
ordering on `(game_id, started_at DESC)` matches the
|
||||
`ListByGame ORDER BY started_at DESC` shape already established by
|
||||
`rtmanager.operationlogstore.ListByGame`.
|
||||
|
||||
### 5. `next_generation_at` is nullable
|
||||
|
||||
**Decision.** `runtime_records.next_generation_at timestamptz` admits
|
||||
NULL; `runtime_records.skip_next_tick boolean NOT NULL DEFAULT false`
|
||||
does not.
|
||||
|
||||
**Why.** A row enters the table at register-runtime with
|
||||
`status='starting'` and no scheduled tick yet — the tick is only
|
||||
computed once the engine `/admin/init` succeeds and the CAS flips the
|
||||
status to `running`. NULL captures «no tick scheduled» without forcing
|
||||
a sentinel value into the column. The scheduler index
|
||||
`(status, next_generation_at)` still works correctly: the predicate
|
||||
`next_generation_at <= now()` is undefined for NULL inputs, and PG
|
||||
excludes those rows from the result set, which is the desired
|
||||
behaviour. `skip_next_tick` is a boolean knob set or cleared by the
|
||||
force-next-turn flow; NULL would be a third state with no semantic, so
|
||||
the column is NOT NULL with a `false` default.
|
||||
|
||||
### 6. Single-init pre-launch policy applies as documented
|
||||
|
||||
**Decision.** `00001_init.sql` evolves in place until first production
|
||||
deploy. Adding a column, an index, or a new table during the
|
||||
pre-launch development window edits this file directly rather than
|
||||
producing `00002_*.sql`. The runtime applies the migration on every
|
||||
boot; if the schema is already at head, `pkg/postgres`'s goose
|
||||
adapter exits zero.
|
||||
|
||||
**Why.** The schema-per-service architectural rule
|
||||
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
|
||||
endorses a single-init policy for pre-launch services. The pre-launch
|
||||
window allows non-additive changes (column rename, type narrowing,
|
||||
CHECK tightening) that a multi-step migration sequence would force into
|
||||
awkward two-step rewrites. Once the service ships to production, the
|
||||
next schema change becomes `00002_*.sql` and the policy lifts.
|
||||
|
||||
### 7. `cmd/jetgen` is a one-to-one mirror of `rtmanager/cmd/jetgen`
|
||||
|
||||
**Decision.** [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) follows
|
||||
the same shape as
|
||||
[`../../rtmanager/cmd/jetgen/main.go`](../../rtmanager/cmd/jetgen/main.go):
|
||||
spin a `postgres:16-alpine` testcontainer, open it as superuser,
|
||||
provision the role and schema, open a second pool with
|
||||
`search_path=gamemaster`, apply the embedded goose migrations, then
|
||||
invoke `github.com/go-jet/jet/v2/generator/postgres.GenerateDB` with
|
||||
schema=gamemaster. Constants differ (`gamemasterservice`,
|
||||
`gamemaster`, `galaxy_gamemaster`) but the algorithm and helper shape
|
||||
are intentionally identical.
|
||||
|
||||
**Why.** Two PG-backed services should not diverge on a dev-only code
|
||||
generator that nothing else in the workspace relies on. Mirroring
|
||||
`rtmanager` keeps `make -C <service> jet` interchangeable for
|
||||
operators and minimises the cognitive overhead of moving between
|
||||
services.
|
||||
|
||||
### 8. Generated jet code is committed
|
||||
|
||||
**Decision.** The output of `make -C gamemaster jet` lands under
|
||||
[`../internal/adapters/postgres/jet/gamemaster/{model,table}/`](../internal/adapters/postgres/jet/gamemaster)
|
||||
and is committed verbatim.
|
||||
|
||||
**Why.** `go build ./...` from the repository root must work without
|
||||
Docker; CI runners and contributor machines without a local Docker
|
||||
daemon must still pass `go test ./gamemaster/...` for the non-PG-store
|
||||
parts of the module. The generation pipeline itself remains available
|
||||
behind `make jet` for everyone who wants to regenerate.
|
||||
|
||||
### 9. Migrations apply synchronously before any listener opens
|
||||
|
||||
**Decision.** [`../internal/app/runtime.go`](../internal/app/runtime.go)
|
||||
calls `postgres.RunMigrations(ctx, pgPool, migrations.FS(), ".")`
|
||||
immediately after the `postgres.Ping` succeeds and before
|
||||
`newWiring`/`internalhttp.NewServer` are constructed. A non-zero exit
|
||||
on migration failure follows the `pkg/postgres` policy.
|
||||
|
||||
**Why.** [`../README.md` §Startup dependencies](../README.md)
|
||||
specifies that «embedded goose migrations apply synchronously before
|
||||
any listener opens». Repeated process boots against a head schema
|
||||
return goose's «no work to do» success — this is how the policy stays
|
||||
operationally cheap, since a freshly-spawned replica re-applies the
|
||||
same `00001_init.sql` with no work and proceeds straight to opening
|
||||
its listeners.
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
— full schema for the four service tables plus indexes and CHECK
|
||||
constraints.
|
||||
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
|
||||
— `//go:embed *.sql` and `FS()` exporter.
|
||||
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) — testcontainers +
|
||||
goose + jet pipeline.
|
||||
- [`../internal/adapters/postgres/jet/gamemaster/`](../internal/adapters/postgres/jet/gamemaster)
|
||||
— generated model and table packages.
|
||||
- [`../internal/app/runtime.go`](../internal/app/runtime.go) — wired
|
||||
`postgres.RunMigrations` call after the pool ping.
|
||||
- [`../Makefile`](../Makefile) — refreshed `jet` target comment now
|
||||
that the pipeline is real.
|
||||
- [`../go.mod`](../go.mod), [`../go.sum`](../go.sum) — promoted
|
||||
`github.com/go-jet/jet/v2`, `github.com/testcontainers/testcontainers-go`,
|
||||
and `github.com/testcontainers/testcontainers-go/modules/postgres`
|
||||
to direct dependencies.
|
||||
- [`../README.md`](../README.md) — corrected §Persistence Layout
|
||||
indexes list (dropped redundant `player_mappings (game_id)` entry)
|
||||
and added a §References pointer to this record.
|
||||
|
||||
## Verification
|
||||
|
||||
- `cd gamemaster && go mod tidy` — no missing dependency, no
|
||||
superfluous indirect.
|
||||
- `make -C gamemaster jet` — bring up `postgres:16-alpine`, apply
|
||||
`00001_init.sql`, regenerate `internal/adapters/postgres/jet/...`;
|
||||
`git status` is clean after a second run.
|
||||
- `go build ./gamemaster/...` succeeds (including the generated jet
|
||||
code).
|
||||
- `go test ./gamemaster/...` passes — existing contract, freeze, and
|
||||
config/telemetry/HTTP tests are unaffected.
|
||||
- Manual smoke against a local PostgreSQL with an empty `gamemaster`
|
||||
schema and a `gamemasterservice` role: the process applies the
|
||||
migration, `/readyz` returns `200`, and a second boot exits zero on
|
||||
the «no work to do» path.
|
||||
@@ -0,0 +1,184 @@
|
||||
---
|
||||
stage: 10
|
||||
title: Domain layer and ports
|
||||
---
|
||||
|
||||
# Stage 10 — Domain layer and ports
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
introducing the in-memory domain model and port interfaces of Game
|
||||
Master at PLAN Stage 10.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 10](../PLAN.md) freezes the domain types and the
|
||||
port surfaces that adapters (Stage 11/12), services (Stages 13–17), and
|
||||
workers (Stage 18) will adopt. No adapter or service code lands here;
|
||||
the stage exists so every consumer of these types in later stages can
|
||||
import a stable contract.
|
||||
|
||||
The reference precedent is `rtmanager`, the most recently landed
|
||||
PG-backed service. Its
|
||||
[`internal/domain/`](../../rtmanager/internal/domain) and
|
||||
[`internal/ports/`](../../rtmanager/internal/ports) directories define
|
||||
the shape every Stage 10 file follows: `Status string` enums with
|
||||
`IsKnown` / `AllStatuses`; `*InvalidTransitionError` wrapping
|
||||
`ErrInvalidTransition`; transition tables keyed by `(from, to)` pairs;
|
||||
input structs with `Validate()` methods on every store mutation.
|
||||
|
||||
Six decisions deviate from a direct copy of `rtmanager` or extend the
|
||||
literal task list of PLAN Stage 10. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. `internal/domain/operation/` is added beyond the literal task list
|
||||
|
||||
**Decision.** Stage 10 ships
|
||||
[`internal/domain/operation/log.go`](../internal/domain/operation/log.go)
|
||||
with `OperationEntry`, `OpKind`, `OpSource`, and `Outcome` types even
|
||||
though PLAN Stage 10's bullet list does not enumerate them.
|
||||
|
||||
**Why.** The Stage 09
|
||||
[`00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
schema already declares CHECK constraints on `op_kind`, `op_source`,
|
||||
and `outcome`. The
|
||||
[`ports/operationlog.go`](../internal/ports/operationlog.go) interface
|
||||
returns and accepts an `OperationEntry` parameter, which must therefore
|
||||
live in the domain layer or be redefined inside `ports`. The
|
||||
`rtmanager` precedent
|
||||
([`rtmanager/internal/domain/operation/log.go`](../../rtmanager/internal/domain/operation/log.go))
|
||||
treats it as a domain package; mirroring that keeps Game Master's layout
|
||||
recognisable and lets later service code import a single canonical
|
||||
type. The alternative (defining the type on the port file) would
|
||||
duplicate the SQL CHECK enums in two places once Stage 11's adapter
|
||||
ships and would force every service-layer caller to import the port
|
||||
package for what is structurally a value type.
|
||||
|
||||
### 2. `Membership` lives on `ports/lobbyclient.go`, not in the domain
|
||||
|
||||
**Decision.** The DTO consumed by `LobbyClient.GetMemberships` is
|
||||
declared inside
|
||||
[`ports/lobbyclient.go`](../internal/ports/lobbyclient.go) rather than a
|
||||
new `internal/domain/membership/` package.
|
||||
|
||||
**Why.** Game Master does not own membership state — Game Lobby does
|
||||
([`../../ARCHITECTURE.md` §Membership rules](../../ARCHITECTURE.md)).
|
||||
Anything GM holds about membership is a remote projection used solely
|
||||
for hot-path authorisation. Treating it as a port-level DTO matches
|
||||
`rtmanager`'s precedent for cross-service projections
|
||||
([`rtmanager/internal/ports/lobbyinternal.go:LobbyGameRecord`](../../rtmanager/internal/ports/lobbyinternal.go))
|
||||
and keeps the domain layer free of types that GM does not author.
|
||||
Promoting it to a domain package later costs nothing if a real
|
||||
GM-owned invariant ever attaches to it, but the v1 surface has none.
|
||||
|
||||
### 3. `EngineVersion.Options` is `[]byte`, not `map[string]any`
|
||||
|
||||
**Decision.**
|
||||
[`engineversion.EngineVersion.Options`](../internal/domain/engineversion/model.go)
|
||||
is declared as `[]byte` carrying the raw `jsonb` document.
|
||||
|
||||
**Why.** The OpenAPI contract
|
||||
([`../api/internal-openapi.yaml`](../api/internal-openapi.yaml)) marks
|
||||
`EngineVersion.options` as `additionalProperties: true` — the engine
|
||||
owns the schema, GM is a pass-through registry. A `map[string]any` Go
|
||||
field would encourage callers to introspect or mutate keys, breaking
|
||||
that pass-through guarantee. `[]byte` matches how `rtmanager` keeps
|
||||
`Details json.RawMessage` on health snapshots
|
||||
([`rtmanager/internal/domain/health/snapshot.go`](../../rtmanager/internal/domain/health/snapshot.go))
|
||||
for the same reason. Schema-aware handling can introduce a typed shape
|
||||
in a future iteration without disturbing existing rows.
|
||||
|
||||
### 4. `Schedule.Next(after, skip)` returns `skipConsumed`, not mutated state
|
||||
|
||||
**Decision.** The wrapper at
|
||||
[`internal/domain/schedule/nexttick.go`](../internal/domain/schedule/nexttick.go)
|
||||
exposes `Next(after time.Time, skip bool) (time.Time, bool)`. The
|
||||
boolean return reports whether the skip flag was consumed; the wrapper
|
||||
itself stores no state.
|
||||
|
||||
**Why.** Persisting `skip_next_tick=false` is a column update on the
|
||||
`runtime_records` row and belongs to the service layer (Stage 15),
|
||||
together with the `next_generation_at` write. Encapsulating that
|
||||
mutation inside the schedule wrapper would couple a pure value type to
|
||||
the store; the boolean return keeps the wrapper trivially testable and
|
||||
lets the caller (service layer) issue the column update via an
|
||||
existing `UpdateScheduling` port call.
|
||||
|
||||
### 5. The transition table includes `engine_unreachable → running`
|
||||
|
||||
**Decision.** The runtime transitions map
|
||||
([`internal/domain/runtime/transitions.go`](../internal/domain/runtime/transitions.go))
|
||||
permits `engine_unreachable → running` even though Stage 10's task
|
||||
list does not introduce a producer for that edge.
|
||||
|
||||
**Why.** The Stage 18
|
||||
([`../PLAN.md` Stage 18](../PLAN.md)) health-events consumer must be
|
||||
able to recover an engine that previously appeared unreachable when a
|
||||
subsequent health observation reports `healthy`. Declaring the edge in
|
||||
Stage 10 means Stage 18 needs no transitions.go edit — the consumer
|
||||
calls `UpdateStatus` with the existing CAS guard. The alternative
|
||||
(wait until Stage 18 to add the edge) would couple two unrelated
|
||||
stages and force a domain-level edit during a worker stage.
|
||||
|
||||
### 6. mockgen directives target `internal/adapters/mocks/` (deferred)
|
||||
|
||||
**Decision.** Every port file carries a
|
||||
`//go:generate go run go.uber.org/mock/mockgen
|
||||
-destination=../adapters/mocks/mock_<file>.go -package=mocks
|
||||
galaxy/gamemaster/internal/ports <Interface>` directive even though
|
||||
the destination directory does not exist yet.
|
||||
|
||||
**Why.** Stage 12 ships the
|
||||
[`internal/adapters/mocks/`](../internal/adapters/mocks) directory and
|
||||
the first regeneration of `make mocks`. Putting the directives in
|
||||
place during Stage 10 means Stage 12 only adds the directory and the
|
||||
generated files; no port file has to be edited then. The directives
|
||||
are inert until the destination directory exists; running
|
||||
`go generate ./internal/ports/...` before Stage 12 is expected to
|
||||
fail. The
|
||||
[`Makefile`](../Makefile)'s `mocks` target already references the
|
||||
directives, matching the lobby and rtmanager pattern
|
||||
([`../../lobby/internal/ports/gmclient.go`](../../lobby/internal/ports/gmclient.go),
|
||||
[`../../rtmanager/internal/ports/dockerclient.go`](../../rtmanager/internal/ports/dockerclient.go)).
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/domain/runtime/{model,errors,transitions}.go`](../internal/domain/runtime)
|
||||
with seven-status enum, `RuntimeRecord` struct, and the transition
|
||||
table from PLAN Stage 10 plus decision §5.
|
||||
- [`../internal/domain/engineversion/{model,semver}.go`](../internal/domain/engineversion)
|
||||
with the registry status enum, `EngineVersion` struct, and the
|
||||
`ParseSemver` / `IsPatchUpgrade` helpers.
|
||||
- [`../internal/domain/playermapping/model.go`](../internal/domain/playermapping/model.go)
|
||||
carrying the (game_id, user_id) → race_name + engine_player_uuid
|
||||
projection.
|
||||
- [`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
|
||||
per decision §1.
|
||||
- [`../internal/domain/schedule/nexttick.go`](../internal/domain/schedule/nexttick.go)
|
||||
per decision §4.
|
||||
- Ten port files under
|
||||
[`../internal/ports/`](../internal/ports) covering the runtime
|
||||
record, engine version, player mapping, operation log, stream
|
||||
offset, engine, lobby, runtime manager, notification publisher, and
|
||||
lobby events surfaces.
|
||||
- Unit tests next to every source file; the suite covers status
|
||||
enums, transition matrix, validators, semver normalisation, and
|
||||
schedule skip semantics.
|
||||
- [`../go.mod`](../go.mod) gains direct dependencies on
|
||||
`galaxy/cronutil` and `golang.org/x/mod` for the schedule wrapper
|
||||
and the semver helpers.
|
||||
|
||||
## Verification
|
||||
|
||||
- `cd gamemaster && go build ./...` — clean.
|
||||
- `cd gamemaster && go test ./internal/domain/... ./internal/ports/...`
|
||||
— green; transition matrix exhaustively asserts every allowed and
|
||||
forbidden pair, semver parser rejects shortened forms, schedule
|
||||
wrapper honours both `skip` modes.
|
||||
- `cd gamemaster && go vet ./internal/...` — clean.
|
||||
- `gofmt -l gamemaster/internal` — empty.
|
||||
- Stage 09 contract tests
|
||||
([`../contract_openapi_test.go`](../contract_openapi_test.go),
|
||||
[`../contract_asyncapi_test.go`](../contract_asyncapi_test.go),
|
||||
[`../notificationintent_audit_test.go`](../notificationintent_audit_test.go))
|
||||
remain green; Stage 10 introduces no contract changes.
|
||||
@@ -0,0 +1,242 @@
|
||||
---
|
||||
stage: 11
|
||||
title: Persistence adapters
|
||||
---
|
||||
|
||||
# Stage 11 — Persistence adapters
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the four PostgreSQL stores and the Redis offset store of
|
||||
Game Master at PLAN Stage 11.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 11](../PLAN.md) ships the persistence layer that
|
||||
the service-layer stages (13-17) and the worker stage (18) consume.
|
||||
Stage 09 already shipped the schema, embedded migration, and the
|
||||
generated jet code; Stage 10 fixed the domain types and the port
|
||||
interfaces. Stage 11 plugs concrete adapters into those ports.
|
||||
|
||||
The reference precedent is `rtmanager`, the most recently landed
|
||||
PG-backed service. Its
|
||||
[`internal/adapters/postgres/`](../../rtmanager/internal/adapters/postgres)
|
||||
and
|
||||
[`internal/adapters/redisstate/`](../../rtmanager/internal/adapters/redisstate)
|
||||
trees define the shape every Stage 11 file follows: per-store package
|
||||
under `postgres/<store>/store.go`, helper packages under
|
||||
`internal/sqlx` and `internal/pgtest`, `Config`/`Store`/`New` triple,
|
||||
ColumnList-driven canonical SELECTs, `sqlx.WithTimeout`/`sqlx.IsNoRows`/
|
||||
`sqlx.IsUniqueViolation` shared boundary helpers.
|
||||
|
||||
Eight decisions either deviate from a literal copy of `rtmanager` or
|
||||
extend the literal task list of PLAN Stage 11. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. `internal/sqlx` and `internal/pgtest` are local clones, not a shared module
|
||||
|
||||
**Decision.**
|
||||
[`internal/adapters/postgres/internal/sqlx/sqlx.go`](../internal/adapters/postgres/internal/sqlx/sqlx.go)
|
||||
and
|
||||
[`internal/adapters/postgres/internal/pgtest/pgtest.go`](../internal/adapters/postgres/internal/pgtest/pgtest.go)
|
||||
are full copies of `rtmanager`'s sibling files, with the few constants
|
||||
that name the schema and role (`gamemaster`, `gamemasterservice`,
|
||||
`galaxy_gamemaster`) replaced verbatim.
|
||||
|
||||
**Why.** Each PG-backed service owns its own role, schema, and
|
||||
migration FS. Promoting these helpers into `pkg/postgres` would force
|
||||
that package to either know about every schema or take them as
|
||||
configuration; either path adds surface area for a runtime helper that
|
||||
already covers exactly one boundary. The `rtmanager` precedent settled
|
||||
on the per-service clone first and Game Master mirrors it for the
|
||||
same architectural reason. The duplication cost is small (≈250 lines
|
||||
total, mechanical) and the alternative would couple services through a
|
||||
testing concern that has no business in production code.
|
||||
|
||||
### 2. CAS via `(game_id, status)` predicate, not `SELECT … FOR UPDATE`
|
||||
|
||||
**Decision.**
|
||||
[`runtimerecordstore.UpdateStatus`](../internal/adapters/postgres/runtimerecordstore/store.go)
|
||||
encodes the compare-and-swap as a `WHERE game_id = $1 AND status = $2`
|
||||
predicate on a single `UPDATE`, then probes the row's existence on
|
||||
`RowsAffected == 0` to distinguish `runtime.ErrConflict` (status
|
||||
changed concurrently) from `runtime.ErrNotFound` (row absent).
|
||||
|
||||
**Why.** Same reasoning as
|
||||
[`rtmanager/docs/postgres-migration.md` §CAS](../../rtmanager/docs/postgres-migration.md):
|
||||
holding a `SELECT … FOR UPDATE` lock would block every other tick on
|
||||
the same game while the Go code computed the next status, lengthening
|
||||
the locked region for no correctness gain. The CAS-only path is
|
||||
verified by `TestUpdateStatusConcurrentCAS` (8 goroutines, exactly one
|
||||
winner).
|
||||
|
||||
### 3. Port-level deviation: `UpdateEngineVersionInput.Now` and `Deprecate(ctx, version, now)`
|
||||
|
||||
**Decision.**
|
||||
[`ports/engineversionstore.go`](../internal/ports/engineversionstore.go)
|
||||
gains a `Now time.Time` field on `UpdateEngineVersionInput` (validated
|
||||
by `Validate` to be non-zero) and a `now time.Time` argument on
|
||||
`Deprecate`. The corresponding port-level test fixtures in
|
||||
`engineversionstore_test.go` are updated to carry the new value.
|
||||
|
||||
**Why.** Stage 10's literal port did not include a wall-clock for the
|
||||
engine-version mutators, while
|
||||
[`UpdateStatusInput`](../internal/ports/runtimerecordstore.go) and
|
||||
[`UpdateSchedulingInput`](../internal/ports/runtimerecordstore.go) do.
|
||||
Without Now in the input, the adapter would have to either call
|
||||
`time.Now()` directly (loses test determinism) or accept a `Clock`
|
||||
dependency in `Config` (adds adapter infrastructure for a single use
|
||||
case). Aligning the inputs is a small, targeted contract change
|
||||
allowed by the pre-launch single-init policy and consistent with the
|
||||
clock-from-input convention adopted everywhere else in the service.
|
||||
|
||||
### 4. Domain-level conflict sentinels `engineversion.ErrConflict` and `playermapping.ErrConflict`
|
||||
|
||||
**Decision.** The domain packages
|
||||
[`engineversion`](../internal/domain/engineversion/model.go) and
|
||||
[`playermapping`](../internal/domain/playermapping/model.go) gain
|
||||
`ErrConflict` sentinels. Adapters surface PostgreSQL unique violations
|
||||
as `fmt.Errorf("...: %w", <pkg>.ErrConflict)` so service callers can
|
||||
branch with `errors.Is`.
|
||||
|
||||
**Why.** `runtime.ErrConflict` already exists in the runtime package
|
||||
and the rest of the codebase (lobby, rtmanager, notification) uses
|
||||
domain-level conflict sentinels (e.g.
|
||||
`membership.ErrConflict`,
|
||||
`runtime.ErrConflict`). Returning a generic wrapped error for
|
||||
engine-version and player-mapping conflicts would break the
|
||||
established pattern and force the service layer to carry adapter
|
||||
implementation knowledge (`sqlx.IsUniqueViolation`). Adding two
|
||||
sentinels is a small, idiomatic deviation from PLAN Stage 11's bullet
|
||||
list, called out here so future contract diffs do not re-litigate it.
|
||||
|
||||
### 5. `Options` jsonb requires explicit `CAST(... AS jsonb)` in dynamic UPDATE
|
||||
|
||||
**Decision.** In
|
||||
[`engineversionstore.Update`](../internal/adapters/postgres/engineversionstore/store.go)
|
||||
the dynamic assignment for `options` wraps the value in
|
||||
`pg.StringExp(pg.CAST(pg.String(...)).AS("jsonb"))`. The plain
|
||||
`pg.String(...)` literal makes PostgreSQL infer the right-hand side as
|
||||
`text` and the assignment to a `jsonb` column then fails with
|
||||
SQLSTATE `42804` (`column is of type jsonb but expression is of type
|
||||
text`).
|
||||
|
||||
**Why.** `INSERT ... VALUES(...)` paths bind the `[]byte` through pgx,
|
||||
which knows how to coerce text into jsonb at the protocol level.
|
||||
Dynamic `UPDATE … SET options = '...'` does not go through that bind
|
||||
because the SQL contains a string literal directly; PostgreSQL applies
|
||||
its own type inference and fails. Using
|
||||
[`jet`'s `CAST`](https://pkg.go.dev/github.com/go-jet/jet/v2/postgres#CAST)
|
||||
is the cleanest way to force the right-hand-side type without dropping
|
||||
to raw SQL. Storing `'{}'::jsonb` as the empty default mirrors the SQL
|
||||
column default.
|
||||
|
||||
### 6. `Deprecate` is idempotent through a pre-check `Get`
|
||||
|
||||
**Decision.**
|
||||
[`engineversionstore.Deprecate`](../internal/adapters/postgres/engineversionstore/store.go)
|
||||
runs `Get(version)` first to distinguish three cases: row absent
|
||||
(return `engineversion.ErrNotFound`), row already deprecated (return
|
||||
`nil` with no further mutation), row active (run the
|
||||
`UPDATE ... SET status='deprecated'`). Without the pre-check the
|
||||
adapter would have to interpret `RowsAffected == 0` against an
|
||||
ambiguous SQL guard (`WHERE version = ? AND status != 'deprecated'`).
|
||||
|
||||
**Why.** Deprecation is a relatively rare admin operation; the extra
|
||||
read costs ≈one millisecond and removes the ambiguity. The
|
||||
alternative is the same `classifyMissingUpdate` probe pattern used by
|
||||
`UpdateStatus`, which would still need a Get to tell "missing" from
|
||||
"already deprecated". The pre-check is the simplest path.
|
||||
|
||||
### 7. `BulkInsert` ships every row in one multi-row `INSERT`, not a transaction
|
||||
|
||||
**Decision.**
|
||||
[`playermappingstore.BulkInsert`](../internal/adapters/postgres/playermappingstore/store.go)
|
||||
emits a single `INSERT ... VALUES (a), (b), …` with as many tuples as
|
||||
the input slice. Any unique-violation rolls back every row in the same
|
||||
statement.
|
||||
|
||||
**Why.** The atomicity guarantee Game Master needs (no partial
|
||||
roster) is already provided by PostgreSQL's per-statement implicit
|
||||
transaction; wrapping the same rows in `BEGIN; INSERT; INSERT; COMMIT`
|
||||
buys nothing and adds round-trips. The multi-row form is also the
|
||||
only path that lets jet's
|
||||
[`InsertStatement.VALUES(...)`](https://pkg.go.dev/github.com/go-jet/jet/v2/postgres#InsertStatement)
|
||||
chain without escape hatches. Atomicity is verified end-to-end by
|
||||
[`TestBulkInsertAtomicConflictRaceName`](../internal/adapters/postgres/playermappingstore/store_test.go)
|
||||
(3 valid rows + 1 conflicting → 0 rows persisted).
|
||||
|
||||
### 8. `miniredis/v2` is a direct gamemaster dependency
|
||||
|
||||
**Decision.**
|
||||
[`go.mod`](../go.mod) gains `github.com/alicebob/miniredis/v2` as a
|
||||
direct dependency. The
|
||||
[`streamoffsets` test suite](../internal/adapters/redisstate/streamoffsets/store_test.go)
|
||||
uses `miniredis.RunT(t)` per test for full isolation.
|
||||
|
||||
**Why.** Same reasoning as `rtmanager`: an in-memory Redis is faster
|
||||
than testcontainers Redis, fully isolated per test, and fits the
|
||||
shape of the offset-store API. Adding it as a direct dep matches the
|
||||
pattern in the repo (`rtmanager`, `notification`, `lobby` all do this
|
||||
for similar adapter test suites).
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/domain/engineversion/model.go`](../internal/domain/engineversion/model.go)
|
||||
— `ErrConflict` sentinel.
|
||||
- [`../internal/domain/playermapping/model.go`](../internal/domain/playermapping/model.go)
|
||||
— `ErrConflict` sentinel.
|
||||
- [`../internal/ports/engineversionstore.go`](../internal/ports/engineversionstore.go)
|
||||
— `Now` field, `Deprecate(ctx, version, now)` signature.
|
||||
- [`../internal/ports/engineversionstore_test.go`](../internal/ports/engineversionstore_test.go)
|
||||
— port-level fixtures plus the new `now must not be zero` reject
|
||||
case.
|
||||
- [`../internal/adapters/postgres/internal/sqlx/sqlx.go`](../internal/adapters/postgres/internal/sqlx/sqlx.go)
|
||||
— `WithTimeout`, `IsNoRows`, `IsUniqueViolation`, `Nullable*`
|
||||
helpers (mirror of `rtmanager`).
|
||||
- [`../internal/adapters/postgres/internal/pgtest/pgtest.go`](../internal/adapters/postgres/internal/pgtest/pgtest.go)
|
||||
— testcontainers harness scoped to the `gamemaster` schema and
|
||||
service role.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore/store.go`](../internal/adapters/postgres/runtimerecordstore/store.go)
|
||||
with full `_test.go`.
|
||||
- [`../internal/adapters/postgres/engineversionstore/store.go`](../internal/adapters/postgres/engineversionstore/store.go)
|
||||
with full `_test.go`.
|
||||
- [`../internal/adapters/postgres/playermappingstore/store.go`](../internal/adapters/postgres/playermappingstore/store.go)
|
||||
with full `_test.go`.
|
||||
- [`../internal/adapters/postgres/operationlog/store.go`](../internal/adapters/postgres/operationlog/store.go)
|
||||
with full `_test.go`.
|
||||
- [`../internal/adapters/redisstate/keyspace.go`](../internal/adapters/redisstate/keyspace.go).
|
||||
- [`../internal/adapters/redisstate/streamoffsets/store.go`](../internal/adapters/redisstate/streamoffsets/store.go)
|
||||
with full `_test.go`.
|
||||
- [`../go.mod`](../go.mod), [`../go.sum`](../go.sum) — `miniredis/v2`
|
||||
promoted to a direct dependency.
|
||||
- [`../README.md`](../README.md) — §References pointer to this
|
||||
record.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Domain + port unit tests still pass after the Stage-11 contract
|
||||
# touch-ups.
|
||||
go test ./internal/domain/... ./internal/ports/...
|
||||
|
||||
# All adapter test suites (require Docker for testcontainers; without
|
||||
# Docker, the pgtest helpers call t.Skip).
|
||||
go test ./internal/adapters/postgres/...
|
||||
go test ./internal/adapters/redisstate/...
|
||||
|
||||
# CAS race coverage with -race; the test must observe exactly one
|
||||
# winner per run.
|
||||
go test -count=3 -race -run TestUpdateStatusConcurrentCAS \
|
||||
./internal/adapters/postgres/runtimerecordstore
|
||||
|
||||
# Stage 06/07 contract freeze tests stay green:
|
||||
go test ./... -run Contract
|
||||
go test ./... -run NotificationIntent
|
||||
```
|
||||
|
||||
The full repo-level `go build ./...` from the workspace root also
|
||||
succeeds; service-layer stages (13+) and the mocks regeneration
|
||||
(stage 12) are unaffected by Stage 11's adapter additions.
|
||||
@@ -0,0 +1,211 @@
|
||||
---
|
||||
stage: 12
|
||||
title: External clients
|
||||
---
|
||||
|
||||
# Stage 12 — External clients
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the five outbound adapters Game Master uses to talk to
|
||||
the engine, Game Lobby, Runtime Manager, the notification stream, and
|
||||
the lobby-events stream at PLAN Stage 12.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 12](../PLAN.md) ships the adapter layer the
|
||||
service-layer stages 13–18 depend on. Ports were frozen by Stage 10
|
||||
([`stage10-domain-and-ports.md`](./stage10-domain-and-ports.md)) and
|
||||
the AsyncAPI/OpenAPI contracts were frozen by Stage 06
|
||||
([`stage06-contract-files.md`](./stage06-contract-files.md)). The
|
||||
reference precedent is `rtmanager`'s adapter tree
|
||||
([`rtmanager/internal/adapters/lobbyclient`](../../rtmanager/internal/adapters/lobbyclient),
|
||||
[`rtmanager/internal/adapters/notificationpublisher`](../../rtmanager/internal/adapters/notificationpublisher),
|
||||
[`rtmanager/internal/adapters/healtheventspublisher`](../../rtmanager/internal/adapters/healtheventspublisher)),
|
||||
which Stage 11 already locked in as the canonical shape for Game
|
||||
Master persistence adapters. Stage 12 extends that precedent to the
|
||||
HTTP clients and stream publishers.
|
||||
|
||||
Six decisions deviate from a literal copy of the `rtmanager` precedent
|
||||
or extend the literal task list of PLAN Stage 12. Each is recorded
|
||||
below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Engine client carries no `BaseURL` in `Config`
|
||||
|
||||
**Decision.**
|
||||
[`engineclient.Config`](../internal/adapters/engineclient/client.go)
|
||||
exposes only `CallTimeout` and `ProbeTimeout`. The engine endpoint
|
||||
URL is supplied per call from `runtime_records.engine_endpoint`.
|
||||
|
||||
**Why.** Game Master operates on N concurrent games at runtime; each
|
||||
game lives behind its own DNS hostname (`http://galaxy-game-{game_id}:8080`).
|
||||
Binding a base URL at construction would force a per-game client
|
||||
instance and complicate the caller. The port already reflects the
|
||||
right shape (`baseURL` is a method parameter on every method), so the
|
||||
adapter follows it. The `*http.Client` is shared, so the HTTP
|
||||
connection pool stays single-instance.
|
||||
|
||||
### 2. Two timeouts on the engine client, dispatched per method
|
||||
|
||||
**Decision.** The engine client routes turn-generation-class methods
|
||||
(`Init`, `Turn`, `BanishRace`, `ExecuteCommands`, `PutOrders`)
|
||||
through `CallTimeout` and inspect-style methods (`Status`,
|
||||
`GetReport`) through `ProbeTimeout`. Both are required and must be
|
||||
positive at construction.
|
||||
|
||||
**Why.** README §Configuration already declares the two
|
||||
(`GAMEMASTER_ENGINE_CALL_TIMEOUT=30s`,
|
||||
`GAMEMASTER_ENGINE_PROBE_TIMEOUT=5s`) for exactly this dispatch:
|
||||
turn generation on a large game can run for tens of seconds, while
|
||||
status/report reads are bounded and benefit from a tight ceiling.
|
||||
A single shared timeout would either starve the long calls or relax
|
||||
the short ones; the dispatch keeps the contract consistent with the
|
||||
documented intent.
|
||||
|
||||
### 3. Engine `population` (number) decoded into `int` via `math.Round`
|
||||
|
||||
**Decision.**
|
||||
[`engineclient`](../internal/adapters/engineclient/client.go) decodes
|
||||
each `PlayerState.population` (typed as `number` in `game/openapi.yaml`)
|
||||
into a private `float64` field, then converts to the port-level `int`
|
||||
through `int(math.Round(value))`. NaN, infinite, and negative values
|
||||
are rejected as `ports.ErrEngineProtocolViolation`.
|
||||
|
||||
**Why.** The port (Stage 10) and the AsyncAPI for `gm:lobby_events`
|
||||
both treat population as a non-negative integer; the engine spec is
|
||||
the only place it is typed as `number`. The engine in practice
|
||||
returns whole values, but a defensive `math.Round` removes any
|
||||
floating-point noise that would otherwise propagate to Lobby.
|
||||
Rejecting NaN/Inf/negative payloads keeps the protocol invariant
|
||||
explicit at the trust boundary.
|
||||
|
||||
### 4. Lobby client walks pagination with a hard page cap
|
||||
|
||||
**Decision.**
|
||||
[`lobbyclient.GetMemberships`](../internal/adapters/lobbyclient/client.go)
|
||||
walks the `next_page_token` chain transparently with `page_size=200`,
|
||||
stopping when the upstream response carries an empty
|
||||
`next_page_token`. A hard cap of 64 pages (`maxPages`) surfaces as
|
||||
`fmt.Errorf("%w: pagination overflow ...", ports.ErrLobbyUnavailable)`
|
||||
when crossed.
|
||||
|
||||
**Why.** The port contract is "every membership of gameID, in any
|
||||
status"; the only way to satisfy it across Lobby's paged contract is
|
||||
to follow the chain. The 64-page cap is a defensive guard against a
|
||||
broken upstream that keeps issuing tokens; 64 × 200 = 12 800
|
||||
memberships per game, two orders of magnitude beyond any realistic
|
||||
Galaxy roster, so legitimate traffic never trips it. Surfacing the
|
||||
overflow as `ErrLobbyUnavailable` lets the membership cache treat it
|
||||
the same as any other transport fault.
|
||||
|
||||
### 5. RTM client does not introduce `ErrSemverPatchOnly`
|
||||
|
||||
**Decision.** RTM's `409 conflict` with `error_code=semver_patch_only`
|
||||
is wrapped as `fmt.Errorf("%w: rtm patch: ... (error_code=semver_patch_only)", ports.ErrRTMUnavailable)`
|
||||
without a dedicated typed sentinel.
|
||||
|
||||
**Why.** The Stage 10 port [`RTMClient.Patch`](../internal/ports/rtmclient.go)
|
||||
declares only `ErrRTMUnavailable`. Adding `ErrSemverPatchOnly` here
|
||||
would extend the port contract beyond Stage 10's frozen surface, and
|
||||
the v1 service-layer caller (Stage 17, `adminpatch`) already
|
||||
validates semver-patch eligibility against `engineversionstore`
|
||||
before issuing the call. The 409 path is therefore a defence-in-depth
|
||||
signal, not a primary branch; a single wrapped error keeps the port
|
||||
narrow and lets the caller match on the message substring if it
|
||||
ever needs to (today it does not).
|
||||
|
||||
### 6. Lobby-events publisher reuses the `rtmanager/healtheventspublisher`
|
||||
shape, with two methods sharing one stream
|
||||
|
||||
**Decision.**
|
||||
[`lobbyeventspublisher.Publisher`](../internal/adapters/lobbyeventspublisher/publisher.go)
|
||||
exposes `PublishSnapshotUpdate` and `PublishGameFinished`, both
|
||||
hitting the same Redis Stream key (`cfg.Streams.LobbyEvents`,
|
||||
default `gm:lobby_events`). Each XADD encodes the same field
|
||||
vocabulary as `rtmanager/healtheventspublisher`: integer fields are
|
||||
serialised through `strconv.FormatInt` / `strconv.Itoa`, the
|
||||
per-player projection is JSON-encoded into one stream field
|
||||
(`player_turn_stats`), and the discriminator field (`event_type`) is
|
||||
a string literal pinned to one of the two AsyncAPI const values.
|
||||
No MAXLEN cap is set on XADD; an empty `PlayerTurnStats` slice is
|
||||
serialised as `"[]"` (literal). All `time.Time` fields are coerced
|
||||
to UTC before `UnixMilli()` so the published timestamps match the
|
||||
contract regardless of caller-supplied timezone.
|
||||
|
||||
**Why.** The two messages share one channel per the AsyncAPI spec
|
||||
([`runtime-events-asyncapi.yaml`](../api/runtime-events-asyncapi.yaml));
|
||||
the discriminator is the documented dispatch key for Lobby's
|
||||
consumer. Using the existing field-encoding pattern from
|
||||
`rtmanager/healtheventspublisher` keeps the wire format consistent
|
||||
across services and lets Lobby reuse the same XADD-decoding helpers
|
||||
it already runs against `runtime:health_events`. Setting MAXLEN was
|
||||
considered and rejected: Game Master never processes the stream
|
||||
itself, and the Lobby consumer owns its consumer-group offset, so
|
||||
trimming would risk dropping unconsumed entries. The empty `"[]"`
|
||||
default keeps the stream entry valid JSON for the field even before
|
||||
the first turn generates (when no per-player stats exist yet).
|
||||
|
||||
### 7. Defensive Makefile guard for `make mocks` between Stage 12 and Stage 19
|
||||
|
||||
**Decision.** The `mocks` Makefile target now skips the
|
||||
`internal/api/internalhttp/handlers/...` line when that directory
|
||||
does not yet exist:
|
||||
|
||||
```makefile
|
||||
mocks:
|
||||
go generate ./internal/ports/...
|
||||
@if [ -d ./internal/api/internalhttp/handlers ]; then \
|
||||
go generate ./internal/api/internalhttp/handlers/...; \
|
||||
fi
|
||||
```
|
||||
|
||||
**Why.** Stage 8 wired the Makefile to regenerate both port-level
|
||||
and handler-level mocks, but the handlers directory only appears at
|
||||
Stage 19. Without the guard, `make mocks` fails with `lstat: no such
|
||||
file or directory` between Stage 12 and Stage 19 — exactly when GM
|
||||
is being grown stage by stage. The guard makes the target idempotent
|
||||
across stages and adds zero cost when the directory is finally
|
||||
created.
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/adapters/engineclient/client.go`](../internal/adapters/engineclient/client.go),
|
||||
[`../internal/adapters/engineclient/client_test.go`](../internal/adapters/engineclient/client_test.go)
|
||||
- [`../internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go),
|
||||
[`../internal/adapters/lobbyclient/client_test.go`](../internal/adapters/lobbyclient/client_test.go)
|
||||
- [`../internal/adapters/rtmclient/client.go`](../internal/adapters/rtmclient/client.go),
|
||||
[`../internal/adapters/rtmclient/client_test.go`](../internal/adapters/rtmclient/client_test.go)
|
||||
- [`../internal/adapters/notificationpublisher/publisher.go`](../internal/adapters/notificationpublisher/publisher.go),
|
||||
[`../internal/adapters/notificationpublisher/publisher_test.go`](../internal/adapters/notificationpublisher/publisher_test.go)
|
||||
- [`../internal/adapters/lobbyeventspublisher/publisher.go`](../internal/adapters/lobbyeventspublisher/publisher.go),
|
||||
[`../internal/adapters/lobbyeventspublisher/publisher_test.go`](../internal/adapters/lobbyeventspublisher/publisher_test.go)
|
||||
- [`../internal/adapters/mocks/`](../internal/adapters/mocks) — ten
|
||||
generated `mockgen` files covering every Stage 10 port (engine,
|
||||
lobby, rtm, notification publisher, lobby-events publisher, plus
|
||||
the five store/log ports landed by Stage 11).
|
||||
- [`../Makefile`](../Makefile) — defensive guard on the `mocks`
|
||||
target.
|
||||
- [`../README.md`](../README.md) — §References pointer to this
|
||||
record.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Mocks regenerate cleanly with no diff after a second run.
|
||||
make mocks
|
||||
git diff --exit-code internal/adapters/mocks
|
||||
|
||||
# Adapter-level unit tests against httptest / miniredis.
|
||||
go test ./internal/adapters/engineclient/...
|
||||
go test ./internal/adapters/lobbyclient/...
|
||||
go test ./internal/adapters/rtmclient/...
|
||||
go test ./internal/adapters/notificationpublisher/...
|
||||
go test ./internal/adapters/lobbyeventspublisher/...
|
||||
|
||||
# Full repo build remains green; Stage 06/07/09–11 contract and
|
||||
# adapter tests are unaffected.
|
||||
go test ./...
|
||||
```
|
||||
@@ -0,0 +1,230 @@
|
||||
---
|
||||
stage: 13
|
||||
title: Register-runtime service
|
||||
---
|
||||
|
||||
# Stage 13 — Register-runtime service
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the `register-runtime` service-layer orchestrator at PLAN
|
||||
Stage 13. The service is the single entry point Game Lobby uses (after
|
||||
Runtime Manager has reported a successful container start) to install a
|
||||
freshly-started game in Game Master.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 13](../PLAN.md) ships the first service-layer stage
|
||||
of Game Master. It lays the orchestrator pattern that Stages 14–17 will
|
||||
reuse (engine version registry CRUD, scheduler, hot path, admin
|
||||
operations). The lifecycle the service drives is frozen by
|
||||
[`../README.md` §Lifecycles → Register-runtime](../README.md):
|
||||
|
||||
1. validate request shape;
|
||||
2. reject if `runtime_records.{game_id}` already exists;
|
||||
3. resolve `image_ref` for `target_engine_version`;
|
||||
4. persist `runtime_records` with `status=starting`;
|
||||
5. call engine `POST /api/v1/admin/init`;
|
||||
6. persist `player_mappings` from the engine response;
|
||||
7. CAS `status: starting → running` and persist initial scheduling;
|
||||
8. append `operation_log`;
|
||||
9. publish `runtime_snapshot_update`;
|
||||
10. return the persisted record.
|
||||
|
||||
The reference precedent is
|
||||
[`rtmanager/internal/service/startruntime`](../../rtmanager/internal/service/startruntime),
|
||||
which established the `Input` / `Result` / `Dependencies` / `NewService`
|
||||
/ `Handle` shape, the `recordFailure` helper, and the
|
||||
`bestEffortAppend` audit-log convention.
|
||||
|
||||
Five decisions deviate from a literal reading of either PLAN Stage 13
|
||||
or the rtmanager precedent. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. `RuntimeRecordStore.Delete` extension
|
||||
|
||||
**Decision.** [`ports.RuntimeRecordStore`](../internal/ports/runtimerecordstore.go)
|
||||
gains an idempotent `Delete(ctx, gameID) error` method. The
|
||||
PostgreSQL-backed adapter
|
||||
[`runtimerecordstore.Store.Delete`](../internal/adapters/postgres/runtimerecordstore/store.go)
|
||||
issues a single `DELETE FROM runtime_records WHERE game_id = $1` and
|
||||
returns `nil` even when no row matches. The mock at
|
||||
[`internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go)
|
||||
is regenerated by `make -C gamemaster mocks`. A lone integration
|
||||
test `TestDeleteIdempotent` mirrors `TestDeleteByGameIdempotent` in
|
||||
`playermappingstore`.
|
||||
|
||||
**Why.** The README's failure paths for `register-runtime` mandate
|
||||
"roll back `runtime_records`" on every post-Insert failure. The Stage 10
|
||||
port surface had no Delete primitive, so the orchestrator could not
|
||||
satisfy the README without one. Three alternatives were considered
|
||||
and rejected:
|
||||
|
||||
- **Reorder the flow** (call engine init first, only then persist
|
||||
`runtime_records`): contradicts the README, which lists the Insert
|
||||
step before the engine call so that the in-flight `starting` row is
|
||||
observable to inspect surfaces and acts as a coordination point for
|
||||
concurrent register-runtime requests on the same game id.
|
||||
- **Introduce a `removed` status enum**: changes the runtime status
|
||||
machine for one transient bookkeeping case; complicates indexes,
|
||||
filters, and the inspect surface; is not described anywhere in
|
||||
README §Game Master status model.
|
||||
- **Single SQL transaction across both stores**: requires the adapter
|
||||
layer to expose a transactional sub-interface, breaking the per-port
|
||||
abstraction Stage 10 set up. The cost of one extra method on a
|
||||
single port is far smaller.
|
||||
|
||||
This is the same pattern Stage 11 used for `UpdateEngineVersionInput.Now`
|
||||
and `Deprecate(ctx, version, now)`: a small, targeted contract delta
|
||||
admitted by the pre-launch single-init policy.
|
||||
|
||||
### 2. Engine 4xx → `engine_validation_error`, engine 5xx →
|
||||
`engine_unreachable`
|
||||
|
||||
**Decision.** When the engine `/admin/init` call returns 4xx, the
|
||||
service produces `Result{ErrorCode: engine_validation_error}`. When it
|
||||
returns 5xx (or fails at the transport layer), the service produces
|
||||
`Result{ErrorCode: engine_unreachable}`. The classification lives in
|
||||
[`classifyEngineError`](../internal/service/registerruntime/service.go)
|
||||
and dispatches on the engine port sentinels
|
||||
(`ports.ErrEngineValidation`, `ports.ErrEngineUnreachable`,
|
||||
`ports.ErrEngineProtocolViolation`).
|
||||
|
||||
**Why.** [`../PLAN.md` Stage 13](../PLAN.md) lists the two as separate
|
||||
test cases ("engine 4xx (engine_validation_error), engine 5xx
|
||||
(engine_unreachable)"), but [`../README.md` §Lifecycles →
|
||||
Register-runtime](../README.md)'s failure-path table at the time of
|
||||
Stage 13 lumped them as `engine_unreachable`. PLAN's classification is
|
||||
more useful operationally:
|
||||
|
||||
- 4xx from the engine signals a contract violation (the engine
|
||||
rejected the request shape, which is a Game Master bug or a stale
|
||||
contract). Treating this as `engine_unreachable` would push
|
||||
operators down the "is the engine alive?" branch when the right
|
||||
branch is "did the GM build send the right shape?".
|
||||
- 5xx (and transport failures) signal that the engine is unreachable
|
||||
or unhealthy. `engine_unreachable` is the right code.
|
||||
|
||||
The README §Lifecycles failure-path table is updated in the same
|
||||
patch to reflect the split, so the two documents agree.
|
||||
|
||||
### 3. Engine response validated as `engine_protocol_violation`
|
||||
|
||||
**Decision.** After a successful engine `/admin/init` HTTP response,
|
||||
the service performs two extra checks before persisting any
|
||||
player_mappings:
|
||||
|
||||
- the number of returned players must equal the input roster size;
|
||||
- the set of `RaceName` values returned must be a subset of the
|
||||
roster (no extra races, no missing races).
|
||||
|
||||
A failure on either check rolls back the runtime record and returns
|
||||
`Result{ErrorCode: engine_protocol_violation}`.
|
||||
|
||||
**Why.** The README's failure-path table includes
|
||||
`engine_protocol_violation` for "engine response missing players or
|
||||
contains races not in roster". The engine adapter ([Stage 12,
|
||||
`engineclient.decodeStateResponse`](../internal/adapters/engineclient/client.go))
|
||||
validates the wire shape (presence of required fields, well-formed
|
||||
numeric values), but it cannot validate against the roster Game Master
|
||||
sent — only the service layer knows the roster. Splitting the two
|
||||
checks keeps the adapter narrow and lets the service-layer error code
|
||||
carry the semantic meaning.
|
||||
|
||||
### 4. Initial `runtime_snapshot_update` carries non-empty
|
||||
`player_turn_stats`
|
||||
|
||||
**Decision.** The first `runtime_snapshot_update` published by
|
||||
register-runtime carries one
|
||||
`PlayerTurnStats{UserID, Planets, Population}` row per active member,
|
||||
projected from the `engine.Init` response by joining on `RaceName`
|
||||
against the input roster. The projection is sorted by `UserID` for a
|
||||
deterministic wire order.
|
||||
|
||||
**Why.** The README §Async Stream Contracts cadence note used to read
|
||||
"empty when the snapshot is published for a status transition with no
|
||||
new turn payload". For register-runtime there *is* a new payload — the
|
||||
engine returns the initial player state in its `/admin/init` response,
|
||||
including `Planets` and `Population`. That state is the turn-0
|
||||
baseline against which Lobby's per-game stats aggregator measures
|
||||
later deltas: without it, the first per-player delta after turn 1
|
||||
would silently equal "everything" instead of "the change since
|
||||
turn 0". The README cadence wording is updated in the same patch to
|
||||
say the register-runtime snapshot carries the engine's turn-0 stats.
|
||||
|
||||
### 5. Best-effort rollback with two-flag gating
|
||||
|
||||
**Decision.** The service exposes a single `rollback(ctx, gameID,
|
||||
playerMappingsInstalled)` helper that always tries `runtime_records.Delete`
|
||||
and conditionally tries `playermappings.DeleteByGame`. The two booleans
|
||||
on `recordFailure` (`runtimeInserted`, `playerMappingsInstalled`)
|
||||
gate the rollback so:
|
||||
|
||||
- a pre-Insert failure (`invalid_request`, `conflict` from `Get`,
|
||||
`engine_version_not_found`, `Insert`'s own `ErrConflict`) skips
|
||||
rollback entirely;
|
||||
- a post-Insert / pre-BulkInsert failure deletes only the runtime
|
||||
row;
|
||||
- a post-BulkInsert failure deletes both. Note that BulkInsert errors
|
||||
themselves never install rows (per stage 11 D7's per-statement
|
||||
atomicity), so on `BulkInsert` returning ErrConflict the rollback
|
||||
flag for player_mappings is `false`.
|
||||
|
||||
The rollback uses a fresh `context.Background()` with a 5-second
|
||||
timeout so a cancelled request context does not strand the
|
||||
`starting` row.
|
||||
|
||||
**Why.** A common pitfall in rollback paths is to call `Delete` on
|
||||
state owned by another caller. The Insert-conflict branch is the
|
||||
canonical example: when our `Insert` returns `ErrConflict`, another
|
||||
request inserted the row first and owns it. Blindly deleting it
|
||||
would corrupt that other caller's state. The two-flag gating makes
|
||||
the ownership transfer explicit. The fresh background context
|
||||
mirrors the same pattern in `rtmanager.startruntime.releaseLease`.
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
|
||||
— added `Delete` to the interface and the comment block.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore/store.go`](../internal/adapters/postgres/runtimerecordstore/store.go)
|
||||
— implemented `Delete`.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore/store_test.go`](../internal/adapters/postgres/runtimerecordstore/store_test.go)
|
||||
— added `TestDeleteIdempotent` and `TestDeleteRejectsEmptyGameID`.
|
||||
- [`../internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go)
|
||||
— regenerated.
|
||||
- [`../internal/service/registerruntime/service.go`](../internal/service/registerruntime/service.go)
|
||||
with [`errors.go`](../internal/service/registerruntime/errors.go)
|
||||
and [`service_test.go`](../internal/service/registerruntime/service_test.go)
|
||||
— new orchestrator package and tests.
|
||||
- [`../README.md`](../README.md) — §References pointer to this record
|
||||
plus one-line clarifications in §Lifecycles → Register-runtime
|
||||
(failure-path table now splits 4xx/5xx per **D2**) and §Async Stream
|
||||
Contracts (cadence note now says the register-runtime snapshot
|
||||
carries `player_turn_stats` from the engine-init response per **D4**).
|
||||
- [`../PLAN.md`](../PLAN.md) — Stage 13 marked done.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Mocks regenerate cleanly with no diff after the port extension.
|
||||
make mocks
|
||||
git diff --exit-code internal/adapters/mocks
|
||||
|
||||
# Domain + port tests still pass.
|
||||
go test ./internal/domain/... ./internal/ports/...
|
||||
|
||||
# Adapter test for the new Delete method.
|
||||
go test ./internal/adapters/postgres/runtimerecordstore/...
|
||||
|
||||
# Service-level tests for the new orchestrator.
|
||||
go test ./internal/service/registerruntime/...
|
||||
|
||||
# Stage 06/07/09–12 contract / adapter / freeze tests stay green.
|
||||
go test ./...
|
||||
```
|
||||
|
||||
The full repo-level `go build ./...` from the workspace root succeeds;
|
||||
later stages (14+) build on the orchestrator shape Stage 13
|
||||
establishes.
|
||||
@@ -0,0 +1,220 @@
|
||||
---
|
||||
stage: 14
|
||||
title: Engine version registry service
|
||||
---
|
||||
|
||||
# Stage 14 — Engine version registry service
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the `engine_version` registry service-layer at PLAN
|
||||
Stage 14. The service backs the
|
||||
`/api/v1/internal/engine-versions/*` REST surface (Stage 19) and the
|
||||
hot-path `image_ref` resolve called synchronously by Game Lobby's
|
||||
start flow.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 14](../PLAN.md) lists seven service methods:
|
||||
`List`, `Get`, `Create`, `Update`, `Deprecate`, `Delete`,
|
||||
`ResolveImageRef`. The lifecycle the service drives is frozen by
|
||||
[`../README.md` §Engine Version Registry](../README.md). The reference
|
||||
precedent for shape and audit semantics is
|
||||
[`../internal/service/registerruntime`](../internal/service/registerruntime/service.go)
|
||||
landed at Stage 13.
|
||||
|
||||
Five decisions deviate from a literal reading of either Stage 14 or
|
||||
the existing port and migration shapes. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. `EngineVersionStore.Delete` extension
|
||||
|
||||
**Decision.** [`ports.EngineVersionStore`](../internal/ports/engineversionstore.go)
|
||||
gains a `Delete(ctx, version) error` method that returns
|
||||
`engineversion.ErrNotFound` when no row matches. The PostgreSQL-backed
|
||||
adapter [`engineversionstore.Store.Delete`](../internal/adapters/postgres/engineversionstore/store.go)
|
||||
issues a single `DELETE FROM engine_versions WHERE version = $1` and
|
||||
distinguishes "missing" from "removed" via `RowsAffected`. The mock at
|
||||
[`internal/adapters/mocks/mock_engineversionstore.go`](../internal/adapters/mocks/mock_engineversionstore.go)
|
||||
is regenerated by `make -C gamemaster mocks`. Three adapter tests
|
||||
(`TestDeleteHappy`, `TestDeleteNotFound`, `TestDeleteRejectsEmptyVersion`)
|
||||
mirror the pattern from the existing Deprecate tests.
|
||||
|
||||
**Why.** Stage 14 explicitly requires the service to expose a hard
|
||||
`Delete` distinct from `Deprecate`. The Stage 11 port surface only
|
||||
carried `Deprecate` (idempotent soft-mark) and
|
||||
`IsReferencedByActiveRuntime` (read probe). Three alternatives were
|
||||
considered and rejected:
|
||||
|
||||
- **Skip hard delete**: omits a Stage 14 deliverable and forces a port
|
||||
delta later. The OpenAPI 409 `engine_version_in_use` example would
|
||||
also become a dangling spec entry.
|
||||
- **Reuse `Deprecate` for both soft and hard semantics**: contradicts
|
||||
README §Engine Version Registry ("`status` values: ... `deprecated`
|
||||
(rejected on new starts; existing runtimes unaffected)"). A
|
||||
referenced version must remain deprecable so the operator can phase
|
||||
in a successor while existing runtimes finish out — folding the
|
||||
reference check into Deprecate would break that flow.
|
||||
- **Inline the SQL inside the service**: contradicts the per-port
|
||||
abstraction Stage 10 set up; the service must not import the jet
|
||||
table package.
|
||||
|
||||
This is the same pattern Stage 13 D1 used for
|
||||
`RuntimeRecordStore.Delete`: a small, targeted contract delta admitted
|
||||
by the pre-launch single-init policy.
|
||||
|
||||
### 2. Hard-delete reference probe runs before adapter `Delete`
|
||||
|
||||
**Decision.** [`Service.Delete`](../internal/service/engineversion/service.go)
|
||||
calls `versions.IsReferencedByActiveRuntime` first; on a positive
|
||||
result it surfaces `ErrInUse` without ever calling the adapter
|
||||
`Delete`. Only when the probe reports zero references does the service
|
||||
issue the SQL DELETE.
|
||||
|
||||
**Why.** Two alternatives were rejected:
|
||||
|
||||
- **Single transaction with `SELECT ... FOR UPDATE` plus DELETE**:
|
||||
requires the adapter to expose a transactional sub-interface and
|
||||
forces the service into store-internal locking semantics. The plan
|
||||
is single-instance (README §Non-Goals), so the small race window
|
||||
between probe and delete is acceptable and self-correcting (a
|
||||
late-arriving register-runtime against a deprecated version would
|
||||
fail at `runtime_records` insert anyway because the version row is
|
||||
gone — the eventual outcome is the same).
|
||||
- **Probe-after-delete**: leaks the DELETE on transient probe
|
||||
failures and surfaces a misleading "deleted" outcome to the caller.
|
||||
|
||||
Surfacing `engine_version_in_use` before any mutation matches the
|
||||
README §Error Model wording and the OpenAPI `EngineVersionInUseError`
|
||||
example.
|
||||
|
||||
### 3. `engine_version_delete` op kind added to schema and domain
|
||||
|
||||
**Decision.** A new audit value `engine_version_delete` is added to:
|
||||
|
||||
- [`domain/operation.OpKind`](../internal/domain/operation/log.go)
|
||||
(constant, `IsKnown`, `AllOpKinds`);
|
||||
- [`migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
(the `operation_log_op_kind_chk` CHECK constraint);
|
||||
- README §Persistence Layout (the `op_kind` enum listing in the
|
||||
`operation_log` description).
|
||||
|
||||
The pre-launch single-init policy from
|
||||
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
|
||||
allows editing `00001_init.sql` until first production deploy.
|
||||
|
||||
**Why.** Two alternatives were rejected:
|
||||
|
||||
- **Reuse `engine_version_deprecate`** for hard delete: semantically
|
||||
weak; audit consumers would have to inspect outcome plus an
|
||||
out-of-band column to tell soft from hard, defeating the audit's
|
||||
signal value.
|
||||
- **Skip audit for hard delete**: inconsistent with every other
|
||||
service-layer mutation (every Stage 13/14 mutation writes
|
||||
operation_log). Forensics on a destructive admin action are exactly
|
||||
where audit matters most.
|
||||
|
||||
### 4. `operation_log.game_id` column doubles as audit subject
|
||||
|
||||
**Decision.** Engine-version CRUD audit entries store the canonical
|
||||
`version` string in the `OperationEntry.GameID` field (and therefore
|
||||
in the `operation_log.game_id` column). For `OpKindEngineVersionCreate`
|
||||
the canonical post-`ParseSemver` form is used (`v1.2.3`); for
|
||||
`OpKindEngineVersionUpdate` / `Deprecate` / `Delete` the user-supplied
|
||||
version is used so failed lookups still record the attempt verbatim.
|
||||
|
||||
**Why.** Three alternatives were considered and rejected:
|
||||
|
||||
- **Make `game_id` nullable and add a `subject_id` column**: requires
|
||||
a migration delta + jet regeneration + a domain field rename. Out
|
||||
of scope for stage 14 and inconsistent with the minimal-diff
|
||||
principle.
|
||||
- **Use a sentinel `engine_version:<v>` prefix**: harder to query
|
||||
alongside per-game audit reads; the index
|
||||
`operation_log (game_id, started_at DESC)` already covers
|
||||
subject-scoped reads, and a sentinel prefix would force callers to
|
||||
strip it.
|
||||
- **Skip audit for engine-version CRUD**: README §Persistence Layout
|
||||
explicitly lists `engine_version_create | engine_version_update |
|
||||
engine_version_deprecate` as op_kind values; the audit table is
|
||||
the canonical surface.
|
||||
|
||||
The decision is recorded both here and in the README §Persistence
|
||||
Layout note so future readers can find the overload rationale.
|
||||
|
||||
### 5. JSON-object validation for `Options`
|
||||
|
||||
**Decision.** [`Service.Create`](../internal/service/engineversion/service.go)
|
||||
and `Service.Update` validate the `Options` byte slice as a JSON
|
||||
object before persisting (raw bytes are decoded into
|
||||
`map[string]any`; non-objects, including arrays and scalars, are
|
||||
rejected with `invalid_request`). Empty/whitespace-only input passes
|
||||
through as nil; the adapter (Stage 11 D5) already substitutes the
|
||||
schema default `'{}'::jsonb`.
|
||||
|
||||
**Why.** The `engine_versions.options` column is `jsonb`. Persisting
|
||||
an array, scalar, or malformed JSON would either be rejected by the
|
||||
PostgreSQL parser at INSERT time (surfacing as a generic 500) or
|
||||
accepted and break engine-side consumers that expect an object. The
|
||||
service-layer validation surfaces a clear `invalid_request` early and
|
||||
keeps the contract honest. README §Engine Version Registry already
|
||||
describes `options` as a "free-form `jsonb` document" (object
|
||||
implied); the validation makes that wording load-bearing.
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/ports/engineversionstore.go`](../internal/ports/engineversionstore.go)
|
||||
— added `Delete` to the interface and the comment block.
|
||||
- [`../internal/adapters/postgres/engineversionstore/store.go`](../internal/adapters/postgres/engineversionstore/store.go)
|
||||
— implemented `Delete`.
|
||||
- [`../internal/adapters/postgres/engineversionstore/store_test.go`](../internal/adapters/postgres/engineversionstore/store_test.go)
|
||||
— added `TestDeleteHappy`, `TestDeleteNotFound`,
|
||||
`TestDeleteRejectsEmptyVersion`.
|
||||
- [`../internal/adapters/mocks/mock_engineversionstore.go`](../internal/adapters/mocks/mock_engineversionstore.go)
|
||||
— regenerated.
|
||||
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
— added `engine_version_delete` to `operation_log_op_kind_chk`.
|
||||
- [`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
|
||||
with [`log_test.go`](../internal/domain/operation/log_test.go)
|
||||
— added `OpKindEngineVersionDelete` plus `IsKnown`/`AllOpKinds`
|
||||
membership.
|
||||
- [`../internal/service/engineversion/service.go`](../internal/service/engineversion/service.go)
|
||||
with [`errors.go`](../internal/service/engineversion/errors.go)
|
||||
and [`service_test.go`](../internal/service/engineversion/service_test.go)
|
||||
— new orchestrator package and tests.
|
||||
- [`../internal/service/registerruntime/service_test.go`](../internal/service/registerruntime/service_test.go)
|
||||
— `fakeEngineVersions` gains a stub `Delete` to satisfy the
|
||||
extended port.
|
||||
- [`../README.md`](../README.md) — §References pointer to this
|
||||
record; §Persistence Layout note that engine-version CRUD audit
|
||||
entries store `version` in the `game_id` column and that
|
||||
`engine_version_delete` joins the op_kind enum.
|
||||
- [`../PLAN.md`](../PLAN.md) — Stage 14 marked done.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Mocks regenerate cleanly with no diff after the port extension is
|
||||
# committed alongside this stage.
|
||||
make mocks
|
||||
git diff --exit-code internal/adapters/mocks
|
||||
|
||||
# Domain + port tests still pass (operation log enum membership).
|
||||
go test ./internal/domain/... ./internal/ports/...
|
||||
|
||||
# Adapter test for the new Delete method and the migration's CHECK
|
||||
# constraint.
|
||||
go test ./internal/adapters/postgres/engineversionstore/...
|
||||
go test ./internal/adapters/postgres/operationlog/...
|
||||
|
||||
# Service-level tests for the new orchestrator.
|
||||
go test ./internal/service/engineversion/...
|
||||
|
||||
# Stage 13 service tests still pass (the fake gains a stub Delete).
|
||||
go test ./internal/service/registerruntime/...
|
||||
|
||||
# Repo build succeeds at the workspace root.
|
||||
go build ./...
|
||||
```
|
||||
@@ -0,0 +1,297 @@
|
||||
---
|
||||
stage: 15
|
||||
title: Scheduler, turn generation, and snapshot publisher
|
||||
---
|
||||
|
||||
# Stage 15 — Scheduler, turn generation, and snapshot publisher
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the scheduler ticker, the turn-generation orchestrator,
|
||||
and the publication of `gm:lobby_events` plus `notification:intents`
|
||||
at PLAN Stage 15. It is the heart of Game Master: every running game
|
||||
flows through this code path on every scheduled or admin-forced turn.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 15](../PLAN.md) ships three components that
|
||||
together drive a turn:
|
||||
|
||||
1. `service/turngeneration` — the orchestrator that CAS's `running →
|
||||
generation_in_progress`, calls the engine `/admin/turn`, branches
|
||||
on `finished`, and publishes a `runtime_snapshot_update` /
|
||||
`game_finished` event plus the corresponding `game.turn.ready` /
|
||||
`game.finished` / `game.generation_failed` notification.
|
||||
2. `service/scheduler` — a thin, stateless wrapper around
|
||||
`domain/schedule.Schedule.Next` reused by the turn-generation
|
||||
recompute step and (in Stage 17) by `service/adminforce`.
|
||||
3. `worker/schedulerticker` — the 1-second loop that scans
|
||||
`runtime_records.ListDueRunning(now)` and dispatches one
|
||||
`turngeneration.Handle` per due game.
|
||||
|
||||
The lifecycle the orchestrator drives is frozen by
|
||||
[`../README.md` §Lifecycles → Turn generation](../README.md), and the
|
||||
publication cadence by [§Async Stream Contracts](../README.md) and
|
||||
[§Notification Contracts](../README.md). The reference precedent for
|
||||
the orchestrator shape (Input / Result / Dependencies / NewService /
|
||||
Handle) is Stage 13's `service/registerruntime`.
|
||||
|
||||
Seven decisions deviate from a literal reading of either PLAN Stage 15,
|
||||
the README, or the Stage 13 precedent. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. Resolve `game_name` synchronously from Lobby per notification
|
||||
|
||||
**Decision.** [`ports.LobbyClient`](../internal/ports/lobbyclient.go)
|
||||
gains a `GetGameSummary(ctx, gameID) (GameSummary, error)` method plus
|
||||
a narrow `GameSummary{GameID, GameName, Status}` type. The
|
||||
HTTP-backed adapter at
|
||||
[`internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go)
|
||||
issues a `GET /api/v1/internal/games/{game_id}` against the Lobby
|
||||
internal listener, decodes the `GameRecord` shape (Lobby's frozen
|
||||
contract), and wraps every non-success outcome with
|
||||
`ports.ErrLobbyUnavailable`. The `turngeneration` service calls it
|
||||
before publishing each `notification:intents` entry; on any error the
|
||||
orchestrator falls back to using `game_id` as `game_name` and logs a
|
||||
`warn` event with `error_code=lobby_unavailable`.
|
||||
|
||||
**Why.** `notificationintent.GameTurnReadyPayload`,
|
||||
`GameFinishedPayload`, and `GameGenerationFailedPayload` all require a
|
||||
`game_name` string, but Game Master does not own the platform name and
|
||||
the `register-runtime` envelope does not carry it. Three alternatives
|
||||
were considered and rejected:
|
||||
|
||||
- **Extend the `register-runtime` contract with `game_name` and
|
||||
persist it on `runtime_records`.** Cleanest architecturally, but
|
||||
requires editing the Stage 06 frozen OpenAPI spec, the contract
|
||||
test, the Stage 09 migration, the Stage 10 domain type, the
|
||||
Stage 11 store and tests, the Stage 13 register-runtime service and
|
||||
tests, and the regenerated jet code. Substantial cross-stage churn
|
||||
for a single denormalised string.
|
||||
- **Use `game_id` as the `game_name` placeholder unconditionally.**
|
||||
Zero change cost, but every push notification a user receives
|
||||
carries the opaque platform identifier — a user-visible regression.
|
||||
- **Defer notification publication to Stage 16.** Contradicts the
|
||||
PLAN Stage 15 task list, which explicitly enumerates
|
||||
`game.turn.ready`, `game.finished`, and `game.generation_failed`
|
||||
publication.
|
||||
|
||||
The chosen design adds one method and one return type to a port
|
||||
already established in Stage 12, with fail-soft fallback semantics
|
||||
that keep notification publication best-effort.
|
||||
|
||||
### D2. `Trigger` parameter classifies telemetry, never logic
|
||||
|
||||
**Decision.** The plan's input shape `{gameID, trigger ∈ {scheduler,
|
||||
force}}` is preserved as `turngeneration.Input.Trigger`. The value
|
||||
flows into the
|
||||
`gamemaster.turn_generation.outcomes` counter as a
|
||||
`trigger` label and into structured logs; it does **not** branch the
|
||||
orchestrator's persistence path. The skip-tick mechanic is driven
|
||||
exclusively by the runtime record's `skip_next_tick` column.
|
||||
|
||||
**Why.** [`../README.md §Force-next-turn`](../README.md) describes
|
||||
adminforce as: "Run the turn-generation flow synchronously (the same
|
||||
code path the scheduler uses). After success, set
|
||||
`runtime_records.skip_next_tick = true`." Adminforce flips the flag
|
||||
*after* the forced turn completes; the *next* scheduler-driven
|
||||
generation consumes it. Forking the orchestrator on `Trigger` would
|
||||
duplicate the recompute logic in two places and reopen the question
|
||||
"what if a force fires while skip_next_tick is already true?".
|
||||
Single-path makes the answer fall out of the existing rule (read the
|
||||
flag at start, clear at recompute) without special cases.
|
||||
|
||||
### D3. Two CAS pattern with cleanup on engine failure
|
||||
|
||||
**Decision.** Persistence steps mirror Stage 13's CAS-then-rollback
|
||||
pattern with two CAS transitions per generation:
|
||||
|
||||
1. `running → generation_in_progress` at the start. On
|
||||
`runtime.ErrConflict` (concurrent stop / external mutation) the
|
||||
orchestrator returns `Result{ErrorCode: conflict}` without
|
||||
publishing events; the external mutation is responsible for its
|
||||
own snapshot.
|
||||
2. After the engine call:
|
||||
- success + `finished=true` → `generation_in_progress → finished`;
|
||||
- success + `finished=false` → `generation_in_progress → running`;
|
||||
- engine error → `generation_in_progress → generation_failed`.
|
||||
|
||||
The post-engine CAS surfaces `runtime.ErrConflict` only when an
|
||||
external mutation (typical cause: admin issued a stop while the engine
|
||||
was generating) overtook the orchestrator. The engine call has
|
||||
already mutated state, but the runtime row is owned by the new actor;
|
||||
the orchestrator records the audit failure with `conflict` and exits.
|
||||
|
||||
**Why.** This keeps Stage 13's pattern intact: every CAS knows what
|
||||
state the row should be in before the call, and a mismatch always
|
||||
yields `conflict`. Mixing the two CAS guards with a single combined
|
||||
status update (e.g., a transactional "running and not stopped") would
|
||||
require the adapter to expose multi-status CAS predicates, breaking
|
||||
the per-row CAS abstraction Stage 11 settled on.
|
||||
|
||||
### D4. Snapshot cadence: one publication per outcome
|
||||
|
||||
**Decision.** The orchestrator publishes exactly one
|
||||
`runtime_snapshot_update` *or* `game_finished` per turn-generation
|
||||
call:
|
||||
|
||||
- success + not finished → `PublishSnapshotUpdate` with full
|
||||
`player_turn_stats`;
|
||||
- success + finished → `PublishGameFinished` with full
|
||||
`player_turn_stats`;
|
||||
- engine failure → `PublishSnapshotUpdate` with
|
||||
`RuntimeStatus=generation_failed` and empty `player_turn_stats`
|
||||
(no fresh engine payload).
|
||||
|
||||
The intermediate `running → generation_in_progress` transition is
|
||||
**not** broadcast.
|
||||
|
||||
**Why.** The README cadence enumerates "transitioned" cases as
|
||||
examples (`running ↔ generation_in_progress`), but PLAN Stage 15
|
||||
explicitly anchors publication on the outcome side. Publishing twice
|
||||
would double Lobby's processing cost without delivering new
|
||||
information, because `generation_in_progress` carries no fresh engine
|
||||
state and Lobby cannot act on the in-progress moment.
|
||||
|
||||
### D5. Notification recipients = `playermappingstore.ListByGame`
|
||||
|
||||
**Decision.** `game.turn.ready` and `game.finished` use
|
||||
`AudienceKindUser` and need a sorted unique non-empty
|
||||
`recipient_user_ids` list. The orchestrator derives it from
|
||||
`playermappingstore.ListByGame(gameID)` projected to `UserID` values,
|
||||
deduplicated and sorted ascending. Empty rosters cause the
|
||||
notification to be skipped silently with a `warn` log; the runtime
|
||||
mutation persists.
|
||||
|
||||
**Why.** This is the only roster data Game Master owns until Stage 16
|
||||
delivers the membership cache. After Stage 17 wires `banish`, the
|
||||
player_mappings rows still represent the engine-known roster and
|
||||
remain a correct conservative recipient set (banished members will be
|
||||
filtered separately by Notification Service's user resolution if
|
||||
absent in `User Service`). Adding a synchronous Lobby
|
||||
`GetMemberships` call here would duplicate the work Stage 16 is
|
||||
already on the hook to provide.
|
||||
|
||||
### D6. Scheduler service is a stateless utility
|
||||
|
||||
**Decision.**
|
||||
[`service/scheduler.Service`](../internal/service/scheduler/service.go)
|
||||
exposes a single `ComputeNext(turnSchedule, after, skipNextTick)
|
||||
(time.Time, bool, error)` method that wraps `schedule.Parse(...).Next(after,
|
||||
skipNextTick)`. The service holds no dependencies and no clock; the
|
||||
caller passes `after`. `turngeneration` injects a
|
||||
`*scheduler.Service` and uses it during the post-success recompute;
|
||||
Stage 17 will reuse the same instance from `adminforce`.
|
||||
|
||||
**Why.** Centralising the parse-then-next sequence in one place keeps
|
||||
the skip rule in one place and makes the future Stage 17 caller
|
||||
trivial. Holding no state means tests are pure value tests against the
|
||||
`domain/schedule` wrapper; no clock injection or dependency wiring is
|
||||
required.
|
||||
|
||||
### D7. Per-game in-flight set on the scheduler ticker
|
||||
|
||||
**Decision.**
|
||||
[`worker/schedulerticker.Worker`](../internal/worker/schedulerticker/worker.go)
|
||||
holds a `sync.Map[gameID]struct{}` of currently-dispatched games. At
|
||||
each tick the worker scans `RuntimeRecords.ListDueRunning(now)` and
|
||||
launches one goroutine per due game; if `LoadOrStore` reports the game
|
||||
is already in-flight, the worker logs at `debug` and skips. The
|
||||
goroutine releases the slot via `defer w.inflight.Delete(gameID)`.
|
||||
|
||||
**Why.** A 1-second tick is shorter than typical engine call latency
|
||||
plus PostgreSQL round-trips, so two ticks can observe the same due row
|
||||
before the first completes. The CAS in `turngeneration` is the
|
||||
authoritative protection (only one goroutine can flip `running →
|
||||
generation_in_progress`), but two goroutines doing the engine call and
|
||||
discarding the loser as `conflict` would waste an engine call and
|
||||
inflate `engine_validation_error` / `engine_unreachable` counters with
|
||||
spurious entries. The in-flight set is a 4-line optimisation that
|
||||
removes the spurious work.
|
||||
|
||||
`Worker.Wait` exposes the in-flight `sync.WaitGroup` so tests (and
|
||||
Stage 19's wiring) can drive `Tick` deterministically and observe
|
||||
completion. `Run` itself waits on the same group before returning so
|
||||
context cancellation gracefully drains in-flight work.
|
||||
|
||||
## Files landed
|
||||
|
||||
**Modified:**
|
||||
|
||||
- [`../internal/ports/lobbyclient.go`](../internal/ports/lobbyclient.go)
|
||||
— added `GetGameSummary` to the interface plus the `GameSummary`
|
||||
type.
|
||||
- [`../internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go)
|
||||
— implemented `GetGameSummary` with the same `ErrLobbyUnavailable`
|
||||
wrapping precedent as `GetMemberships`.
|
||||
- [`../internal/adapters/lobbyclient/client_test.go`](../internal/adapters/lobbyclient/client_test.go)
|
||||
— table-driven tests for happy path, 404, 5xx, malformed JSON,
|
||||
missing required fields, timeout, and bad input.
|
||||
- [`../internal/adapters/mocks/mock_lobbyclient.go`](../internal/adapters/mocks/mock_lobbyclient.go)
|
||||
— regenerated.
|
||||
|
||||
**Created:**
|
||||
|
||||
- [`../internal/service/scheduler/service.go`](../internal/service/scheduler/service.go),
|
||||
[`../internal/service/scheduler/service_test.go`](../internal/service/scheduler/service_test.go)
|
||||
— stateless scheduler utility.
|
||||
- [`../internal/service/turngeneration/service.go`](../internal/service/turngeneration/service.go),
|
||||
[`../internal/service/turngeneration/errors.go`](../internal/service/turngeneration/errors.go),
|
||||
[`../internal/service/turngeneration/service_test.go`](../internal/service/turngeneration/service_test.go)
|
||||
— turn-generation orchestrator and tests.
|
||||
- [`../internal/worker/schedulerticker/worker.go`](../internal/worker/schedulerticker/worker.go),
|
||||
[`../internal/worker/schedulerticker/worker_test.go`](../internal/worker/schedulerticker/worker_test.go)
|
||||
— scheduler ticker worker and tests.
|
||||
- This decision record.
|
||||
|
||||
**Reused (not modified):**
|
||||
|
||||
- `internal/domain/runtime/{model.go, transitions.go}` —
|
||||
`running → generation_in_progress`, `generation_in_progress →
|
||||
running`, `generation_in_progress → generation_failed`,
|
||||
`generation_in_progress → finished` were all permitted by the
|
||||
Stage 10 transitions table.
|
||||
- `internal/domain/schedule/nexttick.go` — the cron + skip wrapper.
|
||||
- `internal/domain/operation/log.go` — the `OpKindTurnGeneration`
|
||||
enum value already in place.
|
||||
- `internal/ports/{runtimerecordstore.go, engineclient.go,
|
||||
playermappingstore.go, operationlog.go,
|
||||
notificationpublisher.go, lobbyeventspublisher.go}` — every store
|
||||
and publisher used by the orchestrator was already present.
|
||||
- `internal/telemetry/runtime.go` — `RecordTurnGenerationOutcome`,
|
||||
`RecordLobbyEventPublished`, `RecordNotificationPublishAttempt`.
|
||||
- `pkg/notificationintent.NewGameTurnReadyIntent`,
|
||||
`NewGameFinishedIntent`, `NewGameGenerationFailedIntent`.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Mock regeneration must produce the GetGameSummary additions and
|
||||
# nothing else.
|
||||
make mocks
|
||||
git diff --stat internal/adapters/mocks
|
||||
|
||||
# Domain + ports tests still pass.
|
||||
go test ./internal/domain/... ./internal/ports/...
|
||||
|
||||
# Scheduler utility.
|
||||
go test ./internal/service/scheduler/...
|
||||
|
||||
# Turn-generation orchestrator.
|
||||
go test ./internal/service/turngeneration/...
|
||||
|
||||
# Scheduler ticker worker.
|
||||
go test ./internal/worker/schedulerticker/...
|
||||
|
||||
# Updated lobby client adapter.
|
||||
go test ./internal/adapters/lobbyclient/...
|
||||
|
||||
# Module-wide build remains green.
|
||||
go test ./...
|
||||
```
|
||||
|
||||
Out-of-scope for this stage: app wiring (Stage 19), service-local
|
||||
integration suite (Stage 21), cross-service Lobby ↔ GM tests
|
||||
(Stage 22).
|
||||
@@ -0,0 +1,256 @@
|
||||
---
|
||||
stage: 16
|
||||
title: Hot-path services and membership cache
|
||||
---
|
||||
|
||||
# Stage 16 — Hot-path services and membership cache
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the gateway-facing trio of player services
|
||||
(`commandexecute`, `orderput`, `reportget`) and the in-process membership
|
||||
cache that authorises every hot-path call. It is the last service-layer
|
||||
stage before Stage 17 (admin operations) and Stage 19 (REST handlers and
|
||||
wiring).
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 16](../PLAN.md) ships four components that together
|
||||
make the player surface usable:
|
||||
|
||||
1. `service/membership` — concurrent in-process LRU cache holding the
|
||||
per-game `user_id → status` projection from
|
||||
`Lobby /api/v1/internal/games/{game_id}/memberships`. TTL is the
|
||||
safety net; the explicit invalidation hook from Lobby is the
|
||||
primary staleness control.
|
||||
2. `service/commandexecute` — orchestrator behind
|
||||
`POST /api/v1/internal/games/{game_id}/commands`. Authorises the
|
||||
caller, resolves `actor=race_name`, reshapes the JSON envelope, and
|
||||
forwards `PUT /api/v1/command` to the engine.
|
||||
3. `service/orderput` — same shape as `commandexecute`, targeting the
|
||||
engine `PUT /api/v1/order`.
|
||||
4. `service/reportget` — orchestrator behind
|
||||
`GET /api/v1/internal/games/{game_id}/reports/{turn}`. Authorises
|
||||
the caller, resolves `race_name`, and forwards
|
||||
`GET /api/v1/report?player=<race>&turn=<turn>` to the engine.
|
||||
|
||||
The reference precedent for the orchestrator shape (Input / Result /
|
||||
Dependencies / NewService / Handle, plus a private `classifyEngineError`
|
||||
helper) is Stage 15's `service/turngeneration`. Six decisions deviate
|
||||
from a literal reading of the README, the OpenAPI surface, or the
|
||||
turngeneration precedent. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. `reportget` does not require `runtime_records.status = running`
|
||||
|
||||
**Decision.**
|
||||
[`service/reportget`](../internal/service/reportget/service.go) accepts
|
||||
any non-deleted runtime row and forwards the read to the engine.
|
||||
`runtime_not_running` is **not** part of `reportget`'s error vocabulary
|
||||
([`errors.go`](../internal/service/reportget/errors.go)).
|
||||
`commandexecute` and `orderput`, by contrast, reject anything other than
|
||||
`StatusRunning` with `runtime_not_running`.
|
||||
|
||||
**Why.** Three signals point at the same conclusion:
|
||||
|
||||
- The OpenAPI surface for `internalGetReport`
|
||||
(`api/internal-openapi.yaml` lines 546–575) lists only
|
||||
`403 / 404 / 502 / 500` responses; there is no 409 / `runtime_not_running`
|
||||
on the report path. The matching error response on commands and
|
||||
orders (lines 502, 540) does include 409.
|
||||
- The README §Reports flow (`../README.md` lines 508–520) lists only
|
||||
authorisation, race-name resolution, and engine forwarding. The
|
||||
preceding §Player commands and orders block (lines 492–506) lists the
|
||||
`status=running` precondition explicitly. The two sections are
|
||||
separately worded by design.
|
||||
- A finished or stopped runtime is a normal target for a post-mortem
|
||||
read of older turns. Refusing the read forces operators to use ad-hoc
|
||||
database access for the same data the engine already exposes.
|
||||
|
||||
The `engine_unreachable` outcome remains the natural failure mode when
|
||||
the engine container is genuinely gone (e.g., on `engine_unreachable`
|
||||
status); no extra branch is required.
|
||||
|
||||
This decision was confirmed with the user during plan-mode review.
|
||||
|
||||
### D2. GM rewrites the engine envelope (`commands` → `cmd`, inject `actor`)
|
||||
|
||||
**Decision.**
|
||||
[`commandexecute.rewriteCommandPayload`](../internal/service/commandexecute/service.go)
|
||||
and the parallel
|
||||
[`orderput.rewriteOrderPayload`](../internal/service/orderput/service.go)
|
||||
unmarshal the GM `ExecuteCommandsRequest` / `PutOrdersRequest` body as
|
||||
`map[string]json.RawMessage`, take the `commands` field, and emit a
|
||||
fresh JSON object containing only `actor` (set to the resolved race
|
||||
name) and `cmd` (carrying the original array). Every other top-level
|
||||
key is dropped. The OpenAPI descriptions for `ExecuteCommandsRequest`
|
||||
and `PutOrdersRequest` were updated in the same patch to document the
|
||||
rewrite.
|
||||
|
||||
**Why.** The literal "forwarded verbatim" wording in the original
|
||||
Stage 06 OpenAPI description conflicted with two upstream constraints:
|
||||
|
||||
- The engine `CommandRequest` schema in `game/openapi.yaml` lines
|
||||
345–364 declares `actor` and `cmd` as required, with no top-level
|
||||
`commands`.
|
||||
- The README §Hot Path rule "GM never trusts a payload field for actor
|
||||
identification" (`../README.md` lines 487–490) requires GM to set
|
||||
`actor` from the authenticated user identity.
|
||||
|
||||
Two alternatives were rejected:
|
||||
|
||||
- **Move the rewrite into `engineclient`.** The adapter's role is thin
|
||||
transport; injecting actor (an authorisation concern) into transport
|
||||
would muddle the boundary and make the adapter test harness
|
||||
authorisation-aware. The service is the right home.
|
||||
- **Inject `actor` only and keep the `commands` key.** The engine schema
|
||||
requires `cmd`; this would require an engine contract change outside
|
||||
the Stage 16 scope and break Stage 05's frozen path.
|
||||
|
||||
The transform is duplicated across the two services rather than
|
||||
extracted to a shared package. Each implementation is twelve lines and
|
||||
each service is otherwise independent; a shared package would add
|
||||
import-edge surface for marginal savings, and the project convention is
|
||||
to prefer the minimal diff (`CLAUDE.md §Priorities`). The duplication is
|
||||
explicitly documented in both file-level comments.
|
||||
|
||||
This decision was confirmed with the user during plan-mode review.
|
||||
|
||||
### D3. Hot-path services do not append to `operation_log`
|
||||
|
||||
**Decision.** None of the three services emit an `operation_log` entry.
|
||||
The `Input` shape carries no `OpSource`/`SourceRef` fields. Telemetry
|
||||
counters
|
||||
(`gamemaster.command_execute.outcomes`,
|
||||
`gamemaster.order_put.outcomes`, `gamemaster.report_get.outcomes`) are
|
||||
the only audit surface.
|
||||
|
||||
**Why.** The `operation.OpKind` enum
|
||||
(`internal/domain/operation/log.go`) intentionally has no value for
|
||||
command, order, or report — it stops at admin and lifecycle operations.
|
||||
Every hot-path call would multiply audit volume by the order rate
|
||||
without adding investigative value: the telemetry counter already
|
||||
exposes outcome distribution, and the engine itself is the source of
|
||||
truth for per-command results. Adding three new `OpKind` values would
|
||||
also bloat the SQL CHECK on `operation_log` with no operational
|
||||
consumer.
|
||||
|
||||
### D4. Membership cache uses a hand-rolled per-game inflight tracker
|
||||
|
||||
**Decision.**
|
||||
[`Cache.fetch`](../internal/service/membership/cache.go) coordinates
|
||||
concurrent misses on the same `game_id` through a tiny
|
||||
`map[gameID]*flight` plus a per-flight `done` channel. Joiners block on
|
||||
`select { case <-existing.done: case <-ctx.Done(): }`. The leader
|
||||
populates `members` (or `err`) on the flight before closing the channel.
|
||||
|
||||
**Why.** `golang.org/x/sync/singleflight` would be a sharper tool, but
|
||||
adding it as a *direct* dependency (it is currently only an indirect
|
||||
transitive of other modules in the workspace) requires the
|
||||
"justification for direct deps" bar set by `CLAUDE.md §Dependencies`.
|
||||
The cache is the only consumer in `gamemaster`, the implementation is
|
||||
~30 lines, and a context-cancellable wait is one extra `select` line we
|
||||
would otherwise have to wrap around `singleflight.Do` anyway. The
|
||||
cache-internal helper is the cheaper choice.
|
||||
|
||||
### D5. Cache returns the raw status string
|
||||
|
||||
**Decision.**
|
||||
[`Cache.Resolve`](../internal/service/membership/cache.go) returns
|
||||
`(status string, err error)` where the status is the verbatim Lobby
|
||||
vocabulary (`"active"`, `"removed"`, `"blocked"`) plus the empty string
|
||||
when the user is not in the roster. Callers compare against
|
||||
`membershipStatusActive = "active"` directly. There is no typed
|
||||
wrapper.
|
||||
|
||||
**Why.** `ports.Membership.Status` is already `string`
|
||||
(`internal/ports/lobbyclient.go` line 56); introducing a `MembershipStatus`
|
||||
domain type purely to be passed through would add boilerplate without
|
||||
enforcing any invariant Go's type system can check. The hot-path
|
||||
services need only a single equality check, so a typed enum buys
|
||||
nothing; it would also need a fallback for "unknown vocabulary"
|
||||
defensive against future Lobby additions, which is more decision
|
||||
surface than the cache should own.
|
||||
|
||||
### D6. Empty roster slot surfaces as `forbidden`
|
||||
|
||||
**Decision.** Two distinct underlying conditions both surface as
|
||||
`ErrorCodeForbidden` from the three services:
|
||||
|
||||
- The membership cache returns the empty string for the requested
|
||||
`(gameID, userID)`: the user is not present in the Lobby roster.
|
||||
- The membership cache returns `"active"` but
|
||||
`playermappingstore.Get(gameID, userID)` returns
|
||||
`playermapping.ErrNotFound`: the user is an active platform member
|
||||
but has no engine roster slot.
|
||||
|
||||
The second condition is an internal inconsistency (register-runtime
|
||||
should have installed the row), but the user-visible semantics — "you
|
||||
are not authorised to act on this game" — are identical to the first.
|
||||
The structured log captures the underlying cause.
|
||||
|
||||
**Why.** Surfacing the second condition as `internal_error` would
|
||||
expose 500 to a perfectly-routine "user not part of the engine roster"
|
||||
case and obscure the actual outcome from the gateway and the user. The
|
||||
inconsistency, if it ever materialises, is an operator concern visible
|
||||
in the warn-level log and the `forbidden` metric attribution; treating
|
||||
it as a 5xx would not help operators (who would then ignore the false
|
||||
alarm) nor users (who only care that they cannot act).
|
||||
|
||||
## Files landed
|
||||
|
||||
**Created:**
|
||||
|
||||
- [`../internal/service/membership/{errors.go, cache.go, cache_test.go}`](../internal/service/membership/)
|
||||
— concurrent LRU cache plus `ErrLobbyUnavailable` sentinel.
|
||||
- [`../internal/service/commandexecute/{errors.go, service.go, service_test.go}`](../internal/service/commandexecute/)
|
||||
— command-execute orchestrator and tests.
|
||||
- [`../internal/service/orderput/{errors.go, service.go, service_test.go}`](../internal/service/orderput/)
|
||||
— order-put orchestrator and tests.
|
||||
- [`../internal/service/reportget/{errors.go, service.go, service_test.go}`](../internal/service/reportget/)
|
||||
— report-get orchestrator and tests.
|
||||
- This decision record.
|
||||
|
||||
**Modified:**
|
||||
|
||||
- [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml) —
|
||||
rewrote the description fields of `ExecuteCommandsRequest` and
|
||||
`PutOrdersRequest` to document the GM-side envelope rewrite.
|
||||
|
||||
**Reused (not modified):**
|
||||
|
||||
- `internal/ports/{engineclient.go, lobbyclient.go,
|
||||
playermappingstore.go, runtimerecordstore.go}` — every interface and
|
||||
sentinel was already present.
|
||||
- `internal/domain/runtime/model.go` — `StatusRunning` constant + the
|
||||
whole status vocabulary.
|
||||
- `internal/domain/playermapping/model.go` — `PlayerMapping` and
|
||||
`ErrNotFound`.
|
||||
- `internal/domain/operation/log.go` — `Outcome` enum.
|
||||
- `internal/config/config.go` — `MembershipCacheConfig.{TTL, MaxGames}`
|
||||
with defaults `30s` / `4096`.
|
||||
- `internal/telemetry/runtime.go` —
|
||||
`RecordCommandExecuteOutcome`, `RecordOrderPutOutcome`,
|
||||
`RecordReportGetOutcome`, `RecordMembershipCacheResult`,
|
||||
`RecordEngineCall` (already wired in Stage 08).
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Membership cache (race-clean concurrency).
|
||||
go test -race ./internal/service/membership/...
|
||||
|
||||
# Each new player service.
|
||||
go test ./internal/service/commandexecute/...
|
||||
go test ./internal/service/orderput/...
|
||||
go test ./internal/service/reportget/...
|
||||
|
||||
# Module-wide build + suite.
|
||||
go build ./...
|
||||
go test ./...
|
||||
```
|
||||
|
||||
Out-of-scope for this stage: app wiring (Stage 19), service-local
|
||||
integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).
|
||||
@@ -0,0 +1,264 @@
|
||||
---
|
||||
stage: 17
|
||||
title: Admin operations and Lobby-facing liveness
|
||||
---
|
||||
|
||||
# Stage 17 — Admin operations and Lobby-facing liveness
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the five Game Master admin/inspect service-layer
|
||||
operations and the Lobby-facing liveness reply
|
||||
(`adminstop`, `adminforce`, `adminpatch`, `adminbanish`,
|
||||
`livenessreply`). Stage 17 is the last service-layer stage before
|
||||
Stage 18 (health-events consumer) and Stage 19 (REST handlers and
|
||||
wiring).
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 17](../PLAN.md) ships five services that close
|
||||
the GM service surface:
|
||||
|
||||
1. `service/adminstop` — orchestrator behind
|
||||
`POST /api/v1/internal/runtimes/{game_id}/stop`. Calls Runtime
|
||||
Manager and CASes `runtime_records.status → stopped`.
|
||||
2. `service/adminforce` — orchestrator behind
|
||||
`POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. Runs
|
||||
the inner `service/turngeneration` flow synchronously, then sets
|
||||
`runtime_records.skip_next_tick = true`.
|
||||
3. `service/adminpatch` — orchestrator behind
|
||||
`POST /api/v1/internal/runtimes/{game_id}/patch`. Calls Runtime
|
||||
Manager and rotates `runtime_records.current_image_ref` plus
|
||||
`current_engine_version`.
|
||||
4. `service/adminbanish` — orchestrator behind
|
||||
`POST /api/v1/internal/games/{game_id}/race/{race_name}/banish`.
|
||||
Resolves the race and calls the engine `/admin/race/banish`.
|
||||
5. `service/livenessreply` — orchestrator behind
|
||||
`GET /api/v1/internal/games/{game_id}/liveness`. Reflects GM's own
|
||||
view of the runtime without ever calling the engine.
|
||||
|
||||
The reference precedent for the orchestrator shape (`Input` /
|
||||
`Result` / `Dependencies` / `NewService` / `Handle`) is Stage 13's
|
||||
`service/registerruntime` and Stage 15's `service/turngeneration`.
|
||||
Six decisions deviate from a literal reading of the README, the
|
||||
OpenAPI surface, or the turngeneration precedent. Each is recorded
|
||||
below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. `RuntimeRecordStore` grows a dedicated `UpdateImage` method
|
||||
|
||||
**Decision.**
|
||||
[`ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
|
||||
adds a new `UpdateImage(ctx, UpdateImageInput) error` method with its
|
||||
own `UpdateImageInput` struct and `Validate`. The Postgres adapter
|
||||
gains a matching SQL UPDATE under a CAS guard on `(game_id, status)`.
|
||||
The existing `UpdateStatus` is **not** repurposed for patch updates.
|
||||
|
||||
**Why.** `UpdateStatusInput.Validate()` (Stage 11) calls
|
||||
`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
|
||||
`ExpectedFrom == To`. Patch deliberately keeps the runtime in
|
||||
`running`, so any attempt to feed `UpdateStatus` with
|
||||
`ExpectedFrom == To == running` is rejected before the SQL even
|
||||
runs. Three alternatives were on the table:
|
||||
|
||||
- Drop the `runtime.Transition` invariant from `UpdateStatusInput`
|
||||
to allow self-transitions. That would weaken the CAS validator
|
||||
for every existing caller — register-runtime, turngeneration,
|
||||
health-events consumer — and reintroduce the «accidental no-op
|
||||
status update» class of bugs the validator was added to catch.
|
||||
- Introduce a synthetic `runtime.StatusRunning → runtime.StatusRunning`
|
||||
edge in `domain/runtime/transitions.go`. Same blast radius as
|
||||
above, only with stronger semantic baggage in the transition table.
|
||||
- Add a dedicated `UpdateImage` method that only writes the two
|
||||
image columns plus `updated_at`. Bounded blast radius (one new
|
||||
method, one new input struct, one new SQL UPDATE), preserves the
|
||||
CAS invariant, and matches how Stage 11 already separated
|
||||
`UpdateScheduling` from `UpdateStatus` for the same reason.
|
||||
|
||||
The third option is what shipped. Existing fakes (`registerruntime`,
|
||||
`turngeneration`, hot-path tests, schedulerticker) carry a no-op
|
||||
`UpdateImage` stub that returns `errors.New(...)` so a test that
|
||||
accidentally exercises the new path fails loudly.
|
||||
|
||||
### D2. `adminstop` is idempotent on `stopped` and `finished`, rejects `starting`
|
||||
|
||||
**Decision.**
|
||||
[`service/adminstop`](../internal/service/adminstop/service.go) reads
|
||||
the runtime row first; if `Status ∈ {stopped, finished}`, the service
|
||||
returns `OutcomeSuccess` without calling Runtime Manager and without
|
||||
publishing a `runtime_snapshot_update`. If `Status == starting`, the
|
||||
service returns `conflict` with `OutcomeFailure`. Every other
|
||||
non-terminal status (`running`, `generation_in_progress`,
|
||||
`generation_failed`, `engine_unreachable`) takes the regular path:
|
||||
RTM call → CAS → snapshot publication.
|
||||
|
||||
**Why.** The README §Stop says «CAS `runtime_records.status: * →
|
||||
stopped`» but in practice three edge cases pull the service away
|
||||
from a literal CAS-only implementation:
|
||||
|
||||
- `stopped` and `finished` are common operator races: an admin clicks
|
||||
«stop» on a UI list while another admin already pressed it (or the
|
||||
game finished naturally). Returning `conflict` would force the UI
|
||||
to retry the read and confuse the operator. Idempotent success is
|
||||
the smallest-surprise behaviour and matches how Lobby's other
|
||||
admin-cancel flows handle terminal states.
|
||||
- `starting` is the active engine-init window. RTM has just been
|
||||
asked to start the container; an admin stop here would race the
|
||||
init flow and almost certainly leave the system in a partially
|
||||
cleaned state. The transition table in Stage 10 deliberately
|
||||
excludes `starting → stopped` for the same reason. Returning
|
||||
`conflict` lets the admin tooling surface «runtime is mid-init,
|
||||
retry in a moment» instead of pretending the stop succeeded.
|
||||
- The «obvious» fourth path — letting the CAS validator reject
|
||||
`starting → stopped` and surface that as the natural conflict —
|
||||
was rejected because it depends on validator implementation
|
||||
detail leaking through; the explicit pre-CAS check makes the
|
||||
intent obvious in the audit log and the structured logs.
|
||||
|
||||
The audit log records every pre-CAS rejection with
|
||||
`outcome=failure / error_code=conflict`, and every idempotent no-op
|
||||
with `outcome=success`, so operators can distinguish the cases in
|
||||
post-hoc analysis.
|
||||
|
||||
### D3. `adminforce` always sets `skip_next_tick=true`, even on a finishing turn
|
||||
|
||||
**Decision.**
|
||||
[`service/adminforce`](../internal/service/adminforce/service.go)
|
||||
issues `UpdateScheduling{SkipNextTick=true,
|
||||
NextGenerationAt=turnResult.Record.NextGenerationAt,
|
||||
CurrentTurn=turnResult.Record.CurrentTurn}` after every successful
|
||||
inner turn-generation, regardless of whether `Result.Finished` is
|
||||
`true`.
|
||||
|
||||
**Why.** The cleaner branch — «skip the scheduling write when the
|
||||
turn just finished the game» — was considered and rejected:
|
||||
|
||||
- `turngeneration` already cleared `next_generation_at` and updated
|
||||
`current_turn` on the finishing branch (Stage 15
|
||||
`completeFinished`). A redundant write that re-affirms those
|
||||
values plus sets `skip_next_tick=true` does no harm: the row is
|
||||
already in `status=finished` and no scheduler tick will ever
|
||||
consume the flag.
|
||||
- The branchless code is shorter and the test contract is simpler
|
||||
(«adminforce always writes the skip flag on success»). One extra
|
||||
conditional saves zero SQL on the production path but doubles the
|
||||
set of cases the test matrix has to assert.
|
||||
- The README §Force-next-turn wording «After success, set
|
||||
`runtime_records.skip_next_tick = true`» is unconditional. Adding
|
||||
a runtime-side branch would silently weaken that contract.
|
||||
|
||||
The driver `op_kind=force_next_turn` audit row records the eventual
|
||||
outcome (success / failure with the same error code that
|
||||
turngeneration surfaced) so audit consumers can tell apart a forced
|
||||
turn that finished the game from a forced turn that prepared the
|
||||
next regular tick.
|
||||
|
||||
### D4. `adminbanish` does not check runtime status; missing race surfaces as `forbidden`
|
||||
|
||||
**Decision.**
|
||||
[`service/adminbanish`](../internal/service/adminbanish/service.go)
|
||||
reads the runtime row only to retrieve the `engine_endpoint`, then
|
||||
calls `playermappingstore.GetByRace`. A missing row maps to
|
||||
`error_code=forbidden`. The runtime status itself is **not**
|
||||
inspected; banish is dispatched even when the runtime is in
|
||||
`stopped`, `finished`, or `engine_unreachable`.
|
||||
|
||||
**Why.** Two threads informed the choice:
|
||||
|
||||
- README §Banish lists only two preconditions: «runtime exists»
|
||||
and «`race_name` resolves to an existing player_mappings row».
|
||||
Adding a status guard would silently extend the contract beyond
|
||||
what Lobby is allowed to depend on, and would make the banish
|
||||
flow fail differently from the documented set.
|
||||
- A banish on a stopped/finished runtime is a no-op at the engine
|
||||
side (the container is exited or absent). The engine call will
|
||||
fail with `engine_unreachable`, which is the right error for the
|
||||
caller to see — it means «the runtime was stopped before banish
|
||||
could land». Pre-rejecting with a different code would hide the
|
||||
real state from the operator.
|
||||
|
||||
The `forbidden` mapping for missing race mirrors Stage 16 D6 («empty
|
||||
roster surfaces as `forbidden`»). The frozen error vocabulary does
|
||||
not contain a `race_not_found` code, and `forbidden` is the
|
||||
semantically closest match: «the platform user this race belonged
|
||||
to is no longer authorised to act on the runtime».
|
||||
|
||||
### D5. `livenessreply` returns 200 / `status=""` on `runtime_not_found`
|
||||
|
||||
**Decision.**
|
||||
[`service/livenessreply`](../internal/service/livenessreply/service.go)
|
||||
absorbs `runtime.ErrNotFound` into a successful Result with
|
||||
`Ready=false` and `Status=runtime.Status("")`. The Go-level error
|
||||
return is reserved for non-business failures only (nil context, nil
|
||||
receiver, store-read errors, invalid input). A handler that wraps
|
||||
this service answers 200 with body `{"ready": false, "status": ""}`
|
||||
when GM has no record for the requested game.
|
||||
|
||||
**Why.** README §Liveness reply specifies the endpoint «never calls
|
||||
the engine; it reflects GM's own view only» and explicitly says it
|
||||
returns 200 even when the runtime is not running. Three response
|
||||
shapes were considered:
|
||||
|
||||
- 200 with `status="runtime_not_found"`. Mixes runtime-status
|
||||
values with error codes in the same field, breaking the
|
||||
caller's enum-match dispatch.
|
||||
- 404 `runtime_not_found`. Contradicts the README §Liveness reply
|
||||
«return `200`» wording and forces Lobby's resume flow to add a
|
||||
404 handler that means «no observation» — semantically the same
|
||||
as `Ready=false`.
|
||||
- 200 with `status=""`. The empty status reads naturally as «GM
|
||||
has no observation»; Lobby's resume flow already needs to handle
|
||||
the `Ready=false` branch and the empty status is exactly what
|
||||
«no observation» looks like in practice. Chosen for the smallest
|
||||
caller-side complexity.
|
||||
|
||||
### D6. RTM client errors surface as `service_unavailable`, not a dedicated code
|
||||
|
||||
**Decision.** Both `service/adminstop` and `service/adminpatch` map
|
||||
every error from `RTMClient.Stop` / `RTMClient.Patch` to
|
||||
`error_code=service_unavailable`, regardless of whether the
|
||||
underlying failure is `ErrRTMUnavailable`, a wrapped HTTP 5xx, or a
|
||||
dialler-level transport error.
|
||||
|
||||
**Why.** The frozen error vocabulary in
|
||||
[`gamemaster/api/internal-openapi.yaml`](../api/internal-openapi.yaml)
|
||||
does not contain a `runtime_manager_unavailable` code. Three options
|
||||
were on the table:
|
||||
|
||||
- Add a new code. Rejected: the OpenAPI surface is contract-frozen
|
||||
from Stage 06 and adding a new error code is a wire-format change
|
||||
that pulls every consumer into a re-validation. Stage 17 deals
|
||||
with service-layer code only; no contract change is in scope.
|
||||
- Map RTM failures to `engine_unreachable`. Rejected: the RTM call
|
||||
is a sibling-service hop, not an engine call; mixing the two in
|
||||
a single label confuses operators reading metric / log labels.
|
||||
- Map RTM failures to `service_unavailable`. Accepted: the
|
||||
vocabulary already documents `service_unavailable` as «a
|
||||
steady-state dependency was unreachable for this call», which is
|
||||
exactly what an RTM outage looks like from GM's perspective.
|
||||
|
||||
The Stage 12 D5 decision record in
|
||||
[`stage12-external-clients.md`](./stage12-external-clients.md)
|
||||
already records that the RTM adapter wraps every non-success
|
||||
outcome in `ports.ErrRTMUnavailable` without distinguishing
|
||||
sub-cases; Stage 17 simply consumes the unified sentinel.
|
||||
|
||||
## Cross-stage consequences
|
||||
|
||||
- The new port surface `RuntimeRecordStore.UpdateImage` is
|
||||
available to every later consumer; Stage 18 and Stage 19 do not
|
||||
use it. Existing hand-rolled fakes carry a no-op stub.
|
||||
- `OpKindStop`, `OpKindForceNextTurn`, `OpKindPatch`, `OpKindBanish`
|
||||
were introduced in Stage 09 / Stage 10 already; Stage 17 is their
|
||||
first writer.
|
||||
- The telemetry counter `gamemaster.banish.outcomes` (declared in
|
||||
Stage 08) gets its first call site in `service/adminbanish`. No
|
||||
new counters are introduced for `adminstop` / `adminforce` /
|
||||
`adminpatch` / `livenessreply`; the README §Observability list
|
||||
does not mention them and Stage 17 deliberately stays inside the
|
||||
declared instrument set.
|
||||
- The Stage 19 REST handlers consume the five services without
|
||||
service-layer changes: each handler decodes the JSON envelope,
|
||||
fills `Input.OpSource` / `Input.SourceRef` from the
|
||||
`X-Galaxy-Caller` header convention, and translates `Result.ErrorCode`
|
||||
into the standard error envelope.
|
||||
@@ -0,0 +1,171 @@
|
||||
---
|
||||
stage: 18
|
||||
title: runtime:health_events consumer
|
||||
---
|
||||
|
||||
# Stage 18 — `runtime:health_events` consumer
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the asynchronous consumer of the `runtime:health_events`
|
||||
Redis Stream produced by Runtime Manager. The consumer translates RTM
|
||||
observations into three effects on Game Master state:
|
||||
|
||||
1. Updates `runtime_records.engine_health` per game with a short
|
||||
summary string.
|
||||
2. For terminal container events applies a CAS
|
||||
`running → engine_unreachable`; for `probe_recovered` applies the
|
||||
symmetric recovery CAS `engine_unreachable → running`.
|
||||
3. Publishes a debounced `runtime_snapshot_update` on `gm:lobby_events`
|
||||
only when the engine-health summary or the runtime status actually
|
||||
changed.
|
||||
|
||||
The reference precedent for the worker shape (`Dependencies` /
|
||||
`NewWorker` / `Run` / `Shutdown` / exported `HandleMessage`) is the
|
||||
Lobby `gmevents` consumer at `lobby/internal/worker/gmevents`. Seven
|
||||
decisions deviate from a literal reading of [`../PLAN.md`](../PLAN.md)
|
||||
or are sharp enough to surface here.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. Event-type taxonomy expanded to seven values
|
||||
|
||||
**Decision.** The consumer maps all seven values published by RTM
|
||||
([`rtmanager/internal/domain/health/snapshot.go`](../../rtmanager/internal/domain/health/snapshot.go)),
|
||||
not the six listed in PLAN Stage 18. The added values are
|
||||
`container_started` and `probe_recovered`. Both are mapped to the
|
||||
summary string `healthy`. `probe_recovered` additionally attempts the
|
||||
recovery CAS `engine_unreachable → running`. `container_started` does
|
||||
not transition status — Game Master owns runtime startup through the
|
||||
register-runtime flow, so RTM's container_started observation is
|
||||
informational at the consumer level.
|
||||
|
||||
**Why.** The transition table in
|
||||
[`internal/domain/runtime/transitions.go`](../internal/domain/runtime/transitions.go)
|
||||
already declares `engine_unreachable → running` with the comment
|
||||
`reserved for the Stage 18 consumer; declared here so Stage 18 needs
|
||||
no transitions edit`. The reserved transition is only useful when an
|
||||
event in the input stream actually triggers it; the only such event in
|
||||
RTM's vocabulary is `probe_recovered`. Leaving the two extra event
|
||||
types unmapped would either drop information (if ignored entirely) or
|
||||
keep the recovery transition forever unreachable. Mapping them now is
|
||||
the minimum diff that closes the loop.
|
||||
|
||||
### D2. CAS conflict on a status mutation falls back to a health-only update
|
||||
|
||||
**Decision.** When the worker plans a status transition (e.g.,
|
||||
`running → engine_unreachable` for `container_oom`) and
|
||||
`RuntimeRecordStore.UpdateStatus` returns `runtime.ErrConflict` or
|
||||
`runtime.ErrInvalidTransition`, the worker logs the conflict at debug
|
||||
and falls back to `RuntimeRecordStore.UpdateEngineHealth`. The summary
|
||||
column is refreshed; the status column stays under whatever the
|
||||
concurrent flow holds.
|
||||
|
||||
**Why.** Two flows can hold the runtime row when an RTM event arrives:
|
||||
turn generation (`generation_in_progress`) and admin operations
|
||||
(`stopped`, `finished`). Forcing the consumer to win over those flows
|
||||
would either reintroduce stale-status writes or require expanding the
|
||||
allowed-transitions table to include every non-terminal source — the
|
||||
latter weakens the guard that turn generation relies on. The failure
|
||||
semantics turn-generation already implements (engine call timeout →
|
||||
`generation_failed`) cover the case where an `oom` arrives while a
|
||||
turn is in flight: the engine call from turngeneration will fail
|
||||
naturally a moment later. The consumer's job in that window is to keep
|
||||
the summary current so operators see «last known: oom» on
|
||||
`gm:lobby_events`.
|
||||
|
||||
### D3. New port method `UpdateEngineHealth`
|
||||
|
||||
**Decision.** [`internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
|
||||
gains a new method `UpdateEngineHealth(ctx, UpdateEngineHealthInput) error`
|
||||
with its own input struct and `Validate`. The Postgres adapter gains a
|
||||
matching `UPDATE runtime_records SET engine_health = $1, updated_at =
|
||||
$2 WHERE game_id = $3`. The existing `UpdateStatus` is **not**
|
||||
repurposed for health-only updates.
|
||||
|
||||
**Why.** `UpdateStatusInput.Validate` calls
|
||||
`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
|
||||
`ExpectedFrom == To` (Stage 17 D1). A health-only update keeps the
|
||||
runtime in its current status, so any attempt to feed `UpdateStatus`
|
||||
with `ExpectedFrom == To` is rejected before the SQL even runs. The
|
||||
same precedent led Stage 17 to add `UpdateImage` rather than relax the
|
||||
self-transition guard. Stage 18 follows that precedent.
|
||||
|
||||
In addition, the health update is not gated on a CAS at all: late-
|
||||
arriving events should still bookkeep the summary regardless of the
|
||||
current status (including `stopped` and `finished`). A guarded
|
||||
`UpdateStatus`-shaped variant would have to enumerate every source
|
||||
status the consumer might observe; an unguarded `UpdateEngineHealth`
|
||||
sidesteps the question.
|
||||
|
||||
### D4. In-memory dedupe of last-emitted summaries per game
|
||||
|
||||
**Decision.** The worker keeps a `map[string]string` (`gameID →
|
||||
lastEmittedSummary`) under a `sync.RWMutex`. A snapshot is published
|
||||
when either the status transitioned in this iteration or when the new
|
||||
summary differs from the cached one for the same game. The cache is
|
||||
process-local; on restart it is empty.
|
||||
|
||||
**Why.** [`./README.md` §`gm:lobby_events`](../README.md) freezes the
|
||||
publication rule: snapshots are emitted on transitions and on health-
|
||||
summary changes («debounced — duplicates are suppressed when the
|
||||
summary did not change»). Stage 18 chooses an in-process map over a
|
||||
Redis-backed dedupe for two reasons:
|
||||
|
||||
1. Game Master is single-instance in v1
|
||||
([`./README.md §Non-Goals`](../README.md)); a per-process map is
|
||||
sufficient for v1 correctness.
|
||||
2. Losing the cache on restart causes at most one extra snapshot per
|
||||
game right after restart — Lobby's `gmevents` consumer is
|
||||
idempotent (CAS-protected status transitions, deterministic
|
||||
snapshot blob), so the extra emission is benign.
|
||||
|
||||
A Redis-backed dedupe is cheap to introduce later if multi-instance
|
||||
Game Master ever lands; until then the simpler choice ships less code.
|
||||
|
||||
### D5. Snapshot construction reads the runtime row again after the mutation
|
||||
|
||||
**Decision.** Whenever the worker decides to publish, it re-reads the
|
||||
runtime record (`RuntimeRecordStore.Get`) and builds the
|
||||
`RuntimeSnapshotUpdate` from that fresh row. The `EngineHealthSummary`,
|
||||
`RuntimeStatus`, and `CurrentTurn` fields therefore reflect whatever
|
||||
the database holds after the mutation, rather than what the worker
|
||||
just intended to write.
|
||||
|
||||
**Why.** Two paths can produce the same publish decision: the CAS
|
||||
succeeded (status changed, summary changed), or the CAS conflicted and
|
||||
the fallback `UpdateEngineHealth` took over (status unchanged from the
|
||||
worker's point of view, but possibly mutated by a concurrent flow
|
||||
between the conflict and the read). A single read-after-write reduces
|
||||
both paths to the same envelope-building code and keeps the snapshot
|
||||
honest about what is actually in the database. `PlayerTurnStats` is
|
||||
intentionally left as `nil`: the consumer does not have a fresh engine
|
||||
state payload, so per-player stats stay empty until the next turn
|
||||
(this matches [`./README.md §`gm:lobby_events`] for status-only
|
||||
transitions).
|
||||
|
||||
### D6. Stream-offset label is `health_events`
|
||||
|
||||
**Decision.** The consumer uses the short label `health_events` for
|
||||
`StreamOffsetStore.Load` / `Save`. The corresponding Redis key is
|
||||
`gamemaster:stream_offsets:health_events`.
|
||||
|
||||
**Why.** The label convention is documented in
|
||||
[`./README.md §Persistence Layout / Redis runtime-coordination state`](../README.md):
|
||||
short logical identifier of the consumer, stable across renames of the
|
||||
underlying stream key. The Lobby `gmevents` consumer follows the same
|
||||
shape (`gm_lobby_events`).
|
||||
|
||||
### D7. Worker wiring deferred to Stage 19
|
||||
|
||||
**Decision.** Stage 18 ships the worker package and unit/loop tests but
|
||||
does not register the worker as an `app.Component` in
|
||||
`internal/app/runtime.go`. Wiring is deferred to Stage 19.
|
||||
|
||||
**Why.** The same pattern is already in place for the scheduler ticker
|
||||
introduced at Stage 15: the worker exists in the source tree but is
|
||||
not wired into `runtime.app = New(cfg, internalServer)`. Stage 19
|
||||
explicitly bundles handler wiring with worker wiring (see PLAN
|
||||
Stage 19), so deferring is consistent with the precedent. The
|
||||
configuration values the wiring will need (stream name, block timeout,
|
||||
offset-store DSN) are already loaded by `internal/config` and were
|
||||
introduced in Stage 08.
|
||||
@@ -0,0 +1,230 @@
|
||||
---
|
||||
stage: 19
|
||||
title: Internal REST handlers
|
||||
---
|
||||
|
||||
# Stage 19 — Internal REST handlers
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
bringing the trusted internal REST listener of Game Master to full
|
||||
contract coverage. The handlers wire the existing service layer
|
||||
(stages 13–17) and the membership cache (stage 16) to the eighteen
|
||||
operations frozen by
|
||||
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml). The
|
||||
listener lifecycle, OpenTelemetry middleware, and the `/healthz` /
|
||||
`/readyz` probes were established in stage 08; this stage adds the
|
||||
per-operation handler subpackage, widens the listener `Dependencies`
|
||||
struct to thread every service port, and grows
|
||||
[`../internal/app/wiring.go`](../internal/app/wiring.go) to construct
|
||||
the entire dependency graph (stores, adapters, services, workers).
|
||||
|
||||
The reference precedent for the handler shape is the rtmanager
|
||||
`internal/api/internalhttp/handlers` tree; the conformance test
|
||||
mirrors `rtmanager/internal/api/internalhttp/conformance_test.go`.
|
||||
Eight decisions deviate from a literal reading of
|
||||
[`../PLAN.md`](../PLAN.md) or are sharp enough to surface here.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. Conformance test lives inside the listener package
|
||||
|
||||
**Decision.** The OpenAPI conformance test ships at
|
||||
[`../internal/api/internalhttp/conformance_test.go`](../internal/api/internalhttp/conformance_test.go),
|
||||
in the `internalhttp` package, not at
|
||||
`gamemaster/api/openapi_conformance_test.go` as the literal text of
|
||||
PLAN.md Stage 19 suggests.
|
||||
|
||||
**Why.** The test instantiates the live `Server.handler` through
|
||||
`NewServer(...)` with stub services and replays each documented
|
||||
operation against it. That requires reading the unexported
|
||||
`handler` field and wiring stub implementations of the
|
||||
handler-package interfaces; both are package-internal concerns that a
|
||||
sibling test under `gamemaster/api/` would not have access to without
|
||||
exporting hooks that exist solely for the test. The rtmanager
|
||||
service ships the analogous test inside its own `internalhttp`
|
||||
package; we follow the same idiom.
|
||||
|
||||
**How to apply.** Future surface-shape audits go in this file.
|
||||
PLAN.md text is treated as a drift; the constraint that the spec is
|
||||
covered by a kin-openapi-driven validation is honoured exactly.
|
||||
|
||||
### D2. `DELETE /engine-versions/{version}` calls `Service.Deprecate`
|
||||
|
||||
**Decision.** The handler bound to the OpenAPI operation
|
||||
`internalDeprecateEngineVersion` calls
|
||||
[`engineversion.Service.Deprecate`](../internal/service/engineversion/service.go)
|
||||
and never `Service.Delete`. The 409 response declared by the
|
||||
spec for `engine_version_in_use` is therefore unreachable on this
|
||||
endpoint.
|
||||
|
||||
**Why.** The operation id and the first sentence of the description
|
||||
explicitly say «Sets the engine version status to `deprecated`». The
|
||||
sentence about hard removal and `engine_version_in_use` is a
|
||||
leftover of an earlier intent — `Service.Deprecate` does not consult
|
||||
`IsReferencedByActiveRuntime`, so the in-use rejection cannot fire
|
||||
through this code path. Hard delete is a future Admin Service
|
||||
operation; v1 does not expose it through REST.
|
||||
|
||||
**How to apply.** Calls that need to release the registry row
|
||||
permanently must use `Service.Delete` directly (not yet wired through
|
||||
REST). The spec's leftover 409 example is recorded here so a future
|
||||
contract reviewer does not chase a phantom failure mode.
|
||||
|
||||
### D3. Workers wired and started alongside the listener
|
||||
|
||||
**Decision.** This stage constructs the scheduler ticker (stage 15)
|
||||
and the runtime:health_events consumer (stage 18) inside
|
||||
`wiring.buildWorkers` and registers them as `App.Component`-s next
|
||||
to the internal HTTP server.
|
||||
|
||||
**Why.** Stage 19's narrow text says «ship the gateway-, Lobby- and
|
||||
Admin-facing REST surface backed by the service layer». But the
|
||||
service layer collaborators referenced from the listener (turn
|
||||
generation, membership cache, runtime record store, etc.) only make
|
||||
sense inside a process that is also producing turns and consuming
|
||||
health events. Keeping the workers idle would leave the wiring graph
|
||||
half-built and the dev experience surprising. Constructing and
|
||||
starting them here makes a freshly-deployed process production-ready
|
||||
the moment the listener accepts traffic.
|
||||
|
||||
**How to apply.** The two workers are owned by `App.Run` exactly
|
||||
like the listener: both `Run` (long-lived) and `Shutdown` are part
|
||||
of `App.Component`. See D4 for the trivial `Shutdown` added on the
|
||||
scheduler ticker.
|
||||
|
||||
### D4. `schedulerticker.Worker.Shutdown` is a no-op
|
||||
|
||||
**Decision.** The scheduler ticker adds a one-line
|
||||
`Shutdown(_ context.Context) error { return nil }` so the type
|
||||
satisfies `app.Component`.
|
||||
|
||||
**Why.** The worker's `Run` already returns when the supplied
|
||||
context is cancelled, and `wg.Wait` drains the in-flight per-game
|
||||
goroutines before `Run` returns. There is nothing additional to
|
||||
release. The `healtheventsconsumer.Worker` already had a `Shutdown`
|
||||
from stage 18; this just brings the two workers to the same shape.
|
||||
|
||||
**How to apply.** When future workers grow real shutdown logic
|
||||
(buffered output to flush, persistent connections to drain), they
|
||||
should embed it inside `Shutdown` rather than relying on context
|
||||
cancellation alone.
|
||||
|
||||
### D5. New `RuntimeRecordStore.List(ctx)` method
|
||||
|
||||
**Decision.** The port grows a fifth read method:
|
||||
`List(ctx) ([]runtime.RuntimeRecord, error)`. The PostgreSQL
|
||||
adapter implements it as one SELECT ordered by
|
||||
`(created_at DESC, game_id ASC)`.
|
||||
|
||||
**Why.** The OpenAPI operation `internalListRuntimes` accepts an
|
||||
optional `status` query parameter. With the parameter set, the
|
||||
existing `ListByStatus` answers; without it, no method on the port
|
||||
returned every record. Composing the unfiltered list as a
|
||||
loop-over-statuses would dilute the ordering guarantee and double
|
||||
the round-trip cost. The new method is additive — every other
|
||||
caller keeps using its narrow read.
|
||||
|
||||
**How to apply.** Test fakes (`fakeRuntimeRecords` in service tests,
|
||||
`fakeRuntimeRecordsBackend` in scheduler-ticker tests) gained the
|
||||
method as well. The handler-side `RuntimeRecordsReader` interface
|
||||
exposes only the three read methods (`Get`, `List`, `ListByStatus`)
|
||||
so the listener cannot accidentally mutate runtime state.
|
||||
|
||||
### D6. `next_generation_at` encodes as `0` when unscheduled
|
||||
|
||||
**Decision.** The wire `RuntimeRecord.next_generation_at` field is
|
||||
declared `required: true` and `format: int64`. The domain holds
|
||||
`*time.Time` and may carry `nil` — typically while a runtime is in
|
||||
status `starting` and the first scheduling write has not yet
|
||||
landed. The encoder writes `0` in that case and writes the UTC
|
||||
millisecond value otherwise.
|
||||
|
||||
**Why.** Encoding `nil` as `0` keeps the wire shape JSON-Schema-valid
|
||||
without forcing every record reader to handle a missing field.
|
||||
Optional pointer-typed timestamps (`started_at`, `stopped_at`,
|
||||
`finished_at`) are still omitted from the JSON form via `omitempty`,
|
||||
matching the `required` list in the spec.
|
||||
|
||||
**How to apply.** Readers must treat `next_generation_at == 0` as
|
||||
«not yet scheduled» when the status warrants it; the field will
|
||||
turn into a real Unix-millisecond value once the scheduler's first
|
||||
write lands. The conformance test seeds a non-nil
|
||||
`NextGenerationAt`, so the strict response validator never sees
|
||||
this edge case at the wire boundary.
|
||||
|
||||
### D7. Hot-path bodies are pass-through, not strict-decoded
|
||||
|
||||
**Decision.** Handlers `internalExecuteCommands`, `internalPutOrders`
|
||||
read the request body as raw bytes. The body is rejected only when
|
||||
empty or not valid JSON; unknown fields pass through.
|
||||
|
||||
**Why.** The OpenAPI request schemas for these three operations carry
|
||||
`additionalProperties: true` because the envelopes are engine-owned
|
||||
(`galaxy/game/openapi.yaml`). Strict decoding here would reject
|
||||
legitimate engine extensions and force every contract bump to land
|
||||
in two services in lockstep.
|
||||
|
||||
**How to apply.** Engine `engine_validation_error` responses still
|
||||
surface as the canonical Game Master error envelope at HTTP 502 —
|
||||
the engine response body is recorded in `result.RawResponse` for
|
||||
audit but the OpenAPI spec mandates the error envelope on this code
|
||||
path. If a future contract version requires forwarding the engine's
|
||||
4xx body to the gateway, a separate response shape needs to land in
|
||||
the spec first.
|
||||
|
||||
### D8. `X-Galaxy-Caller` mapping with admin default
|
||||
|
||||
**Decision.** The `resolveOpSource` helper maps the
|
||||
`X-Galaxy-Caller` header values to
|
||||
[`operation.OpSource`](../internal/domain/operation/log.go) as
|
||||
follows: `gateway → OpSourceGatewayPlayer`,
|
||||
`lobby → OpSourceLobbyInternal`, `admin → OpSourceAdminRest`.
|
||||
Missing or unrecognised values fall back to `OpSourceAdminRest`,
|
||||
matching the contract documented in
|
||||
[`../README.md` §«Internal REST API»](../README.md).
|
||||
|
||||
**Why.** The default is conservative: an Admin Service request
|
||||
without the header still records as admin instead of being dropped.
|
||||
The other two values are reserved for the documented callers and
|
||||
trim/lowercase tolerantly so a casing slip in development does not
|
||||
produce a confusing audit row.
|
||||
|
||||
**How to apply.** New REST callers should set the header
|
||||
explicitly. Adding a fourth caller type requires an `OpSource`
|
||||
constant alongside the mapping change.
|
||||
|
||||
## What ships
|
||||
|
||||
- Eighteen operation handlers under
|
||||
[`../internal/api/internalhttp/handlers`](../internal/api/internalhttp/handlers).
|
||||
- The probe-only `internal/api/internalhttp/server.go` now widens
|
||||
`Dependencies` and forwards the per-operation services to
|
||||
`handlers.Register`.
|
||||
- Full dependency graph in
|
||||
[`../internal/app/wiring.go`](../internal/app/wiring.go): five
|
||||
stores, five external adapters, eleven services, two workers.
|
||||
- `RuntimeRecordStore.List(ctx)` plus its PostgreSQL adapter
|
||||
implementation and regression tests
|
||||
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore)).
|
||||
- `schedulerticker.Worker.Shutdown` so the worker is an
|
||||
`App.Component`.
|
||||
- Mockgen-generated handler-port mocks under
|
||||
[`../internal/api/internalhttp/handlers/mocks`](../internal/api/internalhttp/handlers/mocks).
|
||||
- A kin-openapi-driven conformance test
|
||||
([`../internal/api/internalhttp/conformance_test.go`](../internal/api/internalhttp/conformance_test.go))
|
||||
that validates request and response shapes for every documented
|
||||
operation against
|
||||
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml).
|
||||
- Per-handler unit tests covering happy paths, error-code mapping,
|
||||
unknown-field rejection, and header validation.
|
||||
|
||||
## What remains for later stages
|
||||
|
||||
- Lobby refactor (stage 20) flips Lobby's start flow to call
|
||||
`GET /api/v1/internal/engine-versions/{version}/image-ref`
|
||||
synchronously and adds the `InvalidateMemberships` outbound call
|
||||
on every roster mutation.
|
||||
- Service-local integration suite (stage 21) drives the listener
|
||||
end-to-end against a real engine container.
|
||||
- Cross-service integration tests (stages 22–23) cover Lobby + GM,
|
||||
Lobby + GM + RTM happy and failure paths.
|
||||
Reference in New Issue
Block a user