feat: gamemaster
This commit is contained in:
@@ -0,0 +1,230 @@
|
||||
---
|
||||
stage: 13
|
||||
title: Register-runtime service
|
||||
---
|
||||
|
||||
# Stage 13 — Register-runtime service
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the `register-runtime` service-layer orchestrator at PLAN
|
||||
Stage 13. The service is the single entry point Game Lobby uses (after
|
||||
Runtime Manager has reported a successful container start) to install a
|
||||
freshly-started game in Game Master.
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 13](../PLAN.md) ships the first service-layer stage
|
||||
of Game Master. It lays the orchestrator pattern that Stages 14–17 will
|
||||
reuse (engine version registry CRUD, scheduler, hot path, admin
|
||||
operations). The lifecycle the service drives is frozen by
|
||||
[`../README.md` §Lifecycles → Register-runtime](../README.md):
|
||||
|
||||
1. validate request shape;
|
||||
2. reject if `runtime_records.{game_id}` already exists;
|
||||
3. resolve `image_ref` for `target_engine_version`;
|
||||
4. persist `runtime_records` with `status=starting`;
|
||||
5. call engine `POST /api/v1/admin/init`;
|
||||
6. persist `player_mappings` from the engine response;
|
||||
7. CAS `status: starting → running` and persist initial scheduling;
|
||||
8. append `operation_log`;
|
||||
9. publish `runtime_snapshot_update`;
|
||||
10. return the persisted record.
|
||||
|
||||
The reference precedent is
|
||||
[`rtmanager/internal/service/startruntime`](../../rtmanager/internal/service/startruntime),
|
||||
which established the `Input` / `Result` / `Dependencies` / `NewService`
|
||||
/ `Handle` shape, the `recordFailure` helper, and the
|
||||
`bestEffortAppend` audit-log convention.
|
||||
|
||||
Five decisions deviate from a literal reading of either PLAN Stage 13
|
||||
or the rtmanager precedent. Each is recorded below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. `RuntimeRecordStore.Delete` extension
|
||||
|
||||
**Decision.** [`ports.RuntimeRecordStore`](../internal/ports/runtimerecordstore.go)
|
||||
gains an idempotent `Delete(ctx, gameID) error` method. The
|
||||
PostgreSQL-backed adapter
|
||||
[`runtimerecordstore.Store.Delete`](../internal/adapters/postgres/runtimerecordstore/store.go)
|
||||
issues a single `DELETE FROM runtime_records WHERE game_id = $1` and
|
||||
returns `nil` even when no row matches. The mock at
|
||||
[`internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go)
|
||||
is regenerated by `make -C gamemaster mocks`. A lone integration
|
||||
test `TestDeleteIdempotent` mirrors `TestDeleteByGameIdempotent` in
|
||||
`playermappingstore`.
|
||||
|
||||
**Why.** The README's failure paths for `register-runtime` mandate
|
||||
"roll back `runtime_records`" on every post-Insert failure. The Stage 10
|
||||
port surface had no Delete primitive, so the orchestrator could not
|
||||
satisfy the README without one. Three alternatives were considered
|
||||
and rejected:
|
||||
|
||||
- **Reorder the flow** (call engine init first, only then persist
|
||||
`runtime_records`): contradicts the README, which lists the Insert
|
||||
step before the engine call so that the in-flight `starting` row is
|
||||
observable to inspect surfaces and acts as a coordination point for
|
||||
concurrent register-runtime requests on the same game id.
|
||||
- **Introduce a `removed` status enum**: changes the runtime status
|
||||
machine for one transient bookkeeping case; complicates indexes,
|
||||
filters, and the inspect surface; is not described anywhere in
|
||||
README §Game Master status model.
|
||||
- **Single SQL transaction across both stores**: requires the adapter
|
||||
layer to expose a transactional sub-interface, breaking the per-port
|
||||
abstraction Stage 10 set up. The cost of one extra method on a
|
||||
single port is far smaller.
|
||||
|
||||
This is the same pattern Stage 11 used for `UpdateEngineVersionInput.Now`
|
||||
and `Deprecate(ctx, version, now)`: a small, targeted contract delta
|
||||
admitted by the pre-launch single-init policy.
|
||||
|
||||
### 2. Engine 4xx → `engine_validation_error`, engine 5xx →
|
||||
`engine_unreachable`
|
||||
|
||||
**Decision.** When the engine `/admin/init` call returns 4xx, the
|
||||
service produces `Result{ErrorCode: engine_validation_error}`. When it
|
||||
returns 5xx (or fails at the transport layer), the service produces
|
||||
`Result{ErrorCode: engine_unreachable}`. The classification lives in
|
||||
[`classifyEngineError`](../internal/service/registerruntime/service.go)
|
||||
and dispatches on the engine port sentinels
|
||||
(`ports.ErrEngineValidation`, `ports.ErrEngineUnreachable`,
|
||||
`ports.ErrEngineProtocolViolation`).
|
||||
|
||||
**Why.** [`../PLAN.md` Stage 13](../PLAN.md) lists the two as separate
|
||||
test cases ("engine 4xx (engine_validation_error), engine 5xx
|
||||
(engine_unreachable)"), but [`../README.md` §Lifecycles →
|
||||
Register-runtime](../README.md)'s failure-path table at the time of
|
||||
Stage 13 lumped them as `engine_unreachable`. PLAN's classification is
|
||||
more useful operationally:
|
||||
|
||||
- 4xx from the engine signals a contract violation (the engine
|
||||
rejected the request shape, which is a Game Master bug or a stale
|
||||
contract). Treating this as `engine_unreachable` would push
|
||||
operators down the "is the engine alive?" branch when the right
|
||||
branch is "did the GM build send the right shape?".
|
||||
- 5xx (and transport failures) signal that the engine is unreachable
|
||||
or unhealthy. `engine_unreachable` is the right code.
|
||||
|
||||
The README §Lifecycles failure-path table is updated in the same
|
||||
patch to reflect the split, so the two documents agree.
|
||||
|
||||
### 3. Engine response validated as `engine_protocol_violation`
|
||||
|
||||
**Decision.** After a successful engine `/admin/init` HTTP response,
|
||||
the service performs two extra checks before persisting any
|
||||
player_mappings:
|
||||
|
||||
- the number of returned players must equal the input roster size;
|
||||
- the set of `RaceName` values returned must be a subset of the
|
||||
roster (no extra races, no missing races).
|
||||
|
||||
A failure on either check rolls back the runtime record and returns
|
||||
`Result{ErrorCode: engine_protocol_violation}`.
|
||||
|
||||
**Why.** The README's failure-path table includes
|
||||
`engine_protocol_violation` for "engine response missing players or
|
||||
contains races not in roster". The engine adapter ([Stage 12,
|
||||
`engineclient.decodeStateResponse`](../internal/adapters/engineclient/client.go))
|
||||
validates the wire shape (presence of required fields, well-formed
|
||||
numeric values), but it cannot validate against the roster Game Master
|
||||
sent — only the service layer knows the roster. Splitting the two
|
||||
checks keeps the adapter narrow and lets the service-layer error code
|
||||
carry the semantic meaning.
|
||||
|
||||
### 4. Initial `runtime_snapshot_update` carries non-empty
|
||||
`player_turn_stats`
|
||||
|
||||
**Decision.** The first `runtime_snapshot_update` published by
|
||||
register-runtime carries one
|
||||
`PlayerTurnStats{UserID, Planets, Population}` row per active member,
|
||||
projected from the `engine.Init` response by joining on `RaceName`
|
||||
against the input roster. The projection is sorted by `UserID` for a
|
||||
deterministic wire order.
|
||||
|
||||
**Why.** The README §Async Stream Contracts cadence note used to read
|
||||
"empty when the snapshot is published for a status transition with no
|
||||
new turn payload". For register-runtime there *is* a new payload — the
|
||||
engine returns the initial player state in its `/admin/init` response,
|
||||
including `Planets` and `Population`. That state is the turn-0
|
||||
baseline against which Lobby's per-game stats aggregator measures
|
||||
later deltas: without it, the first per-player delta after turn 1
|
||||
would silently equal "everything" instead of "the change since
|
||||
turn 0". The README cadence wording is updated in the same patch to
|
||||
say the register-runtime snapshot carries the engine's turn-0 stats.
|
||||
|
||||
### 5. Best-effort rollback with two-flag gating
|
||||
|
||||
**Decision.** The service exposes a single `rollback(ctx, gameID,
|
||||
playerMappingsInstalled)` helper that always tries `runtime_records.Delete`
|
||||
and conditionally tries `playermappings.DeleteByGame`. The two booleans
|
||||
on `recordFailure` (`runtimeInserted`, `playerMappingsInstalled`)
|
||||
gate the rollback so:
|
||||
|
||||
- a pre-Insert failure (`invalid_request`, `conflict` from `Get`,
|
||||
`engine_version_not_found`, `Insert`'s own `ErrConflict`) skips
|
||||
rollback entirely;
|
||||
- a post-Insert / pre-BulkInsert failure deletes only the runtime
|
||||
row;
|
||||
- a post-BulkInsert failure deletes both. Note that BulkInsert errors
|
||||
themselves never install rows (per stage 11 D7's per-statement
|
||||
atomicity), so on `BulkInsert` returning ErrConflict the rollback
|
||||
flag for player_mappings is `false`.
|
||||
|
||||
The rollback uses a fresh `context.Background()` with a 5-second
|
||||
timeout so a cancelled request context does not strand the
|
||||
`starting` row.
|
||||
|
||||
**Why.** A common pitfall in rollback paths is to call `Delete` on
|
||||
state owned by another caller. The Insert-conflict branch is the
|
||||
canonical example: when our `Insert` returns `ErrConflict`, another
|
||||
request inserted the row first and owns it. Blindly deleting it
|
||||
would corrupt that other caller's state. The two-flag gating makes
|
||||
the ownership transfer explicit. The fresh background context
|
||||
mirrors the same pattern in `rtmanager.startruntime.releaseLease`.
|
||||
|
||||
## Files landed
|
||||
|
||||
- [`../internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
|
||||
— added `Delete` to the interface and the comment block.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore/store.go`](../internal/adapters/postgres/runtimerecordstore/store.go)
|
||||
— implemented `Delete`.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore/store_test.go`](../internal/adapters/postgres/runtimerecordstore/store_test.go)
|
||||
— added `TestDeleteIdempotent` and `TestDeleteRejectsEmptyGameID`.
|
||||
- [`../internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go)
|
||||
— regenerated.
|
||||
- [`../internal/service/registerruntime/service.go`](../internal/service/registerruntime/service.go)
|
||||
with [`errors.go`](../internal/service/registerruntime/errors.go)
|
||||
and [`service_test.go`](../internal/service/registerruntime/service_test.go)
|
||||
— new orchestrator package and tests.
|
||||
- [`../README.md`](../README.md) — §References pointer to this record
|
||||
plus one-line clarifications in §Lifecycles → Register-runtime
|
||||
(failure-path table now splits 4xx/5xx per **D2**) and §Async Stream
|
||||
Contracts (cadence note now says the register-runtime snapshot
|
||||
carries `player_turn_stats` from the engine-init response per **D4**).
|
||||
- [`../PLAN.md`](../PLAN.md) — Stage 13 marked done.
|
||||
|
||||
## Verification
|
||||
|
||||
```sh
|
||||
cd gamemaster
|
||||
|
||||
# Mocks regenerate cleanly with no diff after the port extension.
|
||||
make mocks
|
||||
git diff --exit-code internal/adapters/mocks
|
||||
|
||||
# Domain + port tests still pass.
|
||||
go test ./internal/domain/... ./internal/ports/...
|
||||
|
||||
# Adapter test for the new Delete method.
|
||||
go test ./internal/adapters/postgres/runtimerecordstore/...
|
||||
|
||||
# Service-level tests for the new orchestrator.
|
||||
go test ./internal/service/registerruntime/...
|
||||
|
||||
# Stage 06/07/09–12 contract / adapter / freeze tests stay green.
|
||||
go test ./...
|
||||
```
|
||||
|
||||
The full repo-level `go build ./...` from the workspace root succeeds;
|
||||
later stages (14+) build on the orchestrator shape Stage 13
|
||||
establishes.
|
||||
Reference in New Issue
Block a user