10 KiB
stage, title
| stage | title |
|---|---|
| 13 | Register-runtime service |
Stage 13 — Register-runtime service
This decision record captures the non-obvious choices made while
implementing the register-runtime service-layer orchestrator at PLAN
Stage 13. The service is the single entry point Game Lobby uses (after
Runtime Manager has reported a successful container start) to install a
freshly-started game in Game Master.
Context
../PLAN.md Stage 13 ships the first service-layer stage
of Game Master. It lays the orchestrator pattern that Stages 14–17 will
reuse (engine version registry CRUD, scheduler, hot path, admin
operations). The lifecycle the service drives is frozen by
../README.md §Lifecycles → Register-runtime:
- validate request shape;
- reject if
runtime_records.{game_id}already exists; - resolve
image_reffortarget_engine_version; - persist
runtime_recordswithstatus=starting; - call engine
POST /api/v1/admin/init; - persist
player_mappingsfrom the engine response; - CAS
status: starting → runningand persist initial scheduling; - append
operation_log; - publish
runtime_snapshot_update; - return the persisted record.
The reference precedent is
rtmanager/internal/service/startruntime,
which established the Input / Result / Dependencies / NewService
/ Handle shape, the recordFailure helper, and the
bestEffortAppend audit-log convention.
Five decisions deviate from a literal reading of either PLAN Stage 13 or the rtmanager precedent. Each is recorded below.
Decisions
1. RuntimeRecordStore.Delete extension
Decision. ports.RuntimeRecordStore
gains an idempotent Delete(ctx, gameID) error method. The
PostgreSQL-backed adapter
runtimerecordstore.Store.Delete
issues a single DELETE FROM runtime_records WHERE game_id = $1 and
returns nil even when no row matches. The mock at
internal/adapters/mocks/mock_runtimerecordstore.go
is regenerated by make -C gamemaster mocks. A lone integration
test TestDeleteIdempotent mirrors TestDeleteByGameIdempotent in
playermappingstore.
Why. The README's failure paths for register-runtime mandate
"roll back runtime_records" on every post-Insert failure. The Stage 10
port surface had no Delete primitive, so the orchestrator could not
satisfy the README without one. Three alternatives were considered
and rejected:
- Reorder the flow (call engine init first, only then persist
runtime_records): contradicts the README, which lists the Insert step before the engine call so that the in-flightstartingrow is observable to inspect surfaces and acts as a coordination point for concurrent register-runtime requests on the same game id. - Introduce a
removedstatus enum: changes the runtime status machine for one transient bookkeeping case; complicates indexes, filters, and the inspect surface; is not described anywhere in README §Game Master status model. - Single SQL transaction across both stores: requires the adapter layer to expose a transactional sub-interface, breaking the per-port abstraction Stage 10 set up. The cost of one extra method on a single port is far smaller.
This is the same pattern Stage 11 used for UpdateEngineVersionInput.Now
and Deprecate(ctx, version, now): a small, targeted contract delta
admitted by the pre-launch single-init policy.
2. Engine 4xx → engine_validation_error, engine 5xx →
engine_unreachable
Decision. When the engine /admin/init call returns 4xx, the
service produces Result{ErrorCode: engine_validation_error}. When it
returns 5xx (or fails at the transport layer), the service produces
Result{ErrorCode: engine_unreachable}. The classification lives in
classifyEngineError
and dispatches on the engine port sentinels
(ports.ErrEngineValidation, ports.ErrEngineUnreachable,
ports.ErrEngineProtocolViolation).
Why. ../PLAN.md Stage 13 lists the two as separate
test cases ("engine 4xx (engine_validation_error), engine 5xx
(engine_unreachable)"), but ../README.md §Lifecycles →
Register-runtime's failure-path table at the time of
Stage 13 lumped them as engine_unreachable. PLAN's classification is
more useful operationally:
- 4xx from the engine signals a contract violation (the engine
rejected the request shape, which is a Game Master bug or a stale
contract). Treating this as
engine_unreachablewould push operators down the "is the engine alive?" branch when the right branch is "did the GM build send the right shape?". - 5xx (and transport failures) signal that the engine is unreachable
or unhealthy.
engine_unreachableis the right code.
The README §Lifecycles failure-path table is updated in the same patch to reflect the split, so the two documents agree.
3. Engine response validated as engine_protocol_violation
Decision. After a successful engine /admin/init HTTP response,
the service performs two extra checks before persisting any
player_mappings:
- the number of returned players must equal the input roster size;
- the set of
RaceNamevalues returned must be a subset of the roster (no extra races, no missing races).
A failure on either check rolls back the runtime record and returns
Result{ErrorCode: engine_protocol_violation}.
Why. The README's failure-path table includes
engine_protocol_violation for "engine response missing players or
contains races not in roster". The engine adapter (Stage 12,
engineclient.decodeStateResponse)
validates the wire shape (presence of required fields, well-formed
numeric values), but it cannot validate against the roster Game Master
sent — only the service layer knows the roster. Splitting the two
checks keeps the adapter narrow and lets the service-layer error code
carry the semantic meaning.
4. Initial runtime_snapshot_update carries non-empty
player_turn_stats
Decision. The first runtime_snapshot_update published by
register-runtime carries one
PlayerTurnStats{UserID, Planets, Population} row per active member,
projected from the engine.Init response by joining on RaceName
against the input roster. The projection is sorted by UserID for a
deterministic wire order.
Why. The README §Async Stream Contracts cadence note used to read
"empty when the snapshot is published for a status transition with no
new turn payload". For register-runtime there is a new payload — the
engine returns the initial player state in its /admin/init response,
including Planets and Population. That state is the turn-0
baseline against which Lobby's per-game stats aggregator measures
later deltas: without it, the first per-player delta after turn 1
would silently equal "everything" instead of "the change since
turn 0". The README cadence wording is updated in the same patch to
say the register-runtime snapshot carries the engine's turn-0 stats.
5. Best-effort rollback with two-flag gating
Decision. The service exposes a single rollback(ctx, gameID, playerMappingsInstalled) helper that always tries runtime_records.Delete
and conditionally tries playermappings.DeleteByGame. The two booleans
on recordFailure (runtimeInserted, playerMappingsInstalled)
gate the rollback so:
- a pre-Insert failure (
invalid_request,conflictfromGet,engine_version_not_found,Insert's ownErrConflict) skips rollback entirely; - a post-Insert / pre-BulkInsert failure deletes only the runtime row;
- a post-BulkInsert failure deletes both. Note that BulkInsert errors
themselves never install rows (per stage 11 D7's per-statement
atomicity), so on
BulkInsertreturning ErrConflict the rollback flag for player_mappings isfalse.
The rollback uses a fresh context.Background() with a 5-second
timeout so a cancelled request context does not strand the
starting row.
Why. A common pitfall in rollback paths is to call Delete on
state owned by another caller. The Insert-conflict branch is the
canonical example: when our Insert returns ErrConflict, another
request inserted the row first and owns it. Blindly deleting it
would corrupt that other caller's state. The two-flag gating makes
the ownership transfer explicit. The fresh background context
mirrors the same pattern in rtmanager.startruntime.releaseLease.
Files landed
../internal/ports/runtimerecordstore.go— addedDeleteto the interface and the comment block.../internal/adapters/postgres/runtimerecordstore/store.go— implementedDelete.../internal/adapters/postgres/runtimerecordstore/store_test.go— addedTestDeleteIdempotentandTestDeleteRejectsEmptyGameID.../internal/adapters/mocks/mock_runtimerecordstore.go— regenerated.../internal/service/registerruntime/service.gowitherrors.goandservice_test.go— new orchestrator package and tests.../README.md— §References pointer to this record plus one-line clarifications in §Lifecycles → Register-runtime (failure-path table now splits 4xx/5xx per D2) and §Async Stream Contracts (cadence note now says the register-runtime snapshot carriesplayer_turn_statsfrom the engine-init response per D4).../PLAN.md— Stage 13 marked done.
Verification
cd gamemaster
# Mocks regenerate cleanly with no diff after the port extension.
make mocks
git diff --exit-code internal/adapters/mocks
# Domain + port tests still pass.
go test ./internal/domain/... ./internal/ports/...
# Adapter test for the new Delete method.
go test ./internal/adapters/postgres/runtimerecordstore/...
# Service-level tests for the new orchestrator.
go test ./internal/service/registerruntime/...
# Stage 06/07/09–12 contract / adapter / freeze tests stay green.
go test ./...
The full repo-level go build ./... from the workspace root succeeds;
later stages (14+) build on the orchestrator shape Stage 13
establishes.