--- stage: 13 title: Register-runtime service --- # Stage 13 — Register-runtime service This decision record captures the non-obvious choices made while implementing the `register-runtime` service-layer orchestrator at PLAN Stage 13. The service is the single entry point Game Lobby uses (after Runtime Manager has reported a successful container start) to install a freshly-started game in Game Master. ## Context [`../PLAN.md` Stage 13](../PLAN.md) ships the first service-layer stage of Game Master. It lays the orchestrator pattern that Stages 14–17 will reuse (engine version registry CRUD, scheduler, hot path, admin operations). The lifecycle the service drives is frozen by [`../README.md` §Lifecycles → Register-runtime](../README.md): 1. validate request shape; 2. reject if `runtime_records.{game_id}` already exists; 3. resolve `image_ref` for `target_engine_version`; 4. persist `runtime_records` with `status=starting`; 5. call engine `POST /api/v1/admin/init`; 6. persist `player_mappings` from the engine response; 7. CAS `status: starting → running` and persist initial scheduling; 8. append `operation_log`; 9. publish `runtime_snapshot_update`; 10. return the persisted record. The reference precedent is [`rtmanager/internal/service/startruntime`](../../rtmanager/internal/service/startruntime), which established the `Input` / `Result` / `Dependencies` / `NewService` / `Handle` shape, the `recordFailure` helper, and the `bestEffortAppend` audit-log convention. Five decisions deviate from a literal reading of either PLAN Stage 13 or the rtmanager precedent. Each is recorded below. ## Decisions ### 1. `RuntimeRecordStore.Delete` extension **Decision.** [`ports.RuntimeRecordStore`](../internal/ports/runtimerecordstore.go) gains an idempotent `Delete(ctx, gameID) error` method. The PostgreSQL-backed adapter [`runtimerecordstore.Store.Delete`](../internal/adapters/postgres/runtimerecordstore/store.go) issues a single `DELETE FROM runtime_records WHERE game_id = $1` and returns `nil` even when no row matches. The mock at [`internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go) is regenerated by `make -C gamemaster mocks`. A lone integration test `TestDeleteIdempotent` mirrors `TestDeleteByGameIdempotent` in `playermappingstore`. **Why.** The README's failure paths for `register-runtime` mandate "roll back `runtime_records`" on every post-Insert failure. The Stage 10 port surface had no Delete primitive, so the orchestrator could not satisfy the README without one. Three alternatives were considered and rejected: - **Reorder the flow** (call engine init first, only then persist `runtime_records`): contradicts the README, which lists the Insert step before the engine call so that the in-flight `starting` row is observable to inspect surfaces and acts as a coordination point for concurrent register-runtime requests on the same game id. - **Introduce a `removed` status enum**: changes the runtime status machine for one transient bookkeeping case; complicates indexes, filters, and the inspect surface; is not described anywhere in README §Game Master status model. - **Single SQL transaction across both stores**: requires the adapter layer to expose a transactional sub-interface, breaking the per-port abstraction Stage 10 set up. The cost of one extra method on a single port is far smaller. This is the same pattern Stage 11 used for `UpdateEngineVersionInput.Now` and `Deprecate(ctx, version, now)`: a small, targeted contract delta admitted by the pre-launch single-init policy. ### 2. Engine 4xx → `engine_validation_error`, engine 5xx → `engine_unreachable` **Decision.** When the engine `/admin/init` call returns 4xx, the service produces `Result{ErrorCode: engine_validation_error}`. When it returns 5xx (or fails at the transport layer), the service produces `Result{ErrorCode: engine_unreachable}`. The classification lives in [`classifyEngineError`](../internal/service/registerruntime/service.go) and dispatches on the engine port sentinels (`ports.ErrEngineValidation`, `ports.ErrEngineUnreachable`, `ports.ErrEngineProtocolViolation`). **Why.** [`../PLAN.md` Stage 13](../PLAN.md) lists the two as separate test cases ("engine 4xx (engine_validation_error), engine 5xx (engine_unreachable)"), but [`../README.md` §Lifecycles → Register-runtime](../README.md)'s failure-path table at the time of Stage 13 lumped them as `engine_unreachable`. PLAN's classification is more useful operationally: - 4xx from the engine signals a contract violation (the engine rejected the request shape, which is a Game Master bug or a stale contract). Treating this as `engine_unreachable` would push operators down the "is the engine alive?" branch when the right branch is "did the GM build send the right shape?". - 5xx (and transport failures) signal that the engine is unreachable or unhealthy. `engine_unreachable` is the right code. The README §Lifecycles failure-path table is updated in the same patch to reflect the split, so the two documents agree. ### 3. Engine response validated as `engine_protocol_violation` **Decision.** After a successful engine `/admin/init` HTTP response, the service performs two extra checks before persisting any player_mappings: - the number of returned players must equal the input roster size; - the set of `RaceName` values returned must be a subset of the roster (no extra races, no missing races). A failure on either check rolls back the runtime record and returns `Result{ErrorCode: engine_protocol_violation}`. **Why.** The README's failure-path table includes `engine_protocol_violation` for "engine response missing players or contains races not in roster". The engine adapter ([Stage 12, `engineclient.decodeStateResponse`](../internal/adapters/engineclient/client.go)) validates the wire shape (presence of required fields, well-formed numeric values), but it cannot validate against the roster Game Master sent — only the service layer knows the roster. Splitting the two checks keeps the adapter narrow and lets the service-layer error code carry the semantic meaning. ### 4. Initial `runtime_snapshot_update` carries non-empty `player_turn_stats` **Decision.** The first `runtime_snapshot_update` published by register-runtime carries one `PlayerTurnStats{UserID, Planets, Population}` row per active member, projected from the `engine.Init` response by joining on `RaceName` against the input roster. The projection is sorted by `UserID` for a deterministic wire order. **Why.** The README §Async Stream Contracts cadence note used to read "empty when the snapshot is published for a status transition with no new turn payload". For register-runtime there *is* a new payload — the engine returns the initial player state in its `/admin/init` response, including `Planets` and `Population`. That state is the turn-0 baseline against which Lobby's per-game stats aggregator measures later deltas: without it, the first per-player delta after turn 1 would silently equal "everything" instead of "the change since turn 0". The README cadence wording is updated in the same patch to say the register-runtime snapshot carries the engine's turn-0 stats. ### 5. Best-effort rollback with two-flag gating **Decision.** The service exposes a single `rollback(ctx, gameID, playerMappingsInstalled)` helper that always tries `runtime_records.Delete` and conditionally tries `playermappings.DeleteByGame`. The two booleans on `recordFailure` (`runtimeInserted`, `playerMappingsInstalled`) gate the rollback so: - a pre-Insert failure (`invalid_request`, `conflict` from `Get`, `engine_version_not_found`, `Insert`'s own `ErrConflict`) skips rollback entirely; - a post-Insert / pre-BulkInsert failure deletes only the runtime row; - a post-BulkInsert failure deletes both. Note that BulkInsert errors themselves never install rows (per stage 11 D7's per-statement atomicity), so on `BulkInsert` returning ErrConflict the rollback flag for player_mappings is `false`. The rollback uses a fresh `context.Background()` with a 5-second timeout so a cancelled request context does not strand the `starting` row. **Why.** A common pitfall in rollback paths is to call `Delete` on state owned by another caller. The Insert-conflict branch is the canonical example: when our `Insert` returns `ErrConflict`, another request inserted the row first and owns it. Blindly deleting it would corrupt that other caller's state. The two-flag gating makes the ownership transfer explicit. The fresh background context mirrors the same pattern in `rtmanager.startruntime.releaseLease`. ## Files landed - [`../internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go) — added `Delete` to the interface and the comment block. - [`../internal/adapters/postgres/runtimerecordstore/store.go`](../internal/adapters/postgres/runtimerecordstore/store.go) — implemented `Delete`. - [`../internal/adapters/postgres/runtimerecordstore/store_test.go`](../internal/adapters/postgres/runtimerecordstore/store_test.go) — added `TestDeleteIdempotent` and `TestDeleteRejectsEmptyGameID`. - [`../internal/adapters/mocks/mock_runtimerecordstore.go`](../internal/adapters/mocks/mock_runtimerecordstore.go) — regenerated. - [`../internal/service/registerruntime/service.go`](../internal/service/registerruntime/service.go) with [`errors.go`](../internal/service/registerruntime/errors.go) and [`service_test.go`](../internal/service/registerruntime/service_test.go) — new orchestrator package and tests. - [`../README.md`](../README.md) — §References pointer to this record plus one-line clarifications in §Lifecycles → Register-runtime (failure-path table now splits 4xx/5xx per **D2**) and §Async Stream Contracts (cadence note now says the register-runtime snapshot carries `player_turn_stats` from the engine-init response per **D4**). - [`../PLAN.md`](../PLAN.md) — Stage 13 marked done. ## Verification ```sh cd gamemaster # Mocks regenerate cleanly with no diff after the port extension. make mocks git diff --exit-code internal/adapters/mocks # Domain + port tests still pass. go test ./internal/domain/... ./internal/ports/... # Adapter test for the new Delete method. go test ./internal/adapters/postgres/runtimerecordstore/... # Service-level tests for the new orchestrator. go test ./internal/service/registerruntime/... # Stage 06/07/09–12 contract / adapter / freeze tests stay green. go test ./... ``` The full repo-level `go build ./...` from the workspace root succeeds; later stages (14+) build on the orchestrator shape Stage 13 establishes.