Files
galaxy-game/gamemaster/docs/stage13-register-runtime.md
T
2026-05-03 07:59:03 +02:00

10 KiB
Raw Blame History

stage, title
stage title
13 Register-runtime service

Stage 13 — Register-runtime service

This decision record captures the non-obvious choices made while implementing the register-runtime service-layer orchestrator at PLAN Stage 13. The service is the single entry point Game Lobby uses (after Runtime Manager has reported a successful container start) to install a freshly-started game in Game Master.

Context

../PLAN.md Stage 13 ships the first service-layer stage of Game Master. It lays the orchestrator pattern that Stages 1417 will reuse (engine version registry CRUD, scheduler, hot path, admin operations). The lifecycle the service drives is frozen by ../README.md §Lifecycles → Register-runtime:

  1. validate request shape;
  2. reject if runtime_records.{game_id} already exists;
  3. resolve image_ref for target_engine_version;
  4. persist runtime_records with status=starting;
  5. call engine POST /api/v1/admin/init;
  6. persist player_mappings from the engine response;
  7. CAS status: starting → running and persist initial scheduling;
  8. append operation_log;
  9. publish runtime_snapshot_update;
  10. return the persisted record.

The reference precedent is rtmanager/internal/service/startruntime, which established the Input / Result / Dependencies / NewService / Handle shape, the recordFailure helper, and the bestEffortAppend audit-log convention.

Five decisions deviate from a literal reading of either PLAN Stage 13 or the rtmanager precedent. Each is recorded below.

Decisions

1. RuntimeRecordStore.Delete extension

Decision. ports.RuntimeRecordStore gains an idempotent Delete(ctx, gameID) error method. The PostgreSQL-backed adapter runtimerecordstore.Store.Delete issues a single DELETE FROM runtime_records WHERE game_id = $1 and returns nil even when no row matches. The mock at internal/adapters/mocks/mock_runtimerecordstore.go is regenerated by make -C gamemaster mocks. A lone integration test TestDeleteIdempotent mirrors TestDeleteByGameIdempotent in playermappingstore.

Why. The README's failure paths for register-runtime mandate "roll back runtime_records" on every post-Insert failure. The Stage 10 port surface had no Delete primitive, so the orchestrator could not satisfy the README without one. Three alternatives were considered and rejected:

  • Reorder the flow (call engine init first, only then persist runtime_records): contradicts the README, which lists the Insert step before the engine call so that the in-flight starting row is observable to inspect surfaces and acts as a coordination point for concurrent register-runtime requests on the same game id.
  • Introduce a removed status enum: changes the runtime status machine for one transient bookkeeping case; complicates indexes, filters, and the inspect surface; is not described anywhere in README §Game Master status model.
  • Single SQL transaction across both stores: requires the adapter layer to expose a transactional sub-interface, breaking the per-port abstraction Stage 10 set up. The cost of one extra method on a single port is far smaller.

This is the same pattern Stage 11 used for UpdateEngineVersionInput.Now and Deprecate(ctx, version, now): a small, targeted contract delta admitted by the pre-launch single-init policy.

2. Engine 4xx → engine_validation_error, engine 5xx →

engine_unreachable

Decision. When the engine /admin/init call returns 4xx, the service produces Result{ErrorCode: engine_validation_error}. When it returns 5xx (or fails at the transport layer), the service produces Result{ErrorCode: engine_unreachable}. The classification lives in classifyEngineError and dispatches on the engine port sentinels (ports.ErrEngineValidation, ports.ErrEngineUnreachable, ports.ErrEngineProtocolViolation).

Why. ../PLAN.md Stage 13 lists the two as separate test cases ("engine 4xx (engine_validation_error), engine 5xx (engine_unreachable)"), but ../README.md §Lifecycles → Register-runtime's failure-path table at the time of Stage 13 lumped them as engine_unreachable. PLAN's classification is more useful operationally:

  • 4xx from the engine signals a contract violation (the engine rejected the request shape, which is a Game Master bug or a stale contract). Treating this as engine_unreachable would push operators down the "is the engine alive?" branch when the right branch is "did the GM build send the right shape?".
  • 5xx (and transport failures) signal that the engine is unreachable or unhealthy. engine_unreachable is the right code.

The README §Lifecycles failure-path table is updated in the same patch to reflect the split, so the two documents agree.

3. Engine response validated as engine_protocol_violation

Decision. After a successful engine /admin/init HTTP response, the service performs two extra checks before persisting any player_mappings:

  • the number of returned players must equal the input roster size;
  • the set of RaceName values returned must be a subset of the roster (no extra races, no missing races).

A failure on either check rolls back the runtime record and returns Result{ErrorCode: engine_protocol_violation}.

Why. The README's failure-path table includes engine_protocol_violation for "engine response missing players or contains races not in roster". The engine adapter (Stage 12, engineclient.decodeStateResponse) validates the wire shape (presence of required fields, well-formed numeric values), but it cannot validate against the roster Game Master sent — only the service layer knows the roster. Splitting the two checks keeps the adapter narrow and lets the service-layer error code carry the semantic meaning.

4. Initial runtime_snapshot_update carries non-empty

player_turn_stats

Decision. The first runtime_snapshot_update published by register-runtime carries one PlayerTurnStats{UserID, Planets, Population} row per active member, projected from the engine.Init response by joining on RaceName against the input roster. The projection is sorted by UserID for a deterministic wire order.

Why. The README §Async Stream Contracts cadence note used to read "empty when the snapshot is published for a status transition with no new turn payload". For register-runtime there is a new payload — the engine returns the initial player state in its /admin/init response, including Planets and Population. That state is the turn-0 baseline against which Lobby's per-game stats aggregator measures later deltas: without it, the first per-player delta after turn 1 would silently equal "everything" instead of "the change since turn 0". The README cadence wording is updated in the same patch to say the register-runtime snapshot carries the engine's turn-0 stats.

5. Best-effort rollback with two-flag gating

Decision. The service exposes a single rollback(ctx, gameID, playerMappingsInstalled) helper that always tries runtime_records.Delete and conditionally tries playermappings.DeleteByGame. The two booleans on recordFailure (runtimeInserted, playerMappingsInstalled) gate the rollback so:

  • a pre-Insert failure (invalid_request, conflict from Get, engine_version_not_found, Insert's own ErrConflict) skips rollback entirely;
  • a post-Insert / pre-BulkInsert failure deletes only the runtime row;
  • a post-BulkInsert failure deletes both. Note that BulkInsert errors themselves never install rows (per stage 11 D7's per-statement atomicity), so on BulkInsert returning ErrConflict the rollback flag for player_mappings is false.

The rollback uses a fresh context.Background() with a 5-second timeout so a cancelled request context does not strand the starting row.

Why. A common pitfall in rollback paths is to call Delete on state owned by another caller. The Insert-conflict branch is the canonical example: when our Insert returns ErrConflict, another request inserted the row first and owns it. Blindly deleting it would corrupt that other caller's state. The two-flag gating makes the ownership transfer explicit. The fresh background context mirrors the same pattern in rtmanager.startruntime.releaseLease.

Files landed

Verification

cd gamemaster

# Mocks regenerate cleanly with no diff after the port extension.
make mocks
git diff --exit-code internal/adapters/mocks

# Domain + port tests still pass.
go test ./internal/domain/... ./internal/ports/...

# Adapter test for the new Delete method.
go test ./internal/adapters/postgres/runtimerecordstore/...

# Service-level tests for the new orchestrator.
go test ./internal/service/registerruntime/...

# Stage 06/07/0912 contract / adapter / freeze tests stay green.
go test ./...

The full repo-level go build ./... from the workspace root succeeds; later stages (14+) build on the orchestrator shape Stage 13 establishes.