--- stage: 17 title: Admin operations and Lobby-facing liveness --- # Stage 17 — Admin operations and Lobby-facing liveness This decision record captures the non-obvious choices made while implementing the five Game Master admin/inspect service-layer operations and the Lobby-facing liveness reply (`adminstop`, `adminforce`, `adminpatch`, `adminbanish`, `livenessreply`). Stage 17 is the last service-layer stage before Stage 18 (health-events consumer) and Stage 19 (REST handlers and wiring). ## Context [`../PLAN.md` Stage 17](../PLAN.md) ships five services that close the GM service surface: 1. `service/adminstop` — orchestrator behind `POST /api/v1/internal/runtimes/{game_id}/stop`. Calls Runtime Manager and CASes `runtime_records.status → stopped`. 2. `service/adminforce` — orchestrator behind `POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. Runs the inner `service/turngeneration` flow synchronously, then sets `runtime_records.skip_next_tick = true`. 3. `service/adminpatch` — orchestrator behind `POST /api/v1/internal/runtimes/{game_id}/patch`. Calls Runtime Manager and rotates `runtime_records.current_image_ref` plus `current_engine_version`. 4. `service/adminbanish` — orchestrator behind `POST /api/v1/internal/games/{game_id}/race/{race_name}/banish`. Resolves the race and calls the engine `/admin/race/banish`. 5. `service/livenessreply` — orchestrator behind `GET /api/v1/internal/games/{game_id}/liveness`. Reflects GM's own view of the runtime without ever calling the engine. The reference precedent for the orchestrator shape (`Input` / `Result` / `Dependencies` / `NewService` / `Handle`) is Stage 13's `service/registerruntime` and Stage 15's `service/turngeneration`. Six decisions deviate from a literal reading of the README, the OpenAPI surface, or the turngeneration precedent. Each is recorded below. ## Decisions ### D1. `RuntimeRecordStore` grows a dedicated `UpdateImage` method **Decision.** [`ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go) adds a new `UpdateImage(ctx, UpdateImageInput) error` method with its own `UpdateImageInput` struct and `Validate`. The Postgres adapter gains a matching SQL UPDATE under a CAS guard on `(game_id, status)`. The existing `UpdateStatus` is **not** repurposed for patch updates. **Why.** `UpdateStatusInput.Validate()` (Stage 11) calls `runtime.Transition(ExpectedFrom, To)` and rejects every pair where `ExpectedFrom == To`. Patch deliberately keeps the runtime in `running`, so any attempt to feed `UpdateStatus` with `ExpectedFrom == To == running` is rejected before the SQL even runs. Three alternatives were on the table: - Drop the `runtime.Transition` invariant from `UpdateStatusInput` to allow self-transitions. That would weaken the CAS validator for every existing caller — register-runtime, turngeneration, health-events consumer — and reintroduce the «accidental no-op status update» class of bugs the validator was added to catch. - Introduce a synthetic `runtime.StatusRunning → runtime.StatusRunning` edge in `domain/runtime/transitions.go`. Same blast radius as above, only with stronger semantic baggage in the transition table. - Add a dedicated `UpdateImage` method that only writes the two image columns plus `updated_at`. Bounded blast radius (one new method, one new input struct, one new SQL UPDATE), preserves the CAS invariant, and matches how Stage 11 already separated `UpdateScheduling` from `UpdateStatus` for the same reason. The third option is what shipped. Existing fakes (`registerruntime`, `turngeneration`, hot-path tests, schedulerticker) carry a no-op `UpdateImage` stub that returns `errors.New(...)` so a test that accidentally exercises the new path fails loudly. ### D2. `adminstop` is idempotent on `stopped` and `finished`, rejects `starting` **Decision.** [`service/adminstop`](../internal/service/adminstop/service.go) reads the runtime row first; if `Status ∈ {stopped, finished}`, the service returns `OutcomeSuccess` without calling Runtime Manager and without publishing a `runtime_snapshot_update`. If `Status == starting`, the service returns `conflict` with `OutcomeFailure`. Every other non-terminal status (`running`, `generation_in_progress`, `generation_failed`, `engine_unreachable`) takes the regular path: RTM call → CAS → snapshot publication. **Why.** The README §Stop says «CAS `runtime_records.status: * → stopped`» but in practice three edge cases pull the service away from a literal CAS-only implementation: - `stopped` and `finished` are common operator races: an admin clicks «stop» on a UI list while another admin already pressed it (or the game finished naturally). Returning `conflict` would force the UI to retry the read and confuse the operator. Idempotent success is the smallest-surprise behaviour and matches how Lobby's other admin-cancel flows handle terminal states. - `starting` is the active engine-init window. RTM has just been asked to start the container; an admin stop here would race the init flow and almost certainly leave the system in a partially cleaned state. The transition table in Stage 10 deliberately excludes `starting → stopped` for the same reason. Returning `conflict` lets the admin tooling surface «runtime is mid-init, retry in a moment» instead of pretending the stop succeeded. - The «obvious» fourth path — letting the CAS validator reject `starting → stopped` and surface that as the natural conflict — was rejected because it depends on validator implementation detail leaking through; the explicit pre-CAS check makes the intent obvious in the audit log and the structured logs. The audit log records every pre-CAS rejection with `outcome=failure / error_code=conflict`, and every idempotent no-op with `outcome=success`, so operators can distinguish the cases in post-hoc analysis. ### D3. `adminforce` always sets `skip_next_tick=true`, even on a finishing turn **Decision.** [`service/adminforce`](../internal/service/adminforce/service.go) issues `UpdateScheduling{SkipNextTick=true, NextGenerationAt=turnResult.Record.NextGenerationAt, CurrentTurn=turnResult.Record.CurrentTurn}` after every successful inner turn-generation, regardless of whether `Result.Finished` is `true`. **Why.** The cleaner branch — «skip the scheduling write when the turn just finished the game» — was considered and rejected: - `turngeneration` already cleared `next_generation_at` and updated `current_turn` on the finishing branch (Stage 15 `completeFinished`). A redundant write that re-affirms those values plus sets `skip_next_tick=true` does no harm: the row is already in `status=finished` and no scheduler tick will ever consume the flag. - The branchless code is shorter and the test contract is simpler («adminforce always writes the skip flag on success»). One extra conditional saves zero SQL on the production path but doubles the set of cases the test matrix has to assert. - The README §Force-next-turn wording «After success, set `runtime_records.skip_next_tick = true`» is unconditional. Adding a runtime-side branch would silently weaken that contract. The driver `op_kind=force_next_turn` audit row records the eventual outcome (success / failure with the same error code that turngeneration surfaced) so audit consumers can tell apart a forced turn that finished the game from a forced turn that prepared the next regular tick. ### D4. `adminbanish` does not check runtime status; missing race surfaces as `forbidden` **Decision.** [`service/adminbanish`](../internal/service/adminbanish/service.go) reads the runtime row only to retrieve the `engine_endpoint`, then calls `playermappingstore.GetByRace`. A missing row maps to `error_code=forbidden`. The runtime status itself is **not** inspected; banish is dispatched even when the runtime is in `stopped`, `finished`, or `engine_unreachable`. **Why.** Two threads informed the choice: - README §Banish lists only two preconditions: «runtime exists» and «`race_name` resolves to an existing player_mappings row». Adding a status guard would silently extend the contract beyond what Lobby is allowed to depend on, and would make the banish flow fail differently from the documented set. - A banish on a stopped/finished runtime is a no-op at the engine side (the container is exited or absent). The engine call will fail with `engine_unreachable`, which is the right error for the caller to see — it means «the runtime was stopped before banish could land». Pre-rejecting with a different code would hide the real state from the operator. The `forbidden` mapping for missing race mirrors Stage 16 D6 («empty roster surfaces as `forbidden`»). The frozen error vocabulary does not contain a `race_not_found` code, and `forbidden` is the semantically closest match: «the platform user this race belonged to is no longer authorised to act on the runtime». ### D5. `livenessreply` returns 200 / `status=""` on `runtime_not_found` **Decision.** [`service/livenessreply`](../internal/service/livenessreply/service.go) absorbs `runtime.ErrNotFound` into a successful Result with `Ready=false` and `Status=runtime.Status("")`. The Go-level error return is reserved for non-business failures only (nil context, nil receiver, store-read errors, invalid input). A handler that wraps this service answers 200 with body `{"ready": false, "status": ""}` when GM has no record for the requested game. **Why.** README §Liveness reply specifies the endpoint «never calls the engine; it reflects GM's own view only» and explicitly says it returns 200 even when the runtime is not running. Three response shapes were considered: - 200 with `status="runtime_not_found"`. Mixes runtime-status values with error codes in the same field, breaking the caller's enum-match dispatch. - 404 `runtime_not_found`. Contradicts the README §Liveness reply «return `200`» wording and forces Lobby's resume flow to add a 404 handler that means «no observation» — semantically the same as `Ready=false`. - 200 with `status=""`. The empty status reads naturally as «GM has no observation»; Lobby's resume flow already needs to handle the `Ready=false` branch and the empty status is exactly what «no observation» looks like in practice. Chosen for the smallest caller-side complexity. ### D6. RTM client errors surface as `service_unavailable`, not a dedicated code **Decision.** Both `service/adminstop` and `service/adminpatch` map every error from `RTMClient.Stop` / `RTMClient.Patch` to `error_code=service_unavailable`, regardless of whether the underlying failure is `ErrRTMUnavailable`, a wrapped HTTP 5xx, or a dialler-level transport error. **Why.** The frozen error vocabulary in [`gamemaster/api/internal-openapi.yaml`](../api/internal-openapi.yaml) does not contain a `runtime_manager_unavailable` code. Three options were on the table: - Add a new code. Rejected: the OpenAPI surface is contract-frozen from Stage 06 and adding a new error code is a wire-format change that pulls every consumer into a re-validation. Stage 17 deals with service-layer code only; no contract change is in scope. - Map RTM failures to `engine_unreachable`. Rejected: the RTM call is a sibling-service hop, not an engine call; mixing the two in a single label confuses operators reading metric / log labels. - Map RTM failures to `service_unavailable`. Accepted: the vocabulary already documents `service_unavailable` as «a steady-state dependency was unreachable for this call», which is exactly what an RTM outage looks like from GM's perspective. The Stage 12 D5 decision record in [`stage12-external-clients.md`](./stage12-external-clients.md) already records that the RTM adapter wraps every non-success outcome in `ports.ErrRTMUnavailable` without distinguishing sub-cases; Stage 17 simply consumes the unified sentinel. ## Cross-stage consequences - The new port surface `RuntimeRecordStore.UpdateImage` is available to every later consumer; Stage 18 and Stage 19 do not use it. Existing hand-rolled fakes carry a no-op stub. - `OpKindStop`, `OpKindForceNextTurn`, `OpKindPatch`, `OpKindBanish` were introduced in Stage 09 / Stage 10 already; Stage 17 is their first writer. - The telemetry counter `gamemaster.banish.outcomes` (declared in Stage 08) gets its first call site in `service/adminbanish`. No new counters are introduced for `adminstop` / `adminforce` / `adminpatch` / `livenessreply`; the README §Observability list does not mention them and Stage 17 deliberately stays inside the declared instrument set. - The Stage 19 REST handlers consume the five services without service-layer changes: each handler decodes the JSON envelope, fills `Input.OpSource` / `Input.SourceRef` from the `X-Galaxy-Caller` header convention, and translates `Result.ErrorCode` into the standard error envelope.