feat: gamemaster

2026-05-03 07:59:03 +02:00
parent a7cee15115
commit 3e2622757e
229 changed files with 41521 additions and 1098 deletions
@@ -0,0 +1,264 @@
+---
+stage: 17
+title: Admin operations and Lobby-facing liveness
+---
+
+# Stage 17 — Admin operations and Lobby-facing liveness
+
+This decision record captures the non-obvious choices made while
+implementing the five Game Master admin/inspect service-layer
+operations and the Lobby-facing liveness reply
+(`adminstop`, `adminforce`, `adminpatch`, `adminbanish`,
+`livenessreply`). Stage 17 is the last service-layer stage before
+Stage 18 (health-events consumer) and Stage 19 (REST handlers and
+wiring).
+
+## Context
+
+[`../PLAN.md` Stage 17](../PLAN.md) ships five services that close
+the GM service surface:
+
+1. `service/adminstop` — orchestrator behind
+   `POST /api/v1/internal/runtimes/{game_id}/stop`. Calls Runtime
+   Manager and CASes `runtime_records.status → stopped`.
+2. `service/adminforce` — orchestrator behind
+   `POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. Runs
+   the inner `service/turngeneration` flow synchronously, then sets
+   `runtime_records.skip_next_tick = true`.
+3. `service/adminpatch` — orchestrator behind
+   `POST /api/v1/internal/runtimes/{game_id}/patch`. Calls Runtime
+   Manager and rotates `runtime_records.current_image_ref` plus
+   `current_engine_version`.
+4. `service/adminbanish` — orchestrator behind
+   `POST /api/v1/internal/games/{game_id}/race/{race_name}/banish`.
+   Resolves the race and calls the engine `/admin/race/banish`.
+5. `service/livenessreply` — orchestrator behind
+   `GET /api/v1/internal/games/{game_id}/liveness`. Reflects GM's own
+   view of the runtime without ever calling the engine.
+
+The reference precedent for the orchestrator shape (`Input` /
+`Result` / `Dependencies` / `NewService` / `Handle`) is Stage 13's
+`service/registerruntime` and Stage 15's `service/turngeneration`.
+Six decisions deviate from a literal reading of the README, the
+OpenAPI surface, or the turngeneration precedent. Each is recorded
+below.
+
+## Decisions
+
+### D1. `RuntimeRecordStore` grows a dedicated `UpdateImage` method
+
+**Decision.**
+[`ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
+adds a new `UpdateImage(ctx, UpdateImageInput) error` method with its
+own `UpdateImageInput` struct and `Validate`. The Postgres adapter
+gains a matching SQL UPDATE under a CAS guard on `(game_id, status)`.
+The existing `UpdateStatus` is **not** repurposed for patch updates.
+
+**Why.** `UpdateStatusInput.Validate()` (Stage 11) calls
+`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
+`ExpectedFrom == To`. Patch deliberately keeps the runtime in
+`running`, so any attempt to feed `UpdateStatus` with
+`ExpectedFrom == To == running` is rejected before the SQL even
+runs. Three alternatives were on the table:
+
+- Drop the `runtime.Transition` invariant from `UpdateStatusInput`
+  to allow self-transitions. That would weaken the CAS validator
+  for every existing caller — register-runtime, turngeneration,
+  health-events consumer — and reintroduce the «accidental no-op
+  status update» class of bugs the validator was added to catch.
+- Introduce a synthetic `runtime.StatusRunning → runtime.StatusRunning`
+  edge in `domain/runtime/transitions.go`. Same blast radius as
+  above, only with stronger semantic baggage in the transition table.
+- Add a dedicated `UpdateImage` method that only writes the two
+  image columns plus `updated_at`. Bounded blast radius (one new
+  method, one new input struct, one new SQL UPDATE), preserves the
+  CAS invariant, and matches how Stage 11 already separated
+  `UpdateScheduling` from `UpdateStatus` for the same reason.
+
+The third option is what shipped. Existing fakes (`registerruntime`,
+`turngeneration`, hot-path tests, schedulerticker) carry a no-op
+`UpdateImage` stub that returns `errors.New(...)` so a test that
+accidentally exercises the new path fails loudly.
+
+### D2. `adminstop` is idempotent on `stopped` and `finished`, rejects `starting`
+
+**Decision.**
+[`service/adminstop`](../internal/service/adminstop/service.go) reads
+the runtime row first; if `Status ∈ {stopped, finished}`, the service
+returns `OutcomeSuccess` without calling Runtime Manager and without
+publishing a `runtime_snapshot_update`. If `Status == starting`, the
+service returns `conflict` with `OutcomeFailure`. Every other
+non-terminal status (`running`, `generation_in_progress`,
+`generation_failed`, `engine_unreachable`) takes the regular path:
+RTM call → CAS → snapshot publication.
+
+**Why.** The README §Stop says «CAS `runtime_records.status: * →
+stopped`» but in practice three edge cases pull the service away
+from a literal CAS-only implementation:
+
+- `stopped` and `finished` are common operator races: an admin clicks
+  «stop» on a UI list while another admin already pressed it (or the
+  game finished naturally). Returning `conflict` would force the UI
+  to retry the read and confuse the operator. Idempotent success is
+  the smallest-surprise behaviour and matches how Lobby's other
+  admin-cancel flows handle terminal states.
+- `starting` is the active engine-init window. RTM has just been
+  asked to start the container; an admin stop here would race the
+  init flow and almost certainly leave the system in a partially
+  cleaned state. The transition table in Stage 10 deliberately
+  excludes `starting → stopped` for the same reason. Returning
+  `conflict` lets the admin tooling surface «runtime is mid-init,
+  retry in a moment» instead of pretending the stop succeeded.
+- The «obvious» fourth path — letting the CAS validator reject
+  `starting → stopped` and surface that as the natural conflict —
+  was rejected because it depends on validator implementation
+  detail leaking through; the explicit pre-CAS check makes the
+  intent obvious in the audit log and the structured logs.
+
+The audit log records every pre-CAS rejection with
+`outcome=failure / error_code=conflict`, and every idempotent no-op
+with `outcome=success`, so operators can distinguish the cases in
+post-hoc analysis.
+
+### D3. `adminforce` always sets `skip_next_tick=true`, even on a finishing turn
+
+**Decision.**
+[`service/adminforce`](../internal/service/adminforce/service.go)
+issues `UpdateScheduling{SkipNextTick=true,
+NextGenerationAt=turnResult.Record.NextGenerationAt,
+CurrentTurn=turnResult.Record.CurrentTurn}` after every successful
+inner turn-generation, regardless of whether `Result.Finished` is
+`true`.
+
+**Why.** The cleaner branch — «skip the scheduling write when the
+turn just finished the game» — was considered and rejected:
+
+- `turngeneration` already cleared `next_generation_at` and updated
+  `current_turn` on the finishing branch (Stage 15
+  `completeFinished`). A redundant write that re-affirms those
+  values plus sets `skip_next_tick=true` does no harm: the row is
+  already in `status=finished` and no scheduler tick will ever
+  consume the flag.
+- The branchless code is shorter and the test contract is simpler
+  («adminforce always writes the skip flag on success»). One extra
+  conditional saves zero SQL on the production path but doubles the
+  set of cases the test matrix has to assert.
+- The README §Force-next-turn wording «After success, set
+  `runtime_records.skip_next_tick = true`» is unconditional. Adding
+  a runtime-side branch would silently weaken that contract.
+
+The driver `op_kind=force_next_turn` audit row records the eventual
+outcome (success / failure with the same error code that
+turngeneration surfaced) so audit consumers can tell apart a forced
+turn that finished the game from a forced turn that prepared the
+next regular tick.
+
+### D4. `adminbanish` does not check runtime status; missing race surfaces as `forbidden`
+
+**Decision.**
+[`service/adminbanish`](../internal/service/adminbanish/service.go)
+reads the runtime row only to retrieve the `engine_endpoint`, then
+calls `playermappingstore.GetByRace`. A missing row maps to
+`error_code=forbidden`. The runtime status itself is **not**
+inspected; banish is dispatched even when the runtime is in
+`stopped`, `finished`, or `engine_unreachable`.
+
+**Why.** Two threads informed the choice:
+
+- README §Banish lists only two preconditions: «runtime exists»
+  and «`race_name` resolves to an existing player_mappings row».
+  Adding a status guard would silently extend the contract beyond
+  what Lobby is allowed to depend on, and would make the banish
+  flow fail differently from the documented set.
+- A banish on a stopped/finished runtime is a no-op at the engine
+  side (the container is exited or absent). The engine call will
+  fail with `engine_unreachable`, which is the right error for the
+  caller to see — it means «the runtime was stopped before banish
+  could land». Pre-rejecting with a different code would hide the
+  real state from the operator.
+
+The `forbidden` mapping for missing race mirrors Stage 16 D6 («empty
+roster surfaces as `forbidden`»). The frozen error vocabulary does
+not contain a `race_not_found` code, and `forbidden` is the
+semantically closest match: «the platform user this race belonged
+to is no longer authorised to act on the runtime».
+
+### D5. `livenessreply` returns 200 / `status=""` on `runtime_not_found`
+
+**Decision.**
+[`service/livenessreply`](../internal/service/livenessreply/service.go)
+absorbs `runtime.ErrNotFound` into a successful Result with
+`Ready=false` and `Status=runtime.Status("")`. The Go-level error
+return is reserved for non-business failures only (nil context, nil
+receiver, store-read errors, invalid input). A handler that wraps
+this service answers 200 with body `{"ready": false, "status": ""}`
+when GM has no record for the requested game.
+
+**Why.** README §Liveness reply specifies the endpoint «never calls
+the engine; it reflects GM's own view only» and explicitly says it
+returns 200 even when the runtime is not running. Three response
+shapes were considered:
+
+- 200 with `status="runtime_not_found"`. Mixes runtime-status
+  values with error codes in the same field, breaking the
+  caller's enum-match dispatch.
+- 404 `runtime_not_found`. Contradicts the README §Liveness reply
+  «return `200`» wording and forces Lobby's resume flow to add a
+  404 handler that means «no observation» — semantically the same
+  as `Ready=false`.
+- 200 with `status=""`. The empty status reads naturally as «GM
+  has no observation»; Lobby's resume flow already needs to handle
+  the `Ready=false` branch and the empty status is exactly what
+  «no observation» looks like in practice. Chosen for the smallest
+  caller-side complexity.
+
+### D6. RTM client errors surface as `service_unavailable`, not a dedicated code
+
+**Decision.** Both `service/adminstop` and `service/adminpatch` map
+every error from `RTMClient.Stop` / `RTMClient.Patch` to
+`error_code=service_unavailable`, regardless of whether the
+underlying failure is `ErrRTMUnavailable`, a wrapped HTTP 5xx, or a
+dialler-level transport error.
+
+**Why.** The frozen error vocabulary in
+[`gamemaster/api/internal-openapi.yaml`](../api/internal-openapi.yaml)
+does not contain a `runtime_manager_unavailable` code. Three options
+were on the table:
+
+- Add a new code. Rejected: the OpenAPI surface is contract-frozen
+  from Stage 06 and adding a new error code is a wire-format change
+  that pulls every consumer into a re-validation. Stage 17 deals
+  with service-layer code only; no contract change is in scope.
+- Map RTM failures to `engine_unreachable`. Rejected: the RTM call
+  is a sibling-service hop, not an engine call; mixing the two in
+  a single label confuses operators reading metric / log labels.
+- Map RTM failures to `service_unavailable`. Accepted: the
+  vocabulary already documents `service_unavailable` as «a
+  steady-state dependency was unreachable for this call», which is
+  exactly what an RTM outage looks like from GM's perspective.
+
+The Stage 12 D5 decision record in
+[`stage12-external-clients.md`](./stage12-external-clients.md)
+already records that the RTM adapter wraps every non-success
+outcome in `ports.ErrRTMUnavailable` without distinguishing
+sub-cases; Stage 17 simply consumes the unified sentinel.
+
+## Cross-stage consequences
+
+- The new port surface `RuntimeRecordStore.UpdateImage` is
+  available to every later consumer; Stage 18 and Stage 19 do not
+  use it. Existing hand-rolled fakes carry a no-op stub.
+- `OpKindStop`, `OpKindForceNextTurn`, `OpKindPatch`, `OpKindBanish`
+  were introduced in Stage 09 / Stage 10 already; Stage 17 is their
+  first writer.
+- The telemetry counter `gamemaster.banish.outcomes` (declared in
+  Stage 08) gets its first call site in `service/adminbanish`. No
+  new counters are introduced for `adminstop` / `adminforce` /
+  `adminpatch` / `livenessreply`; the README §Observability list
+  does not mention them and Stage 17 deliberately stays inside the
+  declared instrument set.
+- The Stage 19 REST handlers consume the five services without
+  service-layer changes: each handler decodes the JSON envelope,
+  fills `Input.OpSource` / `Input.SourceRef` from the
+  `X-Galaxy-Caller` header convention, and translates `Result.ErrorCode`
+  into the standard error envelope.