feat: gamemaster

This commit is contained in:
Ilia Denisov
2026-05-03 07:59:03 +02:00
committed by GitHub
parent a7cee15115
commit 3e2622757e
229 changed files with 41521 additions and 1098 deletions
+264
View File
@@ -0,0 +1,264 @@
---
stage: 17
title: Admin operations and Lobby-facing liveness
---
# Stage 17 — Admin operations and Lobby-facing liveness
This decision record captures the non-obvious choices made while
implementing the five Game Master admin/inspect service-layer
operations and the Lobby-facing liveness reply
(`adminstop`, `adminforce`, `adminpatch`, `adminbanish`,
`livenessreply`). Stage 17 is the last service-layer stage before
Stage 18 (health-events consumer) and Stage 19 (REST handlers and
wiring).
## Context
[`../PLAN.md` Stage 17](../PLAN.md) ships five services that close
the GM service surface:
1. `service/adminstop` — orchestrator behind
`POST /api/v1/internal/runtimes/{game_id}/stop`. Calls Runtime
Manager and CASes `runtime_records.status → stopped`.
2. `service/adminforce` — orchestrator behind
`POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. Runs
the inner `service/turngeneration` flow synchronously, then sets
`runtime_records.skip_next_tick = true`.
3. `service/adminpatch` — orchestrator behind
`POST /api/v1/internal/runtimes/{game_id}/patch`. Calls Runtime
Manager and rotates `runtime_records.current_image_ref` plus
`current_engine_version`.
4. `service/adminbanish` — orchestrator behind
`POST /api/v1/internal/games/{game_id}/race/{race_name}/banish`.
Resolves the race and calls the engine `/admin/race/banish`.
5. `service/livenessreply` — orchestrator behind
`GET /api/v1/internal/games/{game_id}/liveness`. Reflects GM's own
view of the runtime without ever calling the engine.
The reference precedent for the orchestrator shape (`Input` /
`Result` / `Dependencies` / `NewService` / `Handle`) is Stage 13's
`service/registerruntime` and Stage 15's `service/turngeneration`.
Six decisions deviate from a literal reading of the README, the
OpenAPI surface, or the turngeneration precedent. Each is recorded
below.
## Decisions
### D1. `RuntimeRecordStore` grows a dedicated `UpdateImage` method
**Decision.**
[`ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
adds a new `UpdateImage(ctx, UpdateImageInput) error` method with its
own `UpdateImageInput` struct and `Validate`. The Postgres adapter
gains a matching SQL UPDATE under a CAS guard on `(game_id, status)`.
The existing `UpdateStatus` is **not** repurposed for patch updates.
**Why.** `UpdateStatusInput.Validate()` (Stage 11) calls
`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
`ExpectedFrom == To`. Patch deliberately keeps the runtime in
`running`, so any attempt to feed `UpdateStatus` with
`ExpectedFrom == To == running` is rejected before the SQL even
runs. Three alternatives were on the table:
- Drop the `runtime.Transition` invariant from `UpdateStatusInput`
to allow self-transitions. That would weaken the CAS validator
for every existing caller — register-runtime, turngeneration,
health-events consumer — and reintroduce the «accidental no-op
status update» class of bugs the validator was added to catch.
- Introduce a synthetic `runtime.StatusRunning → runtime.StatusRunning`
edge in `domain/runtime/transitions.go`. Same blast radius as
above, only with stronger semantic baggage in the transition table.
- Add a dedicated `UpdateImage` method that only writes the two
image columns plus `updated_at`. Bounded blast radius (one new
method, one new input struct, one new SQL UPDATE), preserves the
CAS invariant, and matches how Stage 11 already separated
`UpdateScheduling` from `UpdateStatus` for the same reason.
The third option is what shipped. Existing fakes (`registerruntime`,
`turngeneration`, hot-path tests, schedulerticker) carry a no-op
`UpdateImage` stub that returns `errors.New(...)` so a test that
accidentally exercises the new path fails loudly.
### D2. `adminstop` is idempotent on `stopped` and `finished`, rejects `starting`
**Decision.**
[`service/adminstop`](../internal/service/adminstop/service.go) reads
the runtime row first; if `Status ∈ {stopped, finished}`, the service
returns `OutcomeSuccess` without calling Runtime Manager and without
publishing a `runtime_snapshot_update`. If `Status == starting`, the
service returns `conflict` with `OutcomeFailure`. Every other
non-terminal status (`running`, `generation_in_progress`,
`generation_failed`, `engine_unreachable`) takes the regular path:
RTM call → CAS → snapshot publication.
**Why.** The README §Stop says «CAS `runtime_records.status: * →
stopped`» but in practice three edge cases pull the service away
from a literal CAS-only implementation:
- `stopped` and `finished` are common operator races: an admin clicks
«stop» on a UI list while another admin already pressed it (or the
game finished naturally). Returning `conflict` would force the UI
to retry the read and confuse the operator. Idempotent success is
the smallest-surprise behaviour and matches how Lobby's other
admin-cancel flows handle terminal states.
- `starting` is the active engine-init window. RTM has just been
asked to start the container; an admin stop here would race the
init flow and almost certainly leave the system in a partially
cleaned state. The transition table in Stage 10 deliberately
excludes `starting → stopped` for the same reason. Returning
`conflict` lets the admin tooling surface «runtime is mid-init,
retry in a moment» instead of pretending the stop succeeded.
- The «obvious» fourth path — letting the CAS validator reject
`starting → stopped` and surface that as the natural conflict —
was rejected because it depends on validator implementation
detail leaking through; the explicit pre-CAS check makes the
intent obvious in the audit log and the structured logs.
The audit log records every pre-CAS rejection with
`outcome=failure / error_code=conflict`, and every idempotent no-op
with `outcome=success`, so operators can distinguish the cases in
post-hoc analysis.
### D3. `adminforce` always sets `skip_next_tick=true`, even on a finishing turn
**Decision.**
[`service/adminforce`](../internal/service/adminforce/service.go)
issues `UpdateScheduling{SkipNextTick=true,
NextGenerationAt=turnResult.Record.NextGenerationAt,
CurrentTurn=turnResult.Record.CurrentTurn}` after every successful
inner turn-generation, regardless of whether `Result.Finished` is
`true`.
**Why.** The cleaner branch — «skip the scheduling write when the
turn just finished the game» — was considered and rejected:
- `turngeneration` already cleared `next_generation_at` and updated
`current_turn` on the finishing branch (Stage 15
`completeFinished`). A redundant write that re-affirms those
values plus sets `skip_next_tick=true` does no harm: the row is
already in `status=finished` and no scheduler tick will ever
consume the flag.
- The branchless code is shorter and the test contract is simpler
(«adminforce always writes the skip flag on success»). One extra
conditional saves zero SQL on the production path but doubles the
set of cases the test matrix has to assert.
- The README §Force-next-turn wording «After success, set
`runtime_records.skip_next_tick = true`» is unconditional. Adding
a runtime-side branch would silently weaken that contract.
The driver `op_kind=force_next_turn` audit row records the eventual
outcome (success / failure with the same error code that
turngeneration surfaced) so audit consumers can tell apart a forced
turn that finished the game from a forced turn that prepared the
next regular tick.
### D4. `adminbanish` does not check runtime status; missing race surfaces as `forbidden`
**Decision.**
[`service/adminbanish`](../internal/service/adminbanish/service.go)
reads the runtime row only to retrieve the `engine_endpoint`, then
calls `playermappingstore.GetByRace`. A missing row maps to
`error_code=forbidden`. The runtime status itself is **not**
inspected; banish is dispatched even when the runtime is in
`stopped`, `finished`, or `engine_unreachable`.
**Why.** Two threads informed the choice:
- README §Banish lists only two preconditions: «runtime exists»
and «`race_name` resolves to an existing player_mappings row».
Adding a status guard would silently extend the contract beyond
what Lobby is allowed to depend on, and would make the banish
flow fail differently from the documented set.
- A banish on a stopped/finished runtime is a no-op at the engine
side (the container is exited or absent). The engine call will
fail with `engine_unreachable`, which is the right error for the
caller to see — it means «the runtime was stopped before banish
could land». Pre-rejecting with a different code would hide the
real state from the operator.
The `forbidden` mapping for missing race mirrors Stage 16 D6 («empty
roster surfaces as `forbidden`»). The frozen error vocabulary does
not contain a `race_not_found` code, and `forbidden` is the
semantically closest match: «the platform user this race belonged
to is no longer authorised to act on the runtime».
### D5. `livenessreply` returns 200 / `status=""` on `runtime_not_found`
**Decision.**
[`service/livenessreply`](../internal/service/livenessreply/service.go)
absorbs `runtime.ErrNotFound` into a successful Result with
`Ready=false` and `Status=runtime.Status("")`. The Go-level error
return is reserved for non-business failures only (nil context, nil
receiver, store-read errors, invalid input). A handler that wraps
this service answers 200 with body `{"ready": false, "status": ""}`
when GM has no record for the requested game.
**Why.** README §Liveness reply specifies the endpoint «never calls
the engine; it reflects GM's own view only» and explicitly says it
returns 200 even when the runtime is not running. Three response
shapes were considered:
- 200 with `status="runtime_not_found"`. Mixes runtime-status
values with error codes in the same field, breaking the
caller's enum-match dispatch.
- 404 `runtime_not_found`. Contradicts the README §Liveness reply
«return `200`» wording and forces Lobby's resume flow to add a
404 handler that means «no observation» — semantically the same
as `Ready=false`.
- 200 with `status=""`. The empty status reads naturally as «GM
has no observation»; Lobby's resume flow already needs to handle
the `Ready=false` branch and the empty status is exactly what
«no observation» looks like in practice. Chosen for the smallest
caller-side complexity.
### D6. RTM client errors surface as `service_unavailable`, not a dedicated code
**Decision.** Both `service/adminstop` and `service/adminpatch` map
every error from `RTMClient.Stop` / `RTMClient.Patch` to
`error_code=service_unavailable`, regardless of whether the
underlying failure is `ErrRTMUnavailable`, a wrapped HTTP 5xx, or a
dialler-level transport error.
**Why.** The frozen error vocabulary in
[`gamemaster/api/internal-openapi.yaml`](../api/internal-openapi.yaml)
does not contain a `runtime_manager_unavailable` code. Three options
were on the table:
- Add a new code. Rejected: the OpenAPI surface is contract-frozen
from Stage 06 and adding a new error code is a wire-format change
that pulls every consumer into a re-validation. Stage 17 deals
with service-layer code only; no contract change is in scope.
- Map RTM failures to `engine_unreachable`. Rejected: the RTM call
is a sibling-service hop, not an engine call; mixing the two in
a single label confuses operators reading metric / log labels.
- Map RTM failures to `service_unavailable`. Accepted: the
vocabulary already documents `service_unavailable` as «a
steady-state dependency was unreachable for this call», which is
exactly what an RTM outage looks like from GM's perspective.
The Stage 12 D5 decision record in
[`stage12-external-clients.md`](./stage12-external-clients.md)
already records that the RTM adapter wraps every non-success
outcome in `ports.ErrRTMUnavailable` without distinguishing
sub-cases; Stage 17 simply consumes the unified sentinel.
## Cross-stage consequences
- The new port surface `RuntimeRecordStore.UpdateImage` is
available to every later consumer; Stage 18 and Stage 19 do not
use it. Existing hand-rolled fakes carry a no-op stub.
- `OpKindStop`, `OpKindForceNextTurn`, `OpKindPatch`, `OpKindBanish`
were introduced in Stage 09 / Stage 10 already; Stage 17 is their
first writer.
- The telemetry counter `gamemaster.banish.outcomes` (declared in
Stage 08) gets its first call site in `service/adminbanish`. No
new counters are introduced for `adminstop` / `adminforce` /
`adminpatch` / `livenessreply`; the README §Observability list
does not mention them and Stage 17 deliberately stays inside the
declared instrument set.
- The Stage 19 REST handlers consume the five services without
service-layer changes: each handler decodes the JSON envelope,
fills `Input.OpSource` / `Input.SourceRef` from the
`X-Galaxy-Caller` header convention, and translates `Result.ErrorCode`
into the standard error envelope.