13 KiB
stage, title
| stage | title |
|---|---|
| 17 | Admin operations and Lobby-facing liveness |
Stage 17 — Admin operations and Lobby-facing liveness
This decision record captures the non-obvious choices made while
implementing the five Game Master admin/inspect service-layer
operations and the Lobby-facing liveness reply
(adminstop, adminforce, adminpatch, adminbanish,
livenessreply). Stage 17 is the last service-layer stage before
Stage 18 (health-events consumer) and Stage 19 (REST handlers and
wiring).
Context
../PLAN.md Stage 17 ships five services that close
the GM service surface:
service/adminstop— orchestrator behindPOST /api/v1/internal/runtimes/{game_id}/stop. Calls Runtime Manager and CASesruntime_records.status → stopped.service/adminforce— orchestrator behindPOST /api/v1/internal/runtimes/{game_id}/force-next-turn. Runs the innerservice/turngenerationflow synchronously, then setsruntime_records.skip_next_tick = true.service/adminpatch— orchestrator behindPOST /api/v1/internal/runtimes/{game_id}/patch. Calls Runtime Manager and rotatesruntime_records.current_image_refpluscurrent_engine_version.service/adminbanish— orchestrator behindPOST /api/v1/internal/games/{game_id}/race/{race_name}/banish. Resolves the race and calls the engine/admin/race/banish.service/livenessreply— orchestrator behindGET /api/v1/internal/games/{game_id}/liveness. Reflects GM's own view of the runtime without ever calling the engine.
The reference precedent for the orchestrator shape (Input /
Result / Dependencies / NewService / Handle) is Stage 13's
service/registerruntime and Stage 15's service/turngeneration.
Six decisions deviate from a literal reading of the README, the
OpenAPI surface, or the turngeneration precedent. Each is recorded
below.
Decisions
D1. RuntimeRecordStore grows a dedicated UpdateImage method
Decision.
ports/runtimerecordstore.go
adds a new UpdateImage(ctx, UpdateImageInput) error method with its
own UpdateImageInput struct and Validate. The Postgres adapter
gains a matching SQL UPDATE under a CAS guard on (game_id, status).
The existing UpdateStatus is not repurposed for patch updates.
Why. UpdateStatusInput.Validate() (Stage 11) calls
runtime.Transition(ExpectedFrom, To) and rejects every pair where
ExpectedFrom == To. Patch deliberately keeps the runtime in
running, so any attempt to feed UpdateStatus with
ExpectedFrom == To == running is rejected before the SQL even
runs. Three alternatives were on the table:
- Drop the
runtime.Transitioninvariant fromUpdateStatusInputto allow self-transitions. That would weaken the CAS validator for every existing caller — register-runtime, turngeneration, health-events consumer — and reintroduce the «accidental no-op status update» class of bugs the validator was added to catch. - Introduce a synthetic
runtime.StatusRunning → runtime.StatusRunningedge indomain/runtime/transitions.go. Same blast radius as above, only with stronger semantic baggage in the transition table. - Add a dedicated
UpdateImagemethod that only writes the two image columns plusupdated_at. Bounded blast radius (one new method, one new input struct, one new SQL UPDATE), preserves the CAS invariant, and matches how Stage 11 already separatedUpdateSchedulingfromUpdateStatusfor the same reason.
The third option is what shipped. Existing fakes (registerruntime,
turngeneration, hot-path tests, schedulerticker) carry a no-op
UpdateImage stub that returns errors.New(...) so a test that
accidentally exercises the new path fails loudly.
D2. adminstop is idempotent on stopped and finished, rejects starting
Decision.
service/adminstop reads
the runtime row first; if Status ∈ {stopped, finished}, the service
returns OutcomeSuccess without calling Runtime Manager and without
publishing a runtime_snapshot_update. If Status == starting, the
service returns conflict with OutcomeFailure. Every other
non-terminal status (running, generation_in_progress,
generation_failed, engine_unreachable) takes the regular path:
RTM call → CAS → snapshot publication.
Why. The README §Stop says «CAS runtime_records.status: * → stopped» but in practice three edge cases pull the service away
from a literal CAS-only implementation:
stoppedandfinishedare common operator races: an admin clicks «stop» on a UI list while another admin already pressed it (or the game finished naturally). Returningconflictwould force the UI to retry the read and confuse the operator. Idempotent success is the smallest-surprise behaviour and matches how Lobby's other admin-cancel flows handle terminal states.startingis the active engine-init window. RTM has just been asked to start the container; an admin stop here would race the init flow and almost certainly leave the system in a partially cleaned state. The transition table in Stage 10 deliberately excludesstarting → stoppedfor the same reason. Returningconflictlets the admin tooling surface «runtime is mid-init, retry in a moment» instead of pretending the stop succeeded.- The «obvious» fourth path — letting the CAS validator reject
starting → stoppedand surface that as the natural conflict — was rejected because it depends on validator implementation detail leaking through; the explicit pre-CAS check makes the intent obvious in the audit log and the structured logs.
The audit log records every pre-CAS rejection with
outcome=failure / error_code=conflict, and every idempotent no-op
with outcome=success, so operators can distinguish the cases in
post-hoc analysis.
D3. adminforce always sets skip_next_tick=true, even on a finishing turn
Decision.
service/adminforce
issues UpdateScheduling{SkipNextTick=true, NextGenerationAt=turnResult.Record.NextGenerationAt, CurrentTurn=turnResult.Record.CurrentTurn} after every successful
inner turn-generation, regardless of whether Result.Finished is
true.
Why. The cleaner branch — «skip the scheduling write when the turn just finished the game» — was considered and rejected:
turngenerationalready clearednext_generation_atand updatedcurrent_turnon the finishing branch (Stage 15completeFinished). A redundant write that re-affirms those values plus setsskip_next_tick=truedoes no harm: the row is already instatus=finishedand no scheduler tick will ever consume the flag.- The branchless code is shorter and the test contract is simpler («adminforce always writes the skip flag on success»). One extra conditional saves zero SQL on the production path but doubles the set of cases the test matrix has to assert.
- The README §Force-next-turn wording «After success, set
runtime_records.skip_next_tick = true» is unconditional. Adding a runtime-side branch would silently weaken that contract.
The driver op_kind=force_next_turn audit row records the eventual
outcome (success / failure with the same error code that
turngeneration surfaced) so audit consumers can tell apart a forced
turn that finished the game from a forced turn that prepared the
next regular tick.
D4. adminbanish does not check runtime status; missing race surfaces as forbidden
Decision.
service/adminbanish
reads the runtime row only to retrieve the engine_endpoint, then
calls playermappingstore.GetByRace. A missing row maps to
error_code=forbidden. The runtime status itself is not
inspected; banish is dispatched even when the runtime is in
stopped, finished, or engine_unreachable.
Why. Two threads informed the choice:
- README §Banish lists only two preconditions: «runtime exists»
and «
race_nameresolves to an existing player_mappings row». Adding a status guard would silently extend the contract beyond what Lobby is allowed to depend on, and would make the banish flow fail differently from the documented set. - A banish on a stopped/finished runtime is a no-op at the engine
side (the container is exited or absent). The engine call will
fail with
engine_unreachable, which is the right error for the caller to see — it means «the runtime was stopped before banish could land». Pre-rejecting with a different code would hide the real state from the operator.
The forbidden mapping for missing race mirrors Stage 16 D6 («empty
roster surfaces as forbidden»). The frozen error vocabulary does
not contain a race_not_found code, and forbidden is the
semantically closest match: «the platform user this race belonged
to is no longer authorised to act on the runtime».
D5. livenessreply returns 200 / status="" on runtime_not_found
Decision.
service/livenessreply
absorbs runtime.ErrNotFound into a successful Result with
Ready=false and Status=runtime.Status(""). The Go-level error
return is reserved for non-business failures only (nil context, nil
receiver, store-read errors, invalid input). A handler that wraps
this service answers 200 with body {"ready": false, "status": ""}
when GM has no record for the requested game.
Why. README §Liveness reply specifies the endpoint «never calls the engine; it reflects GM's own view only» and explicitly says it returns 200 even when the runtime is not running. Three response shapes were considered:
- 200 with
status="runtime_not_found". Mixes runtime-status values with error codes in the same field, breaking the caller's enum-match dispatch. - 404
runtime_not_found. Contradicts the README §Liveness reply «return200» wording and forces Lobby's resume flow to add a 404 handler that means «no observation» — semantically the same asReady=false. - 200 with
status="". The empty status reads naturally as «GM has no observation»; Lobby's resume flow already needs to handle theReady=falsebranch and the empty status is exactly what «no observation» looks like in practice. Chosen for the smallest caller-side complexity.
D6. RTM client errors surface as service_unavailable, not a dedicated code
Decision. Both service/adminstop and service/adminpatch map
every error from RTMClient.Stop / RTMClient.Patch to
error_code=service_unavailable, regardless of whether the
underlying failure is ErrRTMUnavailable, a wrapped HTTP 5xx, or a
dialler-level transport error.
Why. The frozen error vocabulary in
gamemaster/api/internal-openapi.yaml
does not contain a runtime_manager_unavailable code. Three options
were on the table:
- Add a new code. Rejected: the OpenAPI surface is contract-frozen from Stage 06 and adding a new error code is a wire-format change that pulls every consumer into a re-validation. Stage 17 deals with service-layer code only; no contract change is in scope.
- Map RTM failures to
engine_unreachable. Rejected: the RTM call is a sibling-service hop, not an engine call; mixing the two in a single label confuses operators reading metric / log labels. - Map RTM failures to
service_unavailable. Accepted: the vocabulary already documentsservice_unavailableas «a steady-state dependency was unreachable for this call», which is exactly what an RTM outage looks like from GM's perspective.
The Stage 12 D5 decision record in
stage12-external-clients.md
already records that the RTM adapter wraps every non-success
outcome in ports.ErrRTMUnavailable without distinguishing
sub-cases; Stage 17 simply consumes the unified sentinel.
Cross-stage consequences
- The new port surface
RuntimeRecordStore.UpdateImageis available to every later consumer; Stage 18 and Stage 19 do not use it. Existing hand-rolled fakes carry a no-op stub. OpKindStop,OpKindForceNextTurn,OpKindPatch,OpKindBanishwere introduced in Stage 09 / Stage 10 already; Stage 17 is their first writer.- The telemetry counter
gamemaster.banish.outcomes(declared in Stage 08) gets its first call site inservice/adminbanish. No new counters are introduced foradminstop/adminforce/adminpatch/livenessreply; the README §Observability list does not mention them and Stage 17 deliberately stays inside the declared instrument set. - The Stage 19 REST handlers consume the five services without
service-layer changes: each handler decodes the JSON envelope,
fills
Input.OpSource/Input.SourceReffrom theX-Galaxy-Callerheader convention, and translatesResult.ErrorCodeinto the standard error envelope.