Files
galaxy-game/gamemaster/docs/stage17-admin-operations.md
T
2026-05-03 07:59:03 +02:00

13 KiB

stage, title
stage title
17 Admin operations and Lobby-facing liveness

Stage 17 — Admin operations and Lobby-facing liveness

This decision record captures the non-obvious choices made while implementing the five Game Master admin/inspect service-layer operations and the Lobby-facing liveness reply (adminstop, adminforce, adminpatch, adminbanish, livenessreply). Stage 17 is the last service-layer stage before Stage 18 (health-events consumer) and Stage 19 (REST handlers and wiring).

Context

../PLAN.md Stage 17 ships five services that close the GM service surface:

  1. service/adminstop — orchestrator behind POST /api/v1/internal/runtimes/{game_id}/stop. Calls Runtime Manager and CASes runtime_records.status → stopped.
  2. service/adminforce — orchestrator behind POST /api/v1/internal/runtimes/{game_id}/force-next-turn. Runs the inner service/turngeneration flow synchronously, then sets runtime_records.skip_next_tick = true.
  3. service/adminpatch — orchestrator behind POST /api/v1/internal/runtimes/{game_id}/patch. Calls Runtime Manager and rotates runtime_records.current_image_ref plus current_engine_version.
  4. service/adminbanish — orchestrator behind POST /api/v1/internal/games/{game_id}/race/{race_name}/banish. Resolves the race and calls the engine /admin/race/banish.
  5. service/livenessreply — orchestrator behind GET /api/v1/internal/games/{game_id}/liveness. Reflects GM's own view of the runtime without ever calling the engine.

The reference precedent for the orchestrator shape (Input / Result / Dependencies / NewService / Handle) is Stage 13's service/registerruntime and Stage 15's service/turngeneration. Six decisions deviate from a literal reading of the README, the OpenAPI surface, or the turngeneration precedent. Each is recorded below.

Decisions

D1. RuntimeRecordStore grows a dedicated UpdateImage method

Decision. ports/runtimerecordstore.go adds a new UpdateImage(ctx, UpdateImageInput) error method with its own UpdateImageInput struct and Validate. The Postgres adapter gains a matching SQL UPDATE under a CAS guard on (game_id, status). The existing UpdateStatus is not repurposed for patch updates.

Why. UpdateStatusInput.Validate() (Stage 11) calls runtime.Transition(ExpectedFrom, To) and rejects every pair where ExpectedFrom == To. Patch deliberately keeps the runtime in running, so any attempt to feed UpdateStatus with ExpectedFrom == To == running is rejected before the SQL even runs. Three alternatives were on the table:

  • Drop the runtime.Transition invariant from UpdateStatusInput to allow self-transitions. That would weaken the CAS validator for every existing caller — register-runtime, turngeneration, health-events consumer — and reintroduce the «accidental no-op status update» class of bugs the validator was added to catch.
  • Introduce a synthetic runtime.StatusRunning → runtime.StatusRunning edge in domain/runtime/transitions.go. Same blast radius as above, only with stronger semantic baggage in the transition table.
  • Add a dedicated UpdateImage method that only writes the two image columns plus updated_at. Bounded blast radius (one new method, one new input struct, one new SQL UPDATE), preserves the CAS invariant, and matches how Stage 11 already separated UpdateScheduling from UpdateStatus for the same reason.

The third option is what shipped. Existing fakes (registerruntime, turngeneration, hot-path tests, schedulerticker) carry a no-op UpdateImage stub that returns errors.New(...) so a test that accidentally exercises the new path fails loudly.

D2. adminstop is idempotent on stopped and finished, rejects starting

Decision. service/adminstop reads the runtime row first; if Status ∈ {stopped, finished}, the service returns OutcomeSuccess without calling Runtime Manager and without publishing a runtime_snapshot_update. If Status == starting, the service returns conflict with OutcomeFailure. Every other non-terminal status (running, generation_in_progress, generation_failed, engine_unreachable) takes the regular path: RTM call → CAS → snapshot publication.

Why. The README §Stop says «CAS runtime_records.status: * → stopped» but in practice three edge cases pull the service away from a literal CAS-only implementation:

  • stopped and finished are common operator races: an admin clicks «stop» on a UI list while another admin already pressed it (or the game finished naturally). Returning conflict would force the UI to retry the read and confuse the operator. Idempotent success is the smallest-surprise behaviour and matches how Lobby's other admin-cancel flows handle terminal states.
  • starting is the active engine-init window. RTM has just been asked to start the container; an admin stop here would race the init flow and almost certainly leave the system in a partially cleaned state. The transition table in Stage 10 deliberately excludes starting → stopped for the same reason. Returning conflict lets the admin tooling surface «runtime is mid-init, retry in a moment» instead of pretending the stop succeeded.
  • The «obvious» fourth path — letting the CAS validator reject starting → stopped and surface that as the natural conflict — was rejected because it depends on validator implementation detail leaking through; the explicit pre-CAS check makes the intent obvious in the audit log and the structured logs.

The audit log records every pre-CAS rejection with outcome=failure / error_code=conflict, and every idempotent no-op with outcome=success, so operators can distinguish the cases in post-hoc analysis.

D3. adminforce always sets skip_next_tick=true, even on a finishing turn

Decision. service/adminforce issues UpdateScheduling{SkipNextTick=true, NextGenerationAt=turnResult.Record.NextGenerationAt, CurrentTurn=turnResult.Record.CurrentTurn} after every successful inner turn-generation, regardless of whether Result.Finished is true.

Why. The cleaner branch — «skip the scheduling write when the turn just finished the game» — was considered and rejected:

  • turngeneration already cleared next_generation_at and updated current_turn on the finishing branch (Stage 15 completeFinished). A redundant write that re-affirms those values plus sets skip_next_tick=true does no harm: the row is already in status=finished and no scheduler tick will ever consume the flag.
  • The branchless code is shorter and the test contract is simpler («adminforce always writes the skip flag on success»). One extra conditional saves zero SQL on the production path but doubles the set of cases the test matrix has to assert.
  • The README §Force-next-turn wording «After success, set runtime_records.skip_next_tick = true» is unconditional. Adding a runtime-side branch would silently weaken that contract.

The driver op_kind=force_next_turn audit row records the eventual outcome (success / failure with the same error code that turngeneration surfaced) so audit consumers can tell apart a forced turn that finished the game from a forced turn that prepared the next regular tick.

D4. adminbanish does not check runtime status; missing race surfaces as forbidden

Decision. service/adminbanish reads the runtime row only to retrieve the engine_endpoint, then calls playermappingstore.GetByRace. A missing row maps to error_code=forbidden. The runtime status itself is not inspected; banish is dispatched even when the runtime is in stopped, finished, or engine_unreachable.

Why. Two threads informed the choice:

  • README §Banish lists only two preconditions: «runtime exists» and «race_name resolves to an existing player_mappings row». Adding a status guard would silently extend the contract beyond what Lobby is allowed to depend on, and would make the banish flow fail differently from the documented set.
  • A banish on a stopped/finished runtime is a no-op at the engine side (the container is exited or absent). The engine call will fail with engine_unreachable, which is the right error for the caller to see — it means «the runtime was stopped before banish could land». Pre-rejecting with a different code would hide the real state from the operator.

The forbidden mapping for missing race mirrors Stage 16 D6 («empty roster surfaces as forbidden»). The frozen error vocabulary does not contain a race_not_found code, and forbidden is the semantically closest match: «the platform user this race belonged to is no longer authorised to act on the runtime».

D5. livenessreply returns 200 / status="" on runtime_not_found

Decision. service/livenessreply absorbs runtime.ErrNotFound into a successful Result with Ready=false and Status=runtime.Status(""). The Go-level error return is reserved for non-business failures only (nil context, nil receiver, store-read errors, invalid input). A handler that wraps this service answers 200 with body {"ready": false, "status": ""} when GM has no record for the requested game.

Why. README §Liveness reply specifies the endpoint «never calls the engine; it reflects GM's own view only» and explicitly says it returns 200 even when the runtime is not running. Three response shapes were considered:

  • 200 with status="runtime_not_found". Mixes runtime-status values with error codes in the same field, breaking the caller's enum-match dispatch.
  • 404 runtime_not_found. Contradicts the README §Liveness reply «return 200» wording and forces Lobby's resume flow to add a 404 handler that means «no observation» — semantically the same as Ready=false.
  • 200 with status="". The empty status reads naturally as «GM has no observation»; Lobby's resume flow already needs to handle the Ready=false branch and the empty status is exactly what «no observation» looks like in practice. Chosen for the smallest caller-side complexity.

D6. RTM client errors surface as service_unavailable, not a dedicated code

Decision. Both service/adminstop and service/adminpatch map every error from RTMClient.Stop / RTMClient.Patch to error_code=service_unavailable, regardless of whether the underlying failure is ErrRTMUnavailable, a wrapped HTTP 5xx, or a dialler-level transport error.

Why. The frozen error vocabulary in gamemaster/api/internal-openapi.yaml does not contain a runtime_manager_unavailable code. Three options were on the table:

  • Add a new code. Rejected: the OpenAPI surface is contract-frozen from Stage 06 and adding a new error code is a wire-format change that pulls every consumer into a re-validation. Stage 17 deals with service-layer code only; no contract change is in scope.
  • Map RTM failures to engine_unreachable. Rejected: the RTM call is a sibling-service hop, not an engine call; mixing the two in a single label confuses operators reading metric / log labels.
  • Map RTM failures to service_unavailable. Accepted: the vocabulary already documents service_unavailable as «a steady-state dependency was unreachable for this call», which is exactly what an RTM outage looks like from GM's perspective.

The Stage 12 D5 decision record in stage12-external-clients.md already records that the RTM adapter wraps every non-success outcome in ports.ErrRTMUnavailable without distinguishing sub-cases; Stage 17 simply consumes the unified sentinel.

Cross-stage consequences

  • The new port surface RuntimeRecordStore.UpdateImage is available to every later consumer; Stage 18 and Stage 19 do not use it. Existing hand-rolled fakes carry a no-op stub.
  • OpKindStop, OpKindForceNextTurn, OpKindPatch, OpKindBanish were introduced in Stage 09 / Stage 10 already; Stage 17 is their first writer.
  • The telemetry counter gamemaster.banish.outcomes (declared in Stage 08) gets its first call site in service/adminbanish. No new counters are introduced for adminstop / adminforce / adminpatch / livenessreply; the README §Observability list does not mention them and Stage 17 deliberately stays inside the declared instrument set.
  • The Stage 19 REST handlers consume the five services without service-layer changes: each handler decodes the JSON envelope, fills Input.OpSource / Input.SourceRef from the X-Galaxy-Caller header convention, and translates Result.ErrorCode into the standard error envelope.