feat: gamemaster
This commit is contained in:
@@ -0,0 +1,264 @@
|
||||
---
|
||||
stage: 17
|
||||
title: Admin operations and Lobby-facing liveness
|
||||
---
|
||||
|
||||
# Stage 17 — Admin operations and Lobby-facing liveness
|
||||
|
||||
This decision record captures the non-obvious choices made while
|
||||
implementing the five Game Master admin/inspect service-layer
|
||||
operations and the Lobby-facing liveness reply
|
||||
(`adminstop`, `adminforce`, `adminpatch`, `adminbanish`,
|
||||
`livenessreply`). Stage 17 is the last service-layer stage before
|
||||
Stage 18 (health-events consumer) and Stage 19 (REST handlers and
|
||||
wiring).
|
||||
|
||||
## Context
|
||||
|
||||
[`../PLAN.md` Stage 17](../PLAN.md) ships five services that close
|
||||
the GM service surface:
|
||||
|
||||
1. `service/adminstop` — orchestrator behind
|
||||
`POST /api/v1/internal/runtimes/{game_id}/stop`. Calls Runtime
|
||||
Manager and CASes `runtime_records.status → stopped`.
|
||||
2. `service/adminforce` — orchestrator behind
|
||||
`POST /api/v1/internal/runtimes/{game_id}/force-next-turn`. Runs
|
||||
the inner `service/turngeneration` flow synchronously, then sets
|
||||
`runtime_records.skip_next_tick = true`.
|
||||
3. `service/adminpatch` — orchestrator behind
|
||||
`POST /api/v1/internal/runtimes/{game_id}/patch`. Calls Runtime
|
||||
Manager and rotates `runtime_records.current_image_ref` plus
|
||||
`current_engine_version`.
|
||||
4. `service/adminbanish` — orchestrator behind
|
||||
`POST /api/v1/internal/games/{game_id}/race/{race_name}/banish`.
|
||||
Resolves the race and calls the engine `/admin/race/banish`.
|
||||
5. `service/livenessreply` — orchestrator behind
|
||||
`GET /api/v1/internal/games/{game_id}/liveness`. Reflects GM's own
|
||||
view of the runtime without ever calling the engine.
|
||||
|
||||
The reference precedent for the orchestrator shape (`Input` /
|
||||
`Result` / `Dependencies` / `NewService` / `Handle`) is Stage 13's
|
||||
`service/registerruntime` and Stage 15's `service/turngeneration`.
|
||||
Six decisions deviate from a literal reading of the README, the
|
||||
OpenAPI surface, or the turngeneration precedent. Each is recorded
|
||||
below.
|
||||
|
||||
## Decisions
|
||||
|
||||
### D1. `RuntimeRecordStore` grows a dedicated `UpdateImage` method
|
||||
|
||||
**Decision.**
|
||||
[`ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go)
|
||||
adds a new `UpdateImage(ctx, UpdateImageInput) error` method with its
|
||||
own `UpdateImageInput` struct and `Validate`. The Postgres adapter
|
||||
gains a matching SQL UPDATE under a CAS guard on `(game_id, status)`.
|
||||
The existing `UpdateStatus` is **not** repurposed for patch updates.
|
||||
|
||||
**Why.** `UpdateStatusInput.Validate()` (Stage 11) calls
|
||||
`runtime.Transition(ExpectedFrom, To)` and rejects every pair where
|
||||
`ExpectedFrom == To`. Patch deliberately keeps the runtime in
|
||||
`running`, so any attempt to feed `UpdateStatus` with
|
||||
`ExpectedFrom == To == running` is rejected before the SQL even
|
||||
runs. Three alternatives were on the table:
|
||||
|
||||
- Drop the `runtime.Transition` invariant from `UpdateStatusInput`
|
||||
to allow self-transitions. That would weaken the CAS validator
|
||||
for every existing caller — register-runtime, turngeneration,
|
||||
health-events consumer — and reintroduce the «accidental no-op
|
||||
status update» class of bugs the validator was added to catch.
|
||||
- Introduce a synthetic `runtime.StatusRunning → runtime.StatusRunning`
|
||||
edge in `domain/runtime/transitions.go`. Same blast radius as
|
||||
above, only with stronger semantic baggage in the transition table.
|
||||
- Add a dedicated `UpdateImage` method that only writes the two
|
||||
image columns plus `updated_at`. Bounded blast radius (one new
|
||||
method, one new input struct, one new SQL UPDATE), preserves the
|
||||
CAS invariant, and matches how Stage 11 already separated
|
||||
`UpdateScheduling` from `UpdateStatus` for the same reason.
|
||||
|
||||
The third option is what shipped. Existing fakes (`registerruntime`,
|
||||
`turngeneration`, hot-path tests, schedulerticker) carry a no-op
|
||||
`UpdateImage` stub that returns `errors.New(...)` so a test that
|
||||
accidentally exercises the new path fails loudly.
|
||||
|
||||
### D2. `adminstop` is idempotent on `stopped` and `finished`, rejects `starting`
|
||||
|
||||
**Decision.**
|
||||
[`service/adminstop`](../internal/service/adminstop/service.go) reads
|
||||
the runtime row first; if `Status ∈ {stopped, finished}`, the service
|
||||
returns `OutcomeSuccess` without calling Runtime Manager and without
|
||||
publishing a `runtime_snapshot_update`. If `Status == starting`, the
|
||||
service returns `conflict` with `OutcomeFailure`. Every other
|
||||
non-terminal status (`running`, `generation_in_progress`,
|
||||
`generation_failed`, `engine_unreachable`) takes the regular path:
|
||||
RTM call → CAS → snapshot publication.
|
||||
|
||||
**Why.** The README §Stop says «CAS `runtime_records.status: * →
|
||||
stopped`» but in practice three edge cases pull the service away
|
||||
from a literal CAS-only implementation:
|
||||
|
||||
- `stopped` and `finished` are common operator races: an admin clicks
|
||||
«stop» on a UI list while another admin already pressed it (or the
|
||||
game finished naturally). Returning `conflict` would force the UI
|
||||
to retry the read and confuse the operator. Idempotent success is
|
||||
the smallest-surprise behaviour and matches how Lobby's other
|
||||
admin-cancel flows handle terminal states.
|
||||
- `starting` is the active engine-init window. RTM has just been
|
||||
asked to start the container; an admin stop here would race the
|
||||
init flow and almost certainly leave the system in a partially
|
||||
cleaned state. The transition table in Stage 10 deliberately
|
||||
excludes `starting → stopped` for the same reason. Returning
|
||||
`conflict` lets the admin tooling surface «runtime is mid-init,
|
||||
retry in a moment» instead of pretending the stop succeeded.
|
||||
- The «obvious» fourth path — letting the CAS validator reject
|
||||
`starting → stopped` and surface that as the natural conflict —
|
||||
was rejected because it depends on validator implementation
|
||||
detail leaking through; the explicit pre-CAS check makes the
|
||||
intent obvious in the audit log and the structured logs.
|
||||
|
||||
The audit log records every pre-CAS rejection with
|
||||
`outcome=failure / error_code=conflict`, and every idempotent no-op
|
||||
with `outcome=success`, so operators can distinguish the cases in
|
||||
post-hoc analysis.
|
||||
|
||||
### D3. `adminforce` always sets `skip_next_tick=true`, even on a finishing turn
|
||||
|
||||
**Decision.**
|
||||
[`service/adminforce`](../internal/service/adminforce/service.go)
|
||||
issues `UpdateScheduling{SkipNextTick=true,
|
||||
NextGenerationAt=turnResult.Record.NextGenerationAt,
|
||||
CurrentTurn=turnResult.Record.CurrentTurn}` after every successful
|
||||
inner turn-generation, regardless of whether `Result.Finished` is
|
||||
`true`.
|
||||
|
||||
**Why.** The cleaner branch — «skip the scheduling write when the
|
||||
turn just finished the game» — was considered and rejected:
|
||||
|
||||
- `turngeneration` already cleared `next_generation_at` and updated
|
||||
`current_turn` on the finishing branch (Stage 15
|
||||
`completeFinished`). A redundant write that re-affirms those
|
||||
values plus sets `skip_next_tick=true` does no harm: the row is
|
||||
already in `status=finished` and no scheduler tick will ever
|
||||
consume the flag.
|
||||
- The branchless code is shorter and the test contract is simpler
|
||||
(«adminforce always writes the skip flag on success»). One extra
|
||||
conditional saves zero SQL on the production path but doubles the
|
||||
set of cases the test matrix has to assert.
|
||||
- The README §Force-next-turn wording «After success, set
|
||||
`runtime_records.skip_next_tick = true`» is unconditional. Adding
|
||||
a runtime-side branch would silently weaken that contract.
|
||||
|
||||
The driver `op_kind=force_next_turn` audit row records the eventual
|
||||
outcome (success / failure with the same error code that
|
||||
turngeneration surfaced) so audit consumers can tell apart a forced
|
||||
turn that finished the game from a forced turn that prepared the
|
||||
next regular tick.
|
||||
|
||||
### D4. `adminbanish` does not check runtime status; missing race surfaces as `forbidden`
|
||||
|
||||
**Decision.**
|
||||
[`service/adminbanish`](../internal/service/adminbanish/service.go)
|
||||
reads the runtime row only to retrieve the `engine_endpoint`, then
|
||||
calls `playermappingstore.GetByRace`. A missing row maps to
|
||||
`error_code=forbidden`. The runtime status itself is **not**
|
||||
inspected; banish is dispatched even when the runtime is in
|
||||
`stopped`, `finished`, or `engine_unreachable`.
|
||||
|
||||
**Why.** Two threads informed the choice:
|
||||
|
||||
- README §Banish lists only two preconditions: «runtime exists»
|
||||
and «`race_name` resolves to an existing player_mappings row».
|
||||
Adding a status guard would silently extend the contract beyond
|
||||
what Lobby is allowed to depend on, and would make the banish
|
||||
flow fail differently from the documented set.
|
||||
- A banish on a stopped/finished runtime is a no-op at the engine
|
||||
side (the container is exited or absent). The engine call will
|
||||
fail with `engine_unreachable`, which is the right error for the
|
||||
caller to see — it means «the runtime was stopped before banish
|
||||
could land». Pre-rejecting with a different code would hide the
|
||||
real state from the operator.
|
||||
|
||||
The `forbidden` mapping for missing race mirrors Stage 16 D6 («empty
|
||||
roster surfaces as `forbidden`»). The frozen error vocabulary does
|
||||
not contain a `race_not_found` code, and `forbidden` is the
|
||||
semantically closest match: «the platform user this race belonged
|
||||
to is no longer authorised to act on the runtime».
|
||||
|
||||
### D5. `livenessreply` returns 200 / `status=""` on `runtime_not_found`
|
||||
|
||||
**Decision.**
|
||||
[`service/livenessreply`](../internal/service/livenessreply/service.go)
|
||||
absorbs `runtime.ErrNotFound` into a successful Result with
|
||||
`Ready=false` and `Status=runtime.Status("")`. The Go-level error
|
||||
return is reserved for non-business failures only (nil context, nil
|
||||
receiver, store-read errors, invalid input). A handler that wraps
|
||||
this service answers 200 with body `{"ready": false, "status": ""}`
|
||||
when GM has no record for the requested game.
|
||||
|
||||
**Why.** README §Liveness reply specifies the endpoint «never calls
|
||||
the engine; it reflects GM's own view only» and explicitly says it
|
||||
returns 200 even when the runtime is not running. Three response
|
||||
shapes were considered:
|
||||
|
||||
- 200 with `status="runtime_not_found"`. Mixes runtime-status
|
||||
values with error codes in the same field, breaking the
|
||||
caller's enum-match dispatch.
|
||||
- 404 `runtime_not_found`. Contradicts the README §Liveness reply
|
||||
«return `200`» wording and forces Lobby's resume flow to add a
|
||||
404 handler that means «no observation» — semantically the same
|
||||
as `Ready=false`.
|
||||
- 200 with `status=""`. The empty status reads naturally as «GM
|
||||
has no observation»; Lobby's resume flow already needs to handle
|
||||
the `Ready=false` branch and the empty status is exactly what
|
||||
«no observation» looks like in practice. Chosen for the smallest
|
||||
caller-side complexity.
|
||||
|
||||
### D6. RTM client errors surface as `service_unavailable`, not a dedicated code
|
||||
|
||||
**Decision.** Both `service/adminstop` and `service/adminpatch` map
|
||||
every error from `RTMClient.Stop` / `RTMClient.Patch` to
|
||||
`error_code=service_unavailable`, regardless of whether the
|
||||
underlying failure is `ErrRTMUnavailable`, a wrapped HTTP 5xx, or a
|
||||
dialler-level transport error.
|
||||
|
||||
**Why.** The frozen error vocabulary in
|
||||
[`gamemaster/api/internal-openapi.yaml`](../api/internal-openapi.yaml)
|
||||
does not contain a `runtime_manager_unavailable` code. Three options
|
||||
were on the table:
|
||||
|
||||
- Add a new code. Rejected: the OpenAPI surface is contract-frozen
|
||||
from Stage 06 and adding a new error code is a wire-format change
|
||||
that pulls every consumer into a re-validation. Stage 17 deals
|
||||
with service-layer code only; no contract change is in scope.
|
||||
- Map RTM failures to `engine_unreachable`. Rejected: the RTM call
|
||||
is a sibling-service hop, not an engine call; mixing the two in
|
||||
a single label confuses operators reading metric / log labels.
|
||||
- Map RTM failures to `service_unavailable`. Accepted: the
|
||||
vocabulary already documents `service_unavailable` as «a
|
||||
steady-state dependency was unreachable for this call», which is
|
||||
exactly what an RTM outage looks like from GM's perspective.
|
||||
|
||||
The Stage 12 D5 decision record in
|
||||
[`stage12-external-clients.md`](./stage12-external-clients.md)
|
||||
already records that the RTM adapter wraps every non-success
|
||||
outcome in `ports.ErrRTMUnavailable` without distinguishing
|
||||
sub-cases; Stage 17 simply consumes the unified sentinel.
|
||||
|
||||
## Cross-stage consequences
|
||||
|
||||
- The new port surface `RuntimeRecordStore.UpdateImage` is
|
||||
available to every later consumer; Stage 18 and Stage 19 do not
|
||||
use it. Existing hand-rolled fakes carry a no-op stub.
|
||||
- `OpKindStop`, `OpKindForceNextTurn`, `OpKindPatch`, `OpKindBanish`
|
||||
were introduced in Stage 09 / Stage 10 already; Stage 17 is their
|
||||
first writer.
|
||||
- The telemetry counter `gamemaster.banish.outcomes` (declared in
|
||||
Stage 08) gets its first call site in `service/adminbanish`. No
|
||||
new counters are introduced for `adminstop` / `adminforce` /
|
||||
`adminpatch` / `livenessreply`; the README §Observability list
|
||||
does not mention them and Stage 17 deliberately stays inside the
|
||||
declared instrument set.
|
||||
- The Stage 19 REST handlers consume the five services without
|
||||
service-layer changes: each handler decodes the JSON envelope,
|
||||
fills `Input.OpSource` / `Input.SourceRef` from the
|
||||
`X-Galaxy-Caller` header convention, and translates `Result.ErrorCode`
|
||||
into the standard error envelope.
|
||||
Reference in New Issue
Block a user