8.3 KiB
stage, title
| stage | title |
|---|---|
| 18 | runtime:health_events consumer |
Stage 18 — runtime:health_events consumer
This decision record captures the non-obvious choices made while
implementing the asynchronous consumer of the runtime:health_events
Redis Stream produced by Runtime Manager. The consumer translates RTM
observations into three effects on Game Master state:
- Updates
runtime_records.engine_healthper game with a short summary string. - For terminal container events applies a CAS
running → engine_unreachable; forprobe_recoveredapplies the symmetric recovery CASengine_unreachable → running. - Publishes a debounced
runtime_snapshot_updateongm:lobby_eventsonly when the engine-health summary or the runtime status actually changed.
The reference precedent for the worker shape (Dependencies /
NewWorker / Run / Shutdown / exported HandleMessage) is the
Lobby gmevents consumer at lobby/internal/worker/gmevents. Seven
decisions deviate from a literal reading of ../PLAN.md
or are sharp enough to surface here.
Decisions
D1. Event-type taxonomy expanded to seven values
Decision. The consumer maps all seven values published by RTM
(rtmanager/internal/domain/health/snapshot.go),
not the six listed in PLAN Stage 18. The added values are
container_started and probe_recovered. Both are mapped to the
summary string healthy. probe_recovered additionally attempts the
recovery CAS engine_unreachable → running. container_started does
not transition status — Game Master owns runtime startup through the
register-runtime flow, so RTM's container_started observation is
informational at the consumer level.
Why. The transition table in
internal/domain/runtime/transitions.go
already declares engine_unreachable → running with the comment
reserved for the Stage 18 consumer; declared here so Stage 18 needs no transitions edit. The reserved transition is only useful when an
event in the input stream actually triggers it; the only such event in
RTM's vocabulary is probe_recovered. Leaving the two extra event
types unmapped would either drop information (if ignored entirely) or
keep the recovery transition forever unreachable. Mapping them now is
the minimum diff that closes the loop.
D2. CAS conflict on a status mutation falls back to a health-only update
Decision. When the worker plans a status transition (e.g.,
running → engine_unreachable for container_oom) and
RuntimeRecordStore.UpdateStatus returns runtime.ErrConflict or
runtime.ErrInvalidTransition, the worker logs the conflict at debug
and falls back to RuntimeRecordStore.UpdateEngineHealth. The summary
column is refreshed; the status column stays under whatever the
concurrent flow holds.
Why. Two flows can hold the runtime row when an RTM event arrives:
turn generation (generation_in_progress) and admin operations
(stopped, finished). Forcing the consumer to win over those flows
would either reintroduce stale-status writes or require expanding the
allowed-transitions table to include every non-terminal source — the
latter weakens the guard that turn generation relies on. The failure
semantics turn-generation already implements (engine call timeout →
generation_failed) cover the case where an oom arrives while a
turn is in flight: the engine call from turngeneration will fail
naturally a moment later. The consumer's job in that window is to keep
the summary current so operators see «last known: oom» on
gm:lobby_events.
D3. New port method UpdateEngineHealth
Decision. internal/ports/runtimerecordstore.go
gains a new method UpdateEngineHealth(ctx, UpdateEngineHealthInput) error
with its own input struct and Validate. The Postgres adapter gains a
matching UPDATE runtime_records SET engine_health = $1, updated_at = $2 WHERE game_id = $3. The existing UpdateStatus is not
repurposed for health-only updates.
Why. UpdateStatusInput.Validate calls
runtime.Transition(ExpectedFrom, To) and rejects every pair where
ExpectedFrom == To (Stage 17 D1). A health-only update keeps the
runtime in its current status, so any attempt to feed UpdateStatus
with ExpectedFrom == To is rejected before the SQL even runs. The
same precedent led Stage 17 to add UpdateImage rather than relax the
self-transition guard. Stage 18 follows that precedent.
In addition, the health update is not gated on a CAS at all: late-
arriving events should still bookkeep the summary regardless of the
current status (including stopped and finished). A guarded
UpdateStatus-shaped variant would have to enumerate every source
status the consumer might observe; an unguarded UpdateEngineHealth
sidesteps the question.
D4. In-memory dedupe of last-emitted summaries per game
Decision. The worker keeps a map[string]string (gameID → lastEmittedSummary) under a sync.RWMutex. A snapshot is published
when either the status transitioned in this iteration or when the new
summary differs from the cached one for the same game. The cache is
process-local; on restart it is empty.
Why. ./README.md §gm:lobby_events freezes the
publication rule: snapshots are emitted on transitions and on health-
summary changes («debounced — duplicates are suppressed when the
summary did not change»). Stage 18 chooses an in-process map over a
Redis-backed dedupe for two reasons:
- Game Master is single-instance in v1
(
./README.md §Non-Goals); a per-process map is sufficient for v1 correctness. - Losing the cache on restart causes at most one extra snapshot per
game right after restart — Lobby's
gmeventsconsumer is idempotent (CAS-protected status transitions, deterministic snapshot blob), so the extra emission is benign.
A Redis-backed dedupe is cheap to introduce later if multi-instance Game Master ever lands; until then the simpler choice ships less code.
D5. Snapshot construction reads the runtime row again after the mutation
Decision. Whenever the worker decides to publish, it re-reads the
runtime record (RuntimeRecordStore.Get) and builds the
RuntimeSnapshotUpdate from that fresh row. The EngineHealthSummary,
RuntimeStatus, and CurrentTurn fields therefore reflect whatever
the database holds after the mutation, rather than what the worker
just intended to write.
Why. Two paths can produce the same publish decision: the CAS
succeeded (status changed, summary changed), or the CAS conflicted and
the fallback UpdateEngineHealth took over (status unchanged from the
worker's point of view, but possibly mutated by a concurrent flow
between the conflict and the read). A single read-after-write reduces
both paths to the same envelope-building code and keeps the snapshot
honest about what is actually in the database. PlayerTurnStats is
intentionally left as nil: the consumer does not have a fresh engine
state payload, so per-player stats stay empty until the next turn
(this matches [./README.md §gm:lobby_events`] for status-only
transitions).
D6. Stream-offset label is health_events
Decision. The consumer uses the short label health_events for
StreamOffsetStore.Load / Save. The corresponding Redis key is
gamemaster:stream_offsets:health_events.
Why. The label convention is documented in
./README.md §Persistence Layout / Redis runtime-coordination state:
short logical identifier of the consumer, stable across renames of the
underlying stream key. The Lobby gmevents consumer follows the same
shape (gm_lobby_events).
D7. Worker wiring deferred to Stage 19
Decision. Stage 18 ships the worker package and unit/loop tests but
does not register the worker as an app.Component in
internal/app/runtime.go. Wiring is deferred to Stage 19.
Why. The same pattern is already in place for the scheduler ticker
introduced at Stage 15: the worker exists in the source tree but is
not wired into runtime.app = New(cfg, internalServer). Stage 19
explicitly bundles handler wiring with worker wiring (see PLAN
Stage 19), so deferring is consistent with the precedent. The
configuration values the wiring will need (stream name, block timeout,
offset-store DSN) are already loaded by internal/config and were
introduced in Stage 08.