Files
galaxy-game/gamemaster/docs/stage18-health-events-consumer.md
T
2026-05-03 07:59:03 +02:00

8.3 KiB

stage, title
stage title
18 runtime:health_events consumer

Stage 18 — runtime:health_events consumer

This decision record captures the non-obvious choices made while implementing the asynchronous consumer of the runtime:health_events Redis Stream produced by Runtime Manager. The consumer translates RTM observations into three effects on Game Master state:

  1. Updates runtime_records.engine_health per game with a short summary string.
  2. For terminal container events applies a CAS running → engine_unreachable; for probe_recovered applies the symmetric recovery CAS engine_unreachable → running.
  3. Publishes a debounced runtime_snapshot_update on gm:lobby_events only when the engine-health summary or the runtime status actually changed.

The reference precedent for the worker shape (Dependencies / NewWorker / Run / Shutdown / exported HandleMessage) is the Lobby gmevents consumer at lobby/internal/worker/gmevents. Seven decisions deviate from a literal reading of ../PLAN.md or are sharp enough to surface here.

Decisions

D1. Event-type taxonomy expanded to seven values

Decision. The consumer maps all seven values published by RTM (rtmanager/internal/domain/health/snapshot.go), not the six listed in PLAN Stage 18. The added values are container_started and probe_recovered. Both are mapped to the summary string healthy. probe_recovered additionally attempts the recovery CAS engine_unreachable → running. container_started does not transition status — Game Master owns runtime startup through the register-runtime flow, so RTM's container_started observation is informational at the consumer level.

Why. The transition table in internal/domain/runtime/transitions.go already declares engine_unreachable → running with the comment reserved for the Stage 18 consumer; declared here so Stage 18 needs no transitions edit. The reserved transition is only useful when an event in the input stream actually triggers it; the only such event in RTM's vocabulary is probe_recovered. Leaving the two extra event types unmapped would either drop information (if ignored entirely) or keep the recovery transition forever unreachable. Mapping them now is the minimum diff that closes the loop.

D2. CAS conflict on a status mutation falls back to a health-only update

Decision. When the worker plans a status transition (e.g., running → engine_unreachable for container_oom) and RuntimeRecordStore.UpdateStatus returns runtime.ErrConflict or runtime.ErrInvalidTransition, the worker logs the conflict at debug and falls back to RuntimeRecordStore.UpdateEngineHealth. The summary column is refreshed; the status column stays under whatever the concurrent flow holds.

Why. Two flows can hold the runtime row when an RTM event arrives: turn generation (generation_in_progress) and admin operations (stopped, finished). Forcing the consumer to win over those flows would either reintroduce stale-status writes or require expanding the allowed-transitions table to include every non-terminal source — the latter weakens the guard that turn generation relies on. The failure semantics turn-generation already implements (engine call timeout → generation_failed) cover the case where an oom arrives while a turn is in flight: the engine call from turngeneration will fail naturally a moment later. The consumer's job in that window is to keep the summary current so operators see «last known: oom» on gm:lobby_events.

D3. New port method UpdateEngineHealth

Decision. internal/ports/runtimerecordstore.go gains a new method UpdateEngineHealth(ctx, UpdateEngineHealthInput) error with its own input struct and Validate. The Postgres adapter gains a matching UPDATE runtime_records SET engine_health = $1, updated_at = $2 WHERE game_id = $3. The existing UpdateStatus is not repurposed for health-only updates.

Why. UpdateStatusInput.Validate calls runtime.Transition(ExpectedFrom, To) and rejects every pair where ExpectedFrom == To (Stage 17 D1). A health-only update keeps the runtime in its current status, so any attempt to feed UpdateStatus with ExpectedFrom == To is rejected before the SQL even runs. The same precedent led Stage 17 to add UpdateImage rather than relax the self-transition guard. Stage 18 follows that precedent.

In addition, the health update is not gated on a CAS at all: late- arriving events should still bookkeep the summary regardless of the current status (including stopped and finished). A guarded UpdateStatus-shaped variant would have to enumerate every source status the consumer might observe; an unguarded UpdateEngineHealth sidesteps the question.

D4. In-memory dedupe of last-emitted summaries per game

Decision. The worker keeps a map[string]string (gameID → lastEmittedSummary) under a sync.RWMutex. A snapshot is published when either the status transitioned in this iteration or when the new summary differs from the cached one for the same game. The cache is process-local; on restart it is empty.

Why. ./README.md §gm:lobby_events freezes the publication rule: snapshots are emitted on transitions and on health- summary changes («debounced — duplicates are suppressed when the summary did not change»). Stage 18 chooses an in-process map over a Redis-backed dedupe for two reasons:

  1. Game Master is single-instance in v1 (./README.md §Non-Goals); a per-process map is sufficient for v1 correctness.
  2. Losing the cache on restart causes at most one extra snapshot per game right after restart — Lobby's gmevents consumer is idempotent (CAS-protected status transitions, deterministic snapshot blob), so the extra emission is benign.

A Redis-backed dedupe is cheap to introduce later if multi-instance Game Master ever lands; until then the simpler choice ships less code.

D5. Snapshot construction reads the runtime row again after the mutation

Decision. Whenever the worker decides to publish, it re-reads the runtime record (RuntimeRecordStore.Get) and builds the RuntimeSnapshotUpdate from that fresh row. The EngineHealthSummary, RuntimeStatus, and CurrentTurn fields therefore reflect whatever the database holds after the mutation, rather than what the worker just intended to write.

Why. Two paths can produce the same publish decision: the CAS succeeded (status changed, summary changed), or the CAS conflicted and the fallback UpdateEngineHealth took over (status unchanged from the worker's point of view, but possibly mutated by a concurrent flow between the conflict and the read). A single read-after-write reduces both paths to the same envelope-building code and keeps the snapshot honest about what is actually in the database. PlayerTurnStats is intentionally left as nil: the consumer does not have a fresh engine state payload, so per-player stats stay empty until the next turn (this matches [./README.md §gm:lobby_events`] for status-only transitions).

D6. Stream-offset label is health_events

Decision. The consumer uses the short label health_events for StreamOffsetStore.Load / Save. The corresponding Redis key is gamemaster:stream_offsets:health_events.

Why. The label convention is documented in ./README.md §Persistence Layout / Redis runtime-coordination state: short logical identifier of the consumer, stable across renames of the underlying stream key. The Lobby gmevents consumer follows the same shape (gm_lobby_events).

D7. Worker wiring deferred to Stage 19

Decision. Stage 18 ships the worker package and unit/loop tests but does not register the worker as an app.Component in internal/app/runtime.go. Wiring is deferred to Stage 19.

Why. The same pattern is already in place for the scheduler ticker introduced at Stage 15: the worker exists in the source tree but is not wired into runtime.app = New(cfg, internalServer). Stage 19 explicitly bundles handler wiring with worker wiring (see PLAN Stage 19), so deferring is consistent with the precedent. The configuration values the wiring will need (stream name, block timeout, offset-store DSN) are already loaded by internal/config and were introduced in Stage 08.