--- stage: 18 title: runtime:health_events consumer --- # Stage 18 — `runtime:health_events` consumer This decision record captures the non-obvious choices made while implementing the asynchronous consumer of the `runtime:health_events` Redis Stream produced by Runtime Manager. The consumer translates RTM observations into three effects on Game Master state: 1. Updates `runtime_records.engine_health` per game with a short summary string. 2. For terminal container events applies a CAS `running → engine_unreachable`; for `probe_recovered` applies the symmetric recovery CAS `engine_unreachable → running`. 3. Publishes a debounced `runtime_snapshot_update` on `gm:lobby_events` only when the engine-health summary or the runtime status actually changed. The reference precedent for the worker shape (`Dependencies` / `NewWorker` / `Run` / `Shutdown` / exported `HandleMessage`) is the Lobby `gmevents` consumer at `lobby/internal/worker/gmevents`. Seven decisions deviate from a literal reading of [`../PLAN.md`](../PLAN.md) or are sharp enough to surface here. ## Decisions ### D1. Event-type taxonomy expanded to seven values **Decision.** The consumer maps all seven values published by RTM ([`rtmanager/internal/domain/health/snapshot.go`](../../rtmanager/internal/domain/health/snapshot.go)), not the six listed in PLAN Stage 18. The added values are `container_started` and `probe_recovered`. Both are mapped to the summary string `healthy`. `probe_recovered` additionally attempts the recovery CAS `engine_unreachable → running`. `container_started` does not transition status — Game Master owns runtime startup through the register-runtime flow, so RTM's container_started observation is informational at the consumer level. **Why.** The transition table in [`internal/domain/runtime/transitions.go`](../internal/domain/runtime/transitions.go) already declares `engine_unreachable → running` with the comment `reserved for the Stage 18 consumer; declared here so Stage 18 needs no transitions edit`. The reserved transition is only useful when an event in the input stream actually triggers it; the only such event in RTM's vocabulary is `probe_recovered`. Leaving the two extra event types unmapped would either drop information (if ignored entirely) or keep the recovery transition forever unreachable. Mapping them now is the minimum diff that closes the loop. ### D2. CAS conflict on a status mutation falls back to a health-only update **Decision.** When the worker plans a status transition (e.g., `running → engine_unreachable` for `container_oom`) and `RuntimeRecordStore.UpdateStatus` returns `runtime.ErrConflict` or `runtime.ErrInvalidTransition`, the worker logs the conflict at debug and falls back to `RuntimeRecordStore.UpdateEngineHealth`. The summary column is refreshed; the status column stays under whatever the concurrent flow holds. **Why.** Two flows can hold the runtime row when an RTM event arrives: turn generation (`generation_in_progress`) and admin operations (`stopped`, `finished`). Forcing the consumer to win over those flows would either reintroduce stale-status writes or require expanding the allowed-transitions table to include every non-terminal source — the latter weakens the guard that turn generation relies on. The failure semantics turn-generation already implements (engine call timeout → `generation_failed`) cover the case where an `oom` arrives while a turn is in flight: the engine call from turngeneration will fail naturally a moment later. The consumer's job in that window is to keep the summary current so operators see «last known: oom» on `gm:lobby_events`. ### D3. New port method `UpdateEngineHealth` **Decision.** [`internal/ports/runtimerecordstore.go`](../internal/ports/runtimerecordstore.go) gains a new method `UpdateEngineHealth(ctx, UpdateEngineHealthInput) error` with its own input struct and `Validate`. The Postgres adapter gains a matching `UPDATE runtime_records SET engine_health = $1, updated_at = $2 WHERE game_id = $3`. The existing `UpdateStatus` is **not** repurposed for health-only updates. **Why.** `UpdateStatusInput.Validate` calls `runtime.Transition(ExpectedFrom, To)` and rejects every pair where `ExpectedFrom == To` (Stage 17 D1). A health-only update keeps the runtime in its current status, so any attempt to feed `UpdateStatus` with `ExpectedFrom == To` is rejected before the SQL even runs. The same precedent led Stage 17 to add `UpdateImage` rather than relax the self-transition guard. Stage 18 follows that precedent. In addition, the health update is not gated on a CAS at all: late- arriving events should still bookkeep the summary regardless of the current status (including `stopped` and `finished`). A guarded `UpdateStatus`-shaped variant would have to enumerate every source status the consumer might observe; an unguarded `UpdateEngineHealth` sidesteps the question. ### D4. In-memory dedupe of last-emitted summaries per game **Decision.** The worker keeps a `map[string]string` (`gameID → lastEmittedSummary`) under a `sync.RWMutex`. A snapshot is published when either the status transitioned in this iteration or when the new summary differs from the cached one for the same game. The cache is process-local; on restart it is empty. **Why.** [`./README.md` §`gm:lobby_events`](../README.md) freezes the publication rule: snapshots are emitted on transitions and on health- summary changes («debounced — duplicates are suppressed when the summary did not change»). Stage 18 chooses an in-process map over a Redis-backed dedupe for two reasons: 1. Game Master is single-instance in v1 ([`./README.md §Non-Goals`](../README.md)); a per-process map is sufficient for v1 correctness. 2. Losing the cache on restart causes at most one extra snapshot per game right after restart — Lobby's `gmevents` consumer is idempotent (CAS-protected status transitions, deterministic snapshot blob), so the extra emission is benign. A Redis-backed dedupe is cheap to introduce later if multi-instance Game Master ever lands; until then the simpler choice ships less code. ### D5. Snapshot construction reads the runtime row again after the mutation **Decision.** Whenever the worker decides to publish, it re-reads the runtime record (`RuntimeRecordStore.Get`) and builds the `RuntimeSnapshotUpdate` from that fresh row. The `EngineHealthSummary`, `RuntimeStatus`, and `CurrentTurn` fields therefore reflect whatever the database holds after the mutation, rather than what the worker just intended to write. **Why.** Two paths can produce the same publish decision: the CAS succeeded (status changed, summary changed), or the CAS conflicted and the fallback `UpdateEngineHealth` took over (status unchanged from the worker's point of view, but possibly mutated by a concurrent flow between the conflict and the read). A single read-after-write reduces both paths to the same envelope-building code and keeps the snapshot honest about what is actually in the database. `PlayerTurnStats` is intentionally left as `nil`: the consumer does not have a fresh engine state payload, so per-player stats stay empty until the next turn (this matches [`./README.md §`gm:lobby_events`] for status-only transitions). ### D6. Stream-offset label is `health_events` **Decision.** The consumer uses the short label `health_events` for `StreamOffsetStore.Load` / `Save`. The corresponding Redis key is `gamemaster:stream_offsets:health_events`. **Why.** The label convention is documented in [`./README.md §Persistence Layout / Redis runtime-coordination state`](../README.md): short logical identifier of the consumer, stable across renames of the underlying stream key. The Lobby `gmevents` consumer follows the same shape (`gm_lobby_events`). ### D7. Worker wiring deferred to Stage 19 **Decision.** Stage 18 ships the worker package and unit/loop tests but does not register the worker as an `app.Component` in `internal/app/runtime.go`. Wiring is deferred to Stage 19. **Why.** The same pattern is already in place for the scheduler ticker introduced at Stage 15: the worker exists in the source tree but is not wired into `runtime.app = New(cfg, internalServer)`. Stage 19 explicitly bundles handler wiring with worker wiring (see PLAN Stage 19), so deferring is consistent with the precedent. The configuration values the wiring will need (stream name, block timeout, offset-store DSN) are already loaded by `internal/config` and were introduced in Stage 08.