galaxy-game/rtmanager/docs/workers.md

# Background Workers

This document explains the design of the seven background workers
under [`../internal/worker/`](../internal/worker):

- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
  [`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
  consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
  subscription;
- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
  `InspectContainer` worker;
- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
  `/healthz` probe;
- [`reconcile`](../internal/worker/reconcile) — startup + periodic
  drift reconciliation;
- [`containercleanup`](../internal/worker/containercleanup) —
  periodic TTL cleanup.

The current-state behaviour and configuration surface live in
[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
§Reconciliation), and operational notes are in
[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
[`runbook.md`](runbook.md). This file records the rationale.

## 1. Single ownership per `event_type`

The `runtime:health_events` vocabulary is shared across four sources;
each event type is owned by exactly one of them.

| `event_type` | Owner |
| --- | --- |
| `container_started` | `internal/service/startruntime` |
| `container_exited` | `internal/worker/dockerevents` |
| `container_oom` | `internal/worker/dockerevents` |
| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
| `inspect_unhealthy` | `internal/worker/dockerinspect` |
| `probe_failed` | `internal/worker/healthprobe` |
| `probe_recovered` | `internal/worker/healthprobe` |

`container_started` is intentionally not duplicated by the events
listener, even though Docker emits a `start` action whenever the start
service runs the container. The start service already publishes the
event with the same wire shape; observing the action in the listener
would produce two entries per real start.

## 2. `container_disappeared` is conditional on PG state

The Docker events listener inspects the runtime record before emitting
`container_disappeared` for a `destroy` action. Three suppression rules
apply:

- record missing → suppress (the destroyed container was never owned
  by RTM as a tracked runtime, so no consumer cares);
- record `status != running` → suppress (RTM already finished a stop
  or cleanup; the destroy is the expected tail of that operation);
- record `current_container_id != event.ContainerID` → suppress (RTM
  swapped to a new container through restart or patch; the destroy is
  the expected removal of the prior container id).

Only a destroy that arrives for a `running` record whose
`current_container_id` still equals the event id is treated as
unexpected. This is the wire-side analogue of the reconciler's
PG-drift check: the reconciler observes "PG=running, no Docker
container" while the events listener observes "Docker says destroy,
PG still says running pointing at this container". Together they cover
both directions of drift.

A read failure against `runtime_records` is treated conservatively as
"suppress" — the listener cannot tell whether the destroy was external
or RTM-initiated, and over-emitting `container_disappeared` would lead
to a real consumer (`Game Master`) escalating a false positive.

## 3. `die` with exit code `0` is suppressed

`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
event with exit code `0`. The `container_exited` contract guarantees a
non-zero exit; emitting on exit `0` would shower consumers with
normal-stop noise. The listener silently drops the event; the
operation log already records the stop on the caller side.

## 4. Inspect worker leaves `container_disappeared` to the reconciler

When `dockerinspect` calls `InspectContainer` and the daemon returns
`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:

- the reconciler is the single authority for PG-drift reconciliation.
  Adding a third source for `container_disappeared` would risk double
  emission and complicate the consumer story;
- inspect ticks every 30 seconds; the reconciler ticks every 5
  minutes. The latency window for "Docker drops the container, RTM
  notices" is therefore at most 5 minutes in v1, which is acceptable
  for the kinds of drift the reconciler covers (manual `docker rm`
  outside RTM, daemon restart with stale records). If a future
  requirement tightens the window, promoting the inspect-side
  observation to a real `container_disappeared` is a one-line change.

## 5. Probe hysteresis is in-memory and pruned per tick

The active probe worker keeps per-game state in a
`map[string]*probeState` guarded by a mutex. Two counters live there:

- `consecutiveFailures` — incremented on every failed probe, reset on
  every success;
- `failurePublished` — prevents repeated `probe_failed` emission while
  the failure persists, and triggers a single `probe_recovered` on the
  first success after the threshold was crossed.

The state is non-persistent. RTM is single-instance in v1, and a
process restart that loses the counters merely re-establishes the
hysteresis from scratch — the only consequence is that a probe failure
already in progress at the moment of restart needs another full
threshold of failures to surface. Making the state durable would add a
Redis round-trip to every probe attempt without buying anything that
operators or downstream consumers depend on.

State pruning happens at the start of every tick. The worker reads the
current running list and removes any state entry whose `game_id` is
not in the list. A game that transitions through stopped → running
again starts fresh; previously-accumulated counters do not bleed into
the new lifecycle.

## 6. Probe concurrency is bounded by a fixed cap

Probes inside one tick run in parallel through a buffered-channel
semaphore (`defaultMaxConcurrency = 16`). Three reasons:

- A single slow engine cannot delay the entire cohort. Sequential
  per-game probing would multiply the worst case by `len(records)`,
  which is the wrong shape for what is fundamentally a fan-out
  observation pattern.
- An unbounded fan-out (one goroutine per record per tick without a
  cap) was rejected to avoid pathological CPU and connection bursts
  if the running list ever grows beyond what RTM was sized for. 16
  in-flight probes at the default 2s timeout fit a single RTM
  instance well within typical OS file-descriptor and TCP
  ephemeral-port limits.
- The cap is a constant rather than an env var because RTM v1 is
  single-instance and the active-game count is bounded by Lobby; a
  configurable cap is something we promote to env if a real workload
  demands it.

The same reasoning argues against parallelism in the inspect worker:
inspect calls are cheap (sub-ms in the local Docker socket case) and
serial execution avoids unnecessary concurrency on the daemon socket.

## 7. Events listener reconnects with fixed backoff

The Docker daemon's events stream is a long-lived subscription; the
SDK channel terminates on any transport error (daemon restart, socket
hiccup, connection reset). The listener's outer loop handles this by
re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
indefinitely while ctx is alive.

Crashing the process on a transport error was rejected because losing
a few seconds of health observations is a much smaller blast radius
than losing the entire RTM process while the start/stop pipelines are
running. The save-offset case is different: a lost offset replays the
entire backlog and breaks correctness, while a missed health event is
observation-only.

A subscription error is logged at `Warn` so operators can see the
reconnect activity without it dominating the log volume.

## 8. Health publisher remains best-effort

Every emission goes through `ports.HealthEventPublisher.Publish`, the
same surface the start service already uses
([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
and proceeds; the worker does not retry, does not adjust its in-memory
hysteresis, and does not surface the failure to the caller. The
operation log is the source of truth for runtime state; the event
stream is a best-effort notification surface to consumers.

## 9. Stream offset labels are stable identifiers

Both consumers persist their progress through
`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
labels match `rtmanager:stream_offsets:{label}` and stay stable when
the underlying stream key is renamed via
`RTMANAGER_REDIS_START_JOBS_STREAM` /
`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
consumer at a different stream key does not lose the persisted offset.

## 10. `OpSource` and `SourceRef` originate at the consumer boundary

Every consumed envelope is translated into a `Service.Handle` call
with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
`SourceRef` is the Redis Stream entry id (`message.ID`); the
`operation_log` rows therefore record the originating envelope id, and
restart / patch correlation logic ([`services.md`](services.md) §13)
keeps working when those services are invoked indirectly.

## 11. Replay-no-op detection lives in the service layer

The consumer does not detect replays itself. `startruntime.Service`
returns `Outcome=success, ErrorCode=replay_no_op` when the existing
record is already `running` with the same `image_ref`;
`stopruntime.Service` does the same for an already-stopped or
already-removed record. The consumer copies the result fields into
the `RuntimeJobResult` payload verbatim and lets Lobby observe the
replay through `error_code`.

The wire-shape consequences:

- `success` + empty `error_code` → fresh start / fresh stop;
- `success` + `error_code=replay_no_op` → idempotent replay. For
  start, the existing record carries `container_id` and
  `engine_endpoint`; for stop on `status=removed`, both fields are
  empty strings (the record was nulled by an earlier cleanup) — the
  AsyncAPI contract permits empty strings on these required fields;
- `failure` + non-empty `error_code` → the start / stop service
  returned a zero `Record`; the consumer publishes empty
  `container_id` and `engine_endpoint`.

## 12. Per-message errors are absorbed; the offset always advances

The consumer run loop logs and absorbs any decode error, any go-level
service error, and any publish failure; `streamOffsetStore.Save` runs
unconditionally after each handled message. Pinning the offset on a
single transient publish failure was rejected because the durable side
effect (operation_log row, runtime_records mutation, Docker state) has
already happened on the first pass; pinning the offset to retry the
publish would duplicate audit rows for hours until the operator
intervened.

The exception is `streamOffsetStore.Save` itself: a save failure
returns a wrapped error from `Run`. The component supervisor in
`internal/app/app.go` then exits the process and lets the operator
escalate, because losing the offset would cause every subsequent
restart to re-process every prior envelope.

## 13. `requested_at_ms` is logged-only

The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
consumer parses it (rejecting unparseable values) but only includes
the value in structured logs — the field is "used for diagnostics, not
authoritative" per the contract. The service layer ignores it; the
operation_log uses `service.clock()` for `started_at` / `finished_at`
so Lobby's wall-clock skew never bleeds into RTM persistence.

## 14. Reconciler: per-game lease around every write

A `running → removed` mutation that races a restart's inner stop
would clobber the restart's freshly-installed `running` record without
any other guard. The reconciler honours the same per-game lease that
the lifecycle services hold ([`services.md`](services.md) §1).

The reconciler splits its work into two phases:

- **Read pass — lockless.**
  `docker.List({com.galaxy.owner=rtmanager})` followed by
  `RuntimeRecords.ListByStatus(running)`. No lease is taken; both
  reads are point-in-time observations of independent systems and a
  stale view here only delays a mutation by one tick.
- **Write pass — lease-guarded.** Every drift mutation
  (`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
  per-game lease, re-reads the record under the lease, and then
  either applies the mutation or returns when state has changed.
  A lease conflict (`acquired=false`) is logged at `info` and the
  game is silently skipped — the next tick will retry. A lease-store
  error is logged at `warn`; the rest of the pass continues.

The re-read after lease acquisition is intentional: the read pass is
lockless, so by the time the lease is held the runtime record may
have moved. `UpdateStatus` already provides CAS via
`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
does not, so the explicit re-read keeps the three paths uniform and
makes the skip condition obvious in code review.

## 15. Three drift kinds covered by the reconciler

- `adopt` — Docker reports a container labelled
  `com.galaxy.owner=rtmanager` for which RTM has no record; insert a
  fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
  stop or remove the container (operators may have started it
  manually for diagnostics).
- `dispose` — RTM has a `running` record whose container is missing
  in Docker; mark `status=removed`, publish
  `container_disappeared`, append `op_kind=reconcile_dispose`.
- `observed_exited` — RTM has a `running` record whose container
  exists but is in `exited`; mark `status=stopped`, publish
  `container_exited` with the observed exit code. This third path
  exists because the events listener sees only live events; a
  container that died while RTM was offline would otherwise stay
  `running` indefinitely. The drift is exposed through
  `rtmanager.reconcile_drift{kind=observed_exited}` and through the
  `container_exited` health event; no `operation_log` entry is
  written because the audit log records explicit RTM operations, not
  passive observations of Docker state.

## 16. `stopped_at = now (reconciler observation time)`

The `observed_exited` path writes `stopped_at = now`, where `now` is
the reconciler's observation time. The persistence adapter
([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
`stopped_at = now` for the `stopped` destination — there is no
port-level knob for an explicit timestamp, and the reconciler does not
read `State.FinishedAt` from Docker.

The trade-off: `stopped_at` diverges from the daemon's
`State.FinishedAt` by at most one tick interval (default 5 minutes).
If a downstream consumer ever needs the daemon-observed exit
timestamp, the upgrade path is a one-call extension of
`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
that change is deferred until a consumer materialises.

## 17. Synchronous initial pass + periodic Component

`README §Startup dependencies` step 6 demands "Reconciler runs once
and blocks until done" before background workers start, but
`app.App.Run` starts every registered `Component` concurrently —
component ordering does not translate into start ordering.

The reconciler exposes a public `ReconcileNow(ctx)` method that the
runtime calls synchronously between `newWiring` and `app.New`. The
same `*Reconciler` is then registered as a `Component`; its `Run`
only ticks (no immediate pass) so the startup work is not duplicated.
The cost is one public method on the worker; the benefit is that the
README invariant holds verbatim and the periodic loop is a textbook
`Component`.

## 18. Adopt through `Upsert`, race with start is benign

The adopt path constructs a fresh `runtime.RuntimeRecord` (status
running, container id and image_ref from labels, `started_at` from
`com.galaxy.started_at_ms` or inspect, state path and docker network
from configuration, engine endpoint from the
`http://galaxy-game-{game_id}:8080` rule) and calls
`RuntimeRecords.Upsert`.

Race scenario: the start service has called `docker.Run` but has not
yet finished its own `Upsert` when the reconciler observes the
container without a record. Both writers eventually arrive at PG with
the same key data — the start service knows the canonical
`image_ref`, but the reconciler reads it from the
`com.galaxy.engine_image_ref` label that the start service itself
wrote. The CAS-free overwrite is therefore benign:

- `created_at` is preserved across upserts by the
  `ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
  game" timestamp stays stable regardless of which writer lands last;
- all other fields in this race carry identical values (same
  container, same image, same hostname, same state path).

Under the per-game lease this is doubly safe: the reconciler only
issues `Upsert` while holding the lease, and only after re-reading
the record finds it absent. Concurrent start would block on the same
lease; concurrent stop / restart would have moved the record out of
"absent" by the time the reconciler re-reads.

## 19. Cleanup worker delegates to the service

The TTL-cleanup worker is intentionally tiny: it lists
`runtime_records.status='stopped'`, filters in process by
`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
candidate. The service already owns:

- the per-game lease around the Docker `Remove` call;
- the `running → removed` CAS via `UpdateStatus`;
- the operation_log entry (`op_kind=cleanup_container`,
  `op_source=auto_ttl`);
- the telemetry counter and structured log fields.

In-memory filtering is acceptable in v1 because the cardinality of
`status=stopped` rows is bounded by Lobby's active-game count plus
retention period. The dedicated `(status, last_op_at)` index drives
the underlying `ListByStatus(stopped)` query so the database does
the heavy lifting; the Go-side filter is microseconds-per-row.

The worker uses a small `Cleaner` interface in its own package rather
than depending on `*cleanupcontainer.Service` directly. This keeps
the worker's tests light — no need to construct Docker, lease,
operation-log, and telemetry doubles just to verify TTL math — while
the production wiring still binds the real service via a compile-time
interface assertion in `internal/app/wiring.go`.

## 20. Sequential per-game work in reconciler and cleanup

Both workers process games sequentially within a tick. The
reconciler's mutations are dominated by `Get` + `Upsert` /
`UpdateStatus` round-trips against PG plus an occasional Docker
`InspectContainer`; the cleanup worker's mutations are dominated by
the cleanup service's `docker.Remove` call. Parallelising either
would multiply the load on the Docker daemon socket and the PG pool
without buying anything that v1 cardinality demands.

## 21. Cross-module test boundary for the consumer integration test

[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
covers the contract roundtrip without importing
`lobby/internal/...`:

- it XADDs a start envelope in the AsyncAPI wire shape (the same
  shape Lobby's `runtimemanager.Publisher` writes);
- it runs the real `startruntime.Service` against in-memory fakes for
  the persistence stores, the lease, and the notification / health
  publishers, plus a gomock-backed `ports.DockerClient`;
- it lets the real `jobresultspublisher.Publisher` write to
  `runtime:job_results`;
- it reads the resulting entry and asserts the symmetric wire shape;
- it then XADDs the same envelope a second time and asserts the
  `error_code=replay_no_op` outcome with no further Docker calls.

The cross-module integration that runs both the real Lobby publisher
and the real Lobby consumer alongside RTM lives at
`integration/lobbyrtm/`, which is the home for inter-service
fixtures. Keeping the in-package test free of `lobby/...` imports
avoids module-internal coupling and keeps `rtmanager`'s test suite
buildable on its own.