# Background Workers This document explains the design of the seven background workers under [`../internal/worker/`](../internal/worker): - [`startjobsconsumer`](../internal/worker/startjobsconsumer) and [`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`; - [`dockerevents`](../internal/worker/dockerevents) — Docker `/events` subscription; - [`dockerinspect`](../internal/worker/dockerinspect) — periodic `InspectContainer` worker; - [`healthprobe`](../internal/worker/healthprobe) — active HTTP `/healthz` probe; - [`reconcile`](../internal/worker/reconcile) — startup + periodic drift reconciliation; - [`containercleanup`](../internal/worker/containercleanup) — periodic TTL cleanup. The current-state behaviour and configuration surface live in [`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring, §Reconciliation), and operational notes are in [`runtime.md`](runtime.md), [`flows.md`](flows.md), and [`runbook.md`](runbook.md). This file records the rationale. ## 1. Single ownership per `event_type` The `runtime:health_events` vocabulary is shared across four sources; each event type is owned by exactly one of them. | `event_type` | Owner | | --- | --- | | `container_started` | `internal/service/startruntime` | | `container_exited` | `internal/worker/dockerevents` | | `container_oom` | `internal/worker/dockerevents` | | `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) | | `inspect_unhealthy` | `internal/worker/dockerinspect` | | `probe_failed` | `internal/worker/healthprobe` | | `probe_recovered` | `internal/worker/healthprobe` | `container_started` is intentionally not duplicated by the events listener, even though Docker emits a `start` action whenever the start service runs the container. The start service already publishes the event with the same wire shape; observing the action in the listener would produce two entries per real start. ## 2. `container_disappeared` is conditional on PG state The Docker events listener inspects the runtime record before emitting `container_disappeared` for a `destroy` action. Three suppression rules apply: - record missing → suppress (the destroyed container was never owned by RTM as a tracked runtime, so no consumer cares); - record `status != running` → suppress (RTM already finished a stop or cleanup; the destroy is the expected tail of that operation); - record `current_container_id != event.ContainerID` → suppress (RTM swapped to a new container through restart or patch; the destroy is the expected removal of the prior container id). Only a destroy that arrives for a `running` record whose `current_container_id` still equals the event id is treated as unexpected. This is the wire-side analogue of the reconciler's PG-drift check: the reconciler observes "PG=running, no Docker container" while the events listener observes "Docker says destroy, PG still says running pointing at this container". Together they cover both directions of drift. A read failure against `runtime_records` is treated conservatively as "suppress" — the listener cannot tell whether the destroy was external or RTM-initiated, and over-emitting `container_disappeared` would lead to a real consumer (`Game Master`) escalating a false positive. ## 3. `die` with exit code `0` is suppressed `docker stop` (and graceful shutdowns via SIGTERM) produces a `die` event with exit code `0`. The `container_exited` contract guarantees a non-zero exit; emitting on exit `0` would shower consumers with normal-stop noise. The listener silently drops the event; the operation log already records the stop on the caller side. ## 4. Inspect worker leaves `container_disappeared` to the reconciler When `dockerinspect` calls `InspectContainer` and the daemon returns `ports.ErrContainerNotFound`, the worker logs at `Debug` and skips: - the reconciler is the single authority for PG-drift reconciliation. Adding a third source for `container_disappeared` would risk double emission and complicate the consumer story; - inspect ticks every 30 seconds; the reconciler ticks every 5 minutes. The latency window for "Docker drops the container, RTM notices" is therefore at most 5 minutes in v1, which is acceptable for the kinds of drift the reconciler covers (manual `docker rm` outside RTM, daemon restart with stale records). If a future requirement tightens the window, promoting the inspect-side observation to a real `container_disappeared` is a one-line change. ## 5. Probe hysteresis is in-memory and pruned per tick The active probe worker keeps per-game state in a `map[string]*probeState` guarded by a mutex. Two counters live there: - `consecutiveFailures` — incremented on every failed probe, reset on every success; - `failurePublished` — prevents repeated `probe_failed` emission while the failure persists, and triggers a single `probe_recovered` on the first success after the threshold was crossed. The state is non-persistent. RTM is single-instance in v1, and a process restart that loses the counters merely re-establishes the hysteresis from scratch — the only consequence is that a probe failure already in progress at the moment of restart needs another full threshold of failures to surface. Making the state durable would add a Redis round-trip to every probe attempt without buying anything that operators or downstream consumers depend on. State pruning happens at the start of every tick. The worker reads the current running list and removes any state entry whose `game_id` is not in the list. A game that transitions through stopped → running again starts fresh; previously-accumulated counters do not bleed into the new lifecycle. ## 6. Probe concurrency is bounded by a fixed cap Probes inside one tick run in parallel through a buffered-channel semaphore (`defaultMaxConcurrency = 16`). Three reasons: - A single slow engine cannot delay the entire cohort. Sequential per-game probing would multiply the worst case by `len(records)`, which is the wrong shape for what is fundamentally a fan-out observation pattern. - An unbounded fan-out (one goroutine per record per tick without a cap) was rejected to avoid pathological CPU and connection bursts if the running list ever grows beyond what RTM was sized for. 16 in-flight probes at the default 2s timeout fit a single RTM instance well within typical OS file-descriptor and TCP ephemeral-port limits. - The cap is a constant rather than an env var because RTM v1 is single-instance and the active-game count is bounded by Lobby; a configurable cap is something we promote to env if a real workload demands it. The same reasoning argues against parallelism in the inspect worker: inspect calls are cheap (sub-ms in the local Docker socket case) and serial execution avoids unnecessary concurrency on the daemon socket. ## 7. Events listener reconnects with fixed backoff The Docker daemon's events stream is a long-lived subscription; the SDK channel terminates on any transport error (daemon restart, socket hiccup, connection reset). The listener's outer loop handles this by re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait, indefinitely while ctx is alive. Crashing the process on a transport error was rejected because losing a few seconds of health observations is a much smaller blast radius than losing the entire RTM process while the start/stop pipelines are running. The save-offset case is different: a lost offset replays the entire backlog and breaks correctness, while a missed health event is observation-only. A subscription error is logged at `Warn` so operators can see the reconnect activity without it dominating the log volume. ## 8. Health publisher remains best-effort Every emission goes through `ports.HealthEventPublisher.Publish`, the same surface the start service already uses ([`adapters.md`](adapters.md) §8). A publish failure logs at `Error` and proceeds; the worker does not retry, does not adjust its in-memory hysteresis, and does not surface the failure to the caller. The operation log is the source of truth for runtime state; the event stream is a best-effort notification surface to consumers. ## 9. Stream offset labels are stable identifiers Both consumers persist their progress through `ports.StreamOffsetStore` under fixed labels — `startjobs` for the start-jobs consumer and `stopjobs` for the stop-jobs consumer. The labels match `rtmanager:stream_offsets:{label}` and stay stable when the underlying stream key is renamed via `RTMANAGER_REDIS_START_JOBS_STREAM` / `RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the consumer at a different stream key does not lose the persisted offset. ## 10. `OpSource` and `SourceRef` originate at the consumer boundary Every consumed envelope is translated into a `Service.Handle` call with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source `SourceRef` is the Redis Stream entry id (`message.ID`); the `operation_log` rows therefore record the originating envelope id, and restart / patch correlation logic ([`services.md`](services.md) §13) keeps working when those services are invoked indirectly. ## 11. Replay-no-op detection lives in the service layer The consumer does not detect replays itself. `startruntime.Service` returns `Outcome=success, ErrorCode=replay_no_op` when the existing record is already `running` with the same `image_ref`; `stopruntime.Service` does the same for an already-stopped or already-removed record. The consumer copies the result fields into the `RuntimeJobResult` payload verbatim and lets Lobby observe the replay through `error_code`. The wire-shape consequences: - `success` + empty `error_code` → fresh start / fresh stop; - `success` + `error_code=replay_no_op` → idempotent replay. For start, the existing record carries `container_id` and `engine_endpoint`; for stop on `status=removed`, both fields are empty strings (the record was nulled by an earlier cleanup) — the AsyncAPI contract permits empty strings on these required fields; - `failure` + non-empty `error_code` → the start / stop service returned a zero `Record`; the consumer publishes empty `container_id` and `engine_endpoint`. ## 12. Per-message errors are absorbed; the offset always advances The consumer run loop logs and absorbs any decode error, any go-level service error, and any publish failure; `streamOffsetStore.Save` runs unconditionally after each handled message. Pinning the offset on a single transient publish failure was rejected because the durable side effect (operation_log row, runtime_records mutation, Docker state) has already happened on the first pass; pinning the offset to retry the publish would duplicate audit rows for hours until the operator intervened. The exception is `streamOffsetStore.Save` itself: a save failure returns a wrapped error from `Run`. The component supervisor in `internal/app/app.go` then exits the process and lets the operator escalate, because losing the offset would cause every subsequent restart to re-process every prior envelope. ## 13. `requested_at_ms` is logged-only The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The consumer parses it (rejecting unparseable values) but only includes the value in structured logs — the field is "used for diagnostics, not authoritative" per the contract. The service layer ignores it; the operation_log uses `service.clock()` for `started_at` / `finished_at` so Lobby's wall-clock skew never bleeds into RTM persistence. ## 14. Reconciler: per-game lease around every write A `running → removed` mutation that races a restart's inner stop would clobber the restart's freshly-installed `running` record without any other guard. The reconciler honours the same per-game lease that the lifecycle services hold ([`services.md`](services.md) §1). The reconciler splits its work into two phases: - **Read pass — lockless.** `docker.List({com.galaxy.owner=rtmanager})` followed by `RuntimeRecords.ListByStatus(running)`. No lease is taken; both reads are point-in-time observations of independent systems and a stale view here only delays a mutation by one tick. - **Write pass — lease-guarded.** Every drift mutation (`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the per-game lease, re-reads the record under the lease, and then either applies the mutation or returns when state has changed. A lease conflict (`acquired=false`) is logged at `info` and the game is silently skipped — the next tick will retry. A lease-store error is logged at `warn`; the rest of the pass continues. The re-read after lease acquisition is intentional: the read pass is lockless, so by the time the lease is held the runtime record may have moved. `UpdateStatus` already provides CAS via `ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt) does not, so the explicit re-read keeps the three paths uniform and makes the skip condition obvious in code review. ## 15. Three drift kinds covered by the reconciler - `adopt` — Docker reports a container labelled `com.galaxy.owner=rtmanager` for which RTM has no record; insert a fresh `runtime_records` row with `op_kind=reconcile_adopt` and never stop or remove the container (operators may have started it manually for diagnostics). - `dispose` — RTM has a `running` record whose container is missing in Docker; mark `status=removed`, publish `container_disappeared`, append `op_kind=reconcile_dispose`. - `observed_exited` — RTM has a `running` record whose container exists but is in `exited`; mark `status=stopped`, publish `container_exited` with the observed exit code. This third path exists because the events listener sees only live events; a container that died while RTM was offline would otherwise stay `running` indefinitely. The drift is exposed through `rtmanager.reconcile_drift{kind=observed_exited}` and through the `container_exited` health event; no `operation_log` entry is written because the audit log records explicit RTM operations, not passive observations of Docker state. ## 16. `stopped_at = now (reconciler observation time)` The `observed_exited` path writes `stopped_at = now`, where `now` is the reconciler's observation time. The persistence adapter ([`postgres-migration.md`](postgres-migration.md) §8) hard-codes `stopped_at = now` for the `stopped` destination — there is no port-level knob for an explicit timestamp, and the reconciler does not read `State.FinishedAt` from Docker. The trade-off: `stopped_at` diverges from the daemon's `State.FinishedAt` by at most one tick interval (default 5 minutes). If a downstream consumer ever needs the daemon-observed exit timestamp, the upgrade path is a one-call extension of `UpdateStatusInput` with an optional `StoppedAt *time.Time` field; that change is deferred until a consumer materialises. ## 17. Synchronous initial pass + periodic Component `README §Startup dependencies` step 6 demands "Reconciler runs once and blocks until done" before background workers start, but `app.App.Run` starts every registered `Component` concurrently — component ordering does not translate into start ordering. The reconciler exposes a public `ReconcileNow(ctx)` method that the runtime calls synchronously between `newWiring` and `app.New`. The same `*Reconciler` is then registered as a `Component`; its `Run` only ticks (no immediate pass) so the startup work is not duplicated. The cost is one public method on the worker; the benefit is that the README invariant holds verbatim and the periodic loop is a textbook `Component`. ## 18. Adopt through `Upsert`, race with start is benign The adopt path constructs a fresh `runtime.RuntimeRecord` (status running, container id and image_ref from labels, `started_at` from `com.galaxy.started_at_ms` or inspect, state path and docker network from configuration, engine endpoint from the `http://galaxy-game-{game_id}:8080` rule) and calls `RuntimeRecords.Upsert`. Race scenario: the start service has called `docker.Run` but has not yet finished its own `Upsert` when the reconciler observes the container without a record. Both writers eventually arrive at PG with the same key data — the start service knows the canonical `image_ref`, but the reconciler reads it from the `com.galaxy.engine_image_ref` label that the start service itself wrote. The CAS-free overwrite is therefore benign: - `created_at` is preserved across upserts by the `ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this game" timestamp stays stable regardless of which writer lands last; - all other fields in this race carry identical values (same container, same image, same hostname, same state path). Under the per-game lease this is doubly safe: the reconciler only issues `Upsert` while holding the lease, and only after re-reading the record finds it absent. Concurrent start would block on the same lease; concurrent stop / restart would have moved the record out of "absent" by the time the reconciler re-reads. ## 19. Cleanup worker delegates to the service The TTL-cleanup worker is intentionally tiny: it lists `runtime_records.status='stopped'`, filters in process by `record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls `cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each candidate. The service already owns: - the per-game lease around the Docker `Remove` call; - the `running → removed` CAS via `UpdateStatus`; - the operation_log entry (`op_kind=cleanup_container`, `op_source=auto_ttl`); - the telemetry counter and structured log fields. In-memory filtering is acceptable in v1 because the cardinality of `status=stopped` rows is bounded by Lobby's active-game count plus retention period. The dedicated `(status, last_op_at)` index drives the underlying `ListByStatus(stopped)` query so the database does the heavy lifting; the Go-side filter is microseconds-per-row. The worker uses a small `Cleaner` interface in its own package rather than depending on `*cleanupcontainer.Service` directly. This keeps the worker's tests light — no need to construct Docker, lease, operation-log, and telemetry doubles just to verify TTL math — while the production wiring still binds the real service via a compile-time interface assertion in `internal/app/wiring.go`. ## 20. Sequential per-game work in reconciler and cleanup Both workers process games sequentially within a tick. The reconciler's mutations are dominated by `Get` + `Upsert` / `UpdateStatus` round-trips against PG plus an occasional Docker `InspectContainer`; the cleanup worker's mutations are dominated by the cleanup service's `docker.Remove` call. Parallelising either would multiply the load on the Docker daemon socket and the PG pool without buying anything that v1 cardinality demands. ## 21. Cross-module test boundary for the consumer integration test [`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go) covers the contract roundtrip without importing `lobby/internal/...`: - it XADDs a start envelope in the AsyncAPI wire shape (the same shape Lobby's `runtimemanager.Publisher` writes); - it runs the real `startruntime.Service` against in-memory fakes for the persistence stores, the lease, and the notification / health publishers, plus a gomock-backed `ports.DockerClient`; - it lets the real `jobresultspublisher.Publisher` write to `runtime:job_results`; - it reads the resulting entry and asserts the symmetric wire shape; - it then XADDs the same envelope a second time and asserts the `error_code=replay_no_op` outcome with no further Docker calls. The cross-module integration that runs both the real Lobby publisher and the real Lobby consumer alongside RTM lives at `integration/lobbyrtm/`, which is the home for inter-service fixtures. Keeping the in-package test free of `lobby/...` imports avoids module-internal coupling and keeps `rtmanager`'s test suite buildable on its own.