20 KiB
Background Workers
This document explains the design of the seven background workers
under ../internal/worker/:
startjobsconsumerandstopjobsconsumer— async consumers driven byruntime:start_jobs/runtime:stop_jobs;dockerevents— Docker/eventssubscription;dockerinspect— periodicInspectContainerworker;healthprobe— active HTTP/healthzprobe;reconcile— startup + periodic drift reconciliation;containercleanup— periodic TTL cleanup.
The current-state behaviour and configuration surface live in
../README.md (§Runtime Surface, §Health Monitoring,
§Reconciliation), and operational notes are in
runtime.md, flows.md, and
runbook.md. This file records the rationale.
1. Single ownership per event_type
The runtime:health_events vocabulary is shared across four sources;
each event type is owned by exactly one of them.
event_type |
Owner |
|---|---|
container_started |
internal/service/startruntime |
container_exited |
internal/worker/dockerevents |
container_oom |
internal/worker/dockerevents |
container_disappeared |
internal/worker/dockerevents (external destroy) and internal/worker/reconcile (PG-drift) |
inspect_unhealthy |
internal/worker/dockerinspect |
probe_failed |
internal/worker/healthprobe |
probe_recovered |
internal/worker/healthprobe |
container_started is intentionally not duplicated by the events
listener, even though Docker emits a start action whenever the start
service runs the container. The start service already publishes the
event with the same wire shape; observing the action in the listener
would produce two entries per real start.
2. container_disappeared is conditional on PG state
The Docker events listener inspects the runtime record before emitting
container_disappeared for a destroy action. Three suppression rules
apply:
- record missing → suppress (the destroyed container was never owned by RTM as a tracked runtime, so no consumer cares);
- record
status != running→ suppress (RTM already finished a stop or cleanup; the destroy is the expected tail of that operation); - record
current_container_id != event.ContainerID→ suppress (RTM swapped to a new container through restart or patch; the destroy is the expected removal of the prior container id).
Only a destroy that arrives for a running record whose
current_container_id still equals the event id is treated as
unexpected. This is the wire-side analogue of the reconciler's
PG-drift check: the reconciler observes "PG=running, no Docker
container" while the events listener observes "Docker says destroy,
PG still says running pointing at this container". Together they cover
both directions of drift.
A read failure against runtime_records is treated conservatively as
"suppress" — the listener cannot tell whether the destroy was external
or RTM-initiated, and over-emitting container_disappeared would lead
to a real consumer (Game Master) escalating a false positive.
3. die with exit code 0 is suppressed
docker stop (and graceful shutdowns via SIGTERM) produces a die
event with exit code 0. The container_exited contract guarantees a
non-zero exit; emitting on exit 0 would shower consumers with
normal-stop noise. The listener silently drops the event; the
operation log already records the stop on the caller side.
4. Inspect worker leaves container_disappeared to the reconciler
When dockerinspect calls InspectContainer and the daemon returns
ports.ErrContainerNotFound, the worker logs at Debug and skips:
- the reconciler is the single authority for PG-drift reconciliation.
Adding a third source for
container_disappearedwould risk double emission and complicate the consumer story; - inspect ticks every 30 seconds; the reconciler ticks every 5
minutes. The latency window for "Docker drops the container, RTM
notices" is therefore at most 5 minutes in v1, which is acceptable
for the kinds of drift the reconciler covers (manual
docker rmoutside RTM, daemon restart with stale records). If a future requirement tightens the window, promoting the inspect-side observation to a realcontainer_disappearedis a one-line change.
5. Probe hysteresis is in-memory and pruned per tick
The active probe worker keeps per-game state in a
map[string]*probeState guarded by a mutex. Two counters live there:
consecutiveFailures— incremented on every failed probe, reset on every success;failurePublished— prevents repeatedprobe_failedemission while the failure persists, and triggers a singleprobe_recoveredon the first success after the threshold was crossed.
The state is non-persistent. RTM is single-instance in v1, and a process restart that loses the counters merely re-establishes the hysteresis from scratch — the only consequence is that a probe failure already in progress at the moment of restart needs another full threshold of failures to surface. Making the state durable would add a Redis round-trip to every probe attempt without buying anything that operators or downstream consumers depend on.
State pruning happens at the start of every tick. The worker reads the
current running list and removes any state entry whose game_id is
not in the list. A game that transitions through stopped → running
again starts fresh; previously-accumulated counters do not bleed into
the new lifecycle.
6. Probe concurrency is bounded by a fixed cap
Probes inside one tick run in parallel through a buffered-channel
semaphore (defaultMaxConcurrency = 16). Three reasons:
- A single slow engine cannot delay the entire cohort. Sequential
per-game probing would multiply the worst case by
len(records), which is the wrong shape for what is fundamentally a fan-out observation pattern. - An unbounded fan-out (one goroutine per record per tick without a cap) was rejected to avoid pathological CPU and connection bursts if the running list ever grows beyond what RTM was sized for. 16 in-flight probes at the default 2s timeout fit a single RTM instance well within typical OS file-descriptor and TCP ephemeral-port limits.
- The cap is a constant rather than an env var because RTM v1 is single-instance and the active-game count is bounded by Lobby; a configurable cap is something we promote to env if a real workload demands it.
The same reasoning argues against parallelism in the inspect worker: inspect calls are cheap (sub-ms in the local Docker socket case) and serial execution avoids unnecessary concurrency on the daemon socket.
7. Events listener reconnects with fixed backoff
The Docker daemon's events stream is a long-lived subscription; the
SDK channel terminates on any transport error (daemon restart, socket
hiccup, connection reset). The listener's outer loop handles this by
re-subscribing after a fixed defaultReconnectBackoff = 5s wait,
indefinitely while ctx is alive.
Crashing the process on a transport error was rejected because losing a few seconds of health observations is a much smaller blast radius than losing the entire RTM process while the start/stop pipelines are running. The save-offset case is different: a lost offset replays the entire backlog and breaks correctness, while a missed health event is observation-only.
A subscription error is logged at Warn so operators can see the
reconnect activity without it dominating the log volume.
8. Health publisher remains best-effort
Every emission goes through ports.HealthEventPublisher.Publish, the
same surface the start service already uses
(adapters.md §8). A publish failure logs at Error
and proceeds; the worker does not retry, does not adjust its in-memory
hysteresis, and does not surface the failure to the caller. The
operation log is the source of truth for runtime state; the event
stream is a best-effort notification surface to consumers.
9. Stream offset labels are stable identifiers
Both consumers persist their progress through
ports.StreamOffsetStore under fixed labels — startjobs for the
start-jobs consumer and stopjobs for the stop-jobs consumer. The
labels match rtmanager:stream_offsets:{label} and stay stable when
the underlying stream key is renamed via
RTMANAGER_REDIS_START_JOBS_STREAM /
RTMANAGER_REDIS_STOP_JOBS_STREAM, so an operator who points the
consumer at a different stream key does not lose the persisted offset.
10. OpSource and SourceRef originate at the consumer boundary
Every consumed envelope is translated into a Service.Handle call
with OpSource = operation.OpSourceLobbyStream. The opaque per-source
SourceRef is the Redis Stream entry id (message.ID); the
operation_log rows therefore record the originating envelope id, and
restart / patch correlation logic (services.md §13)
keeps working when those services are invoked indirectly.
11. Replay-no-op detection lives in the service layer
The consumer does not detect replays itself. startruntime.Service
returns Outcome=success, ErrorCode=replay_no_op when the existing
record is already running with the same image_ref;
stopruntime.Service does the same for an already-stopped or
already-removed record. The consumer copies the result fields into
the RuntimeJobResult payload verbatim and lets Lobby observe the
replay through error_code.
The wire-shape consequences:
success+ emptyerror_code→ fresh start / fresh stop;success+error_code=replay_no_op→ idempotent replay. For start, the existing record carriescontainer_idandengine_endpoint; for stop onstatus=removed, both fields are empty strings (the record was nulled by an earlier cleanup) — the AsyncAPI contract permits empty strings on these required fields;failure+ non-emptyerror_code→ the start / stop service returned a zeroRecord; the consumer publishes emptycontainer_idandengine_endpoint.
12. Per-message errors are absorbed; the offset always advances
The consumer run loop logs and absorbs any decode error, any go-level
service error, and any publish failure; streamOffsetStore.Save runs
unconditionally after each handled message. Pinning the offset on a
single transient publish failure was rejected because the durable side
effect (operation_log row, runtime_records mutation, Docker state) has
already happened on the first pass; pinning the offset to retry the
publish would duplicate audit rows for hours until the operator
intervened.
The exception is streamOffsetStore.Save itself: a save failure
returns a wrapped error from Run. The component supervisor in
internal/app/app.go then exits the process and lets the operator
escalate, because losing the offset would cause every subsequent
restart to re-process every prior envelope.
13. requested_at_ms is logged-only
The AsyncAPI envelopes carry requested_at_ms from Lobby. The
consumer parses it (rejecting unparseable values) but only includes
the value in structured logs — the field is "used for diagnostics, not
authoritative" per the contract. The service layer ignores it; the
operation_log uses service.clock() for started_at / finished_at
so Lobby's wall-clock skew never bleeds into RTM persistence.
14. Reconciler: per-game lease around every write
A running → removed mutation that races a restart's inner stop
would clobber the restart's freshly-installed running record without
any other guard. The reconciler honours the same per-game lease that
the lifecycle services hold (services.md §1).
The reconciler splits its work into two phases:
- Read pass — lockless.
docker.List({com.galaxy.owner=rtmanager})followed byRuntimeRecords.ListByStatus(running). No lease is taken; both reads are point-in-time observations of independent systems and a stale view here only delays a mutation by one tick. - Write pass — lease-guarded. Every drift mutation
(
adoptOne/disposeOne/observedExitedOne) acquires the per-game lease, re-reads the record under the lease, and then either applies the mutation or returns when state has changed. A lease conflict (acquired=false) is logged atinfoand the game is silently skipped — the next tick will retry. A lease-store error is logged atwarn; the rest of the pass continues.
The re-read after lease acquisition is intentional: the read pass is
lockless, so by the time the lease is held the runtime record may
have moved. UpdateStatus already provides CAS via
ExpectedFrom + ExpectedContainerID, but Upsert (used for adopt)
does not, so the explicit re-read keeps the three paths uniform and
makes the skip condition obvious in code review.
15. Three drift kinds covered by the reconciler
adopt— Docker reports a container labelledcom.galaxy.owner=rtmanagerfor which RTM has no record; insert a freshruntime_recordsrow withop_kind=reconcile_adoptand never stop or remove the container (operators may have started it manually for diagnostics).dispose— RTM has arunningrecord whose container is missing in Docker; markstatus=removed, publishcontainer_disappeared, appendop_kind=reconcile_dispose.observed_exited— RTM has arunningrecord whose container exists but is inexited; markstatus=stopped, publishcontainer_exitedwith the observed exit code. This third path exists because the events listener sees only live events; a container that died while RTM was offline would otherwise stayrunningindefinitely. The drift is exposed throughrtmanager.reconcile_drift{kind=observed_exited}and through thecontainer_exitedhealth event; nooperation_logentry is written because the audit log records explicit RTM operations, not passive observations of Docker state.
16. stopped_at = now (reconciler observation time)
The observed_exited path writes stopped_at = now, where now is
the reconciler's observation time. The persistence adapter
(postgres-migration.md §8) hard-codes
stopped_at = now for the stopped destination — there is no
port-level knob for an explicit timestamp, and the reconciler does not
read State.FinishedAt from Docker.
The trade-off: stopped_at diverges from the daemon's
State.FinishedAt by at most one tick interval (default 5 minutes).
If a downstream consumer ever needs the daemon-observed exit
timestamp, the upgrade path is a one-call extension of
UpdateStatusInput with an optional StoppedAt *time.Time field;
that change is deferred until a consumer materialises.
17. Synchronous initial pass + periodic Component
README §Startup dependencies step 6 demands "Reconciler runs once
and blocks until done" before background workers start, but
app.App.Run starts every registered Component concurrently —
component ordering does not translate into start ordering.
The reconciler exposes a public ReconcileNow(ctx) method that the
runtime calls synchronously between newWiring and app.New. The
same *Reconciler is then registered as a Component; its Run
only ticks (no immediate pass) so the startup work is not duplicated.
The cost is one public method on the worker; the benefit is that the
README invariant holds verbatim and the periodic loop is a textbook
Component.
18. Adopt through Upsert, race with start is benign
The adopt path constructs a fresh runtime.RuntimeRecord (status
running, container id and image_ref from labels, started_at from
com.galaxy.started_at_ms or inspect, state path and docker network
from configuration, engine endpoint from the
http://galaxy-game-{game_id}:8080 rule) and calls
RuntimeRecords.Upsert.
Race scenario: the start service has called docker.Run but has not
yet finished its own Upsert when the reconciler observes the
container without a record. Both writers eventually arrive at PG with
the same key data — the start service knows the canonical
image_ref, but the reconciler reads it from the
com.galaxy.engine_image_ref label that the start service itself
wrote. The CAS-free overwrite is therefore benign:
created_atis preserved across upserts by theON CONFLICT DO UPDATEclause, so the "first time RTM saw this game" timestamp stays stable regardless of which writer lands last;- all other fields in this race carry identical values (same container, same image, same hostname, same state path).
Under the per-game lease this is doubly safe: the reconciler only
issues Upsert while holding the lease, and only after re-reading
the record finds it absent. Concurrent start would block on the same
lease; concurrent stop / restart would have moved the record out of
"absent" by the time the reconciler re-reads.
19. Cleanup worker delegates to the service
The TTL-cleanup worker is intentionally tiny: it lists
runtime_records.status='stopped', filters in process by
record.LastOpAt.Before(now - cfg.Container.Retention), and calls
cleanupcontainer.Service.Handle with OpSource=auto_ttl for each
candidate. The service already owns:
- the per-game lease around the Docker
Removecall; - the
running → removedCAS viaUpdateStatus; - the operation_log entry (
op_kind=cleanup_container,op_source=auto_ttl); - the telemetry counter and structured log fields.
In-memory filtering is acceptable in v1 because the cardinality of
status=stopped rows is bounded by Lobby's active-game count plus
retention period. The dedicated (status, last_op_at) index drives
the underlying ListByStatus(stopped) query so the database does
the heavy lifting; the Go-side filter is microseconds-per-row.
The worker uses a small Cleaner interface in its own package rather
than depending on *cleanupcontainer.Service directly. This keeps
the worker's tests light — no need to construct Docker, lease,
operation-log, and telemetry doubles just to verify TTL math — while
the production wiring still binds the real service via a compile-time
interface assertion in internal/app/wiring.go.
20. Sequential per-game work in reconciler and cleanup
Both workers process games sequentially within a tick. The
reconciler's mutations are dominated by Get + Upsert /
UpdateStatus round-trips against PG plus an occasional Docker
InspectContainer; the cleanup worker's mutations are dominated by
the cleanup service's docker.Remove call. Parallelising either
would multiply the load on the Docker daemon socket and the PG pool
without buying anything that v1 cardinality demands.
21. Cross-module test boundary for the consumer integration test
../internal/worker/startjobsconsumer/integration_test.go
covers the contract roundtrip without importing
lobby/internal/...:
- it XADDs a start envelope in the AsyncAPI wire shape (the same
shape Lobby's
runtimemanager.Publisherwrites); - it runs the real
startruntime.Serviceagainst in-memory fakes for the persistence stores, the lease, and the notification / health publishers, plus a gomock-backedports.DockerClient; - it lets the real
jobresultspublisher.Publisherwrite toruntime:job_results; - it reads the resulting entry and asserts the symmetric wire shape;
- it then XADDs the same envelope a second time and asserts the
error_code=replay_no_opoutcome with no further Docker calls.
The cross-module integration that runs both the real Lobby publisher
and the real Lobby consumer alongside RTM lives at
integration/lobbyrtm/, which is the home for inter-service
fixtures. Keeping the in-package test free of lobby/... imports
avoids module-internal coupling and keeps rtmanager's test suite
buildable on its own.