developer/galaxy-game

Fork 0

Files

T

Ilia Denisov a7cee15115 feat: runtime manager

2026-04-28 20:39:18 +02:00

20 KiB

Raw Blame History

Background Workers

This document explains the design of the seven background workers under ../internal/worker/:

startjobsconsumer and stopjobsconsumer — async consumers driven by runtime:start_jobs / runtime:stop_jobs;
dockerevents — Docker /events subscription;
dockerinspect — periodic InspectContainer worker;
healthprobe — active HTTP /healthz probe;
reconcile — startup + periodic drift reconciliation;
containercleanup — periodic TTL cleanup.

The current-state behaviour and configuration surface live in ../README.md (§Runtime Surface, §Health Monitoring, §Reconciliation), and operational notes are in runtime.md, flows.md, and runbook.md. This file records the rationale.

1. Single ownership per `event_type`

The runtime:health_events vocabulary is shared across four sources; each event type is owned by exactly one of them.

`event_type`	Owner
`container_started`	`internal/service/startruntime`
`container_exited`	`internal/worker/dockerevents`
`container_oom`	`internal/worker/dockerevents`
`container_disappeared`	`internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift)
`inspect_unhealthy`	`internal/worker/dockerinspect`
`probe_failed`	`internal/worker/healthprobe`
`probe_recovered`	`internal/worker/healthprobe`

container_started is intentionally not duplicated by the events listener, even though Docker emits a start action whenever the start service runs the container. The start service already publishes the event with the same wire shape; observing the action in the listener would produce two entries per real start.

2. `container_disappeared` is conditional on PG state

The Docker events listener inspects the runtime record before emitting container_disappeared for a destroy action. Three suppression rules apply:

record missing → suppress (the destroyed container was never owned by RTM as a tracked runtime, so no consumer cares);
record status != running → suppress (RTM already finished a stop or cleanup; the destroy is the expected tail of that operation);
record current_container_id != event.ContainerID → suppress (RTM swapped to a new container through restart or patch; the destroy is the expected removal of the prior container id).

Only a destroy that arrives for a running record whose current_container_id still equals the event id is treated as unexpected. This is the wire-side analogue of the reconciler's PG-drift check: the reconciler observes "PG=running, no Docker container" while the events listener observes "Docker says destroy, PG still says running pointing at this container". Together they cover both directions of drift.

A read failure against runtime_records is treated conservatively as "suppress" — the listener cannot tell whether the destroy was external or RTM-initiated, and over-emitting container_disappeared would lead to a real consumer (Game Master) escalating a false positive.

3. `die` with exit code `0` is suppressed

docker stop (and graceful shutdowns via SIGTERM) produces a die event with exit code 0. The container_exited contract guarantees a non-zero exit; emitting on exit 0 would shower consumers with normal-stop noise. The listener silently drops the event; the operation log already records the stop on the caller side.

4. Inspect worker leaves `container_disappeared` to the reconciler

When dockerinspect calls InspectContainer and the daemon returns ports.ErrContainerNotFound, the worker logs at Debug and skips:

the reconciler is the single authority for PG-drift reconciliation. Adding a third source for container_disappeared would risk double emission and complicate the consumer story;
inspect ticks every 30 seconds; the reconciler ticks every 5 minutes. The latency window for "Docker drops the container, RTM notices" is therefore at most 5 minutes in v1, which is acceptable for the kinds of drift the reconciler covers (manual docker rm outside RTM, daemon restart with stale records). If a future requirement tightens the window, promoting the inspect-side observation to a real container_disappeared is a one-line change.

5. Probe hysteresis is in-memory and pruned per tick

The active probe worker keeps per-game state in a map[string]*probeState guarded by a mutex. Two counters live there:

consecutiveFailures — incremented on every failed probe, reset on every success;
failurePublished — prevents repeated probe_failed emission while the failure persists, and triggers a single probe_recovered on the first success after the threshold was crossed.

The state is non-persistent. RTM is single-instance in v1, and a process restart that loses the counters merely re-establishes the hysteresis from scratch — the only consequence is that a probe failure already in progress at the moment of restart needs another full threshold of failures to surface. Making the state durable would add a Redis round-trip to every probe attempt without buying anything that operators or downstream consumers depend on.

State pruning happens at the start of every tick. The worker reads the current running list and removes any state entry whose game_id is not in the list. A game that transitions through stopped → running again starts fresh; previously-accumulated counters do not bleed into the new lifecycle.

6. Probe concurrency is bounded by a fixed cap

Probes inside one tick run in parallel through a buffered-channel semaphore (defaultMaxConcurrency = 16). Three reasons:

A single slow engine cannot delay the entire cohort. Sequential per-game probing would multiply the worst case by len(records), which is the wrong shape for what is fundamentally a fan-out observation pattern.
An unbounded fan-out (one goroutine per record per tick without a cap) was rejected to avoid pathological CPU and connection bursts if the running list ever grows beyond what RTM was sized for. 16 in-flight probes at the default 2s timeout fit a single RTM instance well within typical OS file-descriptor and TCP ephemeral-port limits.
The cap is a constant rather than an env var because RTM v1 is single-instance and the active-game count is bounded by Lobby; a configurable cap is something we promote to env if a real workload demands it.

The same reasoning argues against parallelism in the inspect worker: inspect calls are cheap (sub-ms in the local Docker socket case) and serial execution avoids unnecessary concurrency on the daemon socket.

7. Events listener reconnects with fixed backoff

The Docker daemon's events stream is a long-lived subscription; the SDK channel terminates on any transport error (daemon restart, socket hiccup, connection reset). The listener's outer loop handles this by re-subscribing after a fixed defaultReconnectBackoff = 5s wait, indefinitely while ctx is alive.

Crashing the process on a transport error was rejected because losing a few seconds of health observations is a much smaller blast radius than losing the entire RTM process while the start/stop pipelines are running. The save-offset case is different: a lost offset replays the entire backlog and breaks correctness, while a missed health event is observation-only.

A subscription error is logged at Warn so operators can see the reconnect activity without it dominating the log volume.

8. Health publisher remains best-effort

Every emission goes through ports.HealthEventPublisher.Publish, the same surface the start service already uses (adapters.md §8). A publish failure logs at Error and proceeds; the worker does not retry, does not adjust its in-memory hysteresis, and does not surface the failure to the caller. The operation log is the source of truth for runtime state; the event stream is a best-effort notification surface to consumers.

9. Stream offset labels are stable identifiers

Both consumers persist their progress through ports.StreamOffsetStore under fixed labels — startjobs for the start-jobs consumer and stopjobs for the stop-jobs consumer. The labels match rtmanager:stream_offsets:{label} and stay stable when the underlying stream key is renamed via RTMANAGER_REDIS_START_JOBS_STREAM / RTMANAGER_REDIS_STOP_JOBS_STREAM, so an operator who points the consumer at a different stream key does not lose the persisted offset.

10. `OpSource` and `SourceRef` originate at the consumer boundary

Every consumed envelope is translated into a Service.Handle call with OpSource = operation.OpSourceLobbyStream. The opaque per-source SourceRef is the Redis Stream entry id (message.ID); the operation_log rows therefore record the originating envelope id, and restart / patch correlation logic (services.md §13) keeps working when those services are invoked indirectly.

11. Replay-no-op detection lives in the service layer

The consumer does not detect replays itself. startruntime.Service returns Outcome=success, ErrorCode=replay_no_op when the existing record is already running with the same image_ref; stopruntime.Service does the same for an already-stopped or already-removed record. The consumer copies the result fields into the RuntimeJobResult payload verbatim and lets Lobby observe the replay through error_code.

The wire-shape consequences:

success + empty error_code → fresh start / fresh stop;
success + error_code=replay_no_op → idempotent replay. For start, the existing record carries container_id and engine_endpoint; for stop on status=removed, both fields are empty strings (the record was nulled by an earlier cleanup) — the AsyncAPI contract permits empty strings on these required fields;
failure + non-empty error_code → the start / stop service returned a zero Record; the consumer publishes empty container_id and engine_endpoint.

12. Per-message errors are absorbed; the offset always advances

The consumer run loop logs and absorbs any decode error, any go-level service error, and any publish failure; streamOffsetStore.Save runs unconditionally after each handled message. Pinning the offset on a single transient publish failure was rejected because the durable side effect (operation_log row, runtime_records mutation, Docker state) has already happened on the first pass; pinning the offset to retry the publish would duplicate audit rows for hours until the operator intervened.

The exception is streamOffsetStore.Save itself: a save failure returns a wrapped error from Run. The component supervisor in internal/app/app.go then exits the process and lets the operator escalate, because losing the offset would cause every subsequent restart to re-process every prior envelope.

13. `requested_at_ms` is logged-only

The AsyncAPI envelopes carry requested_at_ms from Lobby. The consumer parses it (rejecting unparseable values) but only includes the value in structured logs — the field is "used for diagnostics, not authoritative" per the contract. The service layer ignores it; the operation_log uses service.clock() for started_at / finished_at so Lobby's wall-clock skew never bleeds into RTM persistence.

14. Reconciler: per-game lease around every write

A running → removed mutation that races a restart's inner stop would clobber the restart's freshly-installed running record without any other guard. The reconciler honours the same per-game lease that the lifecycle services hold (services.md §1).

The reconciler splits its work into two phases:

Read pass — lockless. docker.List({com.galaxy.owner=rtmanager}) followed by RuntimeRecords.ListByStatus(running). No lease is taken; both reads are point-in-time observations of independent systems and a stale view here only delays a mutation by one tick.
Write pass — lease-guarded. Every drift mutation (adoptOne / disposeOne / observedExitedOne) acquires the per-game lease, re-reads the record under the lease, and then either applies the mutation or returns when state has changed. A lease conflict (acquired=false) is logged at info and the game is silently skipped — the next tick will retry. A lease-store error is logged at warn; the rest of the pass continues.

The re-read after lease acquisition is intentional: the read pass is lockless, so by the time the lease is held the runtime record may have moved. UpdateStatus already provides CAS via ExpectedFrom + ExpectedContainerID, but Upsert (used for adopt) does not, so the explicit re-read keeps the three paths uniform and makes the skip condition obvious in code review.

15. Three drift kinds covered by the reconciler

adopt — Docker reports a container labelled com.galaxy.owner=rtmanager for which RTM has no record; insert a fresh runtime_records row with op_kind=reconcile_adopt and never stop or remove the container (operators may have started it manually for diagnostics).
dispose — RTM has a running record whose container is missing in Docker; mark status=removed, publish container_disappeared, append op_kind=reconcile_dispose.
observed_exited — RTM has a running record whose container exists but is in exited; mark status=stopped, publish container_exited with the observed exit code. This third path exists because the events listener sees only live events; a container that died while RTM was offline would otherwise stay running indefinitely. The drift is exposed through rtmanager.reconcile_drift{kind=observed_exited} and through the container_exited health event; no operation_log entry is written because the audit log records explicit RTM operations, not passive observations of Docker state.

16. `stopped_at = now (reconciler observation time)`

The observed_exited path writes stopped_at = now, where now is the reconciler's observation time. The persistence adapter (postgres-migration.md §8) hard-codes stopped_at = now for the stopped destination — there is no port-level knob for an explicit timestamp, and the reconciler does not read State.FinishedAt from Docker.

The trade-off: stopped_at diverges from the daemon's State.FinishedAt by at most one tick interval (default 5 minutes). If a downstream consumer ever needs the daemon-observed exit timestamp, the upgrade path is a one-call extension of UpdateStatusInput with an optional StoppedAt *time.Time field; that change is deferred until a consumer materialises.

17. Synchronous initial pass + periodic Component

README §Startup dependencies step 6 demands "Reconciler runs once and blocks until done" before background workers start, but app.App.Run starts every registered Component concurrently — component ordering does not translate into start ordering.

The reconciler exposes a public ReconcileNow(ctx) method that the runtime calls synchronously between newWiring and app.New. The same *Reconciler is then registered as a Component; its Run only ticks (no immediate pass) so the startup work is not duplicated. The cost is one public method on the worker; the benefit is that the README invariant holds verbatim and the periodic loop is a textbook Component.

18. Adopt through `Upsert`, race with start is benign

The adopt path constructs a fresh runtime.RuntimeRecord (status running, container id and image_ref from labels, started_at from com.galaxy.started_at_ms or inspect, state path and docker network from configuration, engine endpoint from the http://galaxy-game-{game_id}:8080 rule) and calls RuntimeRecords.Upsert.

Race scenario: the start service has called docker.Run but has not yet finished its own Upsert when the reconciler observes the container without a record. Both writers eventually arrive at PG with the same key data — the start service knows the canonical image_ref, but the reconciler reads it from the com.galaxy.engine_image_ref label that the start service itself wrote. The CAS-free overwrite is therefore benign:

created_at is preserved across upserts by the ON CONFLICT DO UPDATE clause, so the "first time RTM saw this game" timestamp stays stable regardless of which writer lands last;
all other fields in this race carry identical values (same container, same image, same hostname, same state path).

Under the per-game lease this is doubly safe: the reconciler only issues Upsert while holding the lease, and only after re-reading the record finds it absent. Concurrent start would block on the same lease; concurrent stop / restart would have moved the record out of "absent" by the time the reconciler re-reads.

19. Cleanup worker delegates to the service

The TTL-cleanup worker is intentionally tiny: it lists runtime_records.status='stopped', filters in process by record.LastOpAt.Before(now - cfg.Container.Retention), and calls cleanupcontainer.Service.Handle with OpSource=auto_ttl for each candidate. The service already owns:

the per-game lease around the Docker Remove call;
the running → removed CAS via UpdateStatus;
the operation_log entry (op_kind=cleanup_container, op_source=auto_ttl);
the telemetry counter and structured log fields.

In-memory filtering is acceptable in v1 because the cardinality of status=stopped rows is bounded by Lobby's active-game count plus retention period. The dedicated (status, last_op_at) index drives the underlying ListByStatus(stopped) query so the database does the heavy lifting; the Go-side filter is microseconds-per-row.

The worker uses a small Cleaner interface in its own package rather than depending on *cleanupcontainer.Service directly. This keeps the worker's tests light — no need to construct Docker, lease, operation-log, and telemetry doubles just to verify TTL math — while the production wiring still binds the real service via a compile-time interface assertion in internal/app/wiring.go.

20. Sequential per-game work in reconciler and cleanup

Both workers process games sequentially within a tick. The reconciler's mutations are dominated by Get + Upsert / UpdateStatus round-trips against PG plus an occasional Docker InspectContainer; the cleanup worker's mutations are dominated by the cleanup service's docker.Remove call. Parallelising either would multiply the load on the Docker daemon socket and the PG pool without buying anything that v1 cardinality demands.

21. Cross-module test boundary for the consumer integration test

../internal/worker/startjobsconsumer/integration_test.go covers the contract roundtrip without importing lobby/internal/...:

it XADDs a start envelope in the AsyncAPI wire shape (the same shape Lobby's runtimemanager.Publisher writes);
it runs the real startruntime.Service against in-memory fakes for the persistence stores, the lease, and the notification / health publishers, plus a gomock-backed ports.DockerClient;
it lets the real jobresultspublisher.Publisher write to runtime:job_results;
it reads the resulting entry and asserts the symmetric wire shape;
it then XADDs the same envelope a second time and asserts the error_code=replay_no_op outcome with no further Docker calls.

The cross-module integration that runs both the real Lobby publisher and the real Lobby consumer alongside RTM lives at integration/lobbyrtm/, which is the home for inter-service fixtures. Keeping the in-package test free of lobby/... imports avoids module-internal coupling and keeps rtmanager's test suite buildable on its own.

20 KiB Raw Blame History