feat: runtime manager
This commit is contained in:
@@ -0,0 +1,412 @@
|
||||
# Background Workers
|
||||
|
||||
This document explains the design of the seven background workers
|
||||
under [`../internal/worker/`](../internal/worker):
|
||||
|
||||
- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
|
||||
[`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
|
||||
consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
|
||||
- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
|
||||
subscription;
|
||||
- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
|
||||
`InspectContainer` worker;
|
||||
- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
|
||||
`/healthz` probe;
|
||||
- [`reconcile`](../internal/worker/reconcile) — startup + periodic
|
||||
drift reconciliation;
|
||||
- [`containercleanup`](../internal/worker/containercleanup) —
|
||||
periodic TTL cleanup.
|
||||
|
||||
The current-state behaviour and configuration surface live in
|
||||
[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
|
||||
§Reconciliation), and operational notes are in
|
||||
[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
|
||||
[`runbook.md`](runbook.md). This file records the rationale.
|
||||
|
||||
## 1. Single ownership per `event_type`
|
||||
|
||||
The `runtime:health_events` vocabulary is shared across four sources;
|
||||
each event type is owned by exactly one of them.
|
||||
|
||||
| `event_type` | Owner |
|
||||
| --- | --- |
|
||||
| `container_started` | `internal/service/startruntime` |
|
||||
| `container_exited` | `internal/worker/dockerevents` |
|
||||
| `container_oom` | `internal/worker/dockerevents` |
|
||||
| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
|
||||
| `inspect_unhealthy` | `internal/worker/dockerinspect` |
|
||||
| `probe_failed` | `internal/worker/healthprobe` |
|
||||
| `probe_recovered` | `internal/worker/healthprobe` |
|
||||
|
||||
`container_started` is intentionally not duplicated by the events
|
||||
listener, even though Docker emits a `start` action whenever the start
|
||||
service runs the container. The start service already publishes the
|
||||
event with the same wire shape; observing the action in the listener
|
||||
would produce two entries per real start.
|
||||
|
||||
## 2. `container_disappeared` is conditional on PG state
|
||||
|
||||
The Docker events listener inspects the runtime record before emitting
|
||||
`container_disappeared` for a `destroy` action. Three suppression rules
|
||||
apply:
|
||||
|
||||
- record missing → suppress (the destroyed container was never owned
|
||||
by RTM as a tracked runtime, so no consumer cares);
|
||||
- record `status != running` → suppress (RTM already finished a stop
|
||||
or cleanup; the destroy is the expected tail of that operation);
|
||||
- record `current_container_id != event.ContainerID` → suppress (RTM
|
||||
swapped to a new container through restart or patch; the destroy is
|
||||
the expected removal of the prior container id).
|
||||
|
||||
Only a destroy that arrives for a `running` record whose
|
||||
`current_container_id` still equals the event id is treated as
|
||||
unexpected. This is the wire-side analogue of the reconciler's
|
||||
PG-drift check: the reconciler observes "PG=running, no Docker
|
||||
container" while the events listener observes "Docker says destroy,
|
||||
PG still says running pointing at this container". Together they cover
|
||||
both directions of drift.
|
||||
|
||||
A read failure against `runtime_records` is treated conservatively as
|
||||
"suppress" — the listener cannot tell whether the destroy was external
|
||||
or RTM-initiated, and over-emitting `container_disappeared` would lead
|
||||
to a real consumer (`Game Master`) escalating a false positive.
|
||||
|
||||
## 3. `die` with exit code `0` is suppressed
|
||||
|
||||
`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
|
||||
event with exit code `0`. The `container_exited` contract guarantees a
|
||||
non-zero exit; emitting on exit `0` would shower consumers with
|
||||
normal-stop noise. The listener silently drops the event; the
|
||||
operation log already records the stop on the caller side.
|
||||
|
||||
## 4. Inspect worker leaves `container_disappeared` to the reconciler
|
||||
|
||||
When `dockerinspect` calls `InspectContainer` and the daemon returns
|
||||
`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
|
||||
|
||||
- the reconciler is the single authority for PG-drift reconciliation.
|
||||
Adding a third source for `container_disappeared` would risk double
|
||||
emission and complicate the consumer story;
|
||||
- inspect ticks every 30 seconds; the reconciler ticks every 5
|
||||
minutes. The latency window for "Docker drops the container, RTM
|
||||
notices" is therefore at most 5 minutes in v1, which is acceptable
|
||||
for the kinds of drift the reconciler covers (manual `docker rm`
|
||||
outside RTM, daemon restart with stale records). If a future
|
||||
requirement tightens the window, promoting the inspect-side
|
||||
observation to a real `container_disappeared` is a one-line change.
|
||||
|
||||
## 5. Probe hysteresis is in-memory and pruned per tick
|
||||
|
||||
The active probe worker keeps per-game state in a
|
||||
`map[string]*probeState` guarded by a mutex. Two counters live there:
|
||||
|
||||
- `consecutiveFailures` — incremented on every failed probe, reset on
|
||||
every success;
|
||||
- `failurePublished` — prevents repeated `probe_failed` emission while
|
||||
the failure persists, and triggers a single `probe_recovered` on the
|
||||
first success after the threshold was crossed.
|
||||
|
||||
The state is non-persistent. RTM is single-instance in v1, and a
|
||||
process restart that loses the counters merely re-establishes the
|
||||
hysteresis from scratch — the only consequence is that a probe failure
|
||||
already in progress at the moment of restart needs another full
|
||||
threshold of failures to surface. Making the state durable would add a
|
||||
Redis round-trip to every probe attempt without buying anything that
|
||||
operators or downstream consumers depend on.
|
||||
|
||||
State pruning happens at the start of every tick. The worker reads the
|
||||
current running list and removes any state entry whose `game_id` is
|
||||
not in the list. A game that transitions through stopped → running
|
||||
again starts fresh; previously-accumulated counters do not bleed into
|
||||
the new lifecycle.
|
||||
|
||||
## 6. Probe concurrency is bounded by a fixed cap
|
||||
|
||||
Probes inside one tick run in parallel through a buffered-channel
|
||||
semaphore (`defaultMaxConcurrency = 16`). Three reasons:
|
||||
|
||||
- A single slow engine cannot delay the entire cohort. Sequential
|
||||
per-game probing would multiply the worst case by `len(records)`,
|
||||
which is the wrong shape for what is fundamentally a fan-out
|
||||
observation pattern.
|
||||
- An unbounded fan-out (one goroutine per record per tick without a
|
||||
cap) was rejected to avoid pathological CPU and connection bursts
|
||||
if the running list ever grows beyond what RTM was sized for. 16
|
||||
in-flight probes at the default 2s timeout fit a single RTM
|
||||
instance well within typical OS file-descriptor and TCP
|
||||
ephemeral-port limits.
|
||||
- The cap is a constant rather than an env var because RTM v1 is
|
||||
single-instance and the active-game count is bounded by Lobby; a
|
||||
configurable cap is something we promote to env if a real workload
|
||||
demands it.
|
||||
|
||||
The same reasoning argues against parallelism in the inspect worker:
|
||||
inspect calls are cheap (sub-ms in the local Docker socket case) and
|
||||
serial execution avoids unnecessary concurrency on the daemon socket.
|
||||
|
||||
## 7. Events listener reconnects with fixed backoff
|
||||
|
||||
The Docker daemon's events stream is a long-lived subscription; the
|
||||
SDK channel terminates on any transport error (daemon restart, socket
|
||||
hiccup, connection reset). The listener's outer loop handles this by
|
||||
re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
|
||||
indefinitely while ctx is alive.
|
||||
|
||||
Crashing the process on a transport error was rejected because losing
|
||||
a few seconds of health observations is a much smaller blast radius
|
||||
than losing the entire RTM process while the start/stop pipelines are
|
||||
running. The save-offset case is different: a lost offset replays the
|
||||
entire backlog and breaks correctness, while a missed health event is
|
||||
observation-only.
|
||||
|
||||
A subscription error is logged at `Warn` so operators can see the
|
||||
reconnect activity without it dominating the log volume.
|
||||
|
||||
## 8. Health publisher remains best-effort
|
||||
|
||||
Every emission goes through `ports.HealthEventPublisher.Publish`, the
|
||||
same surface the start service already uses
|
||||
([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
|
||||
and proceeds; the worker does not retry, does not adjust its in-memory
|
||||
hysteresis, and does not surface the failure to the caller. The
|
||||
operation log is the source of truth for runtime state; the event
|
||||
stream is a best-effort notification surface to consumers.
|
||||
|
||||
## 9. Stream offset labels are stable identifiers
|
||||
|
||||
Both consumers persist their progress through
|
||||
`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
|
||||
start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
|
||||
labels match `rtmanager:stream_offsets:{label}` and stay stable when
|
||||
the underlying stream key is renamed via
|
||||
`RTMANAGER_REDIS_START_JOBS_STREAM` /
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
|
||||
consumer at a different stream key does not lose the persisted offset.
|
||||
|
||||
## 10. `OpSource` and `SourceRef` originate at the consumer boundary
|
||||
|
||||
Every consumed envelope is translated into a `Service.Handle` call
|
||||
with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
|
||||
`SourceRef` is the Redis Stream entry id (`message.ID`); the
|
||||
`operation_log` rows therefore record the originating envelope id, and
|
||||
restart / patch correlation logic ([`services.md`](services.md) §13)
|
||||
keeps working when those services are invoked indirectly.
|
||||
|
||||
## 11. Replay-no-op detection lives in the service layer
|
||||
|
||||
The consumer does not detect replays itself. `startruntime.Service`
|
||||
returns `Outcome=success, ErrorCode=replay_no_op` when the existing
|
||||
record is already `running` with the same `image_ref`;
|
||||
`stopruntime.Service` does the same for an already-stopped or
|
||||
already-removed record. The consumer copies the result fields into
|
||||
the `RuntimeJobResult` payload verbatim and lets Lobby observe the
|
||||
replay through `error_code`.
|
||||
|
||||
The wire-shape consequences:
|
||||
|
||||
- `success` + empty `error_code` → fresh start / fresh stop;
|
||||
- `success` + `error_code=replay_no_op` → idempotent replay. For
|
||||
start, the existing record carries `container_id` and
|
||||
`engine_endpoint`; for stop on `status=removed`, both fields are
|
||||
empty strings (the record was nulled by an earlier cleanup) — the
|
||||
AsyncAPI contract permits empty strings on these required fields;
|
||||
- `failure` + non-empty `error_code` → the start / stop service
|
||||
returned a zero `Record`; the consumer publishes empty
|
||||
`container_id` and `engine_endpoint`.
|
||||
|
||||
## 12. Per-message errors are absorbed; the offset always advances
|
||||
|
||||
The consumer run loop logs and absorbs any decode error, any go-level
|
||||
service error, and any publish failure; `streamOffsetStore.Save` runs
|
||||
unconditionally after each handled message. Pinning the offset on a
|
||||
single transient publish failure was rejected because the durable side
|
||||
effect (operation_log row, runtime_records mutation, Docker state) has
|
||||
already happened on the first pass; pinning the offset to retry the
|
||||
publish would duplicate audit rows for hours until the operator
|
||||
intervened.
|
||||
|
||||
The exception is `streamOffsetStore.Save` itself: a save failure
|
||||
returns a wrapped error from `Run`. The component supervisor in
|
||||
`internal/app/app.go` then exits the process and lets the operator
|
||||
escalate, because losing the offset would cause every subsequent
|
||||
restart to re-process every prior envelope.
|
||||
|
||||
## 13. `requested_at_ms` is logged-only
|
||||
|
||||
The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
|
||||
consumer parses it (rejecting unparseable values) but only includes
|
||||
the value in structured logs — the field is "used for diagnostics, not
|
||||
authoritative" per the contract. The service layer ignores it; the
|
||||
operation_log uses `service.clock()` for `started_at` / `finished_at`
|
||||
so Lobby's wall-clock skew never bleeds into RTM persistence.
|
||||
|
||||
## 14. Reconciler: per-game lease around every write
|
||||
|
||||
A `running → removed` mutation that races a restart's inner stop
|
||||
would clobber the restart's freshly-installed `running` record without
|
||||
any other guard. The reconciler honours the same per-game lease that
|
||||
the lifecycle services hold ([`services.md`](services.md) §1).
|
||||
|
||||
The reconciler splits its work into two phases:
|
||||
|
||||
- **Read pass — lockless.**
|
||||
`docker.List({com.galaxy.owner=rtmanager})` followed by
|
||||
`RuntimeRecords.ListByStatus(running)`. No lease is taken; both
|
||||
reads are point-in-time observations of independent systems and a
|
||||
stale view here only delays a mutation by one tick.
|
||||
- **Write pass — lease-guarded.** Every drift mutation
|
||||
(`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
|
||||
per-game lease, re-reads the record under the lease, and then
|
||||
either applies the mutation or returns when state has changed.
|
||||
A lease conflict (`acquired=false`) is logged at `info` and the
|
||||
game is silently skipped — the next tick will retry. A lease-store
|
||||
error is logged at `warn`; the rest of the pass continues.
|
||||
|
||||
The re-read after lease acquisition is intentional: the read pass is
|
||||
lockless, so by the time the lease is held the runtime record may
|
||||
have moved. `UpdateStatus` already provides CAS via
|
||||
`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
|
||||
does not, so the explicit re-read keeps the three paths uniform and
|
||||
makes the skip condition obvious in code review.
|
||||
|
||||
## 15. Three drift kinds covered by the reconciler
|
||||
|
||||
- `adopt` — Docker reports a container labelled
|
||||
`com.galaxy.owner=rtmanager` for which RTM has no record; insert a
|
||||
fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
|
||||
stop or remove the container (operators may have started it
|
||||
manually for diagnostics).
|
||||
- `dispose` — RTM has a `running` record whose container is missing
|
||||
in Docker; mark `status=removed`, publish
|
||||
`container_disappeared`, append `op_kind=reconcile_dispose`.
|
||||
- `observed_exited` — RTM has a `running` record whose container
|
||||
exists but is in `exited`; mark `status=stopped`, publish
|
||||
`container_exited` with the observed exit code. This third path
|
||||
exists because the events listener sees only live events; a
|
||||
container that died while RTM was offline would otherwise stay
|
||||
`running` indefinitely. The drift is exposed through
|
||||
`rtmanager.reconcile_drift{kind=observed_exited}` and through the
|
||||
`container_exited` health event; no `operation_log` entry is
|
||||
written because the audit log records explicit RTM operations, not
|
||||
passive observations of Docker state.
|
||||
|
||||
## 16. `stopped_at = now (reconciler observation time)`
|
||||
|
||||
The `observed_exited` path writes `stopped_at = now`, where `now` is
|
||||
the reconciler's observation time. The persistence adapter
|
||||
([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
|
||||
`stopped_at = now` for the `stopped` destination — there is no
|
||||
port-level knob for an explicit timestamp, and the reconciler does not
|
||||
read `State.FinishedAt` from Docker.
|
||||
|
||||
The trade-off: `stopped_at` diverges from the daemon's
|
||||
`State.FinishedAt` by at most one tick interval (default 5 minutes).
|
||||
If a downstream consumer ever needs the daemon-observed exit
|
||||
timestamp, the upgrade path is a one-call extension of
|
||||
`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
|
||||
that change is deferred until a consumer materialises.
|
||||
|
||||
## 17. Synchronous initial pass + periodic Component
|
||||
|
||||
`README §Startup dependencies` step 6 demands "Reconciler runs once
|
||||
and blocks until done" before background workers start, but
|
||||
`app.App.Run` starts every registered `Component` concurrently —
|
||||
component ordering does not translate into start ordering.
|
||||
|
||||
The reconciler exposes a public `ReconcileNow(ctx)` method that the
|
||||
runtime calls synchronously between `newWiring` and `app.New`. The
|
||||
same `*Reconciler` is then registered as a `Component`; its `Run`
|
||||
only ticks (no immediate pass) so the startup work is not duplicated.
|
||||
The cost is one public method on the worker; the benefit is that the
|
||||
README invariant holds verbatim and the periodic loop is a textbook
|
||||
`Component`.
|
||||
|
||||
## 18. Adopt through `Upsert`, race with start is benign
|
||||
|
||||
The adopt path constructs a fresh `runtime.RuntimeRecord` (status
|
||||
running, container id and image_ref from labels, `started_at` from
|
||||
`com.galaxy.started_at_ms` or inspect, state path and docker network
|
||||
from configuration, engine endpoint from the
|
||||
`http://galaxy-game-{game_id}:8080` rule) and calls
|
||||
`RuntimeRecords.Upsert`.
|
||||
|
||||
Race scenario: the start service has called `docker.Run` but has not
|
||||
yet finished its own `Upsert` when the reconciler observes the
|
||||
container without a record. Both writers eventually arrive at PG with
|
||||
the same key data — the start service knows the canonical
|
||||
`image_ref`, but the reconciler reads it from the
|
||||
`com.galaxy.engine_image_ref` label that the start service itself
|
||||
wrote. The CAS-free overwrite is therefore benign:
|
||||
|
||||
- `created_at` is preserved across upserts by the
|
||||
`ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
|
||||
game" timestamp stays stable regardless of which writer lands last;
|
||||
- all other fields in this race carry identical values (same
|
||||
container, same image, same hostname, same state path).
|
||||
|
||||
Under the per-game lease this is doubly safe: the reconciler only
|
||||
issues `Upsert` while holding the lease, and only after re-reading
|
||||
the record finds it absent. Concurrent start would block on the same
|
||||
lease; concurrent stop / restart would have moved the record out of
|
||||
"absent" by the time the reconciler re-reads.
|
||||
|
||||
## 19. Cleanup worker delegates to the service
|
||||
|
||||
The TTL-cleanup worker is intentionally tiny: it lists
|
||||
`runtime_records.status='stopped'`, filters in process by
|
||||
`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
|
||||
`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
|
||||
candidate. The service already owns:
|
||||
|
||||
- the per-game lease around the Docker `Remove` call;
|
||||
- the `running → removed` CAS via `UpdateStatus`;
|
||||
- the operation_log entry (`op_kind=cleanup_container`,
|
||||
`op_source=auto_ttl`);
|
||||
- the telemetry counter and structured log fields.
|
||||
|
||||
In-memory filtering is acceptable in v1 because the cardinality of
|
||||
`status=stopped` rows is bounded by Lobby's active-game count plus
|
||||
retention period. The dedicated `(status, last_op_at)` index drives
|
||||
the underlying `ListByStatus(stopped)` query so the database does
|
||||
the heavy lifting; the Go-side filter is microseconds-per-row.
|
||||
|
||||
The worker uses a small `Cleaner` interface in its own package rather
|
||||
than depending on `*cleanupcontainer.Service` directly. This keeps
|
||||
the worker's tests light — no need to construct Docker, lease,
|
||||
operation-log, and telemetry doubles just to verify TTL math — while
|
||||
the production wiring still binds the real service via a compile-time
|
||||
interface assertion in `internal/app/wiring.go`.
|
||||
|
||||
## 20. Sequential per-game work in reconciler and cleanup
|
||||
|
||||
Both workers process games sequentially within a tick. The
|
||||
reconciler's mutations are dominated by `Get` + `Upsert` /
|
||||
`UpdateStatus` round-trips against PG plus an occasional Docker
|
||||
`InspectContainer`; the cleanup worker's mutations are dominated by
|
||||
the cleanup service's `docker.Remove` call. Parallelising either
|
||||
would multiply the load on the Docker daemon socket and the PG pool
|
||||
without buying anything that v1 cardinality demands.
|
||||
|
||||
## 21. Cross-module test boundary for the consumer integration test
|
||||
|
||||
[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
|
||||
covers the contract roundtrip without importing
|
||||
`lobby/internal/...`:
|
||||
|
||||
- it XADDs a start envelope in the AsyncAPI wire shape (the same
|
||||
shape Lobby's `runtimemanager.Publisher` writes);
|
||||
- it runs the real `startruntime.Service` against in-memory fakes for
|
||||
the persistence stores, the lease, and the notification / health
|
||||
publishers, plus a gomock-backed `ports.DockerClient`;
|
||||
- it lets the real `jobresultspublisher.Publisher` write to
|
||||
`runtime:job_results`;
|
||||
- it reads the resulting entry and asserts the symmetric wire shape;
|
||||
- it then XADDs the same envelope a second time and asserts the
|
||||
`error_code=replay_no_op` outcome with no further Docker calls.
|
||||
|
||||
The cross-module integration that runs both the real Lobby publisher
|
||||
and the real Lobby consumer alongside RTM lives at
|
||||
`integration/lobbyrtm/`, which is the home for inter-service
|
||||
fixtures. Keeping the in-package test free of `lobby/...` imports
|
||||
avoids module-internal coupling and keeps `rtmanager`'s test suite
|
||||
buildable on its own.
|
||||
Reference in New Issue
Block a user