feat: runtime manager

2026-04-28 20:39:18 +02:00
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
@@ -0,0 +1,412 @@
+# Background Workers
+
+This document explains the design of the seven background workers
+under [`../internal/worker/`](../internal/worker):
+
+- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
+  [`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
+  consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
+- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
+  subscription;
+- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
+  `InspectContainer` worker;
+- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
+  `/healthz` probe;
+- [`reconcile`](../internal/worker/reconcile) — startup + periodic
+  drift reconciliation;
+- [`containercleanup`](../internal/worker/containercleanup) —
+  periodic TTL cleanup.
+
+The current-state behaviour and configuration surface live in
+[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
+§Reconciliation), and operational notes are in
+[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
+[`runbook.md`](runbook.md). This file records the rationale.
+
+## 1. Single ownership per `event_type`
+
+The `runtime:health_events` vocabulary is shared across four sources;
+each event type is owned by exactly one of them.
+
+| `event_type` | Owner |
+| --- | --- |
+| `container_started` | `internal/service/startruntime` |
+| `container_exited` | `internal/worker/dockerevents` |
+| `container_oom` | `internal/worker/dockerevents` |
+| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
+| `inspect_unhealthy` | `internal/worker/dockerinspect` |
+| `probe_failed` | `internal/worker/healthprobe` |
+| `probe_recovered` | `internal/worker/healthprobe` |
+
+`container_started` is intentionally not duplicated by the events
+listener, even though Docker emits a `start` action whenever the start
+service runs the container. The start service already publishes the
+event with the same wire shape; observing the action in the listener
+would produce two entries per real start.
+
+## 2. `container_disappeared` is conditional on PG state
+
+The Docker events listener inspects the runtime record before emitting
+`container_disappeared` for a `destroy` action. Three suppression rules
+apply:
+
+- record missing → suppress (the destroyed container was never owned
+  by RTM as a tracked runtime, so no consumer cares);
+- record `status != running` → suppress (RTM already finished a stop
+  or cleanup; the destroy is the expected tail of that operation);
+- record `current_container_id != event.ContainerID` → suppress (RTM
+  swapped to a new container through restart or patch; the destroy is
+  the expected removal of the prior container id).
+
+Only a destroy that arrives for a `running` record whose
+`current_container_id` still equals the event id is treated as
+unexpected. This is the wire-side analogue of the reconciler's
+PG-drift check: the reconciler observes "PG=running, no Docker
+container" while the events listener observes "Docker says destroy,
+PG still says running pointing at this container". Together they cover
+both directions of drift.
+
+A read failure against `runtime_records` is treated conservatively as
+"suppress" — the listener cannot tell whether the destroy was external
+or RTM-initiated, and over-emitting `container_disappeared` would lead
+to a real consumer (`Game Master`) escalating a false positive.
+
+## 3. `die` with exit code `0` is suppressed
+
+`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
+event with exit code `0`. The `container_exited` contract guarantees a
+non-zero exit; emitting on exit `0` would shower consumers with
+normal-stop noise. The listener silently drops the event; the
+operation log already records the stop on the caller side.
+
+## 4. Inspect worker leaves `container_disappeared` to the reconciler
+
+When `dockerinspect` calls `InspectContainer` and the daemon returns
+`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
+
+- the reconciler is the single authority for PG-drift reconciliation.
+  Adding a third source for `container_disappeared` would risk double
+  emission and complicate the consumer story;
+- inspect ticks every 30 seconds; the reconciler ticks every 5
+  minutes. The latency window for "Docker drops the container, RTM
+  notices" is therefore at most 5 minutes in v1, which is acceptable
+  for the kinds of drift the reconciler covers (manual `docker rm`
+  outside RTM, daemon restart with stale records). If a future
+  requirement tightens the window, promoting the inspect-side
+  observation to a real `container_disappeared` is a one-line change.
+
+## 5. Probe hysteresis is in-memory and pruned per tick
+
+The active probe worker keeps per-game state in a
+`map[string]*probeState` guarded by a mutex. Two counters live there:
+
+- `consecutiveFailures` — incremented on every failed probe, reset on
+  every success;
+- `failurePublished` — prevents repeated `probe_failed` emission while
+  the failure persists, and triggers a single `probe_recovered` on the
+  first success after the threshold was crossed.
+
+The state is non-persistent. RTM is single-instance in v1, and a
+process restart that loses the counters merely re-establishes the
+hysteresis from scratch — the only consequence is that a probe failure
+already in progress at the moment of restart needs another full
+threshold of failures to surface. Making the state durable would add a
+Redis round-trip to every probe attempt without buying anything that
+operators or downstream consumers depend on.
+
+State pruning happens at the start of every tick. The worker reads the
+current running list and removes any state entry whose `game_id` is
+not in the list. A game that transitions through stopped → running
+again starts fresh; previously-accumulated counters do not bleed into
+the new lifecycle.
+
+## 6. Probe concurrency is bounded by a fixed cap
+
+Probes inside one tick run in parallel through a buffered-channel
+semaphore (`defaultMaxConcurrency = 16`). Three reasons:
+
+- A single slow engine cannot delay the entire cohort. Sequential
+  per-game probing would multiply the worst case by `len(records)`,
+  which is the wrong shape for what is fundamentally a fan-out
+  observation pattern.
+- An unbounded fan-out (one goroutine per record per tick without a
+  cap) was rejected to avoid pathological CPU and connection bursts
+  if the running list ever grows beyond what RTM was sized for. 16
+  in-flight probes at the default 2s timeout fit a single RTM
+  instance well within typical OS file-descriptor and TCP
+  ephemeral-port limits.
+- The cap is a constant rather than an env var because RTM v1 is
+  single-instance and the active-game count is bounded by Lobby; a
+  configurable cap is something we promote to env if a real workload
+  demands it.
+
+The same reasoning argues against parallelism in the inspect worker:
+inspect calls are cheap (sub-ms in the local Docker socket case) and
+serial execution avoids unnecessary concurrency on the daemon socket.
+
+## 7. Events listener reconnects with fixed backoff
+
+The Docker daemon's events stream is a long-lived subscription; the
+SDK channel terminates on any transport error (daemon restart, socket
+hiccup, connection reset). The listener's outer loop handles this by
+re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
+indefinitely while ctx is alive.
+
+Crashing the process on a transport error was rejected because losing
+a few seconds of health observations is a much smaller blast radius
+than losing the entire RTM process while the start/stop pipelines are
+running. The save-offset case is different: a lost offset replays the
+entire backlog and breaks correctness, while a missed health event is
+observation-only.
+
+A subscription error is logged at `Warn` so operators can see the
+reconnect activity without it dominating the log volume.
+
+## 8. Health publisher remains best-effort
+
+Every emission goes through `ports.HealthEventPublisher.Publish`, the
+same surface the start service already uses
+([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
+and proceeds; the worker does not retry, does not adjust its in-memory
+hysteresis, and does not surface the failure to the caller. The
+operation log is the source of truth for runtime state; the event
+stream is a best-effort notification surface to consumers.
+
+## 9. Stream offset labels are stable identifiers
+
+Both consumers persist their progress through
+`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
+start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
+labels match `rtmanager:stream_offsets:{label}` and stay stable when
+the underlying stream key is renamed via
+`RTMANAGER_REDIS_START_JOBS_STREAM` /
+`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
+consumer at a different stream key does not lose the persisted offset.
+
+## 10. `OpSource` and `SourceRef` originate at the consumer boundary
+
+Every consumed envelope is translated into a `Service.Handle` call
+with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
+`SourceRef` is the Redis Stream entry id (`message.ID`); the
+`operation_log` rows therefore record the originating envelope id, and
+restart / patch correlation logic ([`services.md`](services.md) §13)
+keeps working when those services are invoked indirectly.
+
+## 11. Replay-no-op detection lives in the service layer
+
+The consumer does not detect replays itself. `startruntime.Service`
+returns `Outcome=success, ErrorCode=replay_no_op` when the existing
+record is already `running` with the same `image_ref`;
+`stopruntime.Service` does the same for an already-stopped or
+already-removed record. The consumer copies the result fields into
+the `RuntimeJobResult` payload verbatim and lets Lobby observe the
+replay through `error_code`.
+
+The wire-shape consequences:
+
+- `success` + empty `error_code` → fresh start / fresh stop;
+- `success` + `error_code=replay_no_op` → idempotent replay. For
+  start, the existing record carries `container_id` and
+  `engine_endpoint`; for stop on `status=removed`, both fields are
+  empty strings (the record was nulled by an earlier cleanup) — the
+  AsyncAPI contract permits empty strings on these required fields;
+- `failure` + non-empty `error_code` → the start / stop service
+  returned a zero `Record`; the consumer publishes empty
+  `container_id` and `engine_endpoint`.
+
+## 12. Per-message errors are absorbed; the offset always advances
+
+The consumer run loop logs and absorbs any decode error, any go-level
+service error, and any publish failure; `streamOffsetStore.Save` runs
+unconditionally after each handled message. Pinning the offset on a
+single transient publish failure was rejected because the durable side
+effect (operation_log row, runtime_records mutation, Docker state) has
+already happened on the first pass; pinning the offset to retry the
+publish would duplicate audit rows for hours until the operator
+intervened.
+
+The exception is `streamOffsetStore.Save` itself: a save failure
+returns a wrapped error from `Run`. The component supervisor in
+`internal/app/app.go` then exits the process and lets the operator
+escalate, because losing the offset would cause every subsequent
+restart to re-process every prior envelope.
+
+## 13. `requested_at_ms` is logged-only
+
+The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
+consumer parses it (rejecting unparseable values) but only includes
+the value in structured logs — the field is "used for diagnostics, not
+authoritative" per the contract. The service layer ignores it; the
+operation_log uses `service.clock()` for `started_at` / `finished_at`
+so Lobby's wall-clock skew never bleeds into RTM persistence.
+
+## 14. Reconciler: per-game lease around every write
+
+A `running → removed` mutation that races a restart's inner stop
+would clobber the restart's freshly-installed `running` record without
+any other guard. The reconciler honours the same per-game lease that
+the lifecycle services hold ([`services.md`](services.md) §1).
+
+The reconciler splits its work into two phases:
+
+- **Read pass — lockless.**
+  `docker.List({com.galaxy.owner=rtmanager})` followed by
+  `RuntimeRecords.ListByStatus(running)`. No lease is taken; both
+  reads are point-in-time observations of independent systems and a
+  stale view here only delays a mutation by one tick.
+- **Write pass — lease-guarded.** Every drift mutation
+  (`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
+  per-game lease, re-reads the record under the lease, and then
+  either applies the mutation or returns when state has changed.
+  A lease conflict (`acquired=false`) is logged at `info` and the
+  game is silently skipped — the next tick will retry. A lease-store
+  error is logged at `warn`; the rest of the pass continues.
+
+The re-read after lease acquisition is intentional: the read pass is
+lockless, so by the time the lease is held the runtime record may
+have moved. `UpdateStatus` already provides CAS via
+`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
+does not, so the explicit re-read keeps the three paths uniform and
+makes the skip condition obvious in code review.
+
+## 15. Three drift kinds covered by the reconciler
+
+- `adopt` — Docker reports a container labelled
+  `com.galaxy.owner=rtmanager` for which RTM has no record; insert a
+  fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
+  stop or remove the container (operators may have started it
+  manually for diagnostics).
+- `dispose` — RTM has a `running` record whose container is missing
+  in Docker; mark `status=removed`, publish
+  `container_disappeared`, append `op_kind=reconcile_dispose`.
+- `observed_exited` — RTM has a `running` record whose container
+  exists but is in `exited`; mark `status=stopped`, publish
+  `container_exited` with the observed exit code. This third path
+  exists because the events listener sees only live events; a
+  container that died while RTM was offline would otherwise stay
+  `running` indefinitely. The drift is exposed through
+  `rtmanager.reconcile_drift{kind=observed_exited}` and through the
+  `container_exited` health event; no `operation_log` entry is
+  written because the audit log records explicit RTM operations, not
+  passive observations of Docker state.
+
+## 16. `stopped_at = now (reconciler observation time)`
+
+The `observed_exited` path writes `stopped_at = now`, where `now` is
+the reconciler's observation time. The persistence adapter
+([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
+`stopped_at = now` for the `stopped` destination — there is no
+port-level knob for an explicit timestamp, and the reconciler does not
+read `State.FinishedAt` from Docker.
+
+The trade-off: `stopped_at` diverges from the daemon's
+`State.FinishedAt` by at most one tick interval (default 5 minutes).
+If a downstream consumer ever needs the daemon-observed exit
+timestamp, the upgrade path is a one-call extension of
+`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
+that change is deferred until a consumer materialises.
+
+## 17. Synchronous initial pass + periodic Component
+
+`README §Startup dependencies` step 6 demands "Reconciler runs once
+and blocks until done" before background workers start, but
+`app.App.Run` starts every registered `Component` concurrently —
+component ordering does not translate into start ordering.
+
+The reconciler exposes a public `ReconcileNow(ctx)` method that the
+runtime calls synchronously between `newWiring` and `app.New`. The
+same `*Reconciler` is then registered as a `Component`; its `Run`
+only ticks (no immediate pass) so the startup work is not duplicated.
+The cost is one public method on the worker; the benefit is that the
+README invariant holds verbatim and the periodic loop is a textbook
+`Component`.
+
+## 18. Adopt through `Upsert`, race with start is benign
+
+The adopt path constructs a fresh `runtime.RuntimeRecord` (status
+running, container id and image_ref from labels, `started_at` from
+`com.galaxy.started_at_ms` or inspect, state path and docker network
+from configuration, engine endpoint from the
+`http://galaxy-game-{game_id}:8080` rule) and calls
+`RuntimeRecords.Upsert`.
+
+Race scenario: the start service has called `docker.Run` but has not
+yet finished its own `Upsert` when the reconciler observes the
+container without a record. Both writers eventually arrive at PG with
+the same key data — the start service knows the canonical
+`image_ref`, but the reconciler reads it from the
+`com.galaxy.engine_image_ref` label that the start service itself
+wrote. The CAS-free overwrite is therefore benign:
+
+- `created_at` is preserved across upserts by the
+  `ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
+  game" timestamp stays stable regardless of which writer lands last;
+- all other fields in this race carry identical values (same
+  container, same image, same hostname, same state path).
+
+Under the per-game lease this is doubly safe: the reconciler only
+issues `Upsert` while holding the lease, and only after re-reading
+the record finds it absent. Concurrent start would block on the same
+lease; concurrent stop / restart would have moved the record out of
+"absent" by the time the reconciler re-reads.
+
+## 19. Cleanup worker delegates to the service
+
+The TTL-cleanup worker is intentionally tiny: it lists
+`runtime_records.status='stopped'`, filters in process by
+`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
+`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
+candidate. The service already owns:
+
+- the per-game lease around the Docker `Remove` call;
+- the `running → removed` CAS via `UpdateStatus`;
+- the operation_log entry (`op_kind=cleanup_container`,
+  `op_source=auto_ttl`);
+- the telemetry counter and structured log fields.
+
+In-memory filtering is acceptable in v1 because the cardinality of
+`status=stopped` rows is bounded by Lobby's active-game count plus
+retention period. The dedicated `(status, last_op_at)` index drives
+the underlying `ListByStatus(stopped)` query so the database does
+the heavy lifting; the Go-side filter is microseconds-per-row.
+
+The worker uses a small `Cleaner` interface in its own package rather
+than depending on `*cleanupcontainer.Service` directly. This keeps
+the worker's tests light — no need to construct Docker, lease,
+operation-log, and telemetry doubles just to verify TTL math — while
+the production wiring still binds the real service via a compile-time
+interface assertion in `internal/app/wiring.go`.
+
+## 20. Sequential per-game work in reconciler and cleanup
+
+Both workers process games sequentially within a tick. The
+reconciler's mutations are dominated by `Get` + `Upsert` /
+`UpdateStatus` round-trips against PG plus an occasional Docker
+`InspectContainer`; the cleanup worker's mutations are dominated by
+the cleanup service's `docker.Remove` call. Parallelising either
+would multiply the load on the Docker daemon socket and the PG pool
+without buying anything that v1 cardinality demands.
+
+## 21. Cross-module test boundary for the consumer integration test
+
+[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
+covers the contract roundtrip without importing
+`lobby/internal/...`:
+
+- it XADDs a start envelope in the AsyncAPI wire shape (the same
+  shape Lobby's `runtimemanager.Publisher` writes);
+- it runs the real `startruntime.Service` against in-memory fakes for
+  the persistence stores, the lease, and the notification / health
+  publishers, plus a gomock-backed `ports.DockerClient`;
+- it lets the real `jobresultspublisher.Publisher` write to
+  `runtime:job_results`;
+- it reads the resulting entry and asserts the symmetric wire shape;
+- it then XADDs the same envelope a second time and asserts the
+  `error_code=replay_no_op` outcome with no further Docker calls.
+
+The cross-module integration that runs both the real Lobby publisher
+and the real Lobby consumer alongside RTM lives at
+`integration/lobbyrtm/`, which is the home for inter-service
+fixtures. Keeping the in-package test free of `lobby/...` imports
+avoids module-internal coupling and keeps `rtmanager`'s test suite
+buildable on its own.