feat: runtime manager

2026-04-28 20:39:18 +02:00
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
@@ -0,0 +1,309 @@
+# Runtime and Components
+
+The diagram below focuses on the deployed `galaxy/rtmanager` process
+and its runtime dependencies. The current-state contract for every
+listener, worker, and adapter lives in [`../README.md`](../README.md);
+this document is the navigation aid that points at the right code path
+and the right design-rationale record.
+
+```mermaid
+flowchart LR
+    subgraph Clients
+        GM["Game Master"]
+        Admin["Admin Service"]
+        Lobby["Game Lobby"]
+    end
+
+    subgraph RTM["Runtime Manager process"]
+        InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
+        StartJobs["startjobsconsumer"]
+        StopJobs["stopjobsconsumer"]
+        DockerEvents["dockerevents listener"]
+        HealthProbe["healthprobe worker"]
+        DockerInspect["dockerinspect worker"]
+        Reconcile["reconcile worker"]
+        Cleanup["containercleanup worker"]
+        Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
+        IntentPublisher["notification:intents publisher"]
+        ResultsPublisher["runtime:job_results publisher"]
+        HealthPublisher["runtime:health_events publisher"]
+        Telemetry["Logs, traces, metrics"]
+    end
+
+    Docker["Docker Daemon"]
+    Engine["galaxy-game-{game_id} container"]
+    Postgres["PostgreSQL\nschema rtmanager"]
+    Redis["Redis\nstreams + leases + offsets"]
+    LobbyHTTP["Lobby internal HTTP"]
+
+    Lobby -. runtime:start_jobs .-> StartJobs
+    Lobby -. runtime:stop_jobs .-> StopJobs
+    GM --> InternalHTTP
+    Admin --> InternalHTTP
+
+    StartJobs --> Services
+    StopJobs --> Services
+    InternalHTTP --> Services
+
+    Services --> Docker
+    Services --> Postgres
+    Services --> Redis
+    Services --> ResultsPublisher
+    Services --> HealthPublisher
+    Services --> IntentPublisher
+    Services -. GET diagnostic .-> LobbyHTTP
+
+    DockerEvents --> Docker
+    DockerInspect --> Docker
+    HealthProbe --> Engine
+    Reconcile --> Docker
+    Reconcile --> Postgres
+    Cleanup --> Postgres
+    Cleanup --> Services
+
+    DockerEvents --> HealthPublisher
+    DockerInspect --> HealthPublisher
+    HealthProbe --> HealthPublisher
+
+    HealthPublisher --> Redis
+    ResultsPublisher --> Redis
+    IntentPublisher --> Redis
+
+    StartJobs --> Redis
+    StopJobs --> Redis
+    InternalHTTP --> Postgres
+
+    Docker -->|create / start / stop / rm| Engine
+    Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
+
+    InternalHTTP --> Telemetry
+    Services --> Telemetry
+    StartJobs --> Telemetry
+    StopJobs --> Telemetry
+    DockerEvents --> Telemetry
+    HealthProbe --> Telemetry
+    DockerInspect --> Telemetry
+    Reconcile --> Telemetry
+    Cleanup --> Telemetry
+```
+
+Notes:
+
+- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
+  goose migrations fail, when Redis ping fails, when the Docker daemon
+  ping fails, or when the configured Docker network is missing. Lobby
+  reachability is **not** verified at boot — the start service's
+  diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
+  outside of debug logging
+  ([`services.md` §7](services.md)).
+- The reconciler runs **synchronously** once on startup before
+  `app.App.Run` registers any other component, then re-runs
+  periodically as a regular `Component`. The synchronous pass is the
+  reason why orphaned containers from a prior process can never be
+  observed by the events listener with no PG record
+  ([`workers.md` §17](workers.md)).
+- A single internal HTTP listener exposes both probes
+  (`/healthz`, `/readyz`) and the trusted REST surface for Game Master
+  and Admin Service. There is no public listener — RTM does not face
+  end users.
+
+## Listeners
+
+| Listener | Default addr | Purpose |
+| --- | --- | --- |
+| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
+
+Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
+
+- read timeout: `5s`
+- write timeout: `15s`
+- idle timeout: `60s`
+
+The listener is unauthenticated and assumes a trusted network segment.
+The `X-Galaxy-Caller` request header carries an optional caller
+identity (`gm` or `admin`) that the handler records as
+`operation_log.op_source`
+([`services.md` §18](services.md)).
+
+Probe routes:
+
+- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
+  the listener is up.
+- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
+  Docker daemon, then asserts the configured Docker network exists.
+  Returns `{"status":"ready"}` only when every check passes; otherwise
+  returns `503` with the canonical error envelope.
+
+## Background Workers
+
+Every worker runs as an `app.Component` and is registered in the
+order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
+
+| Worker | Source | Trigger | Function |
+| --- | --- | --- | --- |
+| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
+| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
+| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
+| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
+| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
+| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
+| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
+
+The events listener and the inspect worker do **not** emit
+`container_started` — that event is owned by the start service
+([`workers.md` §1](workers.md)). The events listener and the inspect
+worker also do not emit `container_disappeared` autonomously when a
+record is missing or stale; the conditional emission rules live in
+[`workers.md` §2](workers.md) and [`§4`](workers.md).
+
+## Lifecycle Services
+
+The five lifecycle services are pure orchestrators called from both
+the stream consumers and the REST handlers. Each service owns the
+per-game lease for the duration of its operation.
+
+| Service | Source | Triggers | Failure envelope |
+| --- | --- | --- | --- |
+| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
+| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
+| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
+| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
+| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
+
+All services share three behaviours captured in
+[`services.md`](services.md):
+
+- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
+  TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
+  not by the caller — which keeps consumer and REST callers symmetric
+  ([`services.md` §1](services.md));
+- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
+  `ContainerID`, `EngineEndpoint`) is what consumers and REST
+  handlers translate into job_results / HTTP responses
+  ([`services.md` §3](services.md));
+- failures pass through one `operation_log` write before returning,
+  and three of the failure codes (`start_config_invalid`,
+  `image_pull_failed`, `container_start_failed`) also publish a
+  `runtime.*` admin notification intent
+  ([`services.md` §4](services.md)).
+
+## Synchronous Upstream Client
+
+| Client | Endpoint | Failure mapping |
+| --- | --- | --- |
+| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
+
+Lobby's outbound transport is the only synchronous client RTM holds.
+Every other interaction (Notification Service, Game Master, Admin
+Service) crosses an asynchronous boundary or is initiated by the peer.
+
+## Stream Offsets
+
+Each consumer persists its position under a fixed label so process
+restart preserves stream progress.
+
+| Stream | Offset key | Block timeout env |
+| --- | --- | --- |
+| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
+| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
+
+The labels `startjobs` and `stopjobs` are stable identifiers — they
+are decoupled from the underlying stream key. An operator who renames
+a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
+`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
+Decision: [`workers.md` §9](workers.md).
+
+The `runtime:job_results`, `runtime:health_events`, and
+`notification:intents` streams are outbound; RTM does not consume them
+itself.
+
+## Configuration Groups
+
+The full env-var list with defaults lives in
+[`../README.md` §Configuration](../README.md). The groups below
+summarise the structure:
+
+- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
+  `RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
+  `RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
+  `RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
+- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
+- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
+  `RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
+  `RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
+- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
+  `RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
+  `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
+  `RTMANAGER_CONTAINER_RETENTION_DAYS`,
+  `RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
+  `RTMANAGER_ENGINE_STATE_ENV_NAME`,
+  `RTMANAGER_GAME_STATE_DIR_MODE`,
+  `RTMANAGER_GAME_STATE_OWNER_UID`,
+  `RTMANAGER_GAME_STATE_OWNER_GID`.
+- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
+  `RTMANAGER_POSTGRES_REPLICA_DSNS`,
+  `RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
+  `RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
+  `RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
+  `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
+- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
+  `RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
+  `RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
+- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
+  `RTMANAGER_REDIS_STOP_JOBS_STREAM`,
+  `RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
+  `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
+  `RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
+  `RTMANAGER_STREAM_BLOCK_TIMEOUT`.
+- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
+  `RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
+  `RTMANAGER_PROBE_FAILURES_THRESHOLD`.
+- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
+  `RTMANAGER_CLEANUP_INTERVAL`.
+- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
+- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
+  `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
+- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
+  `RTMANAGER_SHUTDOWN_TIMEOUT`.
+- **Telemetry** — standard `OTEL_*`.
+
+## Runtime Notes
+
+- **Single-instance v1.** Multi-instance Runtime Manager with Redis
+  Streams consumer groups is explicitly out of scope for the current
+  iteration. The per-game lease serialises operations on one game
+  across the consumer + REST entry points; cross-instance
+  coordination is deferred until a real workload demands it.
+- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
+  `SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
+  (default `60s`). The lease is **not renewed mid-operation** in v1;
+  long pulls of multi-GB images can therefore expire the lease
+  before the operation finishes — the trade-off is documented in
+  [`services.md` §1](services.md). The reconciler honours the same
+  lease around every drift mutation
+  ([`workers.md` §14](workers.md)).
+- **Operation log is the source of truth.** Every lifecycle and
+  reconcile mutation appends one row to `rtmanager.operation_log`.
+  The `runtime:health_events` stream and the `notification:intents`
+  emissions are best-effort — a publish failure logs at `Error` and
+  proceeds, never rolling back the recorded operation
+  ([`workers.md` §8](workers.md)).
+- **In-memory probe hysteresis.** The active HTTP probe keeps
+  per-game `consecutiveFailures` and `failurePublished` counters in a
+  mutex-guarded map. State is non-persistent: a process restart that
+  loses the counters re-establishes hysteresis from scratch, and
+  state for a game that transitions through `stopped → running` is
+  pruned at the start of every probe tick
+  ([`workers.md` §5](workers.md)).
+- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
+  `if_missing` (default), `always`, and `never`. Image labels
+  (`com.galaxy.cpu_quota`, `com.galaxy.memory`,
+  `com.galaxy.pids_limit`) drive resource limits when present; the
+  matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
+  label is absent or unparseable. Producers never pass limits.
+- **State directory ownership.** RTM creates per-game state
+  directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
+  mode and uid/gid, but **never deletes them**. Removing the directory
+  is operator domain (backup tooling, a future Admin Service
+  workflow). A cleanup that removes the container leaves the
+  directory intact.