feat: runtime manager

2026-04-28 20:39:18 +02:00
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
@@ -0,0 +1,867 @@
+# Runtime Manager
+
+`Runtime Manager` (RTM) is the only Galaxy platform service permitted to interact with the
+Docker daemon. It owns the lifecycle of `galaxy/game` engine containers and the technical
+runtime view of running games. Other services consume RTM via two transports: an asynchronous
+Redis Streams contract (used by `Game Lobby`) and a synchronous internal REST surface (used by
+`Game Master` and `Admin Service`).
+
+## References
+
+- [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §9 Runtime Manager.
+- [`../TESTING.md`](../TESTING.md) §7 — testing matrix for RTM.
+- [`./docs/README.md`](./docs/README.md) — service-local documentation entry point.
+- [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — REST contract.
+- [`./api/runtime-jobs-asyncapi.yaml`](./api/runtime-jobs-asyncapi.yaml) — start/stop job
+  streams contract.
+- [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml) —
+  `runtime:health_events` stream contract.
+- [`../game/README.md`](../game/README.md) — game engine container contract (env, ports,
+  `/healthz`).
+- [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with RTM.
+
+## Purpose
+
+A running Galaxy game lives in exactly one Docker container. The platform must be able to:
+
+- create the container with the right engine version and configuration;
+- supply the engine with a stable storage location for game state;
+- keep the runtime status visible to platform-level services;
+- replace the container in place for patch upgrades and restarts;
+- remove containers that are no longer needed;
+- detect and surface engine failures to whoever should react.
+
+`Runtime Manager` is the single component that performs these actions. It deliberately does
+**not** reason about platform metadata, membership, schedules, turn cutoffs, or any other
+business state. Game Lobby owns platform metadata; Game Master will own runtime business state
+when implemented.
+
+## Scope
+
+`Runtime Manager` is the source of truth for:
+
+- the mapping `game_id -> current_container_id` for every running container;
+- the durable history of every start, stop, restart, patch, and cleanup operation it performed;
+- the most recent technical health observation per game (last Docker event, last successful or
+  failed probe, last inspect result).
+
+`Runtime Manager` is not the source of truth for:
+
+- any business or platform-level metadata of a game (owned by `Game Lobby`);
+- runtime state visible to players or operators as game state, including current turn,
+  generation status, engine version registry (owned by `Game Master`);
+- the engine version catalogue or which engine version a game is allowed to use (`Game Master`
+  is the future owner; `Game Lobby` supplies `image_ref` in v1);
+- contents of the engine state directory; that is engine domain;
+- backup, archival, or operator cleanup of state directories.
+
+## Non-Goals
+
+- Multi-instance operation in v1. Coordination is single-process; multiple replicas are an
+  explicit future iteration.
+- Engine version arbitration. The producer (`Game Lobby` in v1, `Game Master` later) supplies `image_ref`.
+- Image registry control. Pull policy is configurable, but RTM does not push, retag, or
+  promote images.
+- TLS or mTLS on the internal listener. RTM trusts its network segment.
+- Direct delivery of player-visible push notifications. RTM publishes admin-only notification
+  intents only for failures invisible elsewhere; everything else is delegated.
+- Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached
+  through `unix:///var/run/docker.sock`.
+
+## Position in the System
+
+```mermaid
+flowchart LR
+    Lobby["Game Lobby"]
+    GM["Game Master"]
+    Admin["Admin Service"]
+    Notify["Notification Service"]
+    RTM["Runtime Manager"]
+    Engine["Game Engine container"]
+    Docker["Docker Daemon"]
+    Postgres["PostgreSQL\nschema rtmanager"]
+    Redis["Redis\nstreams + leases"]
+
+    Lobby -->|runtime:start_jobs / stop_jobs| RTM
+    RTM -->|runtime:job_results| Lobby
+    GM -->|internal REST| RTM
+    Admin -->|internal REST| RTM
+    RTM -->|notification:intents (admin)| Notify
+    RTM -->|runtime:health_events| Redis
+    RTM <--> Docker
+    Docker -->|create / start / stop / rm| Engine
+    RTM --> Postgres
+    RTM --> Redis
+    Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
+```
+
+## Responsibility Boundaries
+
+`Runtime Manager` is responsible for:
+
+- accepting start, stop, restart, patch, inspect, and cleanup requests through the supported
+  transports and producing one durable outcome per request;
+- creating Docker containers from a producer-supplied `image_ref` and binding them to the
+  configured Docker network and host state directory;
+- enforcing the one-game-one-container invariant in its own state and on Docker;
+- monitoring container health through Docker events, periodic inspect, and active HTTP probes;
+- publishing technical runtime events (`runtime:job_results`, `runtime:health_events`) and
+  admin-only notification intents for failures that no other service can observe;
+- reconciling its persistent state with Docker reality on startup and periodically;
+- removing exited containers automatically by retention TTL or explicitly by admin command.
+
+`Runtime Manager` is not responsible for:
+
+- evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
+- registering a started runtime with `Game Master` (Lobby calls GM after a successful job
+  result);
+- mapping platform users to engine players (GM owns this mapping);
+- player command routing (GM proxies player commands directly to engine);
+- cleaning up host state directories;
+- patching the engine version registry; the registry lives in `Game Master`.
+
+## Container Model
+
+### Network
+
+Containers attach to a single user-defined Docker bridge network. The network is provisioned
+**outside** RTM: docker-compose, Terraform, or an operator runbook creates `galaxy-net` (or
+whatever name is configured via `RTMANAGER_DOCKER_NETWORK`).
+
+RTM validates the network's presence at startup. A missing network is a fail-fast condition;
+the process exits non-zero before opening any listener.
+
+### DNS name and engine endpoint
+
+Each container is created with hostname `galaxy-game-{game_id}` and is attached to the
+configured network. Docker's embedded DNS resolves the hostname for any other container in the
+same network.
+
+The `engine_endpoint` published in `runtime:job_results` and visible through the inspect REST
+endpoint is the full URL `http://galaxy-game-{game_id}:8080`. The port is fixed at `8080`
+inside the container; RTM does not publish ports to the host.
+
+Restart and patch keep the same DNS name. The `container_id` changes; the `engine_endpoint`
+does not.
+
+### State storage (bind mount)
+
+Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale
+is operator-friendly backup and inspection.
+
+- Host root: `RTMANAGER_GAME_STATE_ROOT` (operator-supplied, e.g. `/var/lib/galaxy/games`).
+- Per-game directory: `<RTMANAGER_GAME_STATE_ROOT>/{game_id}`. RTM creates it with permissions
+  `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and ownership `RTMANAGER_GAME_STATE_OWNER_UID`
+  / `_GID` (default `0:0` — operator overrides for non-root engine).
+- Bind mount: the per-game directory is mounted into the container at the path declared by
+  `RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
+- Environment: the container receives `GAME_STATE_PATH=<mount path>`. The engine resolves the
+  path from this variable. The same variable is forwarded to the engine as `STORAGE_PATH` for
+  backward compatibility — both names are accepted in v1.
+
+RTM never deletes the host state directory. Removing it is the responsibility of operator
+tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container
+through the cleanup endpoint or the retention TTL leaves the directory intact.
+
+### Container labels
+
+RTM applies the following labels to every container it creates:
+
+| Label | Value | Purpose |
+| --- | --- | --- |
+| `com.galaxy.owner` | `rtmanager` | Filter for `docker ps` and reconcile. |
+| `com.galaxy.kind` | `game-engine` | Differentiates from infra containers. |
+| `com.galaxy.game_id` | `{game_id}` | Reverse lookup from container to platform game. |
+| `com.galaxy.engine_image_ref` | `{image_ref}` | Cross-check against `runtime_records`. |
+| `com.galaxy.started_at_ms` | `{ms}` | Unambiguous start timestamp. |
+
+Labels are read from the resolved engine image to choose resource limits (see below).
+
+### Resource limits
+
+Resource limits originate in the **engine image**, not in the producer envelope or RTM config:
+
+| Image label | Container limit | RTM fallback config |
+| --- | --- | --- |
+| `com.galaxy.cpu_quota` | `--cpus` value | `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`) |
+| `com.galaxy.memory` | `--memory` value | `RTMANAGER_DEFAULT_MEMORY` (default `512m`) |
+| `com.galaxy.pids_limit` | `--pids-limit` value | `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`) |
+
+If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass
+limits.
+
+### Logging driver
+
+Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver
+and its options when creating the container:
+
+- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
+- `RTMANAGER_DOCKER_LOG_OPTS` (default empty; comma-separated `key=value` pairs).
+
+RTM never reads the container's stdout itself. Operators consume engine logs via `docker logs`
+or via whatever sink the configured driver feeds (fluentd, journald, etc.).
+
+The production Docker SDK adapter that creates and starts these containers lives at
+`internal/adapters/docker/`. Its design rationale — fixed engine port, partial-rollback on
+`ContainerStart` failure, events-stream filter rationale, and the `mockgen`-driven service-test
+fixture — is captured in [`docs/adapters.md`](docs/adapters.md).
+
+## Runtime Surface
+
+### Listeners
+
+| Listener | Default address | Purpose |
+| --- | --- | --- |
+| `internal` HTTP | `:8096` (`RTMANAGER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Game Master` and `Admin Service`. |
+
+There is no public listener. The internal listener is unauthenticated and assumes a trusted
+network segment.
+
+### Background workers
+
+| Worker | Driver | Description |
+| --- | --- | --- |
+| `startjobs` consumer | Redis Stream `runtime:start_jobs` | Decodes start envelope and invokes the start service. |
+| `stopjobs` consumer | Redis Stream `runtime:stop_jobs` | Decodes stop envelope and invokes the stop service. |
+| Docker events listener | Docker `/events` API | Subscribes with the label filter, emits `runtime:health_events` for container_started / exited / oom / disappeared. |
+| Active HTTP probe | Periodic | `GET {engine_endpoint}/healthz` for every running runtime; emits `probe_failed` / `probe_recovered` with hysteresis. |
+| Periodic Docker inspect | Periodic | Refreshes inspect data; emits `inspect_unhealthy` when restart_count grows or status is unexpected. |
+| Reconciler | Startup + periodic | Reconciles `runtime_records` with `docker ps` (see Reconciliation section). |
+| Container cleanup | Periodic | Removes exited containers older than `RTMANAGER_CONTAINER_RETENTION_DAYS`. |
+
+### Startup dependencies
+
+In start order:
+
+1. PostgreSQL primary (DSN `RTMANAGER_POSTGRES_PRIMARY_DSN`). Goose migrations apply
+   synchronously before any listener opens.
+2. Redis master (`RTMANAGER_REDIS_MASTER_ADDR`).
+3. Docker daemon at `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`). RTM
+   verifies API ping and the presence of `RTMANAGER_DOCKER_NETWORK`.
+4. Telemetry exporter (OTLP grpc/http or stdout).
+5. Internal HTTP listener.
+6. Reconciler runs once and blocks until done.
+7. Background workers start.
+
+A failure in any step is fatal and exits the process non-zero.
+
+### Probes
+
+`/healthz` reports liveness — the process responds when the HTTP server is alive.
+
+`/readyz` reports readiness — `200` only when:
+
+- the PostgreSQL pool can ping the primary;
+- the Redis master client can ping;
+- the Docker client can ping;
+- the configured Docker network exists.
+
+Both probes are documented in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
+
+## Lifecycles
+
+All operations share a per-game-id Redis lease (`rtmanager:game_lease:{game_id}`,
+TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`, default `60`). The lease serialises operations on a
+single game across all entry points (stream consumers and REST handlers). v1 does not renew
+the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before
+the operation finishes — the trade-off is documented in
+[`docs/services.md` §1](docs/services.md).
+
+### Start
+
+**Triggers:**
+
+- Lobby: a Redis Streams entry on `runtime:start_jobs` with envelope
+  `{game_id, image_ref, requested_at_ms}`.
+- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/start` with body
+  `{image_ref}`.
+
+**Pre-conditions:**
+
+- `image_ref` is a non-empty string and parseable as a Docker reference.
+- Configured Docker network exists.
+- The lease for `{game_id}` is acquired.
+
+**Flow on success:**
+
+1. Read `runtime_records.{game_id}`. If `status=running` with the same `image_ref`, return
+   the existing record (idempotent success, `error_code=replay_no_op`).
+2. Pull the image per `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`).
+3. Inspect the resolved image, derive resource limits from labels.
+4. Ensure the per-game state directory exists with the configured mode and ownership.
+5. `docker create` with the configured network, hostname, labels, env (`GAME_STATE_PATH`,
+   `STORAGE_PATH`), bind mount, log driver, resource limits.
+6. `docker start`.
+7. Upsert `runtime_records` (`status=running`, `current_container_id`, `engine_endpoint`,
+   `current_image_ref`, `started_at`, `last_op_at`).
+8. Append `operation_log` entry (`op_kind=start`, `outcome=success`, source-specific
+   `op_source`).
+9. Publish `runtime:health_events` `container_started`.
+10. For Lobby callers: publish `runtime:job_results`
+    `{game_id, outcome=success, container_id, engine_endpoint}`.
+    For REST callers: respond `200` with the runtime record.
+
+**Failure paths:**
+
+| Failure | PG side effect | Notification intent | Outcome to caller |
+| --- | --- | --- | --- |
+| Invalid `image_ref` shape, network missing | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
+| Image pull error | `operation_log` failure | `runtime.image_pull_failed` | `failure / image_pull_failed` |
+| `docker create` / `start` error | `operation_log` failure | `runtime.container_start_failed` | `failure / container_start_failed` |
+| State directory creation error | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
+
+A failed start never leaves a partially-running container: if `docker create` succeeded but
+the subsequent step failed, RTM removes the container before recording the failure.
+
+The production start orchestrator that implements the flow and the failure paths above lives
+at `internal/service/startruntime/`. Its design rationale — why the per-game lease and the
+health-events publisher live with the start service, the `Result`-shaped contract consumed by
+the stream consumer and the REST handler, the rollback rule on Upsert failure, and the
+`created_at`-preservation rule for re-starts — is captured in
+[`docs/services.md`](docs/services.md).
+
+### Stop
+
+**Triggers:**
+
+- Lobby: Redis Streams entry on `runtime:stop_jobs` with envelope
+  `{game_id, reason, requested_at_ms}`. `reason ∈ {orphan_cleanup, cancelled, finished,
+  admin_request, timeout}`.
+- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/stop` with body
+  `{reason}`.
+
+**Pre-conditions:**
+
+- Lease acquired.
+
+**Flow on success:**
+
+1. Read `runtime_records.{game_id}`. If `status` is `stopped` or `removed`, return
+   idempotent success (`error_code=replay_no_op`).
+2. `docker stop` with `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`). Docker fires
+   SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP
+   shutdown endpoint on the engine.
+3. Update `runtime_records` (`status=stopped`, `stopped_at`, `last_op_at`).
+4. Append `operation_log` entry.
+5. Publish `runtime:job_results` (for Lobby) or REST `200` (for REST callers).
+
+The container stays in `exited` state until the cleanup worker removes it (TTL) or an admin
+command forces removal.
+
+**Failure paths:**
+
+| Failure | Outcome |
+| --- | --- |
+| Container not found in Docker but record `running` | Update record `status=removed`, publish `container_disappeared`, return `success` (RTM treats this as already-stopped). |
+| `docker stop` returns non-zero, container still alive | Failure recorded, no state change. Caller may retry. |
+
+### Restart
+
+**Triggers:**
+
+- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/restart`.
+
+Restart is **recreate**: stop + remove + run with the same `image_ref` and the same bind
+mount. `container_id` changes; `engine_endpoint` is stable.
+
+**Flow:**
+
+1. Read `runtime_records.{game_id}`. The current `image_ref` is captured.
+2. Acquire lease.
+3. Run the stop flow (without releasing the lease).
+4. `docker rm` the container.
+5. Run the start flow with the captured `image_ref`.
+6. Append a single `operation_log` entry with `op_kind=restart` and a correlation id linking
+   the implicit stop and start log entries.
+
+If any inner step fails, the operation log records the partial outcome and the outer caller
+receives the same failure; the runtime record converges to whatever state Docker reports.
+
+### Patch
+
+**Triggers:**
+
+- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/patch` with body
+  `{image_ref}`.
+
+Patch is restart with a **new** `image_ref`. The engine reads its state from the bind mount
+on startup, so any data written before the patch survives.
+
+**Pre-conditions:**
+
+- New and current image refs both parse as semver tags. `image_ref_not_semver` failure
+  otherwise.
+- Major and minor versions are equal between current and new (`semver_patch_only` failure
+  otherwise).
+
+**Flow:** identical to restart, with a new `image_ref` injected before the start step.
+`operation_log` entry has `op_kind=patch`.
+
+### Cleanup
+
+**Triggers:**
+
+- Periodic worker: every container with `runtime_records.status=stopped` and
+  `last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
+- Admin Service: `DELETE /api/v1/internal/runtimes/{game_id}/container`.
+
+**Pre-conditions:**
+
+- The container is not in `running` state. RTM refuses to remove a running container through
+  this path; stop first.
+
+**Flow:**
+
+1. Acquire lease.
+2. `docker rm` the container.
+3. Update `runtime_records` (`status=removed`, `removed_at`, `current_container_id=NULL`,
+   `last_op_at`).
+4. Append `operation_log` entry (`op_kind=cleanup_container`,
+   `op_source ∈ {auto_ttl, admin_rest}`).
+
+The host state directory is left untouched.
+
+## Health Monitoring
+
+Three independent sources feed `runtime:health_events` and `health_snapshots`:
+
+1. **Docker events listener.** Subscribes to the Docker events stream and filters
+   container-scoped events by the `com.galaxy.owner=rtmanager` label written into every
+   container by the start service. Emits:
+   - `container_exited` (action=`die` with non-zero exit code; exit `0` is the normal
+     graceful stop and is suppressed).
+   - `container_oom` (action=`oom`).
+   - `container_disappeared` (action=`destroy` observed for a `runtime_records.status=running`
+     row whose `current_container_id` still matches the destroyed container, i.e. a destroy
+     RTM did not initiate).
+
+   `container_started` is emitted by the start service when it runs the container (see
+   `internal/service/startruntime`), not by this listener.
+2. **Periodic Docker inspect** every `RTMANAGER_INSPECT_INTERVAL` (default `30s`). Emits
+   `inspect_unhealthy` when:
+   - `RestartCount` increases between observations;
+   - `State.Status != "running"` for a record marked running;
+   - `State.Health.Status == "unhealthy"` if the image declares a Docker `HEALTHCHECK`.
+3. **Active HTTP probe** every `RTMANAGER_PROBE_INTERVAL` (default `15s`). Calls
+   `GET {engine_endpoint}/healthz` with `RTMANAGER_PROBE_TIMEOUT` (default `2s`). Emits:
+   - `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures
+     (default `3`);
+   - `probe_recovered` on the first success after a `probe_failed` was published.
+
+Every emission updates `health_snapshots.{game_id}` (latest event becomes the snapshot) and
+appends to `runtime:health_events`.
+
+In v1, RTM publishes admin-only notification intents only for first-touch failures of the
+start flow. All ongoing health changes (probe failures, OOMs, exits) flow through
+`runtime:health_events` only. `Game Master` is the consumer that decides whether to escalate
+runtime-level events into notifications.
+
+The three workers that implement the sources above live in
+`internal/worker/{dockerevents,dockerinspect,healthprobe}`. Their design rationale —
+`container_started` ownership, `container_disappeared` emission rules, `die` exit-code
+suppression, probe hysteresis state model, parallel-probe cap, and the events-listener
+reconnect policy — is captured in [`docs/workers.md`](docs/workers.md).
+
+## Reconciliation
+
+RTM never assumes Docker and PostgreSQL are in sync.
+
+At startup (blocking, before workers start) and every `RTMANAGER_RECONCILE_INTERVAL`
+(default `5m`):
+
+1. List Docker containers with label `com.galaxy.owner=rtmanager`.
+2. For each running container without a matching record:
+   - Insert a `runtime_records` row with `status=running`, the discovered
+     `current_image_ref`, `engine_endpoint`, and `started_at` taken from
+     `com.galaxy.started_at_ms` if present (otherwise from `State.StartedAt`).
+   - Append `operation_log` entry with `op_kind=reconcile_adopt`,
+     `op_source=auto_reconcile`.
+   - **Never stop or remove an unrecorded container.** Operators may have started one
+     manually for diagnostics; RTM stays out of their way.
+3. For each `runtime_records` row with `status=running` whose container is missing:
+   - Update `status=removed`, `removed_at=now`, `current_container_id=NULL`.
+   - Publish `runtime:health_events` `container_disappeared`.
+   - Append `operation_log` entry with `op_kind=reconcile_dispose`.
+4. For each `runtime_records` row with `status=running` whose container exists but is in
+   `exited`:
+   - Update `status=stopped`, `stopped_at=now` (reconciler observation time).
+   - Publish `runtime:health_events` `container_exited` with the observed exit code.
+
+The reconciler implementation lives at `internal/worker/reconcile/` and the periodic
+TTL-cleanup worker at `internal/worker/containercleanup/`; the cleanup worker delegates
+removal to `internal/service/cleanupcontainer/`. The design rationale — the per-game
+lease around every drift mutation, the third `observed_exited` path beyond the two
+named cases, the synchronous `ReconcileNow` plus periodic `Component` split, and why
+the cleanup worker is a thin TTL filter on top of the existing service — is captured in
+[`docs/workers.md`](docs/workers.md).
+
+## Trusted Surfaces
+
+### Internal REST
+
+The internal REST surface is consumed by `Game Master` (sync interactions for inspect,
+restart, patch, stop, cleanup) and `Admin Service` (operational tooling, force-cleanup).
+The listener is unauthenticated; downstream services rely on network segmentation.
+
+| Method | Path | Operation ID | Caller |
+| --- | --- | --- | --- |
+| `GET` | `/healthz` | `internalHealthz` | platform probes |
+| `GET` | `/readyz` | `internalReadyz` | platform probes |
+| `GET` | `/api/v1/internal/runtimes` | `internalListRuntimes` | GM, Admin |
+| `GET` | `/api/v1/internal/runtimes/{game_id}` | `internalGetRuntime` | GM, Admin |
+| `POST` | `/api/v1/internal/runtimes/{game_id}/start` | `internalStartRuntime` | GM, Admin |
+| `POST` | `/api/v1/internal/runtimes/{game_id}/stop` | `internalStopRuntime` | GM, Admin |
+| `POST` | `/api/v1/internal/runtimes/{game_id}/restart` | `internalRestartRuntime` | GM, Admin |
+| `POST` | `/api/v1/internal/runtimes/{game_id}/patch` | `internalPatchRuntime` | GM, Admin |
+| `DELETE` | `/api/v1/internal/runtimes/{game_id}/container` | `internalCleanupRuntimeContainer` | Admin |
+
+Request and response shapes are defined in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
+Unknown JSON fields are rejected with `invalid_request`.
+
+Callers identify themselves through the optional `X-Galaxy-Caller`
+request header (`gm` for `Game Master`, `admin` for `Admin Service`).
+The header is recorded as `op_source` in `operation_log` (`gm_rest` or
+`admin_rest`); when missing or carrying any other value Runtime
+Manager defaults to `op_source = admin_rest`. The header is documented
+on every runtime endpoint of
+[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
+
+## Async Stream Contracts
+
+### `runtime:start_jobs` (in)
+
+Producer: `Game Lobby`.
+
+| Field | Type | Notes |
+| --- | --- | --- |
+| `game_id` | string | Lobby `game_id`. |
+| `image_ref` | string | Docker reference. Lobby resolves it from `target_engine_version` using `LOBBY_ENGINE_IMAGE_TEMPLATE`. |
+| `requested_at_ms` | int64 | UTC milliseconds. Used for diagnostics, not authoritative. |
+
+### `runtime:stop_jobs` (in)
+
+Producer: `Game Lobby`.
+
+| Field | Type | Notes |
+| --- | --- | --- |
+| `game_id` | string | |
+| `reason` | enum | `orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`. Recorded in `operation_log.error_code` when the reason matters; otherwise opaque. |
+| `requested_at_ms` | int64 | |
+
+### `runtime:job_results` (out)
+
+Producer: `Runtime Manager`. Consumer: `Game Lobby`.
+
+| Field | Type | Notes |
+| --- | --- | --- |
+| `game_id` | string | |
+| `outcome` | enum | `success`, `failure`. |
+| `container_id` | string | Required for `success`. Empty on `failure`. |
+| `engine_endpoint` | string | Required for `success`. Empty on `failure`. |
+| `error_code` | string | Stable code. `replay_no_op` for idempotent re-runs. |
+| `error_message` | string | Operator-readable detail. |
+
+### `runtime:health_events` (out, new)
+
+Producer: `Runtime Manager`. Consumers: `Game Master`; `Game Lobby` and `Admin Service`
+are reserved as future consumers.
+
+| Field | Type | Notes |
+| --- | --- | --- |
+| `game_id` | string | |
+| `container_id` | string | The container observed (may differ from current after a restart race). |
+| `event_type` | enum | See below. |
+| `occurred_at_ms` | int64 | UTC milliseconds. |
+| `details` | json | Type-specific payload. |
+
+`event_type` values and their `details` schemas:
+
+| `event_type` | `details` payload |
+| --- | --- |
+| `container_started` | `{image_ref}` |
+| `container_exited` | `{exit_code, oom: bool}` |
+| `container_oom` | `{exit_code}` |
+| `container_disappeared` | `{}` |
+| `inspect_unhealthy` | `{restart_count, state, health}` |
+| `probe_failed` | `{consecutive_failures, last_status, last_error}` |
+| `probe_recovered` | `{prior_failure_count}` |
+
+The full schema is enforced by [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml).
+
+## Notification Contracts
+
+`Runtime Manager` publishes admin-only notification intents only for failures invisible to
+any other service:
+
+| Trigger | `notification_type` | Audience | Channels |
+| --- | --- | --- | --- |
+| Image pull error during start | `runtime.image_pull_failed` | admin | email |
+| `docker create` / `docker start` error | `runtime.container_start_failed` | admin | email |
+| Configuration validation error at start (bad image_ref, missing network) | `runtime.start_config_invalid` | admin | email |
+
+Constructors live in `galaxy/pkg/notificationintent`. Catalog entries live in
+[`../notification/README.md`](../notification/README.md) and
+[`../notification/api/intents-asyncapi.yaml`](../notification/api/intents-asyncapi.yaml).
+All three intents share the frozen field set
+`{game_id, image_ref, error_code, error_message, attempted_at_ms}`; the
+`_ms` suffix on `attempted_at_ms` follows the repo-wide convention for
+millisecond integer fields.
+The Redis Streams publisher wrapper used to emit these intents from RTM
+ships in `internal/adapters/notificationpublisher/`; the rationale for the
+signature shim that drops the upstream entry id lives in
+[`docs/domain-and-ports.md` §7](docs/domain-and-ports.md) and the production
+wiring is documented in [`docs/adapters.md`](docs/adapters.md).
+
+Runtime-level changes after a successful start (probe failures, OOM, container exited)
+**do not** produce notifications from RTM. Game Master decides whether to escalate.
+
+## Persistence Layout
+
+### PostgreSQL durable state (schema `rtmanager`)
+
+| Table | Purpose | Key |
+| --- | --- | --- |
+| `runtime_records` | One row per game, latest known runtime status. | `game_id` |
+| `operation_log` | Append-only audit of every operation RTM performed. | `id` (auto) |
+| `health_snapshots` | Latest health observation per game. | `game_id` |
+
+`runtime_records` columns:
+
+- `game_id` — primary key, references Lobby's identifier.
+- `status` — `running | stopped | removed`.
+- `current_container_id` — nullable when `status=removed`.
+- `current_image_ref` — non-null when status is `running` or `stopped`.
+- `engine_endpoint` — `http://galaxy-game-{game_id}:8080`.
+- `state_path` — absolute host path of the bind-mounted directory.
+- `docker_network` — network name observed at create time.
+- `started_at`, `stopped_at`, `removed_at` — last transition timestamps.
+- `last_op_at` — drives retention TTL.
+- `created_at` — first time RTM saw the game.
+
+`operation_log` columns:
+
+- `id`, `game_id`, `op_kind` (`start | stop | restart | patch | cleanup_container |
+  reconcile_adopt | reconcile_dispose`), `op_source` (`lobby_stream | gm_rest | admin_rest |
+  auto_ttl | auto_reconcile`), `source_ref` (stream entry id, REST request id, or admin
+  user), `image_ref`, `container_id`, `outcome` (`success | failure`), `error_code`,
+  `error_message`, `started_at`, `finished_at`.
+
+`health_snapshots` columns:
+
+- `game_id`, `container_id`, `status`
+  (`healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared`),
+  `source` (`docker_event | inspect | probe`), `details` (jsonb), `observed_at`.
+
+Indexes:
+
+- `runtime_records (status, last_op_at)` — drives cleanup worker.
+- `operation_log (game_id, started_at DESC)` — drives audit reads.
+
+Migrations are embedded `00001_init.sql` (single-init pre-launch policy from
+`ARCHITECTURE.md §Persistence Backends`).
+
+### Redis runtime-coordination state
+
+| Key shape | Purpose |
+| --- | --- |
+| `rtmanager:stream_offsets:{label}` | Last processed entry id per consumer (`startjobs`, `stopjobs`). Same shape as Lobby. |
+| `rtmanager:game_lease:{game_id}` | Per-game lease string (`SET ... NX PX <ttl>`). TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default 60s); not renewed mid-operation in v1. The trade-off is documented in [`docs/services.md` §1](docs/services.md). |
+
+Stream key shapes themselves are configurable:
+
+- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
+- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
+- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
+- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
+- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
+
+## Error Model
+
+Error envelope: `{ "error": { "code": "...", "message": "..." } }`, identical to Lobby's.
+
+Stable error codes:
+
+| Code | Meaning |
+| --- | --- |
+| `invalid_request` | Malformed JSON, unknown fields, missing required parameter. |
+| `not_found` | Runtime record does not exist. |
+| `conflict` | Operation incompatible with current `status`. |
+| `service_unavailable` | Dependency unavailable (Docker daemon, PG, Redis). |
+| `internal_error` | Unspecified failure. |
+| `image_pull_failed` | Image pull attempt failed. |
+| `image_ref_not_semver` | Patch attempted with a tag that is not parseable semver. |
+| `semver_patch_only` | Patch attempted across major/minor boundary. |
+| `container_start_failed` | `docker create` / `docker start` failed. |
+| `start_config_invalid` | Network missing, bind path inaccessible, or other config error. |
+| `docker_unavailable` | Docker daemon ping failed. |
+| `replay_no_op` | Idempotent replay; outcome is success but no work was done. |
+
+## Configuration
+
+All variables use the `RTMANAGER_` prefix. Required variables fail-fast on startup.
+
+### Required
+
+- `RTMANAGER_INTERNAL_HTTP_ADDR`
+- `RTMANAGER_POSTGRES_PRIMARY_DSN`
+- `RTMANAGER_REDIS_MASTER_ADDR`
+- `RTMANAGER_REDIS_PASSWORD`
+- `RTMANAGER_DOCKER_HOST`
+- `RTMANAGER_DOCKER_NETWORK`
+- `RTMANAGER_GAME_STATE_ROOT`
+
+### Configuration groups
+
+**Listener:**
+
+- `RTMANAGER_INTERNAL_HTTP_ADDR` (e.g. `:8096`).
+- `RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`).
+- `RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `15s`).
+- `RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`).
+
+**Docker:**
+
+- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`).
+- `RTMANAGER_DOCKER_API_VERSION` (default empty — let SDK negotiate).
+- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`).
+- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
+- `RTMANAGER_DOCKER_LOG_OPTS` (default empty).
+- `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`,
+  values `if_missing | always | never`).
+
+**Container defaults:**
+
+- `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`).
+- `RTMANAGER_DEFAULT_MEMORY` (default `512m`).
+- `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`).
+- `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`).
+- `RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
+- `RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
+- `RTMANAGER_ENGINE_STATE_ENV_NAME` (default `GAME_STATE_PATH`).
+- `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`).
+- `RTMANAGER_GAME_STATE_OWNER_UID` (default `0`).
+- `RTMANAGER_GAME_STATE_OWNER_GID` (default `0`).
+- `RTMANAGER_GAME_STATE_ROOT` (host path).
+
+**Postgres:**
+
+- `RTMANAGER_POSTGRES_PRIMARY_DSN` (`postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable`).
+- `RTMANAGER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used in v1).
+- `RTMANAGER_POSTGRES_OPERATION_TIMEOUT` (default `2s`).
+- `RTMANAGER_POSTGRES_MAX_OPEN_CONNS` (default `10`).
+- `RTMANAGER_POSTGRES_MAX_IDLE_CONNS` (default `2`).
+- `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`).
+
+**Redis:**
+
+- `RTMANAGER_REDIS_MASTER_ADDR`.
+- `RTMANAGER_REDIS_REPLICA_ADDRS` (optional, comma-separated).
+- `RTMANAGER_REDIS_PASSWORD`.
+- `RTMANAGER_REDIS_DB` (default `0`).
+- `RTMANAGER_REDIS_OPERATION_TIMEOUT` (default `2s`).
+
+**Streams:**
+
+- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
+- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
+- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
+- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
+- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
+- `RTMANAGER_STREAM_BLOCK_TIMEOUT` (default `5s`).
+
+**Health monitoring:**
+
+- `RTMANAGER_INSPECT_INTERVAL` (default `30s`).
+- `RTMANAGER_PROBE_INTERVAL` (default `15s`).
+- `RTMANAGER_PROBE_TIMEOUT` (default `2s`).
+- `RTMANAGER_PROBE_FAILURES_THRESHOLD` (default `3`).
+
+**Reconciler / cleanup:**
+
+- `RTMANAGER_RECONCILE_INTERVAL` (default `5m`).
+- `RTMANAGER_CLEANUP_INTERVAL` (default `1h`).
+
+**Coordination:**
+
+- `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60`).
+
+**Lobby internal client:**
+
+- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` (e.g. `http://lobby:8095`).
+- `RTMANAGER_LOBBY_INTERNAL_TIMEOUT` (default `2s`).
+
+**Logging:**
+
+- `RTMANAGER_LOG_LEVEL` (default `info`).
+
+**Lifecycle:**
+
+- `RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`).
+
+**Telemetry:** uses the standard OTLP env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`,
+`OTEL_EXPORTER_OTLP_PROTOCOL`, etc.) shared with other Galaxy services.
+
+## Observability
+
+### Metrics (OpenTelemetry, low cardinality)
+
+- `rtmanager.start_outcomes` — counter, labels `outcome`, `error_code`, `op_source`.
+- `rtmanager.stop_outcomes` — counter, labels `outcome`, `reason`, `op_source`.
+- `rtmanager.restart_outcomes` — counter, labels `outcome`, `error_code`.
+- `rtmanager.patch_outcomes` — counter, labels `outcome`, `error_code`.
+- `rtmanager.cleanup_outcomes` — counter, labels `outcome`, `op_source`.
+- `rtmanager.docker_op_latency` — histogram, label `op` (`pull | create | start | stop | rm
+  | inspect | events`).
+- `rtmanager.health_events` — counter, label `event_type`.
+- `rtmanager.reconcile_drift` — counter, label `kind` (`adopt | dispose | observed_exited`).
+- `rtmanager.runtime_records_by_status` — gauge, label `status`.
+- `rtmanager.lease_acquire_latency` — histogram.
+- `rtmanager.notification_intents` — counter, label `notification_type`.
+
+### Structured logs (slog JSON to stdout)
+
+Common fields on every entry: `service=rtmanager`, `request_id`, `trace_id`, `span_id`,
+`game_id` (when known), `container_id` (when known), `op_kind`, `op_source`, `outcome`,
+`error_code`.
+
+Worker-specific fields: `stream_entry_id` (consumers), `event_type` (health), `image_ref`
+(start/patch).
+
+## Verification
+
+Service-level (TESTING.md §7):
+
+- Unit tests for every service-layer operation against mocked Docker.
+- Adapter tests (PG, Redis, Docker) using `testcontainers-go` for PG/Redis and the Docker
+  daemon socket for the real Docker adapter.
+- Contract tests for `internal-openapi.yaml`, `runtime-jobs-asyncapi.yaml`,
+  `runtime-health-asyncapi.yaml`.
+
+Service-local integration suite under `rtmanager/integration/`:
+
+- Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real
+  `galaxy/game` test image.
+- Replay safety (duplicate stream entries are no-ops).
+- Health observability (kill the engine externally, observe `container_disappeared`; relaunch
+  manually, observe reconcile adopt).
+- Notification on first-touch failures (publish a start with an unresolvable image, observe
+  `runtime.image_pull_failed` intent and a `failure` job result).
+
+Inter-service suite under `integration/lobbyrtm/`:
+
+- Real Lobby + real RTM + real `galaxy/game` test image. Covers happy path, cancel, and
+  start-failed flows.
+
+Manual smoke (development):
+
+```sh
+docker network create galaxy-net   # once
+RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
+RTMANAGER_DOCKER_NETWORK=galaxy-net \
+RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
+... go run ./rtmanager/cmd/rtmanager
+```
+
+After start, `curl http://localhost:8096/readyz` returns `200`. Driving Lobby through its
+public flow brings up `galaxy-game-{game_id}` containers; RTM logs each lifecycle transition
+and publishes the corresponding stream entries.