galaxy-game/rtmanager/docs/runtime.md

# Runtime and Components

The diagram below focuses on the deployed `galaxy/rtmanager` process
and its runtime dependencies. The current-state contract for every
listener, worker, and adapter lives in [`../README.md`](../README.md);
this document is the navigation aid that points at the right code path
and the right design-rationale record.

```mermaid
flowchart LR
    subgraph Clients
        GM["Game Master"]
        Admin["Admin Service"]
        Lobby["Game Lobby"]
    end

    subgraph RTM["Runtime Manager process"]
        InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
        StartJobs["startjobsconsumer"]
        StopJobs["stopjobsconsumer"]
        DockerEvents["dockerevents listener"]
        HealthProbe["healthprobe worker"]
        DockerInspect["dockerinspect worker"]
        Reconcile["reconcile worker"]
        Cleanup["containercleanup worker"]
        Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
        IntentPublisher["notification:intents publisher"]
        ResultsPublisher["runtime:job_results publisher"]
        HealthPublisher["runtime:health_events publisher"]
        Telemetry["Logs, traces, metrics"]
    end

    Docker["Docker Daemon"]
    Engine["galaxy-game-{game_id} container"]
    Postgres["PostgreSQL\nschema rtmanager"]
    Redis["Redis\nstreams + leases + offsets"]
    LobbyHTTP["Lobby internal HTTP"]

    Lobby -. runtime:start_jobs .-> StartJobs
    Lobby -. runtime:stop_jobs .-> StopJobs
    GM --> InternalHTTP
    Admin --> InternalHTTP

    StartJobs --> Services
    StopJobs --> Services
    InternalHTTP --> Services

    Services --> Docker
    Services --> Postgres
    Services --> Redis
    Services --> ResultsPublisher
    Services --> HealthPublisher
    Services --> IntentPublisher
    Services -. GET diagnostic .-> LobbyHTTP

    DockerEvents --> Docker
    DockerInspect --> Docker
    HealthProbe --> Engine
    Reconcile --> Docker
    Reconcile --> Postgres
    Cleanup --> Postgres
    Cleanup --> Services

    DockerEvents --> HealthPublisher
    DockerInspect --> HealthPublisher
    HealthProbe --> HealthPublisher

    HealthPublisher --> Redis
    ResultsPublisher --> Redis
    IntentPublisher --> Redis

    StartJobs --> Redis
    StopJobs --> Redis
    InternalHTTP --> Postgres

    Docker -->|create / start / stop / rm| Engine
    Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]

    InternalHTTP --> Telemetry
    Services --> Telemetry
    StartJobs --> Telemetry
    StopJobs --> Telemetry
    DockerEvents --> Telemetry
    HealthProbe --> Telemetry
    DockerInspect --> Telemetry
    Reconcile --> Telemetry
    Cleanup --> Telemetry
```

Notes:

- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
  goose migrations fail, when Redis ping fails, when the Docker daemon
  ping fails, or when the configured Docker network is missing. Lobby
  reachability is **not** verified at boot — the start service's
  diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
  outside of debug logging
  ([`services.md` §7](services.md)).
- The reconciler runs **synchronously** once on startup before
  `app.App.Run` registers any other component, then re-runs
  periodically as a regular `Component`. The synchronous pass is the
  reason why orphaned containers from a prior process can never be
  observed by the events listener with no PG record
  ([`workers.md` §17](workers.md)).
- A single internal HTTP listener exposes both probes
  (`/healthz`, `/readyz`) and the trusted REST surface for Game Master
  and Admin Service. There is no public listener — RTM does not face
  end users.

## Listeners

| Listener | Default addr | Purpose |
| --- | --- | --- |
| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |

Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:

- read timeout: `5s`
- write timeout: `15s`
- idle timeout: `60s`

The listener is unauthenticated and assumes a trusted network segment.
The `X-Galaxy-Caller` request header carries an optional caller
identity (`gm` or `admin`) that the handler records as
`operation_log.op_source`
([`services.md` §18](services.md)).

Probe routes:

- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
  the listener is up.
- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
  Docker daemon, then asserts the configured Docker network exists.
  Returns `{"status":"ready"}` only when every check passes; otherwise
  returns `503` with the canonical error envelope.

## Background Workers

Every worker runs as an `app.Component` and is registered in the
order below by [`internal/app/runtime.go`](../internal/app/runtime.go).

| Worker | Source | Trigger | Function |
| --- | --- | --- | --- |
| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |

The events listener and the inspect worker do **not** emit
`container_started` — that event is owned by the start service
([`workers.md` §1](workers.md)). The events listener and the inspect
worker also do not emit `container_disappeared` autonomously when a
record is missing or stale; the conditional emission rules live in
[`workers.md` §2](workers.md) and [`§4`](workers.md).

## Lifecycle Services

The five lifecycle services are pure orchestrators called from both
the stream consumers and the REST handlers. Each service owns the
per-game lease for the duration of its operation.

| Service | Source | Triggers | Failure envelope |
| --- | --- | --- | --- |
| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |

All services share three behaviours captured in
[`services.md`](services.md):

- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
  TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
  not by the caller — which keeps consumer and REST callers symmetric
  ([`services.md` §1](services.md));
- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
  `ContainerID`, `EngineEndpoint`) is what consumers and REST
  handlers translate into job_results / HTTP responses
  ([`services.md` §3](services.md));
- failures pass through one `operation_log` write before returning,
  and three of the failure codes (`start_config_invalid`,
  `image_pull_failed`, `container_start_failed`) also publish a
  `runtime.*` admin notification intent
  ([`services.md` §4](services.md)).

## Synchronous Upstream Client

| Client | Endpoint | Failure mapping |
| --- | --- | --- |
| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |

Lobby's outbound transport is the only synchronous client RTM holds.
Every other interaction (Notification Service, Game Master, Admin
Service) crosses an asynchronous boundary or is initiated by the peer.

## Stream Offsets

Each consumer persists its position under a fixed label so process
restart preserves stream progress.

| Stream | Offset key | Block timeout env |
| --- | --- | --- |
| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |

The labels `startjobs` and `stopjobs` are stable identifiers — they
are decoupled from the underlying stream key. An operator who renames
a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
Decision: [`workers.md` §9](workers.md).

The `runtime:job_results`, `runtime:health_events`, and
`notification:intents` streams are outbound; RTM does not consume them
itself.

## Configuration Groups

The full env-var list with defaults lives in
[`../README.md` §Configuration](../README.md). The groups below
summarise the structure:

- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
  `RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
  `RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
  `RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
  `RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
  `RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
  `RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
  `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
  `RTMANAGER_CONTAINER_RETENTION_DAYS`,
  `RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
  `RTMANAGER_ENGINE_STATE_ENV_NAME`,
  `RTMANAGER_GAME_STATE_DIR_MODE`,
  `RTMANAGER_GAME_STATE_OWNER_UID`,
  `RTMANAGER_GAME_STATE_OWNER_GID`.
- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
  `RTMANAGER_POSTGRES_REPLICA_DSNS`,
  `RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
  `RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
  `RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
  `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
  `RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
  `RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
  `RTMANAGER_REDIS_STOP_JOBS_STREAM`,
  `RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
  `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
  `RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
  `RTMANAGER_STREAM_BLOCK_TIMEOUT`.
- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
  `RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
  `RTMANAGER_PROBE_FAILURES_THRESHOLD`.
- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
  `RTMANAGER_CLEANUP_INTERVAL`.
- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
  `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
  `RTMANAGER_SHUTDOWN_TIMEOUT`.
- **Telemetry** — standard `OTEL_*`.

## Runtime Notes

- **Single-instance v1.** Multi-instance Runtime Manager with Redis
  Streams consumer groups is explicitly out of scope for the current
  iteration. The per-game lease serialises operations on one game
  across the consumer + REST entry points; cross-instance
  coordination is deferred until a real workload demands it.
- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
  `SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
  (default `60s`). The lease is **not renewed mid-operation** in v1;
  long pulls of multi-GB images can therefore expire the lease
  before the operation finishes — the trade-off is documented in
  [`services.md` §1](services.md). The reconciler honours the same
  lease around every drift mutation
  ([`workers.md` §14](workers.md)).
- **Operation log is the source of truth.** Every lifecycle and
  reconcile mutation appends one row to `rtmanager.operation_log`.
  The `runtime:health_events` stream and the `notification:intents`
  emissions are best-effort — a publish failure logs at `Error` and
  proceeds, never rolling back the recorded operation
  ([`workers.md` §8](workers.md)).
- **In-memory probe hysteresis.** The active HTTP probe keeps
  per-game `consecutiveFailures` and `failurePublished` counters in a
  mutex-guarded map. State is non-persistent: a process restart that
  loses the counters re-establishes hysteresis from scratch, and
  state for a game that transitions through `stopped → running` is
  pruned at the start of every probe tick
  ([`workers.md` §5](workers.md)).
- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
  `if_missing` (default), `always`, and `never`. Image labels
  (`com.galaxy.cpu_quota`, `com.galaxy.memory`,
  `com.galaxy.pids_limit`) drive resource limits when present; the
  matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
  label is absent or unparseable. Producers never pass limits.
- **State directory ownership.** RTM creates per-game state
  directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
  mode and uid/gid, but **never deletes them**. Removing the directory
  is operator domain (backup tooling, a future Admin Service
  workflow). A cleanup that removes the container leaves the
  directory intact.