Files

T

Ilia Denisov a7cee15115 feat: runtime manager

2026-04-28 20:39:18 +02:00

38 KiB

Raw Blame History

Runtime Manager

Runtime Manager (RTM) is the only Galaxy platform service permitted to interact with the Docker daemon. It owns the lifecycle of galaxy/game engine containers and the technical runtime view of running games. Other services consume RTM via two transports: an asynchronous Redis Streams contract (used by Game Lobby) and a synchronous internal REST surface (used by Game Master and Admin Service).

References

../ARCHITECTURE.md — system architecture, §9 Runtime Manager.
../TESTING.md §7 — testing matrix for RTM.
./docs/README.md — service-local documentation entry point.
./api/internal-openapi.yaml — REST contract.
./api/runtime-jobs-asyncapi.yaml — start/stop job streams contract.
./api/runtime-health-asyncapi.yaml — runtime:health_events stream contract.
../game/README.md — game engine container contract (env, ports, /healthz).
../lobby/README.md — Game Lobby integration with RTM.

Purpose

A running Galaxy game lives in exactly one Docker container. The platform must be able to:

create the container with the right engine version and configuration;
supply the engine with a stable storage location for game state;
keep the runtime status visible to platform-level services;
replace the container in place for patch upgrades and restarts;
remove containers that are no longer needed;
detect and surface engine failures to whoever should react.

Runtime Manager is the single component that performs these actions. It deliberately does not reason about platform metadata, membership, schedules, turn cutoffs, or any other business state. Game Lobby owns platform metadata; Game Master will own runtime business state when implemented.

Scope

Runtime Manager is the source of truth for:

the mapping game_id -> current_container_id for every running container;
the durable history of every start, stop, restart, patch, and cleanup operation it performed;
the most recent technical health observation per game (last Docker event, last successful or failed probe, last inspect result).

Runtime Manager is not the source of truth for:

any business or platform-level metadata of a game (owned by Game Lobby);
runtime state visible to players or operators as game state, including current turn, generation status, engine version registry (owned by Game Master);
the engine version catalogue or which engine version a game is allowed to use (Game Master is the future owner; Game Lobby supplies image_ref in v1);
contents of the engine state directory; that is engine domain;
backup, archival, or operator cleanup of state directories.

Non-Goals

Multi-instance operation in v1. Coordination is single-process; multiple replicas are an explicit future iteration.
Engine version arbitration. The producer (Game Lobby in v1, Game Master later) supplies image_ref.
Image registry control. Pull policy is configurable, but RTM does not push, retag, or promote images.
TLS or mTLS on the internal listener. RTM trusts its network segment.
Direct delivery of player-visible push notifications. RTM publishes admin-only notification intents only for failures invisible elsewhere; everything else is delegated.
Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached through unix:///var/run/docker.sock.

Position in the System

flowchart LR
    Lobby["Game Lobby"]
    GM["Game Master"]
    Admin["Admin Service"]
    Notify["Notification Service"]
    RTM["Runtime Manager"]
    Engine["Game Engine container"]
    Docker["Docker Daemon"]
    Postgres["PostgreSQL\nschema rtmanager"]
    Redis["Redis\nstreams + leases"]

    Lobby -->|runtime:start_jobs / stop_jobs| RTM
    RTM -->|runtime:job_results| Lobby
    GM -->|internal REST| RTM
    Admin -->|internal REST| RTM
    RTM -->|notification:intents (admin)| Notify
    RTM -->|runtime:health_events| Redis
    RTM <--> Docker
    Docker -->|create / start / stop / rm| Engine
    RTM --> Postgres
    RTM --> Redis
    Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]

Responsibility Boundaries

Runtime Manager is responsible for:

accepting start, stop, restart, patch, inspect, and cleanup requests through the supported transports and producing one durable outcome per request;
creating Docker containers from a producer-supplied image_ref and binding them to the configured Docker network and host state directory;
enforcing the one-game-one-container invariant in its own state and on Docker;
monitoring container health through Docker events, periodic inspect, and active HTTP probes;
publishing technical runtime events (runtime:job_results, runtime:health_events) and admin-only notification intents for failures that no other service can observe;
reconciling its persistent state with Docker reality on startup and periodically;
removing exited containers automatically by retention TTL or explicitly by admin command.

Runtime Manager is not responsible for:

evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
registering a started runtime with Game Master (Lobby calls GM after a successful job result);
mapping platform users to engine players (GM owns this mapping);
player command routing (GM proxies player commands directly to engine);
cleaning up host state directories;
patching the engine version registry; the registry lives in Game Master.

Container Model

Network

Containers attach to a single user-defined Docker bridge network. The network is provisioned outside RTM: docker-compose, Terraform, or an operator runbook creates galaxy-net (or whatever name is configured via RTMANAGER_DOCKER_NETWORK).

RTM validates the network's presence at startup. A missing network is a fail-fast condition; the process exits non-zero before opening any listener.

DNS name and engine endpoint

Each container is created with hostname galaxy-game-{game_id} and is attached to the configured network. Docker's embedded DNS resolves the hostname for any other container in the same network.

The engine_endpoint published in runtime:job_results and visible through the inspect REST endpoint is the full URL http://galaxy-game-{game_id}:8080. The port is fixed at 8080 inside the container; RTM does not publish ports to the host.

Restart and patch keep the same DNS name. The container_id changes; the engine_endpoint does not.

State storage (bind mount)

Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale is operator-friendly backup and inspection.

Host root: RTMANAGER_GAME_STATE_ROOT (operator-supplied, e.g. /var/lib/galaxy/games).
Per-game directory: <RTMANAGER_GAME_STATE_ROOT>/{game_id}. RTM creates it with permissions RTMANAGER_GAME_STATE_DIR_MODE (default 0750) and ownership RTMANAGER_GAME_STATE_OWNER_UID / _GID (default 0:0 — operator overrides for non-root engine).
Bind mount: the per-game directory is mounted into the container at the path declared by RTMANAGER_ENGINE_STATE_MOUNT_PATH (default /var/lib/galaxy-game).
Environment: the container receives GAME_STATE_PATH=<mount path>. The engine resolves the path from this variable. The same variable is forwarded to the engine as STORAGE_PATH for backward compatibility — both names are accepted in v1.

RTM never deletes the host state directory. Removing it is the responsibility of operator tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container through the cleanup endpoint or the retention TTL leaves the directory intact.

Container labels

RTM applies the following labels to every container it creates:

Label	Value	Purpose
`com.galaxy.owner`	`rtmanager`	Filter for `docker ps` and reconcile.
`com.galaxy.kind`	`game-engine`	Differentiates from infra containers.
`com.galaxy.game_id`	`{game_id}`	Reverse lookup from container to platform game.
`com.galaxy.engine_image_ref`	`{image_ref}`	Cross-check against `runtime_records`.
`com.galaxy.started_at_ms`	`{ms}`	Unambiguous start timestamp.

Labels are read from the resolved engine image to choose resource limits (see below).

Resource limits

Resource limits originate in the engine image, not in the producer envelope or RTM config:

Image label	Container limit	RTM fallback config
`com.galaxy.cpu_quota`	`--cpus` value	`RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`)
`com.galaxy.memory`	`--memory` value	`RTMANAGER_DEFAULT_MEMORY` (default `512m`)
`com.galaxy.pids_limit`	`--pids-limit` value	`RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`)

If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass limits.

Logging driver

Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver and its options when creating the container:

RTMANAGER_DOCKER_LOG_DRIVER (default json-file).
RTMANAGER_DOCKER_LOG_OPTS (default empty; comma-separated key=value pairs).

RTM never reads the container's stdout itself. Operators consume engine logs via docker logs or via whatever sink the configured driver feeds (fluentd, journald, etc.).

The production Docker SDK adapter that creates and starts these containers lives at internal/adapters/docker/. Its design rationale — fixed engine port, partial-rollback on ContainerStart failure, events-stream filter rationale, and the mockgen-driven service-test fixture — is captured in docs/adapters.md.

Runtime Surface

Listeners

Listener	Default address	Purpose
`internal` HTTP	`:8096` (`RTMANAGER_INTERNAL_HTTP_ADDR`)	Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Game Master` and `Admin Service`.

There is no public listener. The internal listener is unauthenticated and assumes a trusted network segment.

Background workers

Worker	Driver	Description
`startjobs` consumer	Redis Stream `runtime:start_jobs`	Decodes start envelope and invokes the start service.
`stopjobs` consumer	Redis Stream `runtime:stop_jobs`	Decodes stop envelope and invokes the stop service.
Docker events listener	Docker `/events` API	Subscribes with the label filter, emits `runtime:health_events` for container_started / exited / oom / disappeared.
Active HTTP probe	Periodic	`GET {engine_endpoint}/healthz` for every running runtime; emits `probe_failed` / `probe_recovered` with hysteresis.
Periodic Docker inspect	Periodic	Refreshes inspect data; emits `inspect_unhealthy` when restart_count grows or status is unexpected.
Reconciler	Startup + periodic	Reconciles `runtime_records` with `docker ps` (see Reconciliation section).
Container cleanup	Periodic	Removes exited containers older than `RTMANAGER_CONTAINER_RETENTION_DAYS`.

Startup dependencies

In start order:

PostgreSQL primary (DSN RTMANAGER_POSTGRES_PRIMARY_DSN). Goose migrations apply synchronously before any listener opens.
Redis master (RTMANAGER_REDIS_MASTER_ADDR).
Docker daemon at RTMANAGER_DOCKER_HOST (default unix:///var/run/docker.sock). RTM verifies API ping and the presence of RTMANAGER_DOCKER_NETWORK.
Telemetry exporter (OTLP grpc/http or stdout).
Internal HTTP listener.
Reconciler runs once and blocks until done.
Background workers start.

A failure in any step is fatal and exits the process non-zero.

Probes

/healthz reports liveness — the process responds when the HTTP server is alive.

/readyz reports readiness — 200 only when:

the PostgreSQL pool can ping the primary;
the Redis master client can ping;
the Docker client can ping;
the configured Docker network exists.

Both probes are documented in ./api/internal-openapi.yaml.

Lifecycles

All operations share a per-game-id Redis lease (rtmanager:game_lease:{game_id}, TTL RTMANAGER_GAME_LEASE_TTL_SECONDS, default 60). The lease serialises operations on a single game across all entry points (stream consumers and REST handlers). v1 does not renew the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before the operation finishes — the trade-off is documented in docs/services.md §1.

Start

Triggers:

Lobby: a Redis Streams entry on runtime:start_jobs with envelope {game_id, image_ref, requested_at_ms}.
Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/start with body {image_ref}.

Pre-conditions:

image_ref is a non-empty string and parseable as a Docker reference.
Configured Docker network exists.
The lease for {game_id} is acquired.

Flow on success:

Read runtime_records.{game_id}. If status=running with the same image_ref, return the existing record (idempotent success, error_code=replay_no_op).
Pull the image per RTMANAGER_IMAGE_PULL_POLICY (default if_missing).
Inspect the resolved image, derive resource limits from labels.
Ensure the per-game state directory exists with the configured mode and ownership.
docker create with the configured network, hostname, labels, env (GAME_STATE_PATH, STORAGE_PATH), bind mount, log driver, resource limits.
docker start.
Upsert runtime_records (status=running, current_container_id, engine_endpoint, current_image_ref, started_at, last_op_at).
Append operation_log entry (op_kind=start, outcome=success, source-specific op_source).
Publish runtime:health_events container_started.
For Lobby callers: publish runtime:job_results {game_id, outcome=success, container_id, engine_endpoint}. For REST callers: respond 200 with the runtime record.

Failure paths:

Failure	PG side effect	Notification intent	Outcome to caller
Invalid `image_ref` shape, network missing	`operation_log` failure	`runtime.start_config_invalid`	`failure / start_config_invalid`
Image pull error	`operation_log` failure	`runtime.image_pull_failed`	`failure / image_pull_failed`
`docker create` / `start` error	`operation_log` failure	`runtime.container_start_failed`	`failure / container_start_failed`
State directory creation error	`operation_log` failure	`runtime.start_config_invalid`	`failure / start_config_invalid`

A failed start never leaves a partially-running container: if docker create succeeded but the subsequent step failed, RTM removes the container before recording the failure.

The production start orchestrator that implements the flow and the failure paths above lives at internal/service/startruntime/. Its design rationale — why the per-game lease and the health-events publisher live with the start service, the Result-shaped contract consumed by the stream consumer and the REST handler, the rollback rule on Upsert failure, and the created_at-preservation rule for re-starts — is captured in docs/services.md.

Stop

Triggers:

Lobby: Redis Streams entry on runtime:stop_jobs with envelope {game_id, reason, requested_at_ms}. reason ∈ {orphan_cleanup, cancelled, finished, admin_request, timeout}.
Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/stop with body {reason}.

Pre-conditions:

Lease acquired.

Flow on success:

Read runtime_records.{game_id}. If status is stopped or removed, return idempotent success (error_code=replay_no_op).
docker stop with RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS (default 30). Docker fires SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP shutdown endpoint on the engine.
Update runtime_records (status=stopped, stopped_at, last_op_at).
Append operation_log entry.
Publish runtime:job_results (for Lobby) or REST 200 (for REST callers).

The container stays in exited state until the cleanup worker removes it (TTL) or an admin command forces removal.

Failure paths:

Failure	Outcome
Container not found in Docker but record `running`	Update record `status=removed`, publish `container_disappeared`, return `success` (RTM treats this as already-stopped).
`docker stop` returns non-zero, container still alive	Failure recorded, no state change. Caller may retry.

Restart

Triggers:

Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/restart.

Restart is recreate: stop + remove + run with the same image_ref and the same bind mount. container_id changes; engine_endpoint is stable.

Flow:

Read runtime_records.{game_id}. The current image_ref is captured.
Acquire lease.
Run the stop flow (without releasing the lease).
docker rm the container.
Run the start flow with the captured image_ref.
Append a single operation_log entry with op_kind=restart and a correlation id linking the implicit stop and start log entries.

If any inner step fails, the operation log records the partial outcome and the outer caller receives the same failure; the runtime record converges to whatever state Docker reports.

Patch

Triggers:

Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/patch with body {image_ref}.

Patch is restart with a new image_ref. The engine reads its state from the bind mount on startup, so any data written before the patch survives.

Pre-conditions:

New and current image refs both parse as semver tags. image_ref_not_semver failure otherwise.
Major and minor versions are equal between current and new (semver_patch_only failure otherwise).

Flow: identical to restart, with a new image_ref injected before the start step. operation_log entry has op_kind=patch.

Cleanup

Triggers:

Periodic worker: every container with runtime_records.status=stopped and last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS (default 30).
Admin Service: DELETE /api/v1/internal/runtimes/{game_id}/container.

Pre-conditions:

The container is not in running state. RTM refuses to remove a running container through this path; stop first.

Flow:

Acquire lease.
docker rm the container.
Update runtime_records (status=removed, removed_at, current_container_id=NULL, last_op_at).
Append operation_log entry (op_kind=cleanup_container, op_source ∈ {auto_ttl, admin_rest}).

The host state directory is left untouched.

Health Monitoring

Three independent sources feed runtime:health_events and health_snapshots:

Docker events listener. Subscribes to the Docker events stream and filters container-scoped events by the com.galaxy.owner=rtmanager label written into every container by the start service. Emits:
- container_exited (action=die with non-zero exit code; exit 0 is the normal graceful stop and is suppressed).
- container_oom (action=oom).
- container_disappeared (action=destroy observed for a runtime_records.status=running row whose current_container_id still matches the destroyed container, i.e. a destroy RTM did not initiate).
container_started is emitted by the start service when it runs the container (see internal/service/startruntime), not by this listener.
Periodic Docker inspect every RTMANAGER_INSPECT_INTERVAL (default 30s). Emits inspect_unhealthy when:
- RestartCount increases between observations;
- State.Status != "running" for a record marked running;
- State.Health.Status == "unhealthy" if the image declares a Docker HEALTHCHECK.
Active HTTP probe every RTMANAGER_PROBE_INTERVAL (default 15s). Calls GET {engine_endpoint}/healthz with RTMANAGER_PROBE_TIMEOUT (default 2s). Emits:
- probe_failed after RTMANAGER_PROBE_FAILURES_THRESHOLD consecutive failures (default 3);
- probe_recovered on the first success after a probe_failed was published.

Every emission updates health_snapshots.{game_id} (latest event becomes the snapshot) and appends to runtime:health_events.

In v1, RTM publishes admin-only notification intents only for first-touch failures of the start flow. All ongoing health changes (probe failures, OOMs, exits) flow through runtime:health_events only. Game Master is the consumer that decides whether to escalate runtime-level events into notifications.

The three workers that implement the sources above live in internal/worker/{dockerevents,dockerinspect,healthprobe}. Their design rationale — container_started ownership, container_disappeared emission rules, die exit-code suppression, probe hysteresis state model, parallel-probe cap, and the events-listener reconnect policy — is captured in docs/workers.md.

Reconciliation

RTM never assumes Docker and PostgreSQL are in sync.

At startup (blocking, before workers start) and every RTMANAGER_RECONCILE_INTERVAL (default 5m):

List Docker containers with label com.galaxy.owner=rtmanager.
For each running container without a matching record:
- Insert a runtime_records row with status=running, the discovered current_image_ref, engine_endpoint, and started_at taken from com.galaxy.started_at_ms if present (otherwise from State.StartedAt).
- Append operation_log entry with op_kind=reconcile_adopt, op_source=auto_reconcile.
- Never stop or remove an unrecorded container. Operators may have started one manually for diagnostics; RTM stays out of their way.
For each runtime_records row with status=running whose container is missing:
- Update status=removed, removed_at=now, current_container_id=NULL.
- Publish runtime:health_events container_disappeared.
- Append operation_log entry with op_kind=reconcile_dispose.
For each runtime_records row with status=running whose container exists but is in exited:
- Update status=stopped, stopped_at=now (reconciler observation time).
- Publish runtime:health_events container_exited with the observed exit code.

The reconciler implementation lives at internal/worker/reconcile/ and the periodic TTL-cleanup worker at internal/worker/containercleanup/; the cleanup worker delegates removal to internal/service/cleanupcontainer/. The design rationale — the per-game lease around every drift mutation, the third observed_exited path beyond the two named cases, the synchronous ReconcileNow plus periodic Component split, and why the cleanup worker is a thin TTL filter on top of the existing service — is captured in docs/workers.md.

Trusted Surfaces

Internal REST

The internal REST surface is consumed by Game Master (sync interactions for inspect, restart, patch, stop, cleanup) and Admin Service (operational tooling, force-cleanup). The listener is unauthenticated; downstream services rely on network segmentation.

Method	Path	Operation ID	Caller
`GET`	`/healthz`	`internalHealthz`	platform probes
`GET`	`/readyz`	`internalReadyz`	platform probes
`GET`	`/api/v1/internal/runtimes`	`internalListRuntimes`	GM, Admin
`GET`	`/api/v1/internal/runtimes/{game_id}`	`internalGetRuntime`	GM, Admin
`POST`	`/api/v1/internal/runtimes/{game_id}/start`	`internalStartRuntime`	GM, Admin
`POST`	`/api/v1/internal/runtimes/{game_id}/stop`	`internalStopRuntime`	GM, Admin
`POST`	`/api/v1/internal/runtimes/{game_id}/restart`	`internalRestartRuntime`	GM, Admin
`POST`	`/api/v1/internal/runtimes/{game_id}/patch`	`internalPatchRuntime`	GM, Admin
`DELETE`	`/api/v1/internal/runtimes/{game_id}/container`	`internalCleanupRuntimeContainer`	Admin

Request and response shapes are defined in ./api/internal-openapi.yaml. Unknown JSON fields are rejected with invalid_request.

Callers identify themselves through the optional X-Galaxy-Caller request header (gm for Game Master, admin for Admin Service). The header is recorded as op_source in operation_log (gm_rest or admin_rest); when missing or carrying any other value Runtime Manager defaults to op_source = admin_rest. The header is documented on every runtime endpoint of ./api/internal-openapi.yaml.

Async Stream Contracts

`runtime:start_jobs` (in)

Producer: Game Lobby.

Field	Type	Notes
`game_id`	string	Lobby `game_id`.
`image_ref`	string	Docker reference. Lobby resolves it from `target_engine_version` using `LOBBY_ENGINE_IMAGE_TEMPLATE`.
`requested_at_ms`	int64	UTC milliseconds. Used for diagnostics, not authoritative.

`runtime:stop_jobs` (in)

Producer: Game Lobby.

Field	Type	Notes
`game_id`	string
`reason`	enum	`orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`. Recorded in `operation_log.error_code` when the reason matters; otherwise opaque.
`requested_at_ms`	int64

`runtime:job_results` (out)

Producer: Runtime Manager. Consumer: Game Lobby.

Field	Type	Notes
`game_id`	string
`outcome`	enum	`success`, `failure`.
`container_id`	string	Required for `success`. Empty on `failure`.
`engine_endpoint`	string	Required for `success`. Empty on `failure`.
`error_code`	string	Stable code. `replay_no_op` for idempotent re-runs.
`error_message`	string	Operator-readable detail.

`runtime:health_events` (out, new)

Producer: Runtime Manager. Consumers: Game Master; Game Lobby and Admin Service are reserved as future consumers.

Field	Type	Notes
`game_id`	string
`container_id`	string	The container observed (may differ from current after a restart race).
`event_type`	enum	See below.
`occurred_at_ms`	int64	UTC milliseconds.
`details`	json	Type-specific payload.

event_type values and their details schemas:

`event_type`	`details` payload
`container_started`	`{image_ref}`
`container_exited`	`{exit_code, oom: bool}`
`container_oom`	`{exit_code}`
`container_disappeared`	`{}`
`inspect_unhealthy`	`{restart_count, state, health}`
`probe_failed`	`{consecutive_failures, last_status, last_error}`
`probe_recovered`	`{prior_failure_count}`

The full schema is enforced by ./api/runtime-health-asyncapi.yaml.

Notification Contracts

Runtime Manager publishes admin-only notification intents only for failures invisible to any other service:

Trigger	`notification_type`	Audience	Channels
Image pull error during start	`runtime.image_pull_failed`	admin	email
`docker create` / `docker start` error	`runtime.container_start_failed`	admin	email
Configuration validation error at start (bad image_ref, missing network)	`runtime.start_config_invalid`	admin	email

Constructors live in galaxy/pkg/notificationintent. Catalog entries live in ../notification/README.md and ../notification/api/intents-asyncapi.yaml. All three intents share the frozen field set {game_id, image_ref, error_code, error_message, attempted_at_ms}; the _ms suffix on attempted_at_ms follows the repo-wide convention for millisecond integer fields. The Redis Streams publisher wrapper used to emit these intents from RTM ships in internal/adapters/notificationpublisher/; the rationale for the signature shim that drops the upstream entry id lives in docs/domain-and-ports.md §7 and the production wiring is documented in docs/adapters.md.

Runtime-level changes after a successful start (probe failures, OOM, container exited) do not produce notifications from RTM. Game Master decides whether to escalate.

Persistence Layout

PostgreSQL durable state (schema `rtmanager`)

Table	Purpose	Key
`runtime_records`	One row per game, latest known runtime status.	`game_id`
`operation_log`	Append-only audit of every operation RTM performed.	`id` (auto)
`health_snapshots`	Latest health observation per game.	`game_id`

runtime_records columns:

game_id — primary key, references Lobby's identifier.
status — running | stopped | removed.
current_container_id — nullable when status=removed.
current_image_ref — non-null when status is running or stopped.
engine_endpoint — http://galaxy-game-{game_id}:8080.
state_path — absolute host path of the bind-mounted directory.
docker_network — network name observed at create time.
started_at, stopped_at, removed_at — last transition timestamps.
last_op_at — drives retention TTL.
created_at — first time RTM saw the game.

operation_log columns:

id, game_id, op_kind (start | stop | restart | patch | cleanup_container | reconcile_adopt | reconcile_dispose), op_source (lobby_stream | gm_rest | admin_rest | auto_ttl | auto_reconcile), source_ref (stream entry id, REST request id, or admin user), image_ref, container_id, outcome (success | failure), error_code, error_message, started_at, finished_at.

health_snapshots columns:

game_id, container_id, status (healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared), source (docker_event | inspect | probe), details (jsonb), observed_at.

Indexes:

runtime_records (status, last_op_at) — drives cleanup worker.
operation_log (game_id, started_at DESC) — drives audit reads.

Migrations are embedded 00001_init.sql (single-init pre-launch policy from ARCHITECTURE.md §Persistence Backends).

Redis runtime-coordination state

Key shape	Purpose
`rtmanager:stream_offsets:{label}`	Last processed entry id per consumer (`startjobs`, `stopjobs`). Same shape as Lobby.
`rtmanager:game_lease:{game_id}`	Per-game lease string (`SET ... NX PX <ttl>`). TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default 60s); not renewed mid-operation in v1. The trade-off is documented in `docs/services.md` §1.

Stream key shapes themselves are configurable:

RTMANAGER_REDIS_START_JOBS_STREAM (default runtime:start_jobs).
RTMANAGER_REDIS_STOP_JOBS_STREAM (default runtime:stop_jobs).
RTMANAGER_REDIS_JOB_RESULTS_STREAM (default runtime:job_results).
RTMANAGER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events).
RTMANAGER_NOTIFICATION_INTENTS_STREAM (default notification:intents).

Error Model

Error envelope: { "error": { "code": "...", "message": "..." } }, identical to Lobby's.

Stable error codes:

Code	Meaning
`invalid_request`	Malformed JSON, unknown fields, missing required parameter.
`not_found`	Runtime record does not exist.
`conflict`	Operation incompatible with current `status`.
`service_unavailable`	Dependency unavailable (Docker daemon, PG, Redis).
`internal_error`	Unspecified failure.
`image_pull_failed`	Image pull attempt failed.
`image_ref_not_semver`	Patch attempted with a tag that is not parseable semver.
`semver_patch_only`	Patch attempted across major/minor boundary.
`container_start_failed`	`docker create` / `docker start` failed.
`start_config_invalid`	Network missing, bind path inaccessible, or other config error.
`docker_unavailable`	Docker daemon ping failed.
`replay_no_op`	Idempotent replay; outcome is success but no work was done.

Configuration

All variables use the RTMANAGER_ prefix. Required variables fail-fast on startup.

Required

RTMANAGER_INTERNAL_HTTP_ADDR
RTMANAGER_POSTGRES_PRIMARY_DSN
RTMANAGER_REDIS_MASTER_ADDR
RTMANAGER_REDIS_PASSWORD
RTMANAGER_DOCKER_HOST
RTMANAGER_DOCKER_NETWORK
RTMANAGER_GAME_STATE_ROOT

Configuration groups

Listener:

RTMANAGER_INTERNAL_HTTP_ADDR (e.g. :8096).
RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT (default 5s).
RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT (default 15s).
RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT (default 60s).

Docker:

RTMANAGER_DOCKER_HOST (default unix:///var/run/docker.sock).
RTMANAGER_DOCKER_API_VERSION (default empty — let SDK negotiate).
RTMANAGER_DOCKER_NETWORK (default galaxy-net).
RTMANAGER_DOCKER_LOG_DRIVER (default json-file).
RTMANAGER_DOCKER_LOG_OPTS (default empty).
RTMANAGER_IMAGE_PULL_POLICY (default if_missing, values if_missing | always | never).

Container defaults:

RTMANAGER_DEFAULT_CPU_QUOTA (default 1.0).
RTMANAGER_DEFAULT_MEMORY (default 512m).
RTMANAGER_DEFAULT_PIDS_LIMIT (default 512).
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS (default 30).
RTMANAGER_CONTAINER_RETENTION_DAYS (default 30).
RTMANAGER_ENGINE_STATE_MOUNT_PATH (default /var/lib/galaxy-game).
RTMANAGER_ENGINE_STATE_ENV_NAME (default GAME_STATE_PATH).
RTMANAGER_GAME_STATE_DIR_MODE (default 0750).
RTMANAGER_GAME_STATE_OWNER_UID (default 0).
RTMANAGER_GAME_STATE_OWNER_GID (default 0).
RTMANAGER_GAME_STATE_ROOT (host path).

Postgres:

RTMANAGER_POSTGRES_PRIMARY_DSN (postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable).
RTMANAGER_POSTGRES_REPLICA_DSNS (optional, comma-separated; not used in v1).
RTMANAGER_POSTGRES_OPERATION_TIMEOUT (default 2s).
RTMANAGER_POSTGRES_MAX_OPEN_CONNS (default 10).
RTMANAGER_POSTGRES_MAX_IDLE_CONNS (default 2).
RTMANAGER_POSTGRES_CONN_MAX_LIFETIME (default 30m).

Redis:

RTMANAGER_REDIS_MASTER_ADDR.
RTMANAGER_REDIS_REPLICA_ADDRS (optional, comma-separated).
RTMANAGER_REDIS_PASSWORD.
RTMANAGER_REDIS_DB (default 0).
RTMANAGER_REDIS_OPERATION_TIMEOUT (default 2s).

Streams:

RTMANAGER_REDIS_START_JOBS_STREAM (default runtime:start_jobs).
RTMANAGER_REDIS_STOP_JOBS_STREAM (default runtime:stop_jobs).
RTMANAGER_REDIS_JOB_RESULTS_STREAM (default runtime:job_results).
RTMANAGER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events).
RTMANAGER_NOTIFICATION_INTENTS_STREAM (default notification:intents).
RTMANAGER_STREAM_BLOCK_TIMEOUT (default 5s).

Health monitoring:

RTMANAGER_INSPECT_INTERVAL (default 30s).
RTMANAGER_PROBE_INTERVAL (default 15s).
RTMANAGER_PROBE_TIMEOUT (default 2s).
RTMANAGER_PROBE_FAILURES_THRESHOLD (default 3).

Reconciler / cleanup:

RTMANAGER_RECONCILE_INTERVAL (default 5m).
RTMANAGER_CLEANUP_INTERVAL (default 1h).

Coordination:

RTMANAGER_GAME_LEASE_TTL_SECONDS (default 60).

Lobby internal client:

RTMANAGER_LOBBY_INTERNAL_BASE_URL (e.g. http://lobby:8095).
RTMANAGER_LOBBY_INTERNAL_TIMEOUT (default 2s).

Logging:

RTMANAGER_LOG_LEVEL (default info).

Lifecycle:

RTMANAGER_SHUTDOWN_TIMEOUT (default 30s).

Telemetry: uses the standard OTLP env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, etc.) shared with other Galaxy services.

Observability

Metrics (OpenTelemetry, low cardinality)

rtmanager.start_outcomes — counter, labels outcome, error_code, op_source.
rtmanager.stop_outcomes — counter, labels outcome, reason, op_source.
rtmanager.restart_outcomes — counter, labels outcome, error_code.
rtmanager.patch_outcomes — counter, labels outcome, error_code.
rtmanager.cleanup_outcomes — counter, labels outcome, op_source.
rtmanager.docker_op_latency — histogram, label op (pull | create | start | stop | rm | inspect | events).
rtmanager.health_events — counter, label event_type.
rtmanager.reconcile_drift — counter, label kind (adopt | dispose | observed_exited).
rtmanager.runtime_records_by_status — gauge, label status.
rtmanager.lease_acquire_latency — histogram.
rtmanager.notification_intents — counter, label notification_type.

Structured logs (slog JSON to stdout)

Common fields on every entry: service=rtmanager, request_id, trace_id, span_id, game_id (when known), container_id (when known), op_kind, op_source, outcome, error_code.

Worker-specific fields: stream_entry_id (consumers), event_type (health), image_ref (start/patch).

Verification

Service-level (TESTING.md §7):

Unit tests for every service-layer operation against mocked Docker.
Adapter tests (PG, Redis, Docker) using testcontainers-go for PG/Redis and the Docker daemon socket for the real Docker adapter.
Contract tests for internal-openapi.yaml, runtime-jobs-asyncapi.yaml, runtime-health-asyncapi.yaml.

Service-local integration suite under rtmanager/integration/:

Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real galaxy/game test image.
Replay safety (duplicate stream entries are no-ops).
Health observability (kill the engine externally, observe container_disappeared; relaunch manually, observe reconcile adopt).
Notification on first-touch failures (publish a start with an unresolvable image, observe runtime.image_pull_failed intent and a failure job result).

Inter-service suite under integration/lobbyrtm/:

Real Lobby + real RTM + real galaxy/game test image. Covers happy path, cancel, and start-failed flows.

Manual smoke (development):

docker network create galaxy-net   # once
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
RTMANAGER_DOCKER_NETWORK=galaxy-net \
RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
... go run ./rtmanager/cmd/rtmanager

After start, curl http://localhost:8096/readyz returns 200. Driving Lobby through its public flow brings up galaxy-game-{game_id} containers; RTM logs each lifecycle transition and publishes the corresponding stream entries.

38 KiB Raw Blame History

Runtime Manager

References

Purpose

Scope

Non-Goals

Position in the System

Responsibility Boundaries

Container Model

Network

DNS name and engine endpoint

State storage (bind mount)

Container labels

Resource limits

Logging driver

Runtime Surface

Listeners

Background workers

Startup dependencies

Probes

Lifecycles

Start

Stop

Restart

Patch

Cleanup

Health Monitoring

Reconciliation

Trusted Surfaces

Internal REST

Async Stream Contracts

runtime:start_jobs (in)

runtime:stop_jobs (in)

runtime:job_results (out)

runtime:health_events (out, new)

Notification Contracts

Persistence Layout

PostgreSQL durable state (schema rtmanager)

Redis runtime-coordination state

Error Model

Configuration

Required

Configuration groups

Observability

Metrics (OpenTelemetry, low cardinality)

Structured logs (slog JSON to stdout)

Verification

38 KiB

Raw Blame History

`runtime:start_jobs` (in)

`runtime:stop_jobs` (in)

`runtime:job_results` (out)

`runtime:health_events` (out, new)

PostgreSQL durable state (schema `rtmanager`)