Runtime Manager
Runtime Manager (RTM) is the only Galaxy platform service permitted to interact with the
Docker daemon. It owns the lifecycle of galaxy/game engine containers and the technical
runtime view of running games. Other services consume RTM via two transports: an asynchronous
Redis Streams contract (used by Game Lobby) and a synchronous internal REST surface (used by
Game Master and Admin Service).
References
../ARCHITECTURE.md— system architecture, §9 Runtime Manager.../TESTING.md§7 — testing matrix for RTM../docs/README.md— service-local documentation entry point../api/internal-openapi.yaml— REST contract../api/runtime-jobs-asyncapi.yaml— start/stop job streams contract../api/runtime-health-asyncapi.yaml—runtime:health_eventsstream contract.../game/README.md— game engine container contract (env, ports,/healthz).../lobby/README.md— Game Lobby integration with RTM.
Purpose
A running Galaxy game lives in exactly one Docker container. The platform must be able to:
- create the container with the right engine version and configuration;
- supply the engine with a stable storage location for game state;
- keep the runtime status visible to platform-level services;
- replace the container in place for patch upgrades and restarts;
- remove containers that are no longer needed;
- detect and surface engine failures to whoever should react.
Runtime Manager is the single component that performs these actions. It deliberately does
not reason about platform metadata, membership, schedules, turn cutoffs, or any other
business state. Game Lobby owns platform metadata; Game Master will own runtime business state
when implemented.
Scope
Runtime Manager is the source of truth for:
- the mapping
game_id -> current_container_idfor every running container; - the durable history of every start, stop, restart, patch, and cleanup operation it performed;
- the most recent technical health observation per game (last Docker event, last successful or failed probe, last inspect result).
Runtime Manager is not the source of truth for:
- any business or platform-level metadata of a game (owned by
Game Lobby); - runtime state visible to players or operators as game state, including current turn,
generation status, engine version registry (owned by
Game Master); - the engine version catalogue or which engine version a game is allowed to use (
Game Masteris the future owner;Game Lobbysuppliesimage_refin v1); - contents of the engine state directory; that is engine domain;
- backup, archival, or operator cleanup of state directories.
Non-Goals
- Multi-instance operation in v1. Coordination is single-process; multiple replicas are an explicit future iteration.
- Engine version arbitration. The producer (
Game Lobbyin v1,Game Masterlater) suppliesimage_ref. - Image registry control. Pull policy is configurable, but RTM does not push, retag, or promote images.
- TLS or mTLS on the internal listener. RTM trusts its network segment.
- Direct delivery of player-visible push notifications. RTM publishes admin-only notification intents only for failures invisible elsewhere; everything else is delegated.
- Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached
through
unix:///var/run/docker.sock.
Position in the System
flowchart LR
Lobby["Game Lobby"]
GM["Game Master"]
Admin["Admin Service"]
Notify["Notification Service"]
RTM["Runtime Manager"]
Engine["Game Engine container"]
Docker["Docker Daemon"]
Postgres["PostgreSQL\nschema rtmanager"]
Redis["Redis\nstreams + leases"]
Lobby -->|runtime:start_jobs / stop_jobs| RTM
RTM -->|runtime:job_results| Lobby
GM -->|internal REST| RTM
Admin -->|internal REST| RTM
RTM -->|notification:intents (admin)| Notify
RTM -->|runtime:health_events| Redis
RTM <--> Docker
Docker -->|create / start / stop / rm| Engine
RTM --> Postgres
RTM --> Redis
Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
Responsibility Boundaries
Runtime Manager is responsible for:
- accepting start, stop, restart, patch, inspect, and cleanup requests through the supported transports and producing one durable outcome per request;
- creating Docker containers from a producer-supplied
image_refand binding them to the configured Docker network and host state directory; - enforcing the one-game-one-container invariant in its own state and on Docker;
- monitoring container health through Docker events, periodic inspect, and active HTTP probes;
- publishing technical runtime events (
runtime:job_results,runtime:health_events) and admin-only notification intents for failures that no other service can observe; - reconciling its persistent state with Docker reality on startup and periodically;
- removing exited containers automatically by retention TTL or explicitly by admin command.
Runtime Manager is not responsible for:
- evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
- registering a started runtime with
Game Master(Lobby calls GM after a successful job result); - mapping platform users to engine players (GM owns this mapping);
- player command routing (GM proxies player commands directly to engine);
- cleaning up host state directories;
- patching the engine version registry; the registry lives in
Game Master.
Container Model
Network
Containers attach to a single user-defined Docker bridge network. The network is provisioned
outside RTM: docker-compose, Terraform, or an operator runbook creates galaxy-net (or
whatever name is configured via RTMANAGER_DOCKER_NETWORK).
RTM validates the network's presence at startup. A missing network is a fail-fast condition; the process exits non-zero before opening any listener.
DNS name and engine endpoint
Each container is created with hostname galaxy-game-{game_id} and is attached to the
configured network. Docker's embedded DNS resolves the hostname for any other container in the
same network.
The engine_endpoint published in runtime:job_results and visible through the inspect REST
endpoint is the full URL http://galaxy-game-{game_id}:8080. The port is fixed at 8080
inside the container; RTM does not publish ports to the host.
Restart and patch keep the same DNS name. The container_id changes; the engine_endpoint
does not.
State storage (bind mount)
Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale is operator-friendly backup and inspection.
- Host root:
RTMANAGER_GAME_STATE_ROOT(operator-supplied, e.g./var/lib/galaxy/games). - Per-game directory:
<RTMANAGER_GAME_STATE_ROOT>/{game_id}. RTM creates it with permissionsRTMANAGER_GAME_STATE_DIR_MODE(default0750) and ownershipRTMANAGER_GAME_STATE_OWNER_UID/_GID(default0:0— operator overrides for non-root engine). - Bind mount: the per-game directory is mounted into the container at the path declared by
RTMANAGER_ENGINE_STATE_MOUNT_PATH(default/var/lib/galaxy-game). - Environment: the container receives
GAME_STATE_PATH=<mount path>. The engine resolves the path from this variable. The same variable is forwarded to the engine asSTORAGE_PATHfor backward compatibility — both names are accepted in v1.
RTM never deletes the host state directory. Removing it is the responsibility of operator tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container through the cleanup endpoint or the retention TTL leaves the directory intact.
Container labels
RTM applies the following labels to every container it creates:
| Label | Value | Purpose |
|---|---|---|
com.galaxy.owner |
rtmanager |
Filter for docker ps and reconcile. |
com.galaxy.kind |
game-engine |
Differentiates from infra containers. |
com.galaxy.game_id |
{game_id} |
Reverse lookup from container to platform game. |
com.galaxy.engine_image_ref |
{image_ref} |
Cross-check against runtime_records. |
com.galaxy.started_at_ms |
{ms} |
Unambiguous start timestamp. |
Labels are read from the resolved engine image to choose resource limits (see below).
Resource limits
Resource limits originate in the engine image, not in the producer envelope or RTM config:
| Image label | Container limit | RTM fallback config |
|---|---|---|
com.galaxy.cpu_quota |
--cpus value |
RTMANAGER_DEFAULT_CPU_QUOTA (default 1.0) |
com.galaxy.memory |
--memory value |
RTMANAGER_DEFAULT_MEMORY (default 512m) |
com.galaxy.pids_limit |
--pids-limit value |
RTMANAGER_DEFAULT_PIDS_LIMIT (default 512) |
If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass limits.
Logging driver
Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver and its options when creating the container:
RTMANAGER_DOCKER_LOG_DRIVER(defaultjson-file).RTMANAGER_DOCKER_LOG_OPTS(default empty; comma-separatedkey=valuepairs).
RTM never reads the container's stdout itself. Operators consume engine logs via docker logs
or via whatever sink the configured driver feeds (fluentd, journald, etc.).
The production Docker SDK adapter that creates and starts these containers lives at
internal/adapters/docker/. Its design rationale — fixed engine port, partial-rollback on
ContainerStart failure, events-stream filter rationale, and the mockgen-driven service-test
fixture — is captured in docs/adapters.md.
Runtime Surface
Listeners
| Listener | Default address | Purpose |
|---|---|---|
internal HTTP |
:8096 (RTMANAGER_INTERNAL_HTTP_ADDR) |
Probes (/healthz, /readyz) and the trusted REST surface for Game Master and Admin Service. |
There is no public listener. The internal listener is unauthenticated and assumes a trusted network segment.
Background workers
| Worker | Driver | Description |
|---|---|---|
startjobs consumer |
Redis Stream runtime:start_jobs |
Decodes start envelope and invokes the start service. |
stopjobs consumer |
Redis Stream runtime:stop_jobs |
Decodes stop envelope and invokes the stop service. |
| Docker events listener | Docker /events API |
Subscribes with the label filter, emits runtime:health_events for container_started / exited / oom / disappeared. |
| Active HTTP probe | Periodic | GET {engine_endpoint}/healthz for every running runtime; emits probe_failed / probe_recovered with hysteresis. |
| Periodic Docker inspect | Periodic | Refreshes inspect data; emits inspect_unhealthy when restart_count grows or status is unexpected. |
| Reconciler | Startup + periodic | Reconciles runtime_records with docker ps (see Reconciliation section). |
| Container cleanup | Periodic | Removes exited containers older than RTMANAGER_CONTAINER_RETENTION_DAYS. |
Startup dependencies
In start order:
- PostgreSQL primary (DSN
RTMANAGER_POSTGRES_PRIMARY_DSN). Goose migrations apply synchronously before any listener opens. - Redis master (
RTMANAGER_REDIS_MASTER_ADDR). - Docker daemon at
RTMANAGER_DOCKER_HOST(defaultunix:///var/run/docker.sock). RTM verifies API ping and the presence ofRTMANAGER_DOCKER_NETWORK. - Telemetry exporter (OTLP grpc/http or stdout).
- Internal HTTP listener.
- Reconciler runs once and blocks until done.
- Background workers start.
A failure in any step is fatal and exits the process non-zero.
Probes
/healthz reports liveness — the process responds when the HTTP server is alive.
/readyz reports readiness — 200 only when:
- the PostgreSQL pool can ping the primary;
- the Redis master client can ping;
- the Docker client can ping;
- the configured Docker network exists.
Both probes are documented in ./api/internal-openapi.yaml.
Lifecycles
All operations share a per-game-id Redis lease (rtmanager:game_lease:{game_id},
TTL RTMANAGER_GAME_LEASE_TTL_SECONDS, default 60). The lease serialises operations on a
single game across all entry points (stream consumers and REST handlers). v1 does not renew
the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before
the operation finishes — the trade-off is documented in
docs/services.md §1.
Start
Triggers:
- Lobby: a Redis Streams entry on
runtime:start_jobswith envelope{game_id, image_ref, requested_at_ms}. - Game Master / Admin Service:
POST /api/v1/internal/runtimes/{game_id}/startwith body{image_ref}.
Pre-conditions:
image_refis a non-empty string and parseable as a Docker reference.- Configured Docker network exists.
- The lease for
{game_id}is acquired.
Flow on success:
- Read
runtime_records.{game_id}. Ifstatus=runningwith the sameimage_ref, return the existing record (idempotent success,error_code=replay_no_op). - Pull the image per
RTMANAGER_IMAGE_PULL_POLICY(defaultif_missing). - Inspect the resolved image, derive resource limits from labels.
- Ensure the per-game state directory exists with the configured mode and ownership.
docker createwith the configured network, hostname, labels, env (GAME_STATE_PATH,STORAGE_PATH), bind mount, log driver, resource limits.docker start.- Upsert
runtime_records(status=running,current_container_id,engine_endpoint,current_image_ref,started_at,last_op_at). - Append
operation_logentry (op_kind=start,outcome=success, source-specificop_source). - Publish
runtime:health_eventscontainer_started. - For Lobby callers: publish
runtime:job_results{game_id, outcome=success, container_id, engine_endpoint}. For REST callers: respond200with the runtime record.
Failure paths:
| Failure | PG side effect | Notification intent | Outcome to caller |
|---|---|---|---|
Invalid image_ref shape, network missing |
operation_log failure |
runtime.start_config_invalid |
failure / start_config_invalid |
| Image pull error | operation_log failure |
runtime.image_pull_failed |
failure / image_pull_failed |
docker create / start error |
operation_log failure |
runtime.container_start_failed |
failure / container_start_failed |
| State directory creation error | operation_log failure |
runtime.start_config_invalid |
failure / start_config_invalid |
A failed start never leaves a partially-running container: if docker create succeeded but
the subsequent step failed, RTM removes the container before recording the failure.
The production start orchestrator that implements the flow and the failure paths above lives
at internal/service/startruntime/. Its design rationale — why the per-game lease and the
health-events publisher live with the start service, the Result-shaped contract consumed by
the stream consumer and the REST handler, the rollback rule on Upsert failure, and the
created_at-preservation rule for re-starts — is captured in
docs/services.md.
Stop
Triggers:
- Lobby: Redis Streams entry on
runtime:stop_jobswith envelope{game_id, reason, requested_at_ms}.reason ∈ {orphan_cleanup, cancelled, finished, admin_request, timeout}. - Game Master / Admin Service:
POST /api/v1/internal/runtimes/{game_id}/stopwith body{reason}.
Pre-conditions:
- Lease acquired.
Flow on success:
- Read
runtime_records.{game_id}. Ifstatusisstoppedorremoved, return idempotent success (error_code=replay_no_op). docker stopwithRTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS(default30). Docker fires SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP shutdown endpoint on the engine.- Update
runtime_records(status=stopped,stopped_at,last_op_at). - Append
operation_logentry. - Publish
runtime:job_results(for Lobby) or REST200(for REST callers).
The container stays in exited state until the cleanup worker removes it (TTL) or an admin
command forces removal.
Failure paths:
| Failure | Outcome |
|---|---|
Container not found in Docker but record running |
Update record status=removed, publish container_disappeared, return success (RTM treats this as already-stopped). |
docker stop returns non-zero, container still alive |
Failure recorded, no state change. Caller may retry. |
Restart
Triggers:
- Game Master / Admin Service:
POST /api/v1/internal/runtimes/{game_id}/restart.
Restart is recreate: stop + remove + run with the same image_ref and the same bind
mount. container_id changes; engine_endpoint is stable.
Flow:
- Read
runtime_records.{game_id}. The currentimage_refis captured. - Acquire lease.
- Run the stop flow (without releasing the lease).
docker rmthe container.- Run the start flow with the captured
image_ref. - Append a single
operation_logentry withop_kind=restartand a correlation id linking the implicit stop and start log entries.
If any inner step fails, the operation log records the partial outcome and the outer caller receives the same failure; the runtime record converges to whatever state Docker reports.
Patch
Triggers:
- Game Master / Admin Service:
POST /api/v1/internal/runtimes/{game_id}/patchwith body{image_ref}.
Patch is restart with a new image_ref. The engine reads its state from the bind mount
on startup, so any data written before the patch survives.
Pre-conditions:
- New and current image refs both parse as semver tags.
image_ref_not_semverfailure otherwise. - Major and minor versions are equal between current and new (
semver_patch_onlyfailure otherwise).
Flow: identical to restart, with a new image_ref injected before the start step.
operation_log entry has op_kind=patch.
Cleanup
Triggers:
- Periodic worker: every container with
runtime_records.status=stoppedandlast_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS(default30). - Admin Service:
DELETE /api/v1/internal/runtimes/{game_id}/container.
Pre-conditions:
- The container is not in
runningstate. RTM refuses to remove a running container through this path; stop first.
Flow:
- Acquire lease.
docker rmthe container.- Update
runtime_records(status=removed,removed_at,current_container_id=NULL,last_op_at). - Append
operation_logentry (op_kind=cleanup_container,op_source ∈ {auto_ttl, admin_rest}).
The host state directory is left untouched.
Health Monitoring
Three independent sources feed runtime:health_events and health_snapshots:
-
Docker events listener. Subscribes to the Docker events stream and filters container-scoped events by the
com.galaxy.owner=rtmanagerlabel written into every container by the start service. Emits:container_exited(action=diewith non-zero exit code; exit0is the normal graceful stop and is suppressed).container_oom(action=oom).container_disappeared(action=destroyobserved for aruntime_records.status=runningrow whosecurrent_container_idstill matches the destroyed container, i.e. a destroy RTM did not initiate).
container_startedis emitted by the start service when it runs the container (seeinternal/service/startruntime), not by this listener. -
Periodic Docker inspect every
RTMANAGER_INSPECT_INTERVAL(default30s). Emitsinspect_unhealthywhen:RestartCountincreases between observations;State.Status != "running"for a record marked running;State.Health.Status == "unhealthy"if the image declares a DockerHEALTHCHECK.
-
Active HTTP probe every
RTMANAGER_PROBE_INTERVAL(default15s). CallsGET {engine_endpoint}/healthzwithRTMANAGER_PROBE_TIMEOUT(default2s). Emits:probe_failedafterRTMANAGER_PROBE_FAILURES_THRESHOLDconsecutive failures (default3);probe_recoveredon the first success after aprobe_failedwas published.
Every emission updates health_snapshots.{game_id} (latest event becomes the snapshot) and
appends to runtime:health_events.
In v1, RTM publishes admin-only notification intents only for first-touch failures of the
start flow. All ongoing health changes (probe failures, OOMs, exits) flow through
runtime:health_events only. Game Master is the consumer that decides whether to escalate
runtime-level events into notifications.
The three workers that implement the sources above live in
internal/worker/{dockerevents,dockerinspect,healthprobe}. Their design rationale —
container_started ownership, container_disappeared emission rules, die exit-code
suppression, probe hysteresis state model, parallel-probe cap, and the events-listener
reconnect policy — is captured in docs/workers.md.
Reconciliation
RTM never assumes Docker and PostgreSQL are in sync.
At startup (blocking, before workers start) and every RTMANAGER_RECONCILE_INTERVAL
(default 5m):
- List Docker containers with label
com.galaxy.owner=rtmanager. - For each running container without a matching record:
- Insert a
runtime_recordsrow withstatus=running, the discoveredcurrent_image_ref,engine_endpoint, andstarted_attaken fromcom.galaxy.started_at_msif present (otherwise fromState.StartedAt). - Append
operation_logentry withop_kind=reconcile_adopt,op_source=auto_reconcile. - Never stop or remove an unrecorded container. Operators may have started one manually for diagnostics; RTM stays out of their way.
- Insert a
- For each
runtime_recordsrow withstatus=runningwhose container is missing:- Update
status=removed,removed_at=now,current_container_id=NULL. - Publish
runtime:health_eventscontainer_disappeared. - Append
operation_logentry withop_kind=reconcile_dispose.
- Update
- For each
runtime_recordsrow withstatus=runningwhose container exists but is inexited:- Update
status=stopped,stopped_at=now(reconciler observation time). - Publish
runtime:health_eventscontainer_exitedwith the observed exit code.
- Update
The reconciler implementation lives at internal/worker/reconcile/ and the periodic
TTL-cleanup worker at internal/worker/containercleanup/; the cleanup worker delegates
removal to internal/service/cleanupcontainer/. The design rationale — the per-game
lease around every drift mutation, the third observed_exited path beyond the two
named cases, the synchronous ReconcileNow plus periodic Component split, and why
the cleanup worker is a thin TTL filter on top of the existing service — is captured in
docs/workers.md.
Trusted Surfaces
Internal REST
The internal REST surface is consumed by Game Master (sync interactions for inspect,
restart, patch, stop, cleanup) and Admin Service (operational tooling, force-cleanup).
The listener is unauthenticated; downstream services rely on network segmentation.
| Method | Path | Operation ID | Caller |
|---|---|---|---|
GET |
/healthz |
internalHealthz |
platform probes |
GET |
/readyz |
internalReadyz |
platform probes |
GET |
/api/v1/internal/runtimes |
internalListRuntimes |
GM, Admin |
GET |
/api/v1/internal/runtimes/{game_id} |
internalGetRuntime |
GM, Admin |
POST |
/api/v1/internal/runtimes/{game_id}/start |
internalStartRuntime |
GM, Admin |
POST |
/api/v1/internal/runtimes/{game_id}/stop |
internalStopRuntime |
GM, Admin |
POST |
/api/v1/internal/runtimes/{game_id}/restart |
internalRestartRuntime |
GM, Admin |
POST |
/api/v1/internal/runtimes/{game_id}/patch |
internalPatchRuntime |
GM, Admin |
DELETE |
/api/v1/internal/runtimes/{game_id}/container |
internalCleanupRuntimeContainer |
Admin |
Request and response shapes are defined in ./api/internal-openapi.yaml.
Unknown JSON fields are rejected with invalid_request.
Callers identify themselves through the optional X-Galaxy-Caller
request header (gm for Game Master, admin for Admin Service).
The header is recorded as op_source in operation_log (gm_rest or
admin_rest); when missing or carrying any other value Runtime
Manager defaults to op_source = admin_rest. The header is documented
on every runtime endpoint of
./api/internal-openapi.yaml.
Async Stream Contracts
runtime:start_jobs (in)
Producer: Game Lobby.
| Field | Type | Notes |
|---|---|---|
game_id |
string | Lobby game_id. |
image_ref |
string | Docker reference. Lobby resolves it from target_engine_version using LOBBY_ENGINE_IMAGE_TEMPLATE. |
requested_at_ms |
int64 | UTC milliseconds. Used for diagnostics, not authoritative. |
runtime:stop_jobs (in)
Producer: Game Lobby.
| Field | Type | Notes |
|---|---|---|
game_id |
string | |
reason |
enum | orphan_cleanup, cancelled, finished, admin_request, timeout. Recorded in operation_log.error_code when the reason matters; otherwise opaque. |
requested_at_ms |
int64 |
runtime:job_results (out)
Producer: Runtime Manager. Consumer: Game Lobby.
| Field | Type | Notes |
|---|---|---|
game_id |
string | |
outcome |
enum | success, failure. |
container_id |
string | Required for success. Empty on failure. |
engine_endpoint |
string | Required for success. Empty on failure. |
error_code |
string | Stable code. replay_no_op for idempotent re-runs. |
error_message |
string | Operator-readable detail. |
runtime:health_events (out)
Producer: Runtime Manager. Consumer: Game Master — confirmed in
production. Game Lobby and Admin Service are reserved as future
consumers; they do not read the stream in v1.
| Field | Type | Notes |
|---|---|---|
game_id |
string | |
container_id |
string | The container observed (may differ from current after a restart race). |
event_type |
enum | See below. |
occurred_at_ms |
int64 | UTC milliseconds. |
details |
json | Type-specific payload. |
event_type values and their details schemas:
event_type |
details payload |
|---|---|
container_started |
{image_ref} |
container_exited |
{exit_code, oom: bool} |
container_oom |
{exit_code} |
container_disappeared |
{} |
inspect_unhealthy |
{restart_count, state, health} |
probe_failed |
{consecutive_failures, last_status, last_error} |
probe_recovered |
{prior_failure_count} |
The full schema is enforced by ./api/runtime-health-asyncapi.yaml.
Notification Contracts
Runtime Manager publishes admin-only notification intents only for failures invisible to
any other service:
| Trigger | notification_type |
Audience | Channels |
|---|---|---|---|
| Image pull error during start | runtime.image_pull_failed |
admin | |
docker create / docker start error |
runtime.container_start_failed |
admin | |
| Configuration validation error at start (bad image_ref, missing network) | runtime.start_config_invalid |
admin |
Constructors live in galaxy/pkg/notificationintent. Catalog entries live in
../notification/README.md and
../notification/api/intents-asyncapi.yaml.
All three intents share the frozen field set
{game_id, image_ref, error_code, error_message, attempted_at_ms}; the
_ms suffix on attempted_at_ms follows the repo-wide convention for
millisecond integer fields.
The Redis Streams publisher wrapper used to emit these intents from RTM
ships in internal/adapters/notificationpublisher/; the rationale for the
signature shim that drops the upstream entry id lives in
docs/domain-and-ports.md §7 and the production
wiring is documented in docs/adapters.md.
Runtime-level changes after a successful start (probe failures, OOM, container exited) do not produce notifications from RTM. Game Master decides whether to escalate.
Persistence Layout
PostgreSQL durable state (schema rtmanager)
| Table | Purpose | Key |
|---|---|---|
runtime_records |
One row per game, latest known runtime status. | game_id |
operation_log |
Append-only audit of every operation RTM performed. | id (auto) |
health_snapshots |
Latest health observation per game. | game_id |
runtime_records columns:
game_id— primary key, references Lobby's identifier.status—running | stopped | removed.current_container_id— nullable whenstatus=removed.current_image_ref— non-null when status isrunningorstopped.engine_endpoint—http://galaxy-game-{game_id}:8080.state_path— absolute host path of the bind-mounted directory.docker_network— network name observed at create time.started_at,stopped_at,removed_at— last transition timestamps.last_op_at— drives retention TTL.created_at— first time RTM saw the game.
operation_log columns:
id,game_id,op_kind(start | stop | restart | patch | cleanup_container | reconcile_adopt | reconcile_dispose),op_source(lobby_stream | gm_rest | admin_rest | auto_ttl | auto_reconcile),source_ref(stream entry id, REST request id, or admin user),image_ref,container_id,outcome(success | failure),error_code,error_message,started_at,finished_at.
health_snapshots columns:
game_id,container_id,status(healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared),source(docker_event | inspect | probe),details(jsonb),observed_at.
Indexes:
runtime_records (status, last_op_at)— drives cleanup worker.operation_log (game_id, started_at DESC)— drives audit reads.
Migrations are embedded 00001_init.sql (single-init pre-launch policy from
ARCHITECTURE.md §Persistence Backends).
Redis runtime-coordination state
| Key shape | Purpose |
|---|---|
rtmanager:stream_offsets:{label} |
Last processed entry id per consumer (startjobs, stopjobs). Same shape as Lobby. |
rtmanager:game_lease:{game_id} |
Per-game lease string (SET ... NX PX <ttl>). TTL is RTMANAGER_GAME_LEASE_TTL_SECONDS (default 60s); not renewed mid-operation in v1. The trade-off is documented in docs/services.md §1. |
Stream key shapes themselves are configurable:
RTMANAGER_REDIS_START_JOBS_STREAM(defaultruntime:start_jobs).RTMANAGER_REDIS_STOP_JOBS_STREAM(defaultruntime:stop_jobs).RTMANAGER_REDIS_JOB_RESULTS_STREAM(defaultruntime:job_results).RTMANAGER_REDIS_HEALTH_EVENTS_STREAM(defaultruntime:health_events).RTMANAGER_NOTIFICATION_INTENTS_STREAM(defaultnotification:intents).
Error Model
Error envelope: { "error": { "code": "...", "message": "..." } }, identical to Lobby's.
Stable error codes:
| Code | Meaning |
|---|---|
invalid_request |
Malformed JSON, unknown fields, missing required parameter. |
not_found |
Runtime record does not exist. |
conflict |
Operation incompatible with current status. |
service_unavailable |
Dependency unavailable (Docker daemon, PG, Redis). |
internal_error |
Unspecified failure. |
image_pull_failed |
Image pull attempt failed. |
image_ref_not_semver |
Patch attempted with a tag that is not parseable semver. |
semver_patch_only |
Patch attempted across major/minor boundary. |
container_start_failed |
docker create / docker start failed. |
start_config_invalid |
Network missing, bind path inaccessible, or other config error. |
docker_unavailable |
Docker daemon ping failed. |
replay_no_op |
Idempotent replay; outcome is success but no work was done. |
Configuration
All variables use the RTMANAGER_ prefix. Required variables fail-fast on startup.
Required
RTMANAGER_INTERNAL_HTTP_ADDRRTMANAGER_POSTGRES_PRIMARY_DSNRTMANAGER_REDIS_MASTER_ADDRRTMANAGER_REDIS_PASSWORDRTMANAGER_DOCKER_HOSTRTMANAGER_DOCKER_NETWORKRTMANAGER_GAME_STATE_ROOT
Configuration groups
Listener:
RTMANAGER_INTERNAL_HTTP_ADDR(e.g.:8096).RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT(default5s).RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT(default15s).RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT(default60s).
Docker:
RTMANAGER_DOCKER_HOST(defaultunix:///var/run/docker.sock).RTMANAGER_DOCKER_API_VERSION(default empty — let SDK negotiate).RTMANAGER_DOCKER_NETWORK(defaultgalaxy-net).RTMANAGER_DOCKER_LOG_DRIVER(defaultjson-file).RTMANAGER_DOCKER_LOG_OPTS(default empty).RTMANAGER_IMAGE_PULL_POLICY(defaultif_missing, valuesif_missing | always | never).
Container defaults:
RTMANAGER_DEFAULT_CPU_QUOTA(default1.0).RTMANAGER_DEFAULT_MEMORY(default512m).RTMANAGER_DEFAULT_PIDS_LIMIT(default512).RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS(default30).RTMANAGER_CONTAINER_RETENTION_DAYS(default30).RTMANAGER_ENGINE_STATE_MOUNT_PATH(default/var/lib/galaxy-game).RTMANAGER_ENGINE_STATE_ENV_NAME(defaultGAME_STATE_PATH).RTMANAGER_GAME_STATE_DIR_MODE(default0750).RTMANAGER_GAME_STATE_OWNER_UID(default0).RTMANAGER_GAME_STATE_OWNER_GID(default0).RTMANAGER_GAME_STATE_ROOT(host path).
Postgres:
RTMANAGER_POSTGRES_PRIMARY_DSN(postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable).RTMANAGER_POSTGRES_REPLICA_DSNS(optional, comma-separated; not used in v1).RTMANAGER_POSTGRES_OPERATION_TIMEOUT(default2s).RTMANAGER_POSTGRES_MAX_OPEN_CONNS(default10).RTMANAGER_POSTGRES_MAX_IDLE_CONNS(default2).RTMANAGER_POSTGRES_CONN_MAX_LIFETIME(default30m).
Redis:
RTMANAGER_REDIS_MASTER_ADDR.RTMANAGER_REDIS_REPLICA_ADDRS(optional, comma-separated).RTMANAGER_REDIS_PASSWORD.RTMANAGER_REDIS_DB(default0).RTMANAGER_REDIS_OPERATION_TIMEOUT(default2s).
Streams:
RTMANAGER_REDIS_START_JOBS_STREAM(defaultruntime:start_jobs).RTMANAGER_REDIS_STOP_JOBS_STREAM(defaultruntime:stop_jobs).RTMANAGER_REDIS_JOB_RESULTS_STREAM(defaultruntime:job_results).RTMANAGER_REDIS_HEALTH_EVENTS_STREAM(defaultruntime:health_events).RTMANAGER_NOTIFICATION_INTENTS_STREAM(defaultnotification:intents).RTMANAGER_STREAM_BLOCK_TIMEOUT(default5s).
Health monitoring:
RTMANAGER_INSPECT_INTERVAL(default30s).RTMANAGER_PROBE_INTERVAL(default15s).RTMANAGER_PROBE_TIMEOUT(default2s).RTMANAGER_PROBE_FAILURES_THRESHOLD(default3).
Reconciler / cleanup:
RTMANAGER_RECONCILE_INTERVAL(default5m).RTMANAGER_CLEANUP_INTERVAL(default1h).
Coordination:
RTMANAGER_GAME_LEASE_TTL_SECONDS(default60).
Lobby internal client:
RTMANAGER_LOBBY_INTERNAL_BASE_URL(e.g.http://lobby:8095).RTMANAGER_LOBBY_INTERNAL_TIMEOUT(default2s).
Logging:
RTMANAGER_LOG_LEVEL(defaultinfo).
Lifecycle:
RTMANAGER_SHUTDOWN_TIMEOUT(default30s).
Telemetry: uses the standard OTLP env vars (OTEL_EXPORTER_OTLP_ENDPOINT,
OTEL_EXPORTER_OTLP_PROTOCOL, etc.) shared with other Galaxy services.
Observability
Metrics (OpenTelemetry, low cardinality)
rtmanager.start_outcomes— counter, labelsoutcome,error_code,op_source.rtmanager.stop_outcomes— counter, labelsoutcome,reason,op_source.rtmanager.restart_outcomes— counter, labelsoutcome,error_code.rtmanager.patch_outcomes— counter, labelsoutcome,error_code.rtmanager.cleanup_outcomes— counter, labelsoutcome,op_source.rtmanager.docker_op_latency— histogram, labelop(pull | create | start | stop | rm | inspect | events).rtmanager.health_events— counter, labelevent_type.rtmanager.reconcile_drift— counter, labelkind(adopt | dispose | observed_exited).rtmanager.runtime_records_by_status— gauge, labelstatus.rtmanager.lease_acquire_latency— histogram.rtmanager.notification_intents— counter, labelnotification_type.
Structured logs (slog JSON to stdout)
Common fields on every entry: service=rtmanager, request_id, trace_id, span_id,
game_id (when known), container_id (when known), op_kind, op_source, outcome,
error_code.
Worker-specific fields: stream_entry_id (consumers), event_type (health), image_ref
(start/patch).
Verification
Service-level (TESTING.md §7):
- Unit tests for every service-layer operation against mocked Docker.
- Adapter tests (PG, Redis, Docker) using
testcontainers-gofor PG/Redis and the Docker daemon socket for the real Docker adapter. - Contract tests for
internal-openapi.yaml,runtime-jobs-asyncapi.yaml,runtime-health-asyncapi.yaml.
Service-local integration suite under rtmanager/integration/:
- Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real
galaxy/gametest image. - Replay safety (duplicate stream entries are no-ops).
- Health observability (kill the engine externally, observe
container_disappeared; relaunch manually, observe reconcile adopt). - Notification on first-touch failures (publish a start with an unresolvable image, observe
runtime.image_pull_failedintent and afailurejob result).
Inter-service suite under integration/lobbyrtm/:
- Real Lobby + real RTM + real
galaxy/gametest image. Covers happy path, cancel, and start-failed flows.
Manual smoke (development):
docker network create galaxy-net # once
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
RTMANAGER_DOCKER_NETWORK=galaxy-net \
RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
... go run ./rtmanager/cmd/rtmanager
After start, curl http://localhost:8096/readyz returns 200. Driving Lobby through its
public flow brings up galaxy-game-{game_id} containers; RTM logs each lifecycle transition
and publishes the corresponding stream entries.