Files
galaxy-game/rtmanager/README.md
T
2026-04-28 20:39:18 +02:00

38 KiB

Runtime Manager

Runtime Manager (RTM) is the only Galaxy platform service permitted to interact with the Docker daemon. It owns the lifecycle of galaxy/game engine containers and the technical runtime view of running games. Other services consume RTM via two transports: an asynchronous Redis Streams contract (used by Game Lobby) and a synchronous internal REST surface (used by Game Master and Admin Service).

References

Purpose

A running Galaxy game lives in exactly one Docker container. The platform must be able to:

  • create the container with the right engine version and configuration;
  • supply the engine with a stable storage location for game state;
  • keep the runtime status visible to platform-level services;
  • replace the container in place for patch upgrades and restarts;
  • remove containers that are no longer needed;
  • detect and surface engine failures to whoever should react.

Runtime Manager is the single component that performs these actions. It deliberately does not reason about platform metadata, membership, schedules, turn cutoffs, or any other business state. Game Lobby owns platform metadata; Game Master will own runtime business state when implemented.

Scope

Runtime Manager is the source of truth for:

  • the mapping game_id -> current_container_id for every running container;
  • the durable history of every start, stop, restart, patch, and cleanup operation it performed;
  • the most recent technical health observation per game (last Docker event, last successful or failed probe, last inspect result).

Runtime Manager is not the source of truth for:

  • any business or platform-level metadata of a game (owned by Game Lobby);
  • runtime state visible to players or operators as game state, including current turn, generation status, engine version registry (owned by Game Master);
  • the engine version catalogue or which engine version a game is allowed to use (Game Master is the future owner; Game Lobby supplies image_ref in v1);
  • contents of the engine state directory; that is engine domain;
  • backup, archival, or operator cleanup of state directories.

Non-Goals

  • Multi-instance operation in v1. Coordination is single-process; multiple replicas are an explicit future iteration.
  • Engine version arbitration. The producer (Game Lobby in v1, Game Master later) supplies image_ref.
  • Image registry control. Pull policy is configurable, but RTM does not push, retag, or promote images.
  • TLS or mTLS on the internal listener. RTM trusts its network segment.
  • Direct delivery of player-visible push notifications. RTM publishes admin-only notification intents only for failures invisible elsewhere; everything else is delegated.
  • Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached through unix:///var/run/docker.sock.

Position in the System

flowchart LR
    Lobby["Game Lobby"]
    GM["Game Master"]
    Admin["Admin Service"]
    Notify["Notification Service"]
    RTM["Runtime Manager"]
    Engine["Game Engine container"]
    Docker["Docker Daemon"]
    Postgres["PostgreSQL\nschema rtmanager"]
    Redis["Redis\nstreams + leases"]

    Lobby -->|runtime:start_jobs / stop_jobs| RTM
    RTM -->|runtime:job_results| Lobby
    GM -->|internal REST| RTM
    Admin -->|internal REST| RTM
    RTM -->|notification:intents (admin)| Notify
    RTM -->|runtime:health_events| Redis
    RTM <--> Docker
    Docker -->|create / start / stop / rm| Engine
    RTM --> Postgres
    RTM --> Redis
    Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]

Responsibility Boundaries

Runtime Manager is responsible for:

  • accepting start, stop, restart, patch, inspect, and cleanup requests through the supported transports and producing one durable outcome per request;
  • creating Docker containers from a producer-supplied image_ref and binding them to the configured Docker network and host state directory;
  • enforcing the one-game-one-container invariant in its own state and on Docker;
  • monitoring container health through Docker events, periodic inspect, and active HTTP probes;
  • publishing technical runtime events (runtime:job_results, runtime:health_events) and admin-only notification intents for failures that no other service can observe;
  • reconciling its persistent state with Docker reality on startup and periodically;
  • removing exited containers automatically by retention TTL or explicitly by admin command.

Runtime Manager is not responsible for:

  • evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
  • registering a started runtime with Game Master (Lobby calls GM after a successful job result);
  • mapping platform users to engine players (GM owns this mapping);
  • player command routing (GM proxies player commands directly to engine);
  • cleaning up host state directories;
  • patching the engine version registry; the registry lives in Game Master.

Container Model

Network

Containers attach to a single user-defined Docker bridge network. The network is provisioned outside RTM: docker-compose, Terraform, or an operator runbook creates galaxy-net (or whatever name is configured via RTMANAGER_DOCKER_NETWORK).

RTM validates the network's presence at startup. A missing network is a fail-fast condition; the process exits non-zero before opening any listener.

DNS name and engine endpoint

Each container is created with hostname galaxy-game-{game_id} and is attached to the configured network. Docker's embedded DNS resolves the hostname for any other container in the same network.

The engine_endpoint published in runtime:job_results and visible through the inspect REST endpoint is the full URL http://galaxy-game-{game_id}:8080. The port is fixed at 8080 inside the container; RTM does not publish ports to the host.

Restart and patch keep the same DNS name. The container_id changes; the engine_endpoint does not.

State storage (bind mount)

Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale is operator-friendly backup and inspection.

  • Host root: RTMANAGER_GAME_STATE_ROOT (operator-supplied, e.g. /var/lib/galaxy/games).
  • Per-game directory: <RTMANAGER_GAME_STATE_ROOT>/{game_id}. RTM creates it with permissions RTMANAGER_GAME_STATE_DIR_MODE (default 0750) and ownership RTMANAGER_GAME_STATE_OWNER_UID / _GID (default 0:0 — operator overrides for non-root engine).
  • Bind mount: the per-game directory is mounted into the container at the path declared by RTMANAGER_ENGINE_STATE_MOUNT_PATH (default /var/lib/galaxy-game).
  • Environment: the container receives GAME_STATE_PATH=<mount path>. The engine resolves the path from this variable. The same variable is forwarded to the engine as STORAGE_PATH for backward compatibility — both names are accepted in v1.

RTM never deletes the host state directory. Removing it is the responsibility of operator tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container through the cleanup endpoint or the retention TTL leaves the directory intact.

Container labels

RTM applies the following labels to every container it creates:

Label Value Purpose
com.galaxy.owner rtmanager Filter for docker ps and reconcile.
com.galaxy.kind game-engine Differentiates from infra containers.
com.galaxy.game_id {game_id} Reverse lookup from container to platform game.
com.galaxy.engine_image_ref {image_ref} Cross-check against runtime_records.
com.galaxy.started_at_ms {ms} Unambiguous start timestamp.

Labels are read from the resolved engine image to choose resource limits (see below).

Resource limits

Resource limits originate in the engine image, not in the producer envelope or RTM config:

Image label Container limit RTM fallback config
com.galaxy.cpu_quota --cpus value RTMANAGER_DEFAULT_CPU_QUOTA (default 1.0)
com.galaxy.memory --memory value RTMANAGER_DEFAULT_MEMORY (default 512m)
com.galaxy.pids_limit --pids-limit value RTMANAGER_DEFAULT_PIDS_LIMIT (default 512)

If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass limits.

Logging driver

Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver and its options when creating the container:

  • RTMANAGER_DOCKER_LOG_DRIVER (default json-file).
  • RTMANAGER_DOCKER_LOG_OPTS (default empty; comma-separated key=value pairs).

RTM never reads the container's stdout itself. Operators consume engine logs via docker logs or via whatever sink the configured driver feeds (fluentd, journald, etc.).

The production Docker SDK adapter that creates and starts these containers lives at internal/adapters/docker/. Its design rationale — fixed engine port, partial-rollback on ContainerStart failure, events-stream filter rationale, and the mockgen-driven service-test fixture — is captured in docs/adapters.md.

Runtime Surface

Listeners

Listener Default address Purpose
internal HTTP :8096 (RTMANAGER_INTERNAL_HTTP_ADDR) Probes (/healthz, /readyz) and the trusted REST surface for Game Master and Admin Service.

There is no public listener. The internal listener is unauthenticated and assumes a trusted network segment.

Background workers

Worker Driver Description
startjobs consumer Redis Stream runtime:start_jobs Decodes start envelope and invokes the start service.
stopjobs consumer Redis Stream runtime:stop_jobs Decodes stop envelope and invokes the stop service.
Docker events listener Docker /events API Subscribes with the label filter, emits runtime:health_events for container_started / exited / oom / disappeared.
Active HTTP probe Periodic GET {engine_endpoint}/healthz for every running runtime; emits probe_failed / probe_recovered with hysteresis.
Periodic Docker inspect Periodic Refreshes inspect data; emits inspect_unhealthy when restart_count grows or status is unexpected.
Reconciler Startup + periodic Reconciles runtime_records with docker ps (see Reconciliation section).
Container cleanup Periodic Removes exited containers older than RTMANAGER_CONTAINER_RETENTION_DAYS.

Startup dependencies

In start order:

  1. PostgreSQL primary (DSN RTMANAGER_POSTGRES_PRIMARY_DSN). Goose migrations apply synchronously before any listener opens.
  2. Redis master (RTMANAGER_REDIS_MASTER_ADDR).
  3. Docker daemon at RTMANAGER_DOCKER_HOST (default unix:///var/run/docker.sock). RTM verifies API ping and the presence of RTMANAGER_DOCKER_NETWORK.
  4. Telemetry exporter (OTLP grpc/http or stdout).
  5. Internal HTTP listener.
  6. Reconciler runs once and blocks until done.
  7. Background workers start.

A failure in any step is fatal and exits the process non-zero.

Probes

/healthz reports liveness — the process responds when the HTTP server is alive.

/readyz reports readiness — 200 only when:

  • the PostgreSQL pool can ping the primary;
  • the Redis master client can ping;
  • the Docker client can ping;
  • the configured Docker network exists.

Both probes are documented in ./api/internal-openapi.yaml.

Lifecycles

All operations share a per-game-id Redis lease (rtmanager:game_lease:{game_id}, TTL RTMANAGER_GAME_LEASE_TTL_SECONDS, default 60). The lease serialises operations on a single game across all entry points (stream consumers and REST handlers). v1 does not renew the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before the operation finishes — the trade-off is documented in docs/services.md §1.

Start

Triggers:

  • Lobby: a Redis Streams entry on runtime:start_jobs with envelope {game_id, image_ref, requested_at_ms}.
  • Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/start with body {image_ref}.

Pre-conditions:

  • image_ref is a non-empty string and parseable as a Docker reference.
  • Configured Docker network exists.
  • The lease for {game_id} is acquired.

Flow on success:

  1. Read runtime_records.{game_id}. If status=running with the same image_ref, return the existing record (idempotent success, error_code=replay_no_op).
  2. Pull the image per RTMANAGER_IMAGE_PULL_POLICY (default if_missing).
  3. Inspect the resolved image, derive resource limits from labels.
  4. Ensure the per-game state directory exists with the configured mode and ownership.
  5. docker create with the configured network, hostname, labels, env (GAME_STATE_PATH, STORAGE_PATH), bind mount, log driver, resource limits.
  6. docker start.
  7. Upsert runtime_records (status=running, current_container_id, engine_endpoint, current_image_ref, started_at, last_op_at).
  8. Append operation_log entry (op_kind=start, outcome=success, source-specific op_source).
  9. Publish runtime:health_events container_started.
  10. For Lobby callers: publish runtime:job_results {game_id, outcome=success, container_id, engine_endpoint}. For REST callers: respond 200 with the runtime record.

Failure paths:

Failure PG side effect Notification intent Outcome to caller
Invalid image_ref shape, network missing operation_log failure runtime.start_config_invalid failure / start_config_invalid
Image pull error operation_log failure runtime.image_pull_failed failure / image_pull_failed
docker create / start error operation_log failure runtime.container_start_failed failure / container_start_failed
State directory creation error operation_log failure runtime.start_config_invalid failure / start_config_invalid

A failed start never leaves a partially-running container: if docker create succeeded but the subsequent step failed, RTM removes the container before recording the failure.

The production start orchestrator that implements the flow and the failure paths above lives at internal/service/startruntime/. Its design rationale — why the per-game lease and the health-events publisher live with the start service, the Result-shaped contract consumed by the stream consumer and the REST handler, the rollback rule on Upsert failure, and the created_at-preservation rule for re-starts — is captured in docs/services.md.

Stop

Triggers:

  • Lobby: Redis Streams entry on runtime:stop_jobs with envelope {game_id, reason, requested_at_ms}. reason ∈ {orphan_cleanup, cancelled, finished, admin_request, timeout}.
  • Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/stop with body {reason}.

Pre-conditions:

  • Lease acquired.

Flow on success:

  1. Read runtime_records.{game_id}. If status is stopped or removed, return idempotent success (error_code=replay_no_op).
  2. docker stop with RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS (default 30). Docker fires SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP shutdown endpoint on the engine.
  3. Update runtime_records (status=stopped, stopped_at, last_op_at).
  4. Append operation_log entry.
  5. Publish runtime:job_results (for Lobby) or REST 200 (for REST callers).

The container stays in exited state until the cleanup worker removes it (TTL) or an admin command forces removal.

Failure paths:

Failure Outcome
Container not found in Docker but record running Update record status=removed, publish container_disappeared, return success (RTM treats this as already-stopped).
docker stop returns non-zero, container still alive Failure recorded, no state change. Caller may retry.

Restart

Triggers:

  • Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/restart.

Restart is recreate: stop + remove + run with the same image_ref and the same bind mount. container_id changes; engine_endpoint is stable.

Flow:

  1. Read runtime_records.{game_id}. The current image_ref is captured.
  2. Acquire lease.
  3. Run the stop flow (without releasing the lease).
  4. docker rm the container.
  5. Run the start flow with the captured image_ref.
  6. Append a single operation_log entry with op_kind=restart and a correlation id linking the implicit stop and start log entries.

If any inner step fails, the operation log records the partial outcome and the outer caller receives the same failure; the runtime record converges to whatever state Docker reports.

Patch

Triggers:

  • Game Master / Admin Service: POST /api/v1/internal/runtimes/{game_id}/patch with body {image_ref}.

Patch is restart with a new image_ref. The engine reads its state from the bind mount on startup, so any data written before the patch survives.

Pre-conditions:

  • New and current image refs both parse as semver tags. image_ref_not_semver failure otherwise.
  • Major and minor versions are equal between current and new (semver_patch_only failure otherwise).

Flow: identical to restart, with a new image_ref injected before the start step. operation_log entry has op_kind=patch.

Cleanup

Triggers:

  • Periodic worker: every container with runtime_records.status=stopped and last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS (default 30).
  • Admin Service: DELETE /api/v1/internal/runtimes/{game_id}/container.

Pre-conditions:

  • The container is not in running state. RTM refuses to remove a running container through this path; stop first.

Flow:

  1. Acquire lease.
  2. docker rm the container.
  3. Update runtime_records (status=removed, removed_at, current_container_id=NULL, last_op_at).
  4. Append operation_log entry (op_kind=cleanup_container, op_source ∈ {auto_ttl, admin_rest}).

The host state directory is left untouched.

Health Monitoring

Three independent sources feed runtime:health_events and health_snapshots:

  1. Docker events listener. Subscribes to the Docker events stream and filters container-scoped events by the com.galaxy.owner=rtmanager label written into every container by the start service. Emits:

    • container_exited (action=die with non-zero exit code; exit 0 is the normal graceful stop and is suppressed).
    • container_oom (action=oom).
    • container_disappeared (action=destroy observed for a runtime_records.status=running row whose current_container_id still matches the destroyed container, i.e. a destroy RTM did not initiate).

    container_started is emitted by the start service when it runs the container (see internal/service/startruntime), not by this listener.

  2. Periodic Docker inspect every RTMANAGER_INSPECT_INTERVAL (default 30s). Emits inspect_unhealthy when:

    • RestartCount increases between observations;
    • State.Status != "running" for a record marked running;
    • State.Health.Status == "unhealthy" if the image declares a Docker HEALTHCHECK.
  3. Active HTTP probe every RTMANAGER_PROBE_INTERVAL (default 15s). Calls GET {engine_endpoint}/healthz with RTMANAGER_PROBE_TIMEOUT (default 2s). Emits:

    • probe_failed after RTMANAGER_PROBE_FAILURES_THRESHOLD consecutive failures (default 3);
    • probe_recovered on the first success after a probe_failed was published.

Every emission updates health_snapshots.{game_id} (latest event becomes the snapshot) and appends to runtime:health_events.

In v1, RTM publishes admin-only notification intents only for first-touch failures of the start flow. All ongoing health changes (probe failures, OOMs, exits) flow through runtime:health_events only. Game Master is the consumer that decides whether to escalate runtime-level events into notifications.

The three workers that implement the sources above live in internal/worker/{dockerevents,dockerinspect,healthprobe}. Their design rationale — container_started ownership, container_disappeared emission rules, die exit-code suppression, probe hysteresis state model, parallel-probe cap, and the events-listener reconnect policy — is captured in docs/workers.md.

Reconciliation

RTM never assumes Docker and PostgreSQL are in sync.

At startup (blocking, before workers start) and every RTMANAGER_RECONCILE_INTERVAL (default 5m):

  1. List Docker containers with label com.galaxy.owner=rtmanager.
  2. For each running container without a matching record:
    • Insert a runtime_records row with status=running, the discovered current_image_ref, engine_endpoint, and started_at taken from com.galaxy.started_at_ms if present (otherwise from State.StartedAt).
    • Append operation_log entry with op_kind=reconcile_adopt, op_source=auto_reconcile.
    • Never stop or remove an unrecorded container. Operators may have started one manually for diagnostics; RTM stays out of their way.
  3. For each runtime_records row with status=running whose container is missing:
    • Update status=removed, removed_at=now, current_container_id=NULL.
    • Publish runtime:health_events container_disappeared.
    • Append operation_log entry with op_kind=reconcile_dispose.
  4. For each runtime_records row with status=running whose container exists but is in exited:
    • Update status=stopped, stopped_at=now (reconciler observation time).
    • Publish runtime:health_events container_exited with the observed exit code.

The reconciler implementation lives at internal/worker/reconcile/ and the periodic TTL-cleanup worker at internal/worker/containercleanup/; the cleanup worker delegates removal to internal/service/cleanupcontainer/. The design rationale — the per-game lease around every drift mutation, the third observed_exited path beyond the two named cases, the synchronous ReconcileNow plus periodic Component split, and why the cleanup worker is a thin TTL filter on top of the existing service — is captured in docs/workers.md.

Trusted Surfaces

Internal REST

The internal REST surface is consumed by Game Master (sync interactions for inspect, restart, patch, stop, cleanup) and Admin Service (operational tooling, force-cleanup). The listener is unauthenticated; downstream services rely on network segmentation.

Method Path Operation ID Caller
GET /healthz internalHealthz platform probes
GET /readyz internalReadyz platform probes
GET /api/v1/internal/runtimes internalListRuntimes GM, Admin
GET /api/v1/internal/runtimes/{game_id} internalGetRuntime GM, Admin
POST /api/v1/internal/runtimes/{game_id}/start internalStartRuntime GM, Admin
POST /api/v1/internal/runtimes/{game_id}/stop internalStopRuntime GM, Admin
POST /api/v1/internal/runtimes/{game_id}/restart internalRestartRuntime GM, Admin
POST /api/v1/internal/runtimes/{game_id}/patch internalPatchRuntime GM, Admin
DELETE /api/v1/internal/runtimes/{game_id}/container internalCleanupRuntimeContainer Admin

Request and response shapes are defined in ./api/internal-openapi.yaml. Unknown JSON fields are rejected with invalid_request.

Callers identify themselves through the optional X-Galaxy-Caller request header (gm for Game Master, admin for Admin Service). The header is recorded as op_source in operation_log (gm_rest or admin_rest); when missing or carrying any other value Runtime Manager defaults to op_source = admin_rest. The header is documented on every runtime endpoint of ./api/internal-openapi.yaml.

Async Stream Contracts

runtime:start_jobs (in)

Producer: Game Lobby.

Field Type Notes
game_id string Lobby game_id.
image_ref string Docker reference. Lobby resolves it from target_engine_version using LOBBY_ENGINE_IMAGE_TEMPLATE.
requested_at_ms int64 UTC milliseconds. Used for diagnostics, not authoritative.

runtime:stop_jobs (in)

Producer: Game Lobby.

Field Type Notes
game_id string
reason enum orphan_cleanup, cancelled, finished, admin_request, timeout. Recorded in operation_log.error_code when the reason matters; otherwise opaque.
requested_at_ms int64

runtime:job_results (out)

Producer: Runtime Manager. Consumer: Game Lobby.

Field Type Notes
game_id string
outcome enum success, failure.
container_id string Required for success. Empty on failure.
engine_endpoint string Required for success. Empty on failure.
error_code string Stable code. replay_no_op for idempotent re-runs.
error_message string Operator-readable detail.

runtime:health_events (out, new)

Producer: Runtime Manager. Consumers: Game Master; Game Lobby and Admin Service are reserved as future consumers.

Field Type Notes
game_id string
container_id string The container observed (may differ from current after a restart race).
event_type enum See below.
occurred_at_ms int64 UTC milliseconds.
details json Type-specific payload.

event_type values and their details schemas:

event_type details payload
container_started {image_ref}
container_exited {exit_code, oom: bool}
container_oom {exit_code}
container_disappeared {}
inspect_unhealthy {restart_count, state, health}
probe_failed {consecutive_failures, last_status, last_error}
probe_recovered {prior_failure_count}

The full schema is enforced by ./api/runtime-health-asyncapi.yaml.

Notification Contracts

Runtime Manager publishes admin-only notification intents only for failures invisible to any other service:

Trigger notification_type Audience Channels
Image pull error during start runtime.image_pull_failed admin email
docker create / docker start error runtime.container_start_failed admin email
Configuration validation error at start (bad image_ref, missing network) runtime.start_config_invalid admin email

Constructors live in galaxy/pkg/notificationintent. Catalog entries live in ../notification/README.md and ../notification/api/intents-asyncapi.yaml. All three intents share the frozen field set {game_id, image_ref, error_code, error_message, attempted_at_ms}; the _ms suffix on attempted_at_ms follows the repo-wide convention for millisecond integer fields. The Redis Streams publisher wrapper used to emit these intents from RTM ships in internal/adapters/notificationpublisher/; the rationale for the signature shim that drops the upstream entry id lives in docs/domain-and-ports.md §7 and the production wiring is documented in docs/adapters.md.

Runtime-level changes after a successful start (probe failures, OOM, container exited) do not produce notifications from RTM. Game Master decides whether to escalate.

Persistence Layout

PostgreSQL durable state (schema rtmanager)

Table Purpose Key
runtime_records One row per game, latest known runtime status. game_id
operation_log Append-only audit of every operation RTM performed. id (auto)
health_snapshots Latest health observation per game. game_id

runtime_records columns:

  • game_id — primary key, references Lobby's identifier.
  • statusrunning | stopped | removed.
  • current_container_id — nullable when status=removed.
  • current_image_ref — non-null when status is running or stopped.
  • engine_endpointhttp://galaxy-game-{game_id}:8080.
  • state_path — absolute host path of the bind-mounted directory.
  • docker_network — network name observed at create time.
  • started_at, stopped_at, removed_at — last transition timestamps.
  • last_op_at — drives retention TTL.
  • created_at — first time RTM saw the game.

operation_log columns:

  • id, game_id, op_kind (start | stop | restart | patch | cleanup_container | reconcile_adopt | reconcile_dispose), op_source (lobby_stream | gm_rest | admin_rest | auto_ttl | auto_reconcile), source_ref (stream entry id, REST request id, or admin user), image_ref, container_id, outcome (success | failure), error_code, error_message, started_at, finished_at.

health_snapshots columns:

  • game_id, container_id, status (healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared), source (docker_event | inspect | probe), details (jsonb), observed_at.

Indexes:

  • runtime_records (status, last_op_at) — drives cleanup worker.
  • operation_log (game_id, started_at DESC) — drives audit reads.

Migrations are embedded 00001_init.sql (single-init pre-launch policy from ARCHITECTURE.md §Persistence Backends).

Redis runtime-coordination state

Key shape Purpose
rtmanager:stream_offsets:{label} Last processed entry id per consumer (startjobs, stopjobs). Same shape as Lobby.
rtmanager:game_lease:{game_id} Per-game lease string (SET ... NX PX <ttl>). TTL is RTMANAGER_GAME_LEASE_TTL_SECONDS (default 60s); not renewed mid-operation in v1. The trade-off is documented in docs/services.md §1.

Stream key shapes themselves are configurable:

  • RTMANAGER_REDIS_START_JOBS_STREAM (default runtime:start_jobs).
  • RTMANAGER_REDIS_STOP_JOBS_STREAM (default runtime:stop_jobs).
  • RTMANAGER_REDIS_JOB_RESULTS_STREAM (default runtime:job_results).
  • RTMANAGER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events).
  • RTMANAGER_NOTIFICATION_INTENTS_STREAM (default notification:intents).

Error Model

Error envelope: { "error": { "code": "...", "message": "..." } }, identical to Lobby's.

Stable error codes:

Code Meaning
invalid_request Malformed JSON, unknown fields, missing required parameter.
not_found Runtime record does not exist.
conflict Operation incompatible with current status.
service_unavailable Dependency unavailable (Docker daemon, PG, Redis).
internal_error Unspecified failure.
image_pull_failed Image pull attempt failed.
image_ref_not_semver Patch attempted with a tag that is not parseable semver.
semver_patch_only Patch attempted across major/minor boundary.
container_start_failed docker create / docker start failed.
start_config_invalid Network missing, bind path inaccessible, or other config error.
docker_unavailable Docker daemon ping failed.
replay_no_op Idempotent replay; outcome is success but no work was done.

Configuration

All variables use the RTMANAGER_ prefix. Required variables fail-fast on startup.

Required

  • RTMANAGER_INTERNAL_HTTP_ADDR
  • RTMANAGER_POSTGRES_PRIMARY_DSN
  • RTMANAGER_REDIS_MASTER_ADDR
  • RTMANAGER_REDIS_PASSWORD
  • RTMANAGER_DOCKER_HOST
  • RTMANAGER_DOCKER_NETWORK
  • RTMANAGER_GAME_STATE_ROOT

Configuration groups

Listener:

  • RTMANAGER_INTERNAL_HTTP_ADDR (e.g. :8096).
  • RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT (default 5s).
  • RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT (default 15s).
  • RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT (default 60s).

Docker:

  • RTMANAGER_DOCKER_HOST (default unix:///var/run/docker.sock).
  • RTMANAGER_DOCKER_API_VERSION (default empty — let SDK negotiate).
  • RTMANAGER_DOCKER_NETWORK (default galaxy-net).
  • RTMANAGER_DOCKER_LOG_DRIVER (default json-file).
  • RTMANAGER_DOCKER_LOG_OPTS (default empty).
  • RTMANAGER_IMAGE_PULL_POLICY (default if_missing, values if_missing | always | never).

Container defaults:

  • RTMANAGER_DEFAULT_CPU_QUOTA (default 1.0).
  • RTMANAGER_DEFAULT_MEMORY (default 512m).
  • RTMANAGER_DEFAULT_PIDS_LIMIT (default 512).
  • RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS (default 30).
  • RTMANAGER_CONTAINER_RETENTION_DAYS (default 30).
  • RTMANAGER_ENGINE_STATE_MOUNT_PATH (default /var/lib/galaxy-game).
  • RTMANAGER_ENGINE_STATE_ENV_NAME (default GAME_STATE_PATH).
  • RTMANAGER_GAME_STATE_DIR_MODE (default 0750).
  • RTMANAGER_GAME_STATE_OWNER_UID (default 0).
  • RTMANAGER_GAME_STATE_OWNER_GID (default 0).
  • RTMANAGER_GAME_STATE_ROOT (host path).

Postgres:

  • RTMANAGER_POSTGRES_PRIMARY_DSN (postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable).
  • RTMANAGER_POSTGRES_REPLICA_DSNS (optional, comma-separated; not used in v1).
  • RTMANAGER_POSTGRES_OPERATION_TIMEOUT (default 2s).
  • RTMANAGER_POSTGRES_MAX_OPEN_CONNS (default 10).
  • RTMANAGER_POSTGRES_MAX_IDLE_CONNS (default 2).
  • RTMANAGER_POSTGRES_CONN_MAX_LIFETIME (default 30m).

Redis:

  • RTMANAGER_REDIS_MASTER_ADDR.
  • RTMANAGER_REDIS_REPLICA_ADDRS (optional, comma-separated).
  • RTMANAGER_REDIS_PASSWORD.
  • RTMANAGER_REDIS_DB (default 0).
  • RTMANAGER_REDIS_OPERATION_TIMEOUT (default 2s).

Streams:

  • RTMANAGER_REDIS_START_JOBS_STREAM (default runtime:start_jobs).
  • RTMANAGER_REDIS_STOP_JOBS_STREAM (default runtime:stop_jobs).
  • RTMANAGER_REDIS_JOB_RESULTS_STREAM (default runtime:job_results).
  • RTMANAGER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events).
  • RTMANAGER_NOTIFICATION_INTENTS_STREAM (default notification:intents).
  • RTMANAGER_STREAM_BLOCK_TIMEOUT (default 5s).

Health monitoring:

  • RTMANAGER_INSPECT_INTERVAL (default 30s).
  • RTMANAGER_PROBE_INTERVAL (default 15s).
  • RTMANAGER_PROBE_TIMEOUT (default 2s).
  • RTMANAGER_PROBE_FAILURES_THRESHOLD (default 3).

Reconciler / cleanup:

  • RTMANAGER_RECONCILE_INTERVAL (default 5m).
  • RTMANAGER_CLEANUP_INTERVAL (default 1h).

Coordination:

  • RTMANAGER_GAME_LEASE_TTL_SECONDS (default 60).

Lobby internal client:

  • RTMANAGER_LOBBY_INTERNAL_BASE_URL (e.g. http://lobby:8095).
  • RTMANAGER_LOBBY_INTERNAL_TIMEOUT (default 2s).

Logging:

  • RTMANAGER_LOG_LEVEL (default info).

Lifecycle:

  • RTMANAGER_SHUTDOWN_TIMEOUT (default 30s).

Telemetry: uses the standard OTLP env vars (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, etc.) shared with other Galaxy services.

Observability

Metrics (OpenTelemetry, low cardinality)

  • rtmanager.start_outcomes — counter, labels outcome, error_code, op_source.
  • rtmanager.stop_outcomes — counter, labels outcome, reason, op_source.
  • rtmanager.restart_outcomes — counter, labels outcome, error_code.
  • rtmanager.patch_outcomes — counter, labels outcome, error_code.
  • rtmanager.cleanup_outcomes — counter, labels outcome, op_source.
  • rtmanager.docker_op_latency — histogram, label op (pull | create | start | stop | rm | inspect | events).
  • rtmanager.health_events — counter, label event_type.
  • rtmanager.reconcile_drift — counter, label kind (adopt | dispose | observed_exited).
  • rtmanager.runtime_records_by_status — gauge, label status.
  • rtmanager.lease_acquire_latency — histogram.
  • rtmanager.notification_intents — counter, label notification_type.

Structured logs (slog JSON to stdout)

Common fields on every entry: service=rtmanager, request_id, trace_id, span_id, game_id (when known), container_id (when known), op_kind, op_source, outcome, error_code.

Worker-specific fields: stream_entry_id (consumers), event_type (health), image_ref (start/patch).

Verification

Service-level (TESTING.md §7):

  • Unit tests for every service-layer operation against mocked Docker.
  • Adapter tests (PG, Redis, Docker) using testcontainers-go for PG/Redis and the Docker daemon socket for the real Docker adapter.
  • Contract tests for internal-openapi.yaml, runtime-jobs-asyncapi.yaml, runtime-health-asyncapi.yaml.

Service-local integration suite under rtmanager/integration/:

  • Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real galaxy/game test image.
  • Replay safety (duplicate stream entries are no-ops).
  • Health observability (kill the engine externally, observe container_disappeared; relaunch manually, observe reconcile adopt).
  • Notification on first-touch failures (publish a start with an unresolvable image, observe runtime.image_pull_failed intent and a failure job result).

Inter-service suite under integration/lobbyrtm/:

  • Real Lobby + real RTM + real galaxy/game test image. Covers happy path, cancel, and start-failed flows.

Manual smoke (development):

docker network create galaxy-net   # once
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
RTMANAGER_DOCKER_NETWORK=galaxy-net \
RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
... go run ./rtmanager/cmd/rtmanager

After start, curl http://localhost:8096/readyz returns 200. Driving Lobby through its public flow brings up galaxy-game-{game_id} containers; RTM logs each lifecycle transition and publishes the corresponding stream entries.