Files

T

Ilia Denisov a7cee15115 feat: runtime manager

2026-04-28 20:39:18 +02:00

14 KiB

Raw Blame History

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Runtime Manager.

Startup Checks

Before starting the process, confirm:

RTMANAGER_DOCKER_HOST (default unix:///var/run/docker.sock) reaches a Docker daemon the operator controls. RTM is the only Galaxy service permitted to interact with the Docker socket; scoping the daemon to RTM-only callers is operator domain.
RTMANAGER_DOCKER_NETWORK (default galaxy-net) names a user-defined bridge network that has already been created (e.g. via docker network create galaxy-net in the environment's bootstrap script). RTM validates the network at startup but never creates it. A missing network is fail-fast and the process exits non-zero before opening any listener.
RTMANAGER_GAME_STATE_ROOT is a host directory the daemon's user can read and write. Per-game subdirectories are created with RTMANAGER_GAME_STATE_DIR_MODE (default 0750) and RTMANAGER_GAME_STATE_OWNER_UID / _GID (default 0:0); set the uid/gid to match the engine container's user when running with a non-root engine.
RTMANAGER_POSTGRES_PRIMARY_DSN points to the PostgreSQL primary that hosts the rtmanager schema. The DSN must include search_path=rtmanager and sslmode=disable (or a real SSL mode for production). Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. The rtmanager schema and the matching rtmanagerservice role are provisioned externally (postgres-migration.md §1).
RTMANAGER_REDIS_MASTER_ADDR and RTMANAGER_REDIS_PASSWORD reach the Redis deployment used for the runtime-coordination state: stream consumers (runtime:start_jobs, runtime:stop_jobs), publishers (runtime:job_results, runtime:health_events, notification:intents), persisted offsets, and the per-game lease. RTM does not maintain durable business state on Redis.
Stream names match the producers and consumers RTM integrates with:
- RTMANAGER_REDIS_START_JOBS_STREAM (default runtime:start_jobs)
- RTMANAGER_REDIS_STOP_JOBS_STREAM (default runtime:stop_jobs)
- RTMANAGER_REDIS_JOB_RESULTS_STREAM (default runtime:job_results)
- RTMANAGER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events)
- RTMANAGER_NOTIFICATION_INTENTS_STREAM (default notification:intents)
RTMANAGER_LOBBY_INTERNAL_BASE_URL resolves to Lobby's internal HTTP listener. RTM's start service issues a diagnostic GET /api/v1/internal/games/{game_id} per start; failure is logged at debug and does not abort the start (services.md §7).

The startup sequence runs in the order recorded in ../README.md §Startup dependencies:

PostgreSQL primary opens; goose migrations apply synchronously.
Redis master client opens and pings.
Docker daemon ping; configured network presence check.
Telemetry exporter (OTLP grpc/http or stdout).
Internal HTTP listener.
Reconciler runs once synchronously and blocks until done.
Background workers start.

A failure at any step is fatal. The synchronous reconciler pass is the reason orphaned containers from a prior process never reach the periodic workers in an inconsistent state (workers.md §17).

Expected log lines on a healthy boot:

migrations applied,
postgres ping ok,
redis ping ok,
docker ping ok and docker network found,
telemetry exporter started,
internal http listening,
reconciler initial pass completed,
one worker started entry per background worker (seven expected).

Readiness

Use the probes according to what they actually verify:

GET /healthz confirms the listener is alive — no dependency check.
GET /readyz live-pings PostgreSQL primary, Redis master, and the Docker daemon, then asserts the configured Docker network exists. Returns {"status":"ready"} when every check passes; otherwise returns 503 with the canonical {"error":{"code":"service_unavailable","message":"…"}} envelope identifying the first failing dependency.

/readyz is the strongest readiness signal RTM exposes; unlike Lobby's /readyz, it does not rely on a one-shot boot ping. Each request hits the daemon and the database fresh.

For a practical readiness check in production:

confirm the process emitted the listener and worker startup logs;
check GET /healthz and GET /readyz;
verify rtmanager.runtime_records_by_status{status="running"} gauge tracks the expected live game count after the first start completes;
verify rtmanager.docker_op_latency histograms have at least one sample after the first lifecycle operation.

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behaviour:

the per-component shutdown budget is controlled by RTMANAGER_SHUTDOWN_TIMEOUT (default 30s);
the internal HTTP listener drains in-flight requests before closing;
stream consumers stop their XREAD loops and persist the latest offset before returning; the offset survives the restart (workers.md §9);
the Docker events listener cancels its subscription;
the in-flight services release their per-game lease through the surrounding context cancellation;
the reconciler completes its current pass or aborts mid-write at the next lease re-acquisition.

During planned restarts:

send SIGTERM;
wait for the listener and component-stop logs;
expect any consumer that was mid-cycle to retry from the persisted offset on the next process start;
investigate only if shutdown exceeds RTMANAGER_SHUTDOWN_TIMEOUT.

Engine Container Died

A running engine container that exits unexpectedly surfaces through three observation channels:

The Docker events listener emits container_exited (non-zero exit code) or container_oom (Docker action oom).
The active probe worker eventually emits probe_failed once the threshold is crossed.
The Docker inspect worker may emit inspect_unhealthy if the engine restarts under Docker's healthcheck or if Docker reports an unexpected status.

Triage:

Inspect the runtime:health_events stream for the affected game_id and event_type:

redis-cli XRANGE runtime:health_events - + COUNT 200 \
  | grep -A4 'game_id\s*<game_id>'

Read the runtime record and the operation log:

curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT id, op_kind, op_source, outcome, error_code, started_at
   FROM rtmanager.operation_log
   WHERE game_id = '<game_id>'
   ORDER BY started_at DESC LIMIT 20"

If Lobby has not reacted (the game's status remains running in lobby.games), check runtime:job_results lag and Lobby's runtimejobresult worker. RTM publishes the result; Lobby is the consumer.
If the container is already gone (docker ps -a shows no row for galaxy-game-<game_id>), the reconciler will move the record to removed on its next pass. Run the periodic reconcile manually by sending SIGHUP is not supported — wait RTMANAGER_RECONCILE_INTERVAL (default 5m) or restart the process; the synchronous boot pass will handle the drift.
The notification:intents stream is not the place to look for ongoing health changes. Only the three first-touch start failures (runtime.image_pull_failed, runtime.container_start_failed, runtime.start_config_invalid) produce a notification intent; probe failures, OOMs, and exits flow through health events only (../README.md §Notification Contracts).

Patch Upgrade

A patch upgrade replaces the container with a new image_ref while preserving the bind-mounted state directory.

Pre-conditions:

The new and current image_ref tags both parse as semver. RTM rejects non-semver tags with image_ref_not_semver.
The new and current major / minor versions match. A cross-major or cross-minor patch returns semver_patch_only.

Driving the upgrade:

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: admin' \
  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
  -d '{"image_ref": "galaxy/game:1.4.2"}'

Behaviour:

The container is stopped, removed, and recreated. The current_container_id changes; the engine_endpoint (http://galaxy-game-<game_id>:8080) is stable.
The engine reads its state from the bind mount on startup, so any data written before the patch survives.
A single operation_log row is appended with op_kind=patch and the old / new image refs.
A runtime:health_events container_started is emitted by the inner start (workers.md §1).

Post-patch verification:

curl -s http://galaxy-game-<game_id>:8080/healthz
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>

The current_image_ref field on the runtime record reflects the new tag.

Manual Cleanup

The cleanup endpoint removes the container and updates the record to removed. It refuses to remove a running container — stop first.

# Stop, then clean up
curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: admin' \
  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
  -d '{"reason":"admin_request"}'

curl -s -X DELETE \
  -H 'X-Galaxy-Caller: admin' \
  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container

The host state directory under <RTMANAGER_GAME_STATE_ROOT>/<game_id> is never deleted by RTM. Removing the directory is operator domain (backup tooling, future Admin Service workflow). The operation_log records op_kind=cleanup_container with op_source=admin_rest.

Reconcile Drift After Docker Daemon Restart

A Docker daemon restart drops every running engine container; PG records remain. On RTM's next boot (or its next periodic reconcile):

The reconciler observes running records whose containers are missing from docker ps. It updates each record to removed, appends operation_log with op_kind=reconcile_dispose, and publishes runtime:health_events container_disappeared (workers.md §14–§15).
Lobby's runtimejobresult worker does not consume the dispose event in v1, so the cascade does not auto-restart the engine. Operators trigger restarts through Lobby's user-facing flow or directly via the GM/Admin REST restart endpoint.
If the operator brings up an engine container manually for diagnostics (docker run with the com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id> labels), the reconciler adopts it on the next pass: a new runtime_records row appears with op_kind=reconcile_adopt. The reconciler never stops or removes an unrecorded container — operators stay in control of manual containers (../README.md §Reconciliation).

Three drift kinds run through the same lease-guarded write pass: adopt, dispose, and the README-level path observed_exited (a record marked running whose container exists but is in exited). Telemetry counter rtmanager.reconcile_drift{kind} exposes the three independently (workers.md §15).

Testing Locally

# One-time bootstrap
docker network create galaxy-net

# Minimal env (see docs/examples.md for a complete .env)
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
export RTMANAGER_DOCKER_NETWORK=galaxy-net
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
export RTMANAGER_REDIS_PASSWORD=local
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095

go run ./rtmanager/cmd/rtmanager

After start:

curl http://localhost:8096/healthz returns {"status":"ok"};
curl http://localhost:8096/readyz returns {"status":"ready"} once PG, Redis, and Docker pings pass and the configured network exists;
driving Lobby through its public flow (POST /api/v1/lobby/games/<id>/start) brings up galaxy-game-<game_id> containers; RTM logs each lifecycle transition.

The integration suite under rtmanager/integration/ exercises the end-to-end flows against the real Docker daemon. The default go test ./... skips it via the integration build tag; run explicitly with:

make -C rtmanager integration

The suite requires a reachable Docker daemon. Without one, the harness helpers call t.Skip and the package becomes a no-op (integration-tests.md §1).

Diagnostic Queries

Durable runtime state lives in PostgreSQL; runtime-coordination state stays in Redis. CLI snippets that help during incidents:

# Live runtime count by status (PostgreSQL)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"

# Inspect a specific runtime record
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"

# Last 20 operations for a game (newest first)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT id, op_kind, op_source, outcome, error_code,
          started_at, finished_at
   FROM rtmanager.operation_log
   WHERE game_id = '<game_id>'
   ORDER BY started_at DESC, id DESC
   LIMIT 20"

# Latest health snapshot
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"

# Containers RTM owns (Docker)
docker ps --filter label=com.galaxy.owner=rtmanager \
          --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'

# Stream lag (Redis)
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs

# Recent health events (oldest first)
redis-cli XRANGE runtime:health_events - + COUNT 100

# Per-game lease (only present while an operation runs)
redis-cli GET rtmanager:game_lease:<game_id>
redis-cli TTL rtmanager:game_lease:<game_id>

Operators reach the gauges and counters surfaced through OpenTelemetry as the primary observability surface; raw PostgreSQL and Redis access is for last-resort triage.

14 KiB Raw Blame History