Files
galaxy-game/rtmanager/docs/runbook.md
T
2026-04-28 20:39:18 +02:00

14 KiB

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Runtime Manager.

Startup Checks

Before starting the process, confirm:

  • RTMANAGER_DOCKER_HOST (default unix:///var/run/docker.sock) reaches a Docker daemon the operator controls. RTM is the only Galaxy service permitted to interact with the Docker socket; scoping the daemon to RTM-only callers is operator domain.
  • RTMANAGER_DOCKER_NETWORK (default galaxy-net) names a user-defined bridge network that has already been created (e.g. via docker network create galaxy-net in the environment's bootstrap script). RTM validates the network at startup but never creates it. A missing network is fail-fast and the process exits non-zero before opening any listener.
  • RTMANAGER_GAME_STATE_ROOT is a host directory the daemon's user can read and write. Per-game subdirectories are created with RTMANAGER_GAME_STATE_DIR_MODE (default 0750) and RTMANAGER_GAME_STATE_OWNER_UID / _GID (default 0:0); set the uid/gid to match the engine container's user when running with a non-root engine.
  • RTMANAGER_POSTGRES_PRIMARY_DSN points to the PostgreSQL primary that hosts the rtmanager schema. The DSN must include search_path=rtmanager and sslmode=disable (or a real SSL mode for production). Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. The rtmanager schema and the matching rtmanagerservice role are provisioned externally (postgres-migration.md §1).
  • RTMANAGER_REDIS_MASTER_ADDR and RTMANAGER_REDIS_PASSWORD reach the Redis deployment used for the runtime-coordination state: stream consumers (runtime:start_jobs, runtime:stop_jobs), publishers (runtime:job_results, runtime:health_events, notification:intents), persisted offsets, and the per-game lease. RTM does not maintain durable business state on Redis.
  • Stream names match the producers and consumers RTM integrates with:
    • RTMANAGER_REDIS_START_JOBS_STREAM (default runtime:start_jobs)
    • RTMANAGER_REDIS_STOP_JOBS_STREAM (default runtime:stop_jobs)
    • RTMANAGER_REDIS_JOB_RESULTS_STREAM (default runtime:job_results)
    • RTMANAGER_REDIS_HEALTH_EVENTS_STREAM (default runtime:health_events)
    • RTMANAGER_NOTIFICATION_INTENTS_STREAM (default notification:intents)
  • RTMANAGER_LOBBY_INTERNAL_BASE_URL resolves to Lobby's internal HTTP listener. RTM's start service issues a diagnostic GET /api/v1/internal/games/{game_id} per start; failure is logged at debug and does not abort the start (services.md §7).

The startup sequence runs in the order recorded in ../README.md §Startup dependencies:

  1. PostgreSQL primary opens; goose migrations apply synchronously.
  2. Redis master client opens and pings.
  3. Docker daemon ping; configured network presence check.
  4. Telemetry exporter (OTLP grpc/http or stdout).
  5. Internal HTTP listener.
  6. Reconciler runs once synchronously and blocks until done.
  7. Background workers start.

A failure at any step is fatal. The synchronous reconciler pass is the reason orphaned containers from a prior process never reach the periodic workers in an inconsistent state (workers.md §17).

Expected log lines on a healthy boot:

  • migrations applied,
  • postgres ping ok,
  • redis ping ok,
  • docker ping ok and docker network found,
  • telemetry exporter started,
  • internal http listening,
  • reconciler initial pass completed,
  • one worker started entry per background worker (seven expected).

Readiness

Use the probes according to what they actually verify:

  • GET /healthz confirms the listener is alive — no dependency check.
  • GET /readyz live-pings PostgreSQL primary, Redis master, and the Docker daemon, then asserts the configured Docker network exists. Returns {"status":"ready"} when every check passes; otherwise returns 503 with the canonical {"error":{"code":"service_unavailable","message":"…"}} envelope identifying the first failing dependency.

/readyz is the strongest readiness signal RTM exposes; unlike Lobby's /readyz, it does not rely on a one-shot boot ping. Each request hits the daemon and the database fresh.

For a practical readiness check in production:

  1. confirm the process emitted the listener and worker startup logs;
  2. check GET /healthz and GET /readyz;
  3. verify rtmanager.runtime_records_by_status{status="running"} gauge tracks the expected live game count after the first start completes;
  4. verify rtmanager.docker_op_latency histograms have at least one sample after the first lifecycle operation.

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behaviour:

  • the per-component shutdown budget is controlled by RTMANAGER_SHUTDOWN_TIMEOUT (default 30s);
  • the internal HTTP listener drains in-flight requests before closing;
  • stream consumers stop their XREAD loops and persist the latest offset before returning; the offset survives the restart (workers.md §9);
  • the Docker events listener cancels its subscription;
  • the in-flight services release their per-game lease through the surrounding context cancellation;
  • the reconciler completes its current pass or aborts mid-write at the next lease re-acquisition.

During planned restarts:

  1. send SIGTERM;
  2. wait for the listener and component-stop logs;
  3. expect any consumer that was mid-cycle to retry from the persisted offset on the next process start;
  4. investigate only if shutdown exceeds RTMANAGER_SHUTDOWN_TIMEOUT.

Engine Container Died

A running engine container that exits unexpectedly surfaces through three observation channels:

  • The Docker events listener emits container_exited (non-zero exit code) or container_oom (Docker action oom).
  • The active probe worker eventually emits probe_failed once the threshold is crossed.
  • The Docker inspect worker may emit inspect_unhealthy if the engine restarts under Docker's healthcheck or if Docker reports an unexpected status.

Triage:

  1. Inspect the runtime:health_events stream for the affected game_id and event_type:
    redis-cli XRANGE runtime:health_events - + COUNT 200 \
      | grep -A4 'game_id\s*<game_id>'
    
  2. Read the runtime record and the operation log:
    curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
    psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
      "SELECT id, op_kind, op_source, outcome, error_code, started_at
       FROM rtmanager.operation_log
       WHERE game_id = '<game_id>'
       ORDER BY started_at DESC LIMIT 20"
    
  3. If Lobby has not reacted (the game's status remains running in lobby.games), check runtime:job_results lag and Lobby's runtimejobresult worker. RTM publishes the result; Lobby is the consumer.
  4. If the container is already gone (docker ps -a shows no row for galaxy-game-<game_id>), the reconciler will move the record to removed on its next pass. Run the periodic reconcile manually by sending SIGHUP is not supported — wait RTMANAGER_RECONCILE_INTERVAL (default 5m) or restart the process; the synchronous boot pass will handle the drift.
  5. The notification:intents stream is not the place to look for ongoing health changes. Only the three first-touch start failures (runtime.image_pull_failed, runtime.container_start_failed, runtime.start_config_invalid) produce a notification intent; probe failures, OOMs, and exits flow through health events only (../README.md §Notification Contracts).

Patch Upgrade

A patch upgrade replaces the container with a new image_ref while preserving the bind-mounted state directory.

Pre-conditions:

  • The new and current image_ref tags both parse as semver. RTM rejects non-semver tags with image_ref_not_semver.
  • The new and current major / minor versions match. A cross-major or cross-minor patch returns semver_patch_only.

Driving the upgrade:

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: admin' \
  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
  -d '{"image_ref": "galaxy/game:1.4.2"}'

Behaviour:

  • The container is stopped, removed, and recreated. The current_container_id changes; the engine_endpoint (http://galaxy-game-<game_id>:8080) is stable.
  • The engine reads its state from the bind mount on startup, so any data written before the patch survives.
  • A single operation_log row is appended with op_kind=patch and the old / new image refs.
  • A runtime:health_events container_started is emitted by the inner start (workers.md §1).

Post-patch verification:

curl -s http://galaxy-game-<game_id>:8080/healthz
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>

The current_image_ref field on the runtime record reflects the new tag.

Manual Cleanup

The cleanup endpoint removes the container and updates the record to removed. It refuses to remove a running container — stop first.

# Stop, then clean up
curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: admin' \
  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
  -d '{"reason":"admin_request"}'

curl -s -X DELETE \
  -H 'X-Galaxy-Caller: admin' \
  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container

The host state directory under <RTMANAGER_GAME_STATE_ROOT>/<game_id> is never deleted by RTM. Removing the directory is operator domain (backup tooling, future Admin Service workflow). The operation_log records op_kind=cleanup_container with op_source=admin_rest.

Reconcile Drift After Docker Daemon Restart

A Docker daemon restart drops every running engine container; PG records remain. On RTM's next boot (or its next periodic reconcile):

  1. The reconciler observes running records whose containers are missing from docker ps. It updates each record to removed, appends operation_log with op_kind=reconcile_dispose, and publishes runtime:health_events container_disappeared (workers.md §14–§15).
  2. Lobby's runtimejobresult worker does not consume the dispose event in v1, so the cascade does not auto-restart the engine. Operators trigger restarts through Lobby's user-facing flow or directly via the GM/Admin REST restart endpoint.
  3. If the operator brings up an engine container manually for diagnostics (docker run with the com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id> labels), the reconciler adopts it on the next pass: a new runtime_records row appears with op_kind=reconcile_adopt. The reconciler never stops or removes an unrecorded container — operators stay in control of manual containers (../README.md §Reconciliation).

Three drift kinds run through the same lease-guarded write pass: adopt, dispose, and the README-level path observed_exited (a record marked running whose container exists but is in exited). Telemetry counter rtmanager.reconcile_drift{kind} exposes the three independently (workers.md §15).

Testing Locally

# One-time bootstrap
docker network create galaxy-net

# Minimal env (see docs/examples.md for a complete .env)
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
export RTMANAGER_DOCKER_NETWORK=galaxy-net
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
export RTMANAGER_REDIS_PASSWORD=local
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095

go run ./rtmanager/cmd/rtmanager

After start:

  • curl http://localhost:8096/healthz returns {"status":"ok"};
  • curl http://localhost:8096/readyz returns {"status":"ready"} once PG, Redis, and Docker pings pass and the configured network exists;
  • driving Lobby through its public flow (POST /api/v1/lobby/games/<id>/start) brings up galaxy-game-<game_id> containers; RTM logs each lifecycle transition.

The integration suite under rtmanager/integration/ exercises the end-to-end flows against the real Docker daemon. The default go test ./... skips it via the integration build tag; run explicitly with:

make -C rtmanager integration

The suite requires a reachable Docker daemon. Without one, the harness helpers call t.Skip and the package becomes a no-op (integration-tests.md §1).

Diagnostic Queries

Durable runtime state lives in PostgreSQL; runtime-coordination state stays in Redis. CLI snippets that help during incidents:

# Live runtime count by status (PostgreSQL)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"

# Inspect a specific runtime record
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"

# Last 20 operations for a game (newest first)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT id, op_kind, op_source, outcome, error_code,
          started_at, finished_at
   FROM rtmanager.operation_log
   WHERE game_id = '<game_id>'
   ORDER BY started_at DESC, id DESC
   LIMIT 20"

# Latest health snapshot
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"

# Containers RTM owns (Docker)
docker ps --filter label=com.galaxy.owner=rtmanager \
          --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'

# Stream lag (Redis)
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs

# Recent health events (oldest first)
redis-cli XRANGE runtime:health_events - + COUNT 100

# Per-game lease (only present while an operation runs)
redis-cli GET rtmanager:game_lease:<game_id>
redis-cli TTL rtmanager:game_lease:<game_id>

Operators reach the gauges and counters surfaced through OpenTelemetry as the primary observability surface; raw PostgreSQL and Redis access is for last-resort triage.