# Operator Runbook This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Runtime Manager. ## Startup Checks Before starting the process, confirm: - `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`) reaches a Docker daemon the operator controls. RTM is the only Galaxy service permitted to interact with the Docker socket; scoping the daemon to RTM-only callers is operator domain. - `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a user-defined bridge network that has already been created (e.g. via `docker network create galaxy-net` in the environment's bootstrap script). RTM **validates** the network at startup but never creates it. A missing network is fail-fast and the process exits non-zero before opening any listener. - `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user can read and write. Per-game subdirectories are created with `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and `RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the uid/gid to match the engine container's user when running with a non-root engine. - `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that hosts the `rtmanager` schema. The DSN must include `search_path=rtmanager` and `sslmode=disable` (or a real SSL mode for production). Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. The `rtmanager` schema and the matching `rtmanagerservice` role are provisioned externally ([`postgres-migration.md` §1](postgres-migration.md)). - `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach the Redis deployment used for the runtime-coordination state: stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`), publishers (`runtime:job_results`, `runtime:health_events`, `notification:intents`), persisted offsets, and the per-game lease. RTM does not maintain durable business state on Redis. - Stream names match the producers and consumers RTM integrates with: - `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`) - `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`) - `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`) - `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`) - `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`) - `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal HTTP listener. RTM's start service issues a diagnostic `GET /api/v1/internal/games/{game_id}` per start; failure is logged at debug and does not abort the start ([`services.md` §7](services.md)). The startup sequence runs in the order recorded in [`../README.md` §Startup dependencies](../README.md#startup-dependencies): 1. PostgreSQL primary opens; goose migrations apply synchronously. 2. Redis master client opens and pings. 3. Docker daemon ping; configured network presence check. 4. Telemetry exporter (OTLP grpc/http or stdout). 5. Internal HTTP listener. 6. Reconciler runs **once synchronously** and blocks until done. 7. Background workers start. A failure at any step is fatal. The synchronous reconciler pass is the reason orphaned containers from a prior process never reach the periodic workers in an inconsistent state ([`workers.md` §17](workers.md)). Expected log lines on a healthy boot: - `migrations applied`, - `postgres ping ok`, - `redis ping ok`, - `docker ping ok` and `docker network found`, - `telemetry exporter started`, - `internal http listening`, - `reconciler initial pass completed`, - one `worker started` entry per background worker (seven expected). ## Readiness Use the probes according to what they actually verify: - `GET /healthz` confirms the listener is alive — no dependency check. - `GET /readyz` live-pings PostgreSQL primary, Redis master, and the Docker daemon, then asserts the configured Docker network exists. Returns `{"status":"ready"}` when every check passes; otherwise returns `503` with the canonical `{"error":{"code":"service_unavailable","message":"…"}}` envelope identifying the first failing dependency. `/readyz` is the strongest readiness signal RTM exposes; unlike Lobby's `/readyz`, it does **not** rely on a one-shot boot ping. Each request hits the daemon and the database fresh. For a practical readiness check in production: 1. confirm the process emitted the listener and worker startup logs; 2. check `GET /healthz` and `GET /readyz`; 3. verify `rtmanager.runtime_records_by_status{status="running"}` gauge tracks the expected live game count after the first start completes; 4. verify `rtmanager.docker_op_latency` histograms have at least one sample after the first lifecycle operation. ## Shutdown The process handles `SIGINT` and `SIGTERM`. Shutdown behaviour: - the per-component shutdown budget is controlled by `RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`); - the internal HTTP listener drains in-flight requests before closing; - stream consumers stop their `XREAD` loops and persist the latest offset before returning; the offset survives the restart ([`workers.md` §9](workers.md)); - the Docker events listener cancels its subscription; - the in-flight services release their per-game lease through the surrounding context cancellation; - the reconciler completes its current pass or aborts mid-write at the next lease re-acquisition. During planned restarts: 1. send `SIGTERM`; 2. wait for the listener and component-stop logs; 3. expect any consumer that was mid-cycle to retry from the persisted offset on the next process start; 4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`. ## Engine Container Died A running engine container that exits unexpectedly surfaces through three observation channels: - The Docker events listener emits `container_exited` (non-zero exit code) or `container_oom` (Docker action `oom`). - The active probe worker eventually emits `probe_failed` once the threshold is crossed. - The Docker inspect worker may emit `inspect_unhealthy` if the engine restarts under Docker's healthcheck or if Docker reports an unexpected status. Triage: 1. Inspect the `runtime:health_events` stream for the affected `game_id` and `event_type`: ```bash redis-cli XRANGE runtime:health_events - + COUNT 200 \ | grep -A4 'game_id\s*' ``` 2. Read the runtime record and the operation log: ```bash curl -s http://:8096/api/v1/internal/runtimes/ psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \ "SELECT id, op_kind, op_source, outcome, error_code, started_at FROM rtmanager.operation_log WHERE game_id = '' ORDER BY started_at DESC LIMIT 20" ``` 3. If Lobby has not reacted (the game's status remains `running` in `lobby.games`), check `runtime:job_results` lag and Lobby's `runtimejobresult` worker. RTM publishes the result; Lobby is the consumer. 4. If the container is already gone (`docker ps -a` shows no row for `galaxy-game-`), the reconciler will move the record to `removed` on its next pass. Run the periodic reconcile manually by sending `SIGHUP` is **not** supported — wait `RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the process; the synchronous boot pass will handle the drift. 5. The `notification:intents` stream is **not** the place to look for ongoing health changes. Only the three first-touch start failures (`runtime.image_pull_failed`, `runtime.container_start_failed`, `runtime.start_config_invalid`) produce a notification intent; probe failures, OOMs, and exits flow through health events only ([`../README.md` §Notification Contracts](../README.md#notification-contracts)). ## Patch Upgrade A patch upgrade replaces the container with a new `image_ref` while preserving the bind-mounted state directory. Pre-conditions: - The new and current `image_ref` tags both parse as semver. RTM rejects non-semver tags with `image_ref_not_semver`. - The new and current major / minor versions match. A cross-major or cross-minor patch returns `semver_patch_only`. Driving the upgrade: ```bash curl -s -X POST \ -H 'Content-Type: application/json' \ -H 'X-Galaxy-Caller: admin' \ http://:8096/api/v1/internal/runtimes//patch \ -d '{"image_ref": "galaxy/game:1.4.2"}' ``` Behaviour: - The container is stopped, removed, and recreated. The `current_container_id` changes; the `engine_endpoint` (`http://galaxy-game-:8080`) is stable. - The engine reads its state from the bind mount on startup, so any data written before the patch survives. - A single `operation_log` row is appended with `op_kind=patch` and the old / new image refs. - A `runtime:health_events container_started` is emitted by the inner start ([`workers.md` §1](workers.md)). Post-patch verification: ```bash curl -s http://galaxy-game-:8080/healthz curl -s http://:8096/api/v1/internal/runtimes/ ``` The `current_image_ref` field on the runtime record reflects the new tag. ## Manual Cleanup The cleanup endpoint removes the container and updates the record to `removed`. It refuses to remove a `running` container — stop first. ```bash # Stop, then clean up curl -s -X POST \ -H 'Content-Type: application/json' \ -H 'X-Galaxy-Caller: admin' \ http://:8096/api/v1/internal/runtimes//stop \ -d '{"reason":"admin_request"}' curl -s -X DELETE \ -H 'X-Galaxy-Caller: admin' \ http://:8096/api/v1/internal/runtimes//container ``` The host state directory under `/` is **never** deleted by RTM. Removing the directory is operator domain (backup tooling, future Admin Service workflow). The operation_log records `op_kind=cleanup_container` with `op_source=admin_rest`. ## Reconcile Drift After Docker Daemon Restart A Docker daemon restart drops every running engine container; PG records remain. On RTM's next boot (or its next periodic reconcile): 1. The reconciler observes `running` records whose containers are missing from `docker ps`. It updates each record to `removed`, appends `operation_log` with `op_kind=reconcile_dispose`, and publishes `runtime:health_events container_disappeared` ([`workers.md` §14–§15](workers.md)). 2. Lobby's `runtimejobresult` worker does not consume the dispose event in v1, so the cascade does not auto-restart the engine. Operators trigger restarts through Lobby's user-facing flow or directly via the GM/Admin REST `restart` endpoint. 3. If the operator brings up an engine container manually for diagnostics (`docker run` with the `com.galaxy.owner=rtmanager,com.galaxy.game_id=` labels), the reconciler **adopts** it on the next pass: a new `runtime_records` row appears with `op_kind=reconcile_adopt`. The reconciler **never stops or removes** an unrecorded container — operators stay in control of manual containers ([`../README.md` §Reconciliation](../README.md#reconciliation)). Three drift kinds run through the same lease-guarded write pass: `adopt`, `dispose`, and the README-level path `observed_exited` (a record marked `running` whose container exists but is in `exited`). Telemetry counter `rtmanager.reconcile_drift{kind}` exposes the three independently ([`workers.md` §15](workers.md)). ## Testing Locally ```sh # One-time bootstrap docker network create galaxy-net # Minimal env (see docs/examples.md for a complete .env) export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games export RTMANAGER_DOCKER_NETWORK=galaxy-net export RTMANAGER_INTERNAL_HTTP_ADDR=:8096 export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable' export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379 export RTMANAGER_REDIS_PASSWORD=local export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095 go run ./rtmanager/cmd/rtmanager ``` After start: - `curl http://localhost:8096/healthz` returns `{"status":"ok"}`; - `curl http://localhost:8096/readyz` returns `{"status":"ready"}` once PG, Redis, and Docker pings pass and the configured network exists; - driving Lobby through its public flow (`POST /api/v1/lobby/games//start`) brings up `galaxy-game-` containers; RTM logs each lifecycle transition. The integration suite under `rtmanager/integration/` exercises the end-to-end flows against the real Docker daemon. The default `go test ./...` skips it via the `integration` build tag; run explicitly with: ```sh make -C rtmanager integration ``` The suite requires a reachable Docker daemon. Without one, the harness helpers call `t.Skip` and the package becomes a no-op ([`integration-tests.md` §1](integration-tests.md)). ## Diagnostic Queries Durable runtime state lives in PostgreSQL; runtime-coordination state stays in Redis. CLI snippets that help during incidents: ```bash # Live runtime count by status (PostgreSQL) psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \ "SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status" # Inspect a specific runtime record psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \ "SELECT * FROM rtmanager.runtime_records WHERE game_id = ''" # Last 20 operations for a game (newest first) psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \ "SELECT id, op_kind, op_source, outcome, error_code, started_at, finished_at FROM rtmanager.operation_log WHERE game_id = '' ORDER BY started_at DESC, id DESC LIMIT 20" # Latest health snapshot psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \ "SELECT * FROM rtmanager.health_snapshots WHERE game_id = ''" # Containers RTM owns (Docker) docker ps --filter label=com.galaxy.owner=rtmanager \ --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}' # Stream lag (Redis) redis-cli XINFO STREAM runtime:start_jobs redis-cli XINFO STREAM runtime:stop_jobs redis-cli GET rtmanager:stream_offsets:startjobs redis-cli GET rtmanager:stream_offsets:stopjobs # Recent health events (oldest first) redis-cli XRANGE runtime:health_events - + COUNT 100 # Per-game lease (only present while an operation runs) redis-cli GET rtmanager:game_lease: redis-cli TTL rtmanager:game_lease: ``` Operators reach the gauges and counters surfaced through OpenTelemetry as the primary observability surface; raw PostgreSQL and Redis access is for last-resort triage.