14 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and the handful of recovery paths specific to Runtime Manager.
Startup Checks
Before starting the process, confirm:
RTMANAGER_DOCKER_HOST(defaultunix:///var/run/docker.sock) reaches a Docker daemon the operator controls. RTM is the only Galaxy service permitted to interact with the Docker socket; scoping the daemon to RTM-only callers is operator domain.RTMANAGER_DOCKER_NETWORK(defaultgalaxy-net) names a user-defined bridge network that has already been created (e.g. viadocker network create galaxy-netin the environment's bootstrap script). RTM validates the network at startup but never creates it. A missing network is fail-fast and the process exits non-zero before opening any listener.RTMANAGER_GAME_STATE_ROOTis a host directory the daemon's user can read and write. Per-game subdirectories are created withRTMANAGER_GAME_STATE_DIR_MODE(default0750) andRTMANAGER_GAME_STATE_OWNER_UID/_GID(default0:0); set the uid/gid to match the engine container's user when running with a non-root engine.RTMANAGER_POSTGRES_PRIMARY_DSNpoints to the PostgreSQL primary that hosts thertmanagerschema. The DSN must includesearch_path=rtmanagerandsslmode=disable(or a real SSL mode for production). Embedded goose migrations apply at startup before any HTTP listener opens; a migration or ping failure terminates the process with a non-zero exit. Thertmanagerschema and the matchingrtmanagerservicerole are provisioned externally (postgres-migration.md§1).RTMANAGER_REDIS_MASTER_ADDRandRTMANAGER_REDIS_PASSWORDreach the Redis deployment used for the runtime-coordination state: stream consumers (runtime:start_jobs,runtime:stop_jobs), publishers (runtime:job_results,runtime:health_events,notification:intents), persisted offsets, and the per-game lease. RTM does not maintain durable business state on Redis.- Stream names match the producers and consumers RTM integrates with:
RTMANAGER_REDIS_START_JOBS_STREAM(defaultruntime:start_jobs)RTMANAGER_REDIS_STOP_JOBS_STREAM(defaultruntime:stop_jobs)RTMANAGER_REDIS_JOB_RESULTS_STREAM(defaultruntime:job_results)RTMANAGER_REDIS_HEALTH_EVENTS_STREAM(defaultruntime:health_events)RTMANAGER_NOTIFICATION_INTENTS_STREAM(defaultnotification:intents)
RTMANAGER_LOBBY_INTERNAL_BASE_URLresolves to Lobby's internal HTTP listener. RTM's start service issues a diagnosticGET /api/v1/internal/games/{game_id}per start; failure is logged at debug and does not abort the start (services.md§7).
The startup sequence runs in the order recorded in
../README.md §Startup dependencies:
- PostgreSQL primary opens; goose migrations apply synchronously.
- Redis master client opens and pings.
- Docker daemon ping; configured network presence check.
- Telemetry exporter (OTLP grpc/http or stdout).
- Internal HTTP listener.
- Reconciler runs once synchronously and blocks until done.
- Background workers start.
A failure at any step is fatal. The synchronous reconciler pass is
the reason orphaned containers from a prior process never reach the
periodic workers in an inconsistent state
(workers.md §17).
Expected log lines on a healthy boot:
migrations applied,postgres ping ok,redis ping ok,docker ping okanddocker network found,telemetry exporter started,internal http listening,reconciler initial pass completed,- one
worker startedentry per background worker (seven expected).
Readiness
Use the probes according to what they actually verify:
GET /healthzconfirms the listener is alive — no dependency check.GET /readyzlive-pings PostgreSQL primary, Redis master, and the Docker daemon, then asserts the configured Docker network exists. Returns{"status":"ready"}when every check passes; otherwise returns503with the canonical{"error":{"code":"service_unavailable","message":"…"}}envelope identifying the first failing dependency.
/readyz is the strongest readiness signal RTM exposes; unlike
Lobby's /readyz, it does not rely on a one-shot boot ping.
Each request hits the daemon and the database fresh.
For a practical readiness check in production:
- confirm the process emitted the listener and worker startup logs;
- check
GET /healthzandGET /readyz; - verify
rtmanager.runtime_records_by_status{status="running"}gauge tracks the expected live game count after the first start completes; - verify
rtmanager.docker_op_latencyhistograms have at least one sample after the first lifecycle operation.
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behaviour:
- the per-component shutdown budget is controlled by
RTMANAGER_SHUTDOWN_TIMEOUT(default30s); - the internal HTTP listener drains in-flight requests before closing;
- stream consumers stop their
XREADloops and persist the latest offset before returning; the offset survives the restart (workers.md§9); - the Docker events listener cancels its subscription;
- the in-flight services release their per-game lease through the surrounding context cancellation;
- the reconciler completes its current pass or aborts mid-write at the next lease re-acquisition.
During planned restarts:
- send
SIGTERM; - wait for the listener and component-stop logs;
- expect any consumer that was mid-cycle to retry from the persisted offset on the next process start;
- investigate only if shutdown exceeds
RTMANAGER_SHUTDOWN_TIMEOUT.
Engine Container Died
A running engine container that exits unexpectedly surfaces through three observation channels:
- The Docker events listener emits
container_exited(non-zero exit code) orcontainer_oom(Docker actionoom). - The active probe worker eventually emits
probe_failedonce the threshold is crossed. - The Docker inspect worker may emit
inspect_unhealthyif the engine restarts under Docker's healthcheck or if Docker reports an unexpected status.
Triage:
- Inspect the
runtime:health_eventsstream for the affectedgame_idandevent_type:redis-cli XRANGE runtime:health_events - + COUNT 200 \ | grep -A4 'game_id\s*<game_id>' - Read the runtime record and the operation log:
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id> psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \ "SELECT id, op_kind, op_source, outcome, error_code, started_at FROM rtmanager.operation_log WHERE game_id = '<game_id>' ORDER BY started_at DESC LIMIT 20" - If Lobby has not reacted (the game's status remains
runninginlobby.games), checkruntime:job_resultslag and Lobby'sruntimejobresultworker. RTM publishes the result; Lobby is the consumer. - If the container is already gone (
docker ps -ashows no row forgalaxy-game-<game_id>), the reconciler will move the record toremovedon its next pass. Run the periodic reconcile manually by sendingSIGHUPis not supported — waitRTMANAGER_RECONCILE_INTERVAL(default5m) or restart the process; the synchronous boot pass will handle the drift. - The
notification:intentsstream is not the place to look for ongoing health changes. Only the three first-touch start failures (runtime.image_pull_failed,runtime.container_start_failed,runtime.start_config_invalid) produce a notification intent; probe failures, OOMs, and exits flow through health events only (../README.md§Notification Contracts).
Patch Upgrade
A patch upgrade replaces the container with a new image_ref while
preserving the bind-mounted state directory.
Pre-conditions:
- The new and current
image_reftags both parse as semver. RTM rejects non-semver tags withimage_ref_not_semver. - The new and current major / minor versions match. A cross-major or
cross-minor patch returns
semver_patch_only.
Driving the upgrade:
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
Behaviour:
- The container is stopped, removed, and recreated. The
current_container_idchanges; theengine_endpoint(http://galaxy-game-<game_id>:8080) is stable. - The engine reads its state from the bind mount on startup, so any data written before the patch survives.
- A single
operation_logrow is appended withop_kind=patchand the old / new image refs. - A
runtime:health_events container_startedis emitted by the inner start (workers.md§1).
Post-patch verification:
curl -s http://galaxy-game-<game_id>:8080/healthz
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
The current_image_ref field on the runtime record reflects the new
tag.
Manual Cleanup
The cleanup endpoint removes the container and updates the record to
removed. It refuses to remove a running container — stop first.
# Stop, then clean up
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
-d '{"reason":"admin_request"}'
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
The host state directory under <RTMANAGER_GAME_STATE_ROOT>/<game_id>
is never deleted by RTM. Removing the directory is operator
domain (backup tooling, future Admin Service workflow). The
operation_log records op_kind=cleanup_container with
op_source=admin_rest.
Reconcile Drift After Docker Daemon Restart
A Docker daemon restart drops every running engine container; PG records remain. On RTM's next boot (or its next periodic reconcile):
- The reconciler observes
runningrecords whose containers are missing fromdocker ps. It updates each record toremoved, appendsoperation_logwithop_kind=reconcile_dispose, and publishesruntime:health_events container_disappeared(workers.md§14–§15). - Lobby's
runtimejobresultworker does not consume the dispose event in v1, so the cascade does not auto-restart the engine. Operators trigger restarts through Lobby's user-facing flow or directly via the GM/Admin RESTrestartendpoint. - If the operator brings up an engine container manually for
diagnostics (
docker runwith thecom.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>labels), the reconciler adopts it on the next pass: a newruntime_recordsrow appears withop_kind=reconcile_adopt. The reconciler never stops or removes an unrecorded container — operators stay in control of manual containers (../README.md§Reconciliation).
Three drift kinds run through the same lease-guarded write pass:
adopt, dispose, and the README-level path
observed_exited (a record marked running whose container exists
but is in exited). Telemetry counter
rtmanager.reconcile_drift{kind} exposes the three independently
(workers.md §15).
Testing Locally
# One-time bootstrap
docker network create galaxy-net
# Minimal env (see docs/examples.md for a complete .env)
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
export RTMANAGER_DOCKER_NETWORK=galaxy-net
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
export RTMANAGER_REDIS_PASSWORD=local
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
go run ./rtmanager/cmd/rtmanager
After start:
curl http://localhost:8096/healthzreturns{"status":"ok"};curl http://localhost:8096/readyzreturns{"status":"ready"}once PG, Redis, and Docker pings pass and the configured network exists;- driving Lobby through its public flow (
POST /api/v1/lobby/games/<id>/start) brings upgalaxy-game-<game_id>containers; RTM logs each lifecycle transition.
The integration suite under rtmanager/integration/ exercises the
end-to-end flows against the real Docker daemon. The default
go test ./... skips it via the integration build tag; run
explicitly with:
make -C rtmanager integration
The suite requires a reachable Docker daemon. Without one, the
harness helpers call t.Skip and the package becomes a no-op
(integration-tests.md §1).
Diagnostic Queries
Durable runtime state lives in PostgreSQL; runtime-coordination state stays in Redis. CLI snippets that help during incidents:
# Live runtime count by status (PostgreSQL)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
# Inspect a specific runtime record
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
# Last 20 operations for a game (newest first)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = '<game_id>'
ORDER BY started_at DESC, id DESC
LIMIT 20"
# Latest health snapshot
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
# Containers RTM owns (Docker)
docker ps --filter label=com.galaxy.owner=rtmanager \
--format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
# Stream lag (Redis)
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Recent health events (oldest first)
redis-cli XRANGE runtime:health_events - + COUNT 100
# Per-game lease (only present while an operation runs)
redis-cli GET rtmanager:game_lease:<game_id>
redis-cli TTL rtmanager:game_lease:<game_id>
Operators reach the gauges and counters surfaced through OpenTelemetry as the primary observability surface; raw PostgreSQL and Redis access is for last-resort triage.