12 KiB
Configuration And Contract Examples
The examples below are illustrative. Replace localhost, port
numbers, IDs, and timestamps with values that match the deployment
under inspection.
Example .env
A minimum-viable RTMANAGER_* set for a local run against a single
Redis container plus a PostgreSQL container with the rtmanager
schema and the rtmanagerservice role provisioned. The full list
with defaults lives in ../README.md §Configuration.
# Required
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
RTMANAGER_REDIS_PASSWORD=local
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
RTMANAGER_DOCKER_NETWORK=galaxy-net
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
# Lobby internal client (diagnostic GET only in v1)
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
# Container defaults (image labels override these per container)
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
RTMANAGER_DEFAULT_MEMORY=512m
RTMANAGER_DEFAULT_PIDS_LIMIT=512
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
RTMANAGER_CONTAINER_RETENTION_DAYS=30
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
RTMANAGER_GAME_STATE_DIR_MODE=0750
RTMANAGER_GAME_STATE_OWNER_UID=0
RTMANAGER_GAME_STATE_OWNER_GID=0
# Workers
RTMANAGER_INSPECT_INTERVAL=30s
RTMANAGER_PROBE_INTERVAL=15s
RTMANAGER_PROBE_TIMEOUT=2s
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
RTMANAGER_RECONCILE_INTERVAL=5m
RTMANAGER_CLEANUP_INTERVAL=1h
# Coordination
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
# Process and logging
RTMANAGER_LOG_LEVEL=info
RTMANAGER_SHUTDOWN_TIMEOUT=30s
# Telemetry (disabled for local dev — enable to ship traces / metrics)
OTEL_SERVICE_NAME=galaxy-rtmanager
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
For a production-shaped deployment, set
RTMANAGER_IMAGE_PULL_POLICY=always (forces a pull on every start so
a tag mutation is immediately visible to the next runtime),
RTMANAGER_GAME_STATE_OWNER_UID / _GID to match the engine
container's user, and configure OTEL_* against the cluster's OTLP
collector. The RTMANAGER_DOCKER_LOG_DRIVER /
RTMANAGER_DOCKER_LOG_OPTS pair routes engine stdout/stderr to the
sink the operator runs (fluentd, journald, etc.).
For tests, point RTMANAGER_POSTGRES_PRIMARY_DSN and
RTMANAGER_REDIS_MASTER_ADDR at the testcontainers fixtures the
service-local harness brings up
(integration-tests.md §7).
Internal HTTP Examples
Every endpoint admits the optional X-Galaxy-Caller header which the
handler records as op_source in operation_log (gm → gm_rest,
admin → admin_rest; missing or unknown values default to
admin_rest in v1). Decision: services.md §18.
Probe a runtime record
curl -s -H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
Response (200 OK):
{
"game_id": "game-01HZ...",
"status": "running",
"current_container_id": "1f2a...",
"current_image_ref": "galaxy/game:1.4.0",
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
"state_path": "/var/lib/galaxy/games/game-01HZ...",
"docker_network": "galaxy-net",
"started_at": "2026-04-28T07:18:54Z",
"stopped_at": null,
"removed_at": null,
"last_op_at": "2026-04-28T07:18:54Z",
"created_at": "2026-04-28T07:18:54Z"
}
List all runtimes
curl -s -H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes
The response shape is {"items":[<RuntimeRecord>...]}.
Start a runtime
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
-d '{"image_ref": "galaxy/game:1.4.0"}'
A 200 returns the RuntimeRecord for the running runtime. Failure
shapes use the canonical envelope; e.g. an invalid image_ref:
{
"error": {
"code": "start_config_invalid",
"message": "image_ref shape rejected by docker reference parser"
}
}
Stop a runtime
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
-d '{"reason": "admin_request"}'
Valid reason values:
orphan_cleanup | cancelled | finished | admin_request | timeout.
Restart a runtime
curl -s -X POST \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
The body is empty; restart re-uses the current image_ref.
Patch a runtime
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
Patch enforces the semver-only rule: a non-semver tag returns
image_ref_not_semver; a cross-major or cross-minor change returns
semver_patch_only.
Cleanup a stopped runtime container
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
Cleanup refuses a running runtime with 409 conflict; stop first.
Stream Payload Examples
Every stream key shape is configurable via RTMANAGER_REDIS_*_STREAM;
the defaults are used below. Field types and required/optional
semantics are frozen by
../api/runtime-jobs-asyncapi.yaml
and
../api/runtime-health-asyncapi.yaml.
runtime:start_jobs (Lobby → RTM)
redis-cli XADD runtime:start_jobs '*' \
game_id 'game-01HZ...' \
image_ref 'galaxy/game:1.4.0' \
requested_at_ms 1714081234567
runtime:stop_jobs (Lobby → RTM)
redis-cli XADD runtime:stop_jobs '*' \
game_id 'game-01HZ...' \
reason 'cancelled' \
requested_at_ms 1714081234567
runtime:job_results (RTM → Lobby)
Success envelope:
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code '' \
error_message ''
Failure envelope:
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'failure' \
container_id '' \
engine_endpoint '' \
error_code 'image_pull_failed' \
error_message 'pull failed: manifest unknown'
Idempotent replay envelope (success outcome with explicit
replay_no_op):
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code 'replay_no_op' \
error_message ''
The contract permits empty container_id and engine_endpoint
strings on every value of outcome so the consumer can decode the
envelope uniformly (workers.md §11).
runtime:health_events (RTM out)
The wire shape is the same for every event type — only the
details payload differs.
container_started:
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_started' \
occurred_at_ms 1714081234567 \
details '{"image_ref":"galaxy/game:1.4.0"}'
container_exited:
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_exited' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137,"oom":false}'
container_oom:
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_oom' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137}'
container_disappeared:
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_disappeared' \
occurred_at_ms 1714081234567 \
details '{}'
inspect_unhealthy:
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'inspect_unhealthy' \
occurred_at_ms 1714081234567 \
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
probe_failed (after the threshold is crossed):
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_failed' \
occurred_at_ms 1714081234567 \
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
probe_recovered:
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_recovered' \
occurred_at_ms 1714081234567 \
details '{"prior_failure_count":3}'
notification:intents (RTM admin notifications)
RTM publishes admin-only notification intents only for the three
first-touch start failures. Every payload shares the frozen field
set {game_id, image_ref, error_code, error_message, attempted_at_ms}
(../README.md §Notification Contracts).
runtime.image_pull_failed:
redis-cli XADD notification:intents '*' \
envelope '{
"type": "runtime.image_pull_failed",
"producer": "rtmanager",
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
"payload": {
"game_id": "game-01HZ...",
"image_ref": "galaxy/game:1.4.0",
"error_code": "image_pull_failed",
"error_message": "pull failed: manifest unknown",
"attempted_at_ms": 1714081234567
}
}'
runtime.container_start_failed and runtime.start_config_invalid
share the same envelope with their respective type and
error_code values.
Storage Inspection
Inspect a runtime record (PostgreSQL)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
Columns mirror the fields documented in
../README.md §Persistence Layout.
Inspect runtime status counts
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
Inspect the operation log for a game
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = 'game-01HZ...'
ORDER BY started_at DESC, id DESC
LIMIT 50"
Inspect the latest health snapshot
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT game_id, container_id, status, source, observed_at, details
FROM rtmanager.health_snapshots
WHERE game_id = 'game-01HZ...'"
Inspect Redis runtime-coordination keys
# Stream offsets
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Per-game lease (only present while an operation is in flight)
redis-cli GET rtmanager:game_lease:game-01HZ...
redis-cli TTL rtmanager:game_lease:game-01HZ...
# Recent stream entries
redis-cli XRANGE runtime:start_jobs - + COUNT 20
redis-cli XRANGE runtime:job_results - + COUNT 20
redis-cli XRANGE runtime:health_events - + COUNT 50
# Stream metadata
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli XINFO STREAM runtime:health_events