Files
galaxy-game/rtmanager/docs/examples.md
T
2026-04-28 20:39:18 +02:00

12 KiB

Configuration And Contract Examples

The examples below are illustrative. Replace localhost, port numbers, IDs, and timestamps with values that match the deployment under inspection.

Example .env

A minimum-viable RTMANAGER_* set for a local run against a single Redis container plus a PostgreSQL container with the rtmanager schema and the rtmanagerservice role provisioned. The full list with defaults lives in ../README.md §Configuration.

# Required
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
RTMANAGER_REDIS_PASSWORD=local
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
RTMANAGER_DOCKER_NETWORK=galaxy-net
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games

# Lobby internal client (diagnostic GET only in v1)
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s

# Container defaults (image labels override these per container)
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
RTMANAGER_DEFAULT_MEMORY=512m
RTMANAGER_DEFAULT_PIDS_LIMIT=512
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
RTMANAGER_CONTAINER_RETENTION_DAYS=30
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
RTMANAGER_GAME_STATE_DIR_MODE=0750
RTMANAGER_GAME_STATE_OWNER_UID=0
RTMANAGER_GAME_STATE_OWNER_GID=0

# Workers
RTMANAGER_INSPECT_INTERVAL=30s
RTMANAGER_PROBE_INTERVAL=15s
RTMANAGER_PROBE_TIMEOUT=2s
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
RTMANAGER_RECONCILE_INTERVAL=5m
RTMANAGER_CLEANUP_INTERVAL=1h

# Coordination
RTMANAGER_GAME_LEASE_TTL_SECONDS=60

# Process and logging
RTMANAGER_LOG_LEVEL=info
RTMANAGER_SHUTDOWN_TIMEOUT=30s

# Telemetry (disabled for local dev — enable to ship traces / metrics)
OTEL_SERVICE_NAME=galaxy-rtmanager
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none

For a production-shaped deployment, set RTMANAGER_IMAGE_PULL_POLICY=always (forces a pull on every start so a tag mutation is immediately visible to the next runtime), RTMANAGER_GAME_STATE_OWNER_UID / _GID to match the engine container's user, and configure OTEL_* against the cluster's OTLP collector. The RTMANAGER_DOCKER_LOG_DRIVER / RTMANAGER_DOCKER_LOG_OPTS pair routes engine stdout/stderr to the sink the operator runs (fluentd, journald, etc.).

For tests, point RTMANAGER_POSTGRES_PRIMARY_DSN and RTMANAGER_REDIS_MASTER_ADDR at the testcontainers fixtures the service-local harness brings up (integration-tests.md §7).

Internal HTTP Examples

Every endpoint admits the optional X-Galaxy-Caller header which the handler records as op_source in operation_log (gmgm_rest, adminadmin_rest; missing or unknown values default to admin_rest in v1). Decision: services.md §18.

Probe a runtime record

curl -s -H 'X-Galaxy-Caller: gm' \
  http://localhost:8096/api/v1/internal/runtimes/game-01HZ...

Response (200 OK):

{
  "game_id": "game-01HZ...",
  "status": "running",
  "current_container_id": "1f2a...",
  "current_image_ref": "galaxy/game:1.4.0",
  "engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
  "state_path": "/var/lib/galaxy/games/game-01HZ...",
  "docker_network": "galaxy-net",
  "started_at": "2026-04-28T07:18:54Z",
  "stopped_at": null,
  "removed_at": null,
  "last_op_at": "2026-04-28T07:18:54Z",
  "created_at": "2026-04-28T07:18:54Z"
}

List all runtimes

curl -s -H 'X-Galaxy-Caller: admin' \
  http://localhost:8096/api/v1/internal/runtimes

The response shape is {"items":[<RuntimeRecord>...]}.

Start a runtime

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: gm' \
  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
  -d '{"image_ref": "galaxy/game:1.4.0"}'

A 200 returns the RuntimeRecord for the running runtime. Failure shapes use the canonical envelope; e.g. an invalid image_ref:

{
  "error": {
    "code": "start_config_invalid",
    "message": "image_ref shape rejected by docker reference parser"
  }
}

Stop a runtime

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: admin' \
  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
  -d '{"reason": "admin_request"}'

Valid reason values: orphan_cleanup | cancelled | finished | admin_request | timeout.

Restart a runtime

curl -s -X POST \
  -H 'X-Galaxy-Caller: admin' \
  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart

The body is empty; restart re-uses the current image_ref.

Patch a runtime

curl -s -X POST \
  -H 'Content-Type: application/json' \
  -H 'X-Galaxy-Caller: admin' \
  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
  -d '{"image_ref": "galaxy/game:1.4.2"}'

Patch enforces the semver-only rule: a non-semver tag returns image_ref_not_semver; a cross-major or cross-minor change returns semver_patch_only.

Cleanup a stopped runtime container

curl -s -X DELETE \
  -H 'X-Galaxy-Caller: admin' \
  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container

Cleanup refuses a running runtime with 409 conflict; stop first.

Stream Payload Examples

Every stream key shape is configurable via RTMANAGER_REDIS_*_STREAM; the defaults are used below. Field types and required/optional semantics are frozen by ../api/runtime-jobs-asyncapi.yaml and ../api/runtime-health-asyncapi.yaml.

runtime:start_jobs (Lobby → RTM)

redis-cli XADD runtime:start_jobs '*' \
  game_id 'game-01HZ...' \
  image_ref 'galaxy/game:1.4.0' \
  requested_at_ms 1714081234567

runtime:stop_jobs (Lobby → RTM)

redis-cli XADD runtime:stop_jobs '*' \
  game_id 'game-01HZ...' \
  reason 'cancelled' \
  requested_at_ms 1714081234567

runtime:job_results (RTM → Lobby)

Success envelope:

redis-cli XADD runtime:job_results '*' \
  game_id 'game-01HZ...' \
  outcome 'success' \
  container_id '1f2a...' \
  engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
  error_code '' \
  error_message ''

Failure envelope:

redis-cli XADD runtime:job_results '*' \
  game_id 'game-01HZ...' \
  outcome 'failure' \
  container_id '' \
  engine_endpoint '' \
  error_code 'image_pull_failed' \
  error_message 'pull failed: manifest unknown'

Idempotent replay envelope (success outcome with explicit replay_no_op):

redis-cli XADD runtime:job_results '*' \
  game_id 'game-01HZ...' \
  outcome 'success' \
  container_id '1f2a...' \
  engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
  error_code 'replay_no_op' \
  error_message ''

The contract permits empty container_id and engine_endpoint strings on every value of outcome so the consumer can decode the envelope uniformly (workers.md §11).

runtime:health_events (RTM out)

The wire shape is the same for every event type — only the details payload differs.

container_started:

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'container_started' \
  occurred_at_ms 1714081234567 \
  details '{"image_ref":"galaxy/game:1.4.0"}'

container_exited:

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'container_exited' \
  occurred_at_ms 1714081234567 \
  details '{"exit_code":137,"oom":false}'

container_oom:

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'container_oom' \
  occurred_at_ms 1714081234567 \
  details '{"exit_code":137}'

container_disappeared:

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'container_disappeared' \
  occurred_at_ms 1714081234567 \
  details '{}'

inspect_unhealthy:

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'inspect_unhealthy' \
  occurred_at_ms 1714081234567 \
  details '{"restart_count":3,"state":"running","health":"unhealthy"}'

probe_failed (after the threshold is crossed):

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'probe_failed' \
  occurred_at_ms 1714081234567 \
  details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'

probe_recovered:

redis-cli XADD runtime:health_events '*' \
  game_id 'game-01HZ...' \
  container_id '1f2a...' \
  event_type 'probe_recovered' \
  occurred_at_ms 1714081234567 \
  details '{"prior_failure_count":3}'

notification:intents (RTM admin notifications)

RTM publishes admin-only notification intents only for the three first-touch start failures. Every payload shares the frozen field set {game_id, image_ref, error_code, error_message, attempted_at_ms} (../README.md §Notification Contracts).

runtime.image_pull_failed:

redis-cli XADD notification:intents '*' \
  envelope '{
    "type": "runtime.image_pull_failed",
    "producer": "rtmanager",
    "idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
    "audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
    "payload": {
      "game_id": "game-01HZ...",
      "image_ref": "galaxy/game:1.4.0",
      "error_code": "image_pull_failed",
      "error_message": "pull failed: manifest unknown",
      "attempted_at_ms": 1714081234567
    }
  }'

runtime.container_start_failed and runtime.start_config_invalid share the same envelope with their respective type and error_code values.

Storage Inspection

Inspect a runtime record (PostgreSQL)

psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"

Columns mirror the fields documented in ../README.md §Persistence Layout.

Inspect runtime status counts

psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"

Inspect the operation log for a game

psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT id, op_kind, op_source, outcome, error_code,
          started_at, finished_at
   FROM rtmanager.operation_log
   WHERE game_id = 'game-01HZ...'
   ORDER BY started_at DESC, id DESC
   LIMIT 50"

Inspect the latest health snapshot

psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
  "SELECT game_id, container_id, status, source, observed_at, details
   FROM rtmanager.health_snapshots
   WHERE game_id = 'game-01HZ...'"

Inspect Redis runtime-coordination keys

# Stream offsets
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs

# Per-game lease (only present while an operation is in flight)
redis-cli GET rtmanager:game_lease:game-01HZ...
redis-cli TTL rtmanager:game_lease:game-01HZ...

# Recent stream entries
redis-cli XRANGE runtime:start_jobs - + COUNT 20
redis-cli XRANGE runtime:job_results - + COUNT 20
redis-cli XRANGE runtime:health_events - + COUNT 50

# Stream metadata
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli XINFO STREAM runtime:health_events