430 lines
12 KiB
Markdown
430 lines
12 KiB
Markdown
# Configuration And Contract Examples
|
|
|
|
The examples below are illustrative. Replace `localhost`, port
|
|
numbers, IDs, and timestamps with values that match the deployment
|
|
under inspection.
|
|
|
|
## Example `.env`
|
|
|
|
A minimum-viable `RTMANAGER_*` set for a local run against a single
|
|
Redis container plus a PostgreSQL container with the `rtmanager`
|
|
schema and the `rtmanagerservice` role provisioned. The full list
|
|
with defaults lives in [`../README.md` §Configuration](../README.md).
|
|
|
|
```bash
|
|
# Required
|
|
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
|
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
|
|
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
|
RTMANAGER_REDIS_PASSWORD=local
|
|
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
|
RTMANAGER_DOCKER_NETWORK=galaxy-net
|
|
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
|
|
|
# Lobby internal client (diagnostic GET only in v1)
|
|
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
|
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
|
|
|
|
# Container defaults (image labels override these per container)
|
|
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
|
|
RTMANAGER_DEFAULT_MEMORY=512m
|
|
RTMANAGER_DEFAULT_PIDS_LIMIT=512
|
|
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
|
|
RTMANAGER_CONTAINER_RETENTION_DAYS=30
|
|
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
|
|
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
|
|
RTMANAGER_GAME_STATE_DIR_MODE=0750
|
|
RTMANAGER_GAME_STATE_OWNER_UID=0
|
|
RTMANAGER_GAME_STATE_OWNER_GID=0
|
|
|
|
# Workers
|
|
RTMANAGER_INSPECT_INTERVAL=30s
|
|
RTMANAGER_PROBE_INTERVAL=15s
|
|
RTMANAGER_PROBE_TIMEOUT=2s
|
|
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
|
|
RTMANAGER_RECONCILE_INTERVAL=5m
|
|
RTMANAGER_CLEANUP_INTERVAL=1h
|
|
|
|
# Coordination
|
|
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
|
|
|
|
# Process and logging
|
|
RTMANAGER_LOG_LEVEL=info
|
|
RTMANAGER_SHUTDOWN_TIMEOUT=30s
|
|
|
|
# Telemetry (disabled for local dev — enable to ship traces / metrics)
|
|
OTEL_SERVICE_NAME=galaxy-rtmanager
|
|
OTEL_TRACES_EXPORTER=none
|
|
OTEL_METRICS_EXPORTER=none
|
|
```
|
|
|
|
For a production-shaped deployment, set
|
|
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
|
|
a tag mutation is immediately visible to the next runtime),
|
|
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
|
|
container's user, and configure `OTEL_*` against the cluster's OTLP
|
|
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
|
|
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
|
|
sink the operator runs (fluentd, journald, etc.).
|
|
|
|
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
|
|
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
|
|
service-local harness brings up
|
|
([`integration-tests.md` §7](integration-tests.md)).
|
|
|
|
## Internal HTTP Examples
|
|
|
|
Every endpoint admits the optional `X-Galaxy-Caller` header which the
|
|
handler records as `op_source` in `operation_log` (`gm` → `gm_rest`,
|
|
`admin` → `admin_rest`; missing or unknown values default to
|
|
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
|
|
|
|
### Probe a runtime record
|
|
|
|
```bash
|
|
curl -s -H 'X-Galaxy-Caller: gm' \
|
|
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
|
|
```
|
|
|
|
Response (`200 OK`):
|
|
|
|
```json
|
|
{
|
|
"game_id": "game-01HZ...",
|
|
"status": "running",
|
|
"current_container_id": "1f2a...",
|
|
"current_image_ref": "galaxy/game:1.4.0",
|
|
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
|
|
"state_path": "/var/lib/galaxy/games/game-01HZ...",
|
|
"docker_network": "galaxy-net",
|
|
"started_at": "2026-04-28T07:18:54Z",
|
|
"stopped_at": null,
|
|
"removed_at": null,
|
|
"last_op_at": "2026-04-28T07:18:54Z",
|
|
"created_at": "2026-04-28T07:18:54Z"
|
|
}
|
|
```
|
|
|
|
### List all runtimes
|
|
|
|
```bash
|
|
curl -s -H 'X-Galaxy-Caller: admin' \
|
|
http://localhost:8096/api/v1/internal/runtimes
|
|
```
|
|
|
|
The response shape is `{"items":[<RuntimeRecord>...]}`.
|
|
|
|
### Start a runtime
|
|
|
|
```bash
|
|
curl -s -X POST \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'X-Galaxy-Caller: gm' \
|
|
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
|
|
-d '{"image_ref": "galaxy/game:1.4.0"}'
|
|
```
|
|
|
|
A `200` returns the `RuntimeRecord` for the running runtime. Failure
|
|
shapes use the canonical envelope; e.g. an invalid image_ref:
|
|
|
|
```json
|
|
{
|
|
"error": {
|
|
"code": "start_config_invalid",
|
|
"message": "image_ref shape rejected by docker reference parser"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Stop a runtime
|
|
|
|
```bash
|
|
curl -s -X POST \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'X-Galaxy-Caller: admin' \
|
|
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
|
|
-d '{"reason": "admin_request"}'
|
|
```
|
|
|
|
Valid `reason` values:
|
|
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
|
|
|
|
### Restart a runtime
|
|
|
|
```bash
|
|
curl -s -X POST \
|
|
-H 'X-Galaxy-Caller: admin' \
|
|
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
|
|
```
|
|
|
|
The body is empty; restart re-uses the current `image_ref`.
|
|
|
|
### Patch a runtime
|
|
|
|
```bash
|
|
curl -s -X POST \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'X-Galaxy-Caller: admin' \
|
|
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
|
|
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
|
```
|
|
|
|
Patch enforces the semver-only rule: a non-semver tag returns
|
|
`image_ref_not_semver`; a cross-major or cross-minor change returns
|
|
`semver_patch_only`.
|
|
|
|
### Cleanup a stopped runtime container
|
|
|
|
```bash
|
|
curl -s -X DELETE \
|
|
-H 'X-Galaxy-Caller: admin' \
|
|
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
|
|
```
|
|
|
|
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
|
|
|
|
## Stream Payload Examples
|
|
|
|
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
|
|
the defaults are used below. Field types and required/optional
|
|
semantics are frozen by
|
|
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
|
|
and
|
|
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
|
|
|
|
### `runtime:start_jobs` (Lobby → RTM)
|
|
|
|
```bash
|
|
redis-cli XADD runtime:start_jobs '*' \
|
|
game_id 'game-01HZ...' \
|
|
image_ref 'galaxy/game:1.4.0' \
|
|
requested_at_ms 1714081234567
|
|
```
|
|
|
|
### `runtime:stop_jobs` (Lobby → RTM)
|
|
|
|
```bash
|
|
redis-cli XADD runtime:stop_jobs '*' \
|
|
game_id 'game-01HZ...' \
|
|
reason 'cancelled' \
|
|
requested_at_ms 1714081234567
|
|
```
|
|
|
|
### `runtime:job_results` (RTM → Lobby)
|
|
|
|
Success envelope:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:job_results '*' \
|
|
game_id 'game-01HZ...' \
|
|
outcome 'success' \
|
|
container_id '1f2a...' \
|
|
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
|
error_code '' \
|
|
error_message ''
|
|
```
|
|
|
|
Failure envelope:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:job_results '*' \
|
|
game_id 'game-01HZ...' \
|
|
outcome 'failure' \
|
|
container_id '' \
|
|
engine_endpoint '' \
|
|
error_code 'image_pull_failed' \
|
|
error_message 'pull failed: manifest unknown'
|
|
```
|
|
|
|
Idempotent replay envelope (success outcome with explicit
|
|
`replay_no_op`):
|
|
|
|
```bash
|
|
redis-cli XADD runtime:job_results '*' \
|
|
game_id 'game-01HZ...' \
|
|
outcome 'success' \
|
|
container_id '1f2a...' \
|
|
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
|
error_code 'replay_no_op' \
|
|
error_message ''
|
|
```
|
|
|
|
The contract permits empty `container_id` and `engine_endpoint`
|
|
strings on every value of `outcome` so the consumer can decode the
|
|
envelope uniformly ([`workers.md` §11](workers.md)).
|
|
|
|
### `runtime:health_events` (RTM out)
|
|
|
|
The wire shape is the same for every event type — only the
|
|
`details` payload differs.
|
|
|
|
`container_started`:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'container_started' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{"image_ref":"galaxy/game:1.4.0"}'
|
|
```
|
|
|
|
`container_exited`:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'container_exited' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{"exit_code":137,"oom":false}'
|
|
```
|
|
|
|
`container_oom`:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'container_oom' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{"exit_code":137}'
|
|
```
|
|
|
|
`container_disappeared`:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'container_disappeared' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{}'
|
|
```
|
|
|
|
`inspect_unhealthy`:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'inspect_unhealthy' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
|
|
```
|
|
|
|
`probe_failed` (after the threshold is crossed):
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'probe_failed' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
|
|
```
|
|
|
|
`probe_recovered`:
|
|
|
|
```bash
|
|
redis-cli XADD runtime:health_events '*' \
|
|
game_id 'game-01HZ...' \
|
|
container_id '1f2a...' \
|
|
event_type 'probe_recovered' \
|
|
occurred_at_ms 1714081234567 \
|
|
details '{"prior_failure_count":3}'
|
|
```
|
|
|
|
### `notification:intents` (RTM admin notifications)
|
|
|
|
RTM publishes admin-only notification intents only for the three
|
|
first-touch start failures. Every payload shares the frozen field
|
|
set `{game_id, image_ref, error_code, error_message,
|
|
attempted_at_ms}`
|
|
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
|
|
|
`runtime.image_pull_failed`:
|
|
|
|
```bash
|
|
redis-cli XADD notification:intents '*' \
|
|
envelope '{
|
|
"type": "runtime.image_pull_failed",
|
|
"producer": "rtmanager",
|
|
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
|
|
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
|
|
"payload": {
|
|
"game_id": "game-01HZ...",
|
|
"image_ref": "galaxy/game:1.4.0",
|
|
"error_code": "image_pull_failed",
|
|
"error_message": "pull failed: manifest unknown",
|
|
"attempted_at_ms": 1714081234567
|
|
}
|
|
}'
|
|
```
|
|
|
|
`runtime.container_start_failed` and `runtime.start_config_invalid`
|
|
share the same envelope with their respective `type` and
|
|
`error_code` values.
|
|
|
|
## Storage Inspection
|
|
|
|
### Inspect a runtime record (PostgreSQL)
|
|
|
|
```bash
|
|
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
|
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
|
|
```
|
|
|
|
Columns mirror the fields documented in
|
|
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
|
|
|
|
### Inspect runtime status counts
|
|
|
|
```bash
|
|
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
|
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
|
```
|
|
|
|
### Inspect the operation log for a game
|
|
|
|
```bash
|
|
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
|
"SELECT id, op_kind, op_source, outcome, error_code,
|
|
started_at, finished_at
|
|
FROM rtmanager.operation_log
|
|
WHERE game_id = 'game-01HZ...'
|
|
ORDER BY started_at DESC, id DESC
|
|
LIMIT 50"
|
|
```
|
|
|
|
### Inspect the latest health snapshot
|
|
|
|
```bash
|
|
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
|
"SELECT game_id, container_id, status, source, observed_at, details
|
|
FROM rtmanager.health_snapshots
|
|
WHERE game_id = 'game-01HZ...'"
|
|
```
|
|
|
|
### Inspect Redis runtime-coordination keys
|
|
|
|
```bash
|
|
# Stream offsets
|
|
redis-cli GET rtmanager:stream_offsets:startjobs
|
|
redis-cli GET rtmanager:stream_offsets:stopjobs
|
|
|
|
# Per-game lease (only present while an operation is in flight)
|
|
redis-cli GET rtmanager:game_lease:game-01HZ...
|
|
redis-cli TTL rtmanager:game_lease:game-01HZ...
|
|
|
|
# Recent stream entries
|
|
redis-cli XRANGE runtime:start_jobs - + COUNT 20
|
|
redis-cli XRANGE runtime:job_results - + COUNT 20
|
|
redis-cli XRANGE runtime:health_events - + COUNT 50
|
|
|
|
# Stream metadata
|
|
redis-cli XINFO STREAM runtime:start_jobs
|
|
redis-cli XINFO STREAM runtime:stop_jobs
|
|
redis-cli XINFO STREAM runtime:health_events
|
|
```
|