feat: runtime manager
This commit is contained in:
@@ -0,0 +1,429 @@
|
||||
# Configuration And Contract Examples
|
||||
|
||||
The examples below are illustrative. Replace `localhost`, port
|
||||
numbers, IDs, and timestamps with values that match the deployment
|
||||
under inspection.
|
||||
|
||||
## Example `.env`
|
||||
|
||||
A minimum-viable `RTMANAGER_*` set for a local run against a single
|
||||
Redis container plus a PostgreSQL container with the `rtmanager`
|
||||
schema and the `rtmanagerservice` role provisioned. The full list
|
||||
with defaults lives in [`../README.md` §Configuration](../README.md).
|
||||
|
||||
```bash
|
||||
# Required
|
||||
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
||||
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
|
||||
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
||||
RTMANAGER_REDIS_PASSWORD=local
|
||||
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
RTMANAGER_DOCKER_NETWORK=galaxy-net
|
||||
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
||||
|
||||
# Lobby internal client (diagnostic GET only in v1)
|
||||
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
||||
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
|
||||
|
||||
# Container defaults (image labels override these per container)
|
||||
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
|
||||
RTMANAGER_DEFAULT_MEMORY=512m
|
||||
RTMANAGER_DEFAULT_PIDS_LIMIT=512
|
||||
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
|
||||
RTMANAGER_CONTAINER_RETENTION_DAYS=30
|
||||
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
|
||||
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
|
||||
RTMANAGER_GAME_STATE_DIR_MODE=0750
|
||||
RTMANAGER_GAME_STATE_OWNER_UID=0
|
||||
RTMANAGER_GAME_STATE_OWNER_GID=0
|
||||
|
||||
# Workers
|
||||
RTMANAGER_INSPECT_INTERVAL=30s
|
||||
RTMANAGER_PROBE_INTERVAL=15s
|
||||
RTMANAGER_PROBE_TIMEOUT=2s
|
||||
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
|
||||
RTMANAGER_RECONCILE_INTERVAL=5m
|
||||
RTMANAGER_CLEANUP_INTERVAL=1h
|
||||
|
||||
# Coordination
|
||||
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
|
||||
|
||||
# Process and logging
|
||||
RTMANAGER_LOG_LEVEL=info
|
||||
RTMANAGER_SHUTDOWN_TIMEOUT=30s
|
||||
|
||||
# Telemetry (disabled for local dev — enable to ship traces / metrics)
|
||||
OTEL_SERVICE_NAME=galaxy-rtmanager
|
||||
OTEL_TRACES_EXPORTER=none
|
||||
OTEL_METRICS_EXPORTER=none
|
||||
```
|
||||
|
||||
For a production-shaped deployment, set
|
||||
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
|
||||
a tag mutation is immediately visible to the next runtime),
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
|
||||
container's user, and configure `OTEL_*` against the cluster's OTLP
|
||||
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
|
||||
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
|
||||
sink the operator runs (fluentd, journald, etc.).
|
||||
|
||||
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
|
||||
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
|
||||
service-local harness brings up
|
||||
([`integration-tests.md` §7](integration-tests.md)).
|
||||
|
||||
## Internal HTTP Examples
|
||||
|
||||
Every endpoint admits the optional `X-Galaxy-Caller` header which the
|
||||
handler records as `op_source` in `operation_log` (`gm` → `gm_rest`,
|
||||
`admin` → `admin_rest`; missing or unknown values default to
|
||||
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
|
||||
|
||||
### Probe a runtime record
|
||||
|
||||
```bash
|
||||
curl -s -H 'X-Galaxy-Caller: gm' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
|
||||
```
|
||||
|
||||
Response (`200 OK`):
|
||||
|
||||
```json
|
||||
{
|
||||
"game_id": "game-01HZ...",
|
||||
"status": "running",
|
||||
"current_container_id": "1f2a...",
|
||||
"current_image_ref": "galaxy/game:1.4.0",
|
||||
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
|
||||
"state_path": "/var/lib/galaxy/games/game-01HZ...",
|
||||
"docker_network": "galaxy-net",
|
||||
"started_at": "2026-04-28T07:18:54Z",
|
||||
"stopped_at": null,
|
||||
"removed_at": null,
|
||||
"last_op_at": "2026-04-28T07:18:54Z",
|
||||
"created_at": "2026-04-28T07:18:54Z"
|
||||
}
|
||||
```
|
||||
|
||||
### List all runtimes
|
||||
|
||||
```bash
|
||||
curl -s -H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes
|
||||
```
|
||||
|
||||
The response shape is `{"items":[<RuntimeRecord>...]}`.
|
||||
|
||||
### Start a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: gm' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
|
||||
-d '{"image_ref": "galaxy/game:1.4.0"}'
|
||||
```
|
||||
|
||||
A `200` returns the `RuntimeRecord` for the running runtime. Failure
|
||||
shapes use the canonical envelope; e.g. an invalid image_ref:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "start_config_invalid",
|
||||
"message": "image_ref shape rejected by docker reference parser"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Stop a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
|
||||
-d '{"reason": "admin_request"}'
|
||||
```
|
||||
|
||||
Valid `reason` values:
|
||||
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
|
||||
|
||||
### Restart a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
|
||||
```
|
||||
|
||||
The body is empty; restart re-uses the current `image_ref`.
|
||||
|
||||
### Patch a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
|
||||
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
||||
```
|
||||
|
||||
Patch enforces the semver-only rule: a non-semver tag returns
|
||||
`image_ref_not_semver`; a cross-major or cross-minor change returns
|
||||
`semver_patch_only`.
|
||||
|
||||
### Cleanup a stopped runtime container
|
||||
|
||||
```bash
|
||||
curl -s -X DELETE \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
|
||||
```
|
||||
|
||||
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
|
||||
|
||||
## Stream Payload Examples
|
||||
|
||||
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
|
||||
the defaults are used below. Field types and required/optional
|
||||
semantics are frozen by
|
||||
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
|
||||
and
|
||||
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
|
||||
|
||||
### `runtime:start_jobs` (Lobby → RTM)
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:start_jobs '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
image_ref 'galaxy/game:1.4.0' \
|
||||
requested_at_ms 1714081234567
|
||||
```
|
||||
|
||||
### `runtime:stop_jobs` (Lobby → RTM)
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:stop_jobs '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
reason 'cancelled' \
|
||||
requested_at_ms 1714081234567
|
||||
```
|
||||
|
||||
### `runtime:job_results` (RTM → Lobby)
|
||||
|
||||
Success envelope:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id '1f2a...' \
|
||||
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
||||
error_code '' \
|
||||
error_message ''
|
||||
```
|
||||
|
||||
Failure envelope:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'failure' \
|
||||
container_id '' \
|
||||
engine_endpoint '' \
|
||||
error_code 'image_pull_failed' \
|
||||
error_message 'pull failed: manifest unknown'
|
||||
```
|
||||
|
||||
Idempotent replay envelope (success outcome with explicit
|
||||
`replay_no_op`):
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id '1f2a...' \
|
||||
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
||||
error_code 'replay_no_op' \
|
||||
error_message ''
|
||||
```
|
||||
|
||||
The contract permits empty `container_id` and `engine_endpoint`
|
||||
strings on every value of `outcome` so the consumer can decode the
|
||||
envelope uniformly ([`workers.md` §11](workers.md)).
|
||||
|
||||
### `runtime:health_events` (RTM out)
|
||||
|
||||
The wire shape is the same for every event type — only the
|
||||
`details` payload differs.
|
||||
|
||||
`container_started`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_started' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"image_ref":"galaxy/game:1.4.0"}'
|
||||
```
|
||||
|
||||
`container_exited`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_exited' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"exit_code":137,"oom":false}'
|
||||
```
|
||||
|
||||
`container_oom`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_oom' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"exit_code":137}'
|
||||
```
|
||||
|
||||
`container_disappeared`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_disappeared' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{}'
|
||||
```
|
||||
|
||||
`inspect_unhealthy`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'inspect_unhealthy' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
|
||||
```
|
||||
|
||||
`probe_failed` (after the threshold is crossed):
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'probe_failed' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
|
||||
```
|
||||
|
||||
`probe_recovered`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'probe_recovered' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"prior_failure_count":3}'
|
||||
```
|
||||
|
||||
### `notification:intents` (RTM admin notifications)
|
||||
|
||||
RTM publishes admin-only notification intents only for the three
|
||||
first-touch start failures. Every payload shares the frozen field
|
||||
set `{game_id, image_ref, error_code, error_message,
|
||||
attempted_at_ms}`
|
||||
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
||||
|
||||
`runtime.image_pull_failed`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD notification:intents '*' \
|
||||
envelope '{
|
||||
"type": "runtime.image_pull_failed",
|
||||
"producer": "rtmanager",
|
||||
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
|
||||
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
|
||||
"payload": {
|
||||
"game_id": "game-01HZ...",
|
||||
"image_ref": "galaxy/game:1.4.0",
|
||||
"error_code": "image_pull_failed",
|
||||
"error_message": "pull failed: manifest unknown",
|
||||
"attempted_at_ms": 1714081234567
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
`runtime.container_start_failed` and `runtime.start_config_invalid`
|
||||
share the same envelope with their respective `type` and
|
||||
`error_code` values.
|
||||
|
||||
## Storage Inspection
|
||||
|
||||
### Inspect a runtime record (PostgreSQL)
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
|
||||
```
|
||||
|
||||
Columns mirror the fields documented in
|
||||
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
|
||||
|
||||
### Inspect runtime status counts
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
||||
```
|
||||
|
||||
### Inspect the operation log for a game
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code,
|
||||
started_at, finished_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = 'game-01HZ...'
|
||||
ORDER BY started_at DESC, id DESC
|
||||
LIMIT 50"
|
||||
```
|
||||
|
||||
### Inspect the latest health snapshot
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT game_id, container_id, status, source, observed_at, details
|
||||
FROM rtmanager.health_snapshots
|
||||
WHERE game_id = 'game-01HZ...'"
|
||||
```
|
||||
|
||||
### Inspect Redis runtime-coordination keys
|
||||
|
||||
```bash
|
||||
# Stream offsets
|
||||
redis-cli GET rtmanager:stream_offsets:startjobs
|
||||
redis-cli GET rtmanager:stream_offsets:stopjobs
|
||||
|
||||
# Per-game lease (only present while an operation is in flight)
|
||||
redis-cli GET rtmanager:game_lease:game-01HZ...
|
||||
redis-cli TTL rtmanager:game_lease:game-01HZ...
|
||||
|
||||
# Recent stream entries
|
||||
redis-cli XRANGE runtime:start_jobs - + COUNT 20
|
||||
redis-cli XRANGE runtime:job_results - + COUNT 20
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 50
|
||||
|
||||
# Stream metadata
|
||||
redis-cli XINFO STREAM runtime:start_jobs
|
||||
redis-cli XINFO STREAM runtime:stop_jobs
|
||||
redis-cli XINFO STREAM runtime:health_events
|
||||
```
|
||||
Reference in New Issue
Block a user