feat: runtime manager

This commit is contained in:
Ilia Denisov
2026-04-28 20:39:18 +02:00
committed by GitHub
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
+429
View File
@@ -0,0 +1,429 @@
# Configuration And Contract Examples
The examples below are illustrative. Replace `localhost`, port
numbers, IDs, and timestamps with values that match the deployment
under inspection.
## Example `.env`
A minimum-viable `RTMANAGER_*` set for a local run against a single
Redis container plus a PostgreSQL container with the `rtmanager`
schema and the `rtmanagerservice` role provisioned. The full list
with defaults lives in [`../README.md` §Configuration](../README.md).
```bash
# Required
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
RTMANAGER_REDIS_PASSWORD=local
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
RTMANAGER_DOCKER_NETWORK=galaxy-net
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
# Lobby internal client (diagnostic GET only in v1)
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
# Container defaults (image labels override these per container)
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
RTMANAGER_DEFAULT_MEMORY=512m
RTMANAGER_DEFAULT_PIDS_LIMIT=512
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
RTMANAGER_CONTAINER_RETENTION_DAYS=30
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
RTMANAGER_GAME_STATE_DIR_MODE=0750
RTMANAGER_GAME_STATE_OWNER_UID=0
RTMANAGER_GAME_STATE_OWNER_GID=0
# Workers
RTMANAGER_INSPECT_INTERVAL=30s
RTMANAGER_PROBE_INTERVAL=15s
RTMANAGER_PROBE_TIMEOUT=2s
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
RTMANAGER_RECONCILE_INTERVAL=5m
RTMANAGER_CLEANUP_INTERVAL=1h
# Coordination
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
# Process and logging
RTMANAGER_LOG_LEVEL=info
RTMANAGER_SHUTDOWN_TIMEOUT=30s
# Telemetry (disabled for local dev — enable to ship traces / metrics)
OTEL_SERVICE_NAME=galaxy-rtmanager
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
For a production-shaped deployment, set
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
a tag mutation is immediately visible to the next runtime),
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
container's user, and configure `OTEL_*` against the cluster's OTLP
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
sink the operator runs (fluentd, journald, etc.).
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
service-local harness brings up
([`integration-tests.md` §7](integration-tests.md)).
## Internal HTTP Examples
Every endpoint admits the optional `X-Galaxy-Caller` header which the
handler records as `op_source` in `operation_log` (`gm``gm_rest`,
`admin``admin_rest`; missing or unknown values default to
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
### Probe a runtime record
```bash
curl -s -H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
```
Response (`200 OK`):
```json
{
"game_id": "game-01HZ...",
"status": "running",
"current_container_id": "1f2a...",
"current_image_ref": "galaxy/game:1.4.0",
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
"state_path": "/var/lib/galaxy/games/game-01HZ...",
"docker_network": "galaxy-net",
"started_at": "2026-04-28T07:18:54Z",
"stopped_at": null,
"removed_at": null,
"last_op_at": "2026-04-28T07:18:54Z",
"created_at": "2026-04-28T07:18:54Z"
}
```
### List all runtimes
```bash
curl -s -H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes
```
The response shape is `{"items":[<RuntimeRecord>...]}`.
### Start a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
-d '{"image_ref": "galaxy/game:1.4.0"}'
```
A `200` returns the `RuntimeRecord` for the running runtime. Failure
shapes use the canonical envelope; e.g. an invalid image_ref:
```json
{
"error": {
"code": "start_config_invalid",
"message": "image_ref shape rejected by docker reference parser"
}
}
```
### Stop a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
-d '{"reason": "admin_request"}'
```
Valid `reason` values:
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
### Restart a runtime
```bash
curl -s -X POST \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
```
The body is empty; restart re-uses the current `image_ref`.
### Patch a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
```
Patch enforces the semver-only rule: a non-semver tag returns
`image_ref_not_semver`; a cross-major or cross-minor change returns
`semver_patch_only`.
### Cleanup a stopped runtime container
```bash
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
```
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
## Stream Payload Examples
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
the defaults are used below. Field types and required/optional
semantics are frozen by
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
and
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
### `runtime:start_jobs` (Lobby → RTM)
```bash
redis-cli XADD runtime:start_jobs '*' \
game_id 'game-01HZ...' \
image_ref 'galaxy/game:1.4.0' \
requested_at_ms 1714081234567
```
### `runtime:stop_jobs` (Lobby → RTM)
```bash
redis-cli XADD runtime:stop_jobs '*' \
game_id 'game-01HZ...' \
reason 'cancelled' \
requested_at_ms 1714081234567
```
### `runtime:job_results` (RTM → Lobby)
Success envelope:
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code '' \
error_message ''
```
Failure envelope:
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'failure' \
container_id '' \
engine_endpoint '' \
error_code 'image_pull_failed' \
error_message 'pull failed: manifest unknown'
```
Idempotent replay envelope (success outcome with explicit
`replay_no_op`):
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code 'replay_no_op' \
error_message ''
```
The contract permits empty `container_id` and `engine_endpoint`
strings on every value of `outcome` so the consumer can decode the
envelope uniformly ([`workers.md` §11](workers.md)).
### `runtime:health_events` (RTM out)
The wire shape is the same for every event type — only the
`details` payload differs.
`container_started`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_started' \
occurred_at_ms 1714081234567 \
details '{"image_ref":"galaxy/game:1.4.0"}'
```
`container_exited`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_exited' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137,"oom":false}'
```
`container_oom`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_oom' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137}'
```
`container_disappeared`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_disappeared' \
occurred_at_ms 1714081234567 \
details '{}'
```
`inspect_unhealthy`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'inspect_unhealthy' \
occurred_at_ms 1714081234567 \
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
```
`probe_failed` (after the threshold is crossed):
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_failed' \
occurred_at_ms 1714081234567 \
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
```
`probe_recovered`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_recovered' \
occurred_at_ms 1714081234567 \
details '{"prior_failure_count":3}'
```
### `notification:intents` (RTM admin notifications)
RTM publishes admin-only notification intents only for the three
first-touch start failures. Every payload shares the frozen field
set `{game_id, image_ref, error_code, error_message,
attempted_at_ms}`
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
`runtime.image_pull_failed`:
```bash
redis-cli XADD notification:intents '*' \
envelope '{
"type": "runtime.image_pull_failed",
"producer": "rtmanager",
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
"payload": {
"game_id": "game-01HZ...",
"image_ref": "galaxy/game:1.4.0",
"error_code": "image_pull_failed",
"error_message": "pull failed: manifest unknown",
"attempted_at_ms": 1714081234567
}
}'
```
`runtime.container_start_failed` and `runtime.start_config_invalid`
share the same envelope with their respective `type` and
`error_code` values.
## Storage Inspection
### Inspect a runtime record (PostgreSQL)
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
```
Columns mirror the fields documented in
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
### Inspect runtime status counts
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
```
### Inspect the operation log for a game
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = 'game-01HZ...'
ORDER BY started_at DESC, id DESC
LIMIT 50"
```
### Inspect the latest health snapshot
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT game_id, container_id, status, source, observed_at, details
FROM rtmanager.health_snapshots
WHERE game_id = 'game-01HZ...'"
```
### Inspect Redis runtime-coordination keys
```bash
# Stream offsets
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Per-game lease (only present while an operation is in flight)
redis-cli GET rtmanager:game_lease:game-01HZ...
redis-cli TTL rtmanager:game_lease:game-01HZ...
# Recent stream entries
redis-cli XRANGE runtime:start_jobs - + COUNT 20
redis-cli XRANGE runtime:job_results - + COUNT 20
redis-cli XRANGE runtime:health_events - + COUNT 50
# Stream metadata
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli XINFO STREAM runtime:health_events
```