feat: runtime manager
This commit is contained in:
@@ -0,0 +1,368 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup,
|
||||
steady-state readiness, shutdown, and the handful of recovery paths
|
||||
specific to Runtime Manager.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`)
|
||||
reaches a Docker daemon the operator controls. RTM is the only
|
||||
Galaxy service permitted to interact with the Docker socket;
|
||||
scoping the daemon to RTM-only callers is operator domain.
|
||||
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a
|
||||
user-defined bridge network that has already been created (e.g.
|
||||
via `docker network create galaxy-net` in the environment's
|
||||
bootstrap script). RTM **validates** the network at startup but
|
||||
never creates it. A missing network is fail-fast and the process
|
||||
exits non-zero before opening any listener.
|
||||
- `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user
|
||||
can read and write. Per-game subdirectories are created with
|
||||
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the
|
||||
uid/gid to match the engine container's user when running with a
|
||||
non-root engine.
|
||||
- `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
|
||||
that hosts the `rtmanager` schema. The DSN must include
|
||||
`search_path=rtmanager` and `sslmode=disable` (or a real SSL mode
|
||||
for production). Embedded goose migrations apply at startup before
|
||||
any HTTP listener opens; a migration or ping failure terminates the
|
||||
process with a non-zero exit. The `rtmanager` schema and the
|
||||
matching `rtmanagerservice` role are provisioned externally
|
||||
([`postgres-migration.md` §1](postgres-migration.md)).
|
||||
- `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach
|
||||
the Redis deployment used for the runtime-coordination state:
|
||||
stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`),
|
||||
publishers (`runtime:job_results`, `runtime:health_events`,
|
||||
`notification:intents`), persisted offsets, and the per-game
|
||||
lease. RTM does not maintain durable business state on Redis.
|
||||
- Stream names match the producers and consumers RTM integrates with:
|
||||
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`)
|
||||
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
|
||||
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`)
|
||||
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`)
|
||||
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
|
||||
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal
|
||||
HTTP listener. RTM's start service issues a diagnostic
|
||||
`GET /api/v1/internal/games/{game_id}` per start; failure is logged
|
||||
at debug and does not abort the start
|
||||
([`services.md` §7](services.md)).
|
||||
|
||||
The startup sequence runs in the order recorded in
|
||||
[`../README.md` §Startup dependencies](../README.md#startup-dependencies):
|
||||
|
||||
1. PostgreSQL primary opens; goose migrations apply synchronously.
|
||||
2. Redis master client opens and pings.
|
||||
3. Docker daemon ping; configured network presence check.
|
||||
4. Telemetry exporter (OTLP grpc/http or stdout).
|
||||
5. Internal HTTP listener.
|
||||
6. Reconciler runs **once synchronously** and blocks until done.
|
||||
7. Background workers start.
|
||||
|
||||
A failure at any step is fatal. The synchronous reconciler pass is
|
||||
the reason orphaned containers from a prior process never reach the
|
||||
periodic workers in an inconsistent state
|
||||
([`workers.md` §17](workers.md)).
|
||||
|
||||
Expected log lines on a healthy boot:
|
||||
|
||||
- `migrations applied`,
|
||||
- `postgres ping ok`,
|
||||
- `redis ping ok`,
|
||||
- `docker ping ok` and `docker network found`,
|
||||
- `telemetry exporter started`,
|
||||
- `internal http listening`,
|
||||
- `reconciler initial pass completed`,
|
||||
- one `worker started` entry per background worker (seven expected).
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually verify:
|
||||
|
||||
- `GET /healthz` confirms the listener is alive — no dependency
|
||||
check.
|
||||
- `GET /readyz` live-pings PostgreSQL primary, Redis master, and the
|
||||
Docker daemon, then asserts the configured Docker network exists.
|
||||
Returns `{"status":"ready"}` when every check passes; otherwise
|
||||
returns `503` with the canonical
|
||||
`{"error":{"code":"service_unavailable","message":"…"}}` envelope
|
||||
identifying the first failing dependency.
|
||||
|
||||
`/readyz` is the strongest readiness signal RTM exposes; unlike
|
||||
Lobby's `/readyz`, it does **not** rely on a one-shot boot ping.
|
||||
Each request hits the daemon and the database fresh.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted the listener and worker startup logs;
|
||||
2. check `GET /healthz` and `GET /readyz`;
|
||||
3. verify `rtmanager.runtime_records_by_status{status="running"}`
|
||||
gauge tracks the expected live game count after the first start
|
||||
completes;
|
||||
4. verify `rtmanager.docker_op_latency` histograms have at least one
|
||||
sample after the first lifecycle operation.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behaviour:
|
||||
|
||||
- the per-component shutdown budget is controlled by
|
||||
`RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`);
|
||||
- the internal HTTP listener drains in-flight requests before closing;
|
||||
- stream consumers stop their `XREAD` loops and persist the latest
|
||||
offset before returning; the offset survives the restart
|
||||
([`workers.md` §9](workers.md));
|
||||
- the Docker events listener cancels its subscription;
|
||||
- the in-flight services release their per-game lease through the
|
||||
surrounding context cancellation;
|
||||
- the reconciler completes its current pass or aborts mid-write at
|
||||
the next lease re-acquisition.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for the listener and component-stop logs;
|
||||
3. expect any consumer that was mid-cycle to retry from the persisted
|
||||
offset on the next process start;
|
||||
4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`.
|
||||
|
||||
## Engine Container Died
|
||||
|
||||
A running engine container that exits unexpectedly surfaces through
|
||||
three observation channels:
|
||||
|
||||
- The Docker events listener emits `container_exited` (non-zero exit
|
||||
code) or `container_oom` (Docker action `oom`).
|
||||
- The active probe worker eventually emits `probe_failed` once the
|
||||
threshold is crossed.
|
||||
- The Docker inspect worker may emit `inspect_unhealthy` if the
|
||||
engine restarts under Docker's healthcheck or if Docker reports an
|
||||
unexpected status.
|
||||
|
||||
Triage:
|
||||
|
||||
1. Inspect the `runtime:health_events` stream for the affected
|
||||
`game_id` and `event_type`:
|
||||
```bash
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 200 \
|
||||
| grep -A4 'game_id\s*<game_id>'
|
||||
```
|
||||
2. Read the runtime record and the operation log:
|
||||
```bash
|
||||
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code, started_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = '<game_id>'
|
||||
ORDER BY started_at DESC LIMIT 20"
|
||||
```
|
||||
3. If Lobby has not reacted (the game's status remains `running` in
|
||||
`lobby.games`), check `runtime:job_results` lag and Lobby's
|
||||
`runtimejobresult` worker. RTM publishes the result; Lobby is the
|
||||
consumer.
|
||||
4. If the container is already gone (`docker ps -a` shows no row for
|
||||
`galaxy-game-<game_id>`), the reconciler will move the record to
|
||||
`removed` on its next pass. Run the periodic reconcile manually
|
||||
by sending `SIGHUP` is **not** supported — wait
|
||||
`RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the
|
||||
process; the synchronous boot pass will handle the drift.
|
||||
5. The `notification:intents` stream is **not** the place to look
|
||||
for ongoing health changes. Only the three first-touch start
|
||||
failures (`runtime.image_pull_failed`,
|
||||
`runtime.container_start_failed`,
|
||||
`runtime.start_config_invalid`) produce a notification intent;
|
||||
probe failures, OOMs, and exits flow through health events only
|
||||
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
||||
|
||||
## Patch Upgrade
|
||||
|
||||
A patch upgrade replaces the container with a new `image_ref` while
|
||||
preserving the bind-mounted state directory.
|
||||
|
||||
Pre-conditions:
|
||||
|
||||
- The new and current `image_ref` tags both parse as semver. RTM
|
||||
rejects non-semver tags with `image_ref_not_semver`.
|
||||
- The new and current major / minor versions match. A cross-major or
|
||||
cross-minor patch returns `semver_patch_only`.
|
||||
|
||||
Driving the upgrade:
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
|
||||
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
||||
```
|
||||
|
||||
Behaviour:
|
||||
|
||||
- The container is stopped, removed, and recreated. The
|
||||
`current_container_id` changes; the `engine_endpoint`
|
||||
(`http://galaxy-game-<game_id>:8080`) is stable.
|
||||
- The engine reads its state from the bind mount on startup, so any
|
||||
data written before the patch survives.
|
||||
- A single `operation_log` row is appended with `op_kind=patch` and
|
||||
the old / new image refs.
|
||||
- A `runtime:health_events container_started` is emitted by the
|
||||
inner start ([`workers.md` §1](workers.md)).
|
||||
|
||||
Post-patch verification:
|
||||
|
||||
```bash
|
||||
curl -s http://galaxy-game-<game_id>:8080/healthz
|
||||
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
|
||||
```
|
||||
|
||||
The `current_image_ref` field on the runtime record reflects the new
|
||||
tag.
|
||||
|
||||
## Manual Cleanup
|
||||
|
||||
The cleanup endpoint removes the container and updates the record to
|
||||
`removed`. It refuses to remove a `running` container — stop first.
|
||||
|
||||
```bash
|
||||
# Stop, then clean up
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
|
||||
-d '{"reason":"admin_request"}'
|
||||
|
||||
curl -s -X DELETE \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
|
||||
```
|
||||
|
||||
The host state directory under `<RTMANAGER_GAME_STATE_ROOT>/<game_id>`
|
||||
is **never** deleted by RTM. Removing the directory is operator
|
||||
domain (backup tooling, future Admin Service workflow). The
|
||||
operation_log records `op_kind=cleanup_container` with
|
||||
`op_source=admin_rest`.
|
||||
|
||||
## Reconcile Drift After Docker Daemon Restart
|
||||
|
||||
A Docker daemon restart drops every running engine container; PG
|
||||
records remain. On RTM's next boot (or its next periodic reconcile):
|
||||
|
||||
1. The reconciler observes `running` records whose containers are
|
||||
missing from `docker ps`. It updates each record to `removed`,
|
||||
appends `operation_log` with `op_kind=reconcile_dispose`, and
|
||||
publishes `runtime:health_events container_disappeared`
|
||||
([`workers.md` §14–§15](workers.md)).
|
||||
2. Lobby's `runtimejobresult` worker does not consume the dispose
|
||||
event in v1, so the cascade does not auto-restart the engine.
|
||||
Operators trigger restarts through Lobby's user-facing flow or
|
||||
directly via the GM/Admin REST `restart` endpoint.
|
||||
3. If the operator brings up an engine container manually for
|
||||
diagnostics (`docker run` with the
|
||||
`com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>` labels),
|
||||
the reconciler **adopts** it on the next pass: a new
|
||||
`runtime_records` row appears with `op_kind=reconcile_adopt`.
|
||||
The reconciler **never stops or removes** an unrecorded
|
||||
container — operators stay in control of manual containers
|
||||
([`../README.md` §Reconciliation](../README.md#reconciliation)).
|
||||
|
||||
Three drift kinds run through the same lease-guarded write pass:
|
||||
`adopt`, `dispose`, and the README-level path
|
||||
`observed_exited` (a record marked `running` whose container exists
|
||||
but is in `exited`). Telemetry counter
|
||||
`rtmanager.reconcile_drift{kind}` exposes the three independently
|
||||
([`workers.md` §15](workers.md)).
|
||||
|
||||
## Testing Locally
|
||||
|
||||
```sh
|
||||
# One-time bootstrap
|
||||
docker network create galaxy-net
|
||||
|
||||
# Minimal env (see docs/examples.md for a complete .env)
|
||||
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
||||
export RTMANAGER_DOCKER_NETWORK=galaxy-net
|
||||
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
||||
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
|
||||
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
||||
export RTMANAGER_REDIS_PASSWORD=local
|
||||
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
||||
|
||||
go run ./rtmanager/cmd/rtmanager
|
||||
```
|
||||
|
||||
After start:
|
||||
|
||||
- `curl http://localhost:8096/healthz` returns `{"status":"ok"}`;
|
||||
- `curl http://localhost:8096/readyz` returns `{"status":"ready"}`
|
||||
once PG, Redis, and Docker pings pass and the configured network
|
||||
exists;
|
||||
- driving Lobby through its public flow (`POST /api/v1/lobby/games/<id>/start`)
|
||||
brings up `galaxy-game-<game_id>` containers; RTM logs each
|
||||
lifecycle transition.
|
||||
|
||||
The integration suite under `rtmanager/integration/` exercises the
|
||||
end-to-end flows against the real Docker daemon. The default
|
||||
`go test ./...` skips it via the `integration` build tag; run
|
||||
explicitly with:
|
||||
|
||||
```sh
|
||||
make -C rtmanager integration
|
||||
```
|
||||
|
||||
The suite requires a reachable Docker daemon. Without one, the
|
||||
harness helpers call `t.Skip` and the package becomes a no-op
|
||||
([`integration-tests.md` §1](integration-tests.md)).
|
||||
|
||||
## Diagnostic Queries
|
||||
|
||||
Durable runtime state lives in PostgreSQL; runtime-coordination state
|
||||
stays in Redis. CLI snippets that help during incidents:
|
||||
|
||||
```bash
|
||||
# Live runtime count by status (PostgreSQL)
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
||||
|
||||
# Inspect a specific runtime record
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
|
||||
|
||||
# Last 20 operations for a game (newest first)
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code,
|
||||
started_at, finished_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = '<game_id>'
|
||||
ORDER BY started_at DESC, id DESC
|
||||
LIMIT 20"
|
||||
|
||||
# Latest health snapshot
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
|
||||
|
||||
# Containers RTM owns (Docker)
|
||||
docker ps --filter label=com.galaxy.owner=rtmanager \
|
||||
--format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
|
||||
|
||||
# Stream lag (Redis)
|
||||
redis-cli XINFO STREAM runtime:start_jobs
|
||||
redis-cli XINFO STREAM runtime:stop_jobs
|
||||
redis-cli GET rtmanager:stream_offsets:startjobs
|
||||
redis-cli GET rtmanager:stream_offsets:stopjobs
|
||||
|
||||
# Recent health events (oldest first)
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 100
|
||||
|
||||
# Per-game lease (only present while an operation runs)
|
||||
redis-cli GET rtmanager:game_lease:<game_id>
|
||||
redis-cli TTL rtmanager:game_lease:<game_id>
|
||||
```
|
||||
|
||||
Operators reach the gauges and counters surfaced through
|
||||
OpenTelemetry as the primary observability surface; raw PostgreSQL
|
||||
and Redis access is for last-resort triage.
|
||||
Reference in New Issue
Block a user