310 lines
16 KiB
Markdown
310 lines
16 KiB
Markdown
# Runtime and Components
|
|
|
|
The diagram below focuses on the deployed `galaxy/rtmanager` process
|
|
and its runtime dependencies. The current-state contract for every
|
|
listener, worker, and adapter lives in [`../README.md`](../README.md);
|
|
this document is the navigation aid that points at the right code path
|
|
and the right design-rationale record.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph Clients
|
|
GM["Game Master"]
|
|
Admin["Admin Service"]
|
|
Lobby["Game Lobby"]
|
|
end
|
|
|
|
subgraph RTM["Runtime Manager process"]
|
|
InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
|
|
StartJobs["startjobsconsumer"]
|
|
StopJobs["stopjobsconsumer"]
|
|
DockerEvents["dockerevents listener"]
|
|
HealthProbe["healthprobe worker"]
|
|
DockerInspect["dockerinspect worker"]
|
|
Reconcile["reconcile worker"]
|
|
Cleanup["containercleanup worker"]
|
|
Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
|
|
IntentPublisher["notification:intents publisher"]
|
|
ResultsPublisher["runtime:job_results publisher"]
|
|
HealthPublisher["runtime:health_events publisher"]
|
|
Telemetry["Logs, traces, metrics"]
|
|
end
|
|
|
|
Docker["Docker Daemon"]
|
|
Engine["galaxy-game-{game_id} container"]
|
|
Postgres["PostgreSQL\nschema rtmanager"]
|
|
Redis["Redis\nstreams + leases + offsets"]
|
|
LobbyHTTP["Lobby internal HTTP"]
|
|
|
|
Lobby -. runtime:start_jobs .-> StartJobs
|
|
Lobby -. runtime:stop_jobs .-> StopJobs
|
|
GM --> InternalHTTP
|
|
Admin --> InternalHTTP
|
|
|
|
StartJobs --> Services
|
|
StopJobs --> Services
|
|
InternalHTTP --> Services
|
|
|
|
Services --> Docker
|
|
Services --> Postgres
|
|
Services --> Redis
|
|
Services --> ResultsPublisher
|
|
Services --> HealthPublisher
|
|
Services --> IntentPublisher
|
|
Services -. GET diagnostic .-> LobbyHTTP
|
|
|
|
DockerEvents --> Docker
|
|
DockerInspect --> Docker
|
|
HealthProbe --> Engine
|
|
Reconcile --> Docker
|
|
Reconcile --> Postgres
|
|
Cleanup --> Postgres
|
|
Cleanup --> Services
|
|
|
|
DockerEvents --> HealthPublisher
|
|
DockerInspect --> HealthPublisher
|
|
HealthProbe --> HealthPublisher
|
|
|
|
HealthPublisher --> Redis
|
|
ResultsPublisher --> Redis
|
|
IntentPublisher --> Redis
|
|
|
|
StartJobs --> Redis
|
|
StopJobs --> Redis
|
|
InternalHTTP --> Postgres
|
|
|
|
Docker -->|create / start / stop / rm| Engine
|
|
Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
|
|
|
|
InternalHTTP --> Telemetry
|
|
Services --> Telemetry
|
|
StartJobs --> Telemetry
|
|
StopJobs --> Telemetry
|
|
DockerEvents --> Telemetry
|
|
HealthProbe --> Telemetry
|
|
DockerInspect --> Telemetry
|
|
Reconcile --> Telemetry
|
|
Cleanup --> Telemetry
|
|
```
|
|
|
|
Notes:
|
|
|
|
- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
|
|
goose migrations fail, when Redis ping fails, when the Docker daemon
|
|
ping fails, or when the configured Docker network is missing. Lobby
|
|
reachability is **not** verified at boot — the start service's
|
|
diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
|
|
outside of debug logging
|
|
([`services.md` §7](services.md)).
|
|
- The reconciler runs **synchronously** once on startup before
|
|
`app.App.Run` registers any other component, then re-runs
|
|
periodically as a regular `Component`. The synchronous pass is the
|
|
reason why orphaned containers from a prior process can never be
|
|
observed by the events listener with no PG record
|
|
([`workers.md` §17](workers.md)).
|
|
- A single internal HTTP listener exposes both probes
|
|
(`/healthz`, `/readyz`) and the trusted REST surface for Game Master
|
|
and Admin Service. There is no public listener — RTM does not face
|
|
end users.
|
|
|
|
## Listeners
|
|
|
|
| Listener | Default addr | Purpose |
|
|
| --- | --- | --- |
|
|
| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
|
|
|
|
Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
|
|
|
|
- read timeout: `5s`
|
|
- write timeout: `15s`
|
|
- idle timeout: `60s`
|
|
|
|
The listener is unauthenticated and assumes a trusted network segment.
|
|
The `X-Galaxy-Caller` request header carries an optional caller
|
|
identity (`gm` or `admin`) that the handler records as
|
|
`operation_log.op_source`
|
|
([`services.md` §18](services.md)).
|
|
|
|
Probe routes:
|
|
|
|
- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
|
|
the listener is up.
|
|
- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
|
|
Docker daemon, then asserts the configured Docker network exists.
|
|
Returns `{"status":"ready"}` only when every check passes; otherwise
|
|
returns `503` with the canonical error envelope.
|
|
|
|
## Background Workers
|
|
|
|
Every worker runs as an `app.Component` and is registered in the
|
|
order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
|
|
|
|
| Worker | Source | Trigger | Function |
|
|
| --- | --- | --- | --- |
|
|
| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
|
|
| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
|
|
| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
|
|
| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
|
|
| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
|
|
| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
|
|
| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
|
|
|
|
The events listener and the inspect worker do **not** emit
|
|
`container_started` — that event is owned by the start service
|
|
([`workers.md` §1](workers.md)). The events listener and the inspect
|
|
worker also do not emit `container_disappeared` autonomously when a
|
|
record is missing or stale; the conditional emission rules live in
|
|
[`workers.md` §2](workers.md) and [`§4`](workers.md).
|
|
|
|
## Lifecycle Services
|
|
|
|
The five lifecycle services are pure orchestrators called from both
|
|
the stream consumers and the REST handlers. Each service owns the
|
|
per-game lease for the duration of its operation.
|
|
|
|
| Service | Source | Triggers | Failure envelope |
|
|
| --- | --- | --- | --- |
|
|
| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
|
|
| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
|
|
| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
|
|
| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
|
|
| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
|
|
|
|
All services share three behaviours captured in
|
|
[`services.md`](services.md):
|
|
|
|
- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
|
|
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
|
|
not by the caller — which keeps consumer and REST callers symmetric
|
|
([`services.md` §1](services.md));
|
|
- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
|
|
`ContainerID`, `EngineEndpoint`) is what consumers and REST
|
|
handlers translate into job_results / HTTP responses
|
|
([`services.md` §3](services.md));
|
|
- failures pass through one `operation_log` write before returning,
|
|
and three of the failure codes (`start_config_invalid`,
|
|
`image_pull_failed`, `container_start_failed`) also publish a
|
|
`runtime.*` admin notification intent
|
|
([`services.md` §4](services.md)).
|
|
|
|
## Synchronous Upstream Client
|
|
|
|
| Client | Endpoint | Failure mapping |
|
|
| --- | --- | --- |
|
|
| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
|
|
|
|
Lobby's outbound transport is the only synchronous client RTM holds.
|
|
Every other interaction (Notification Service, Game Master, Admin
|
|
Service) crosses an asynchronous boundary or is initiated by the peer.
|
|
|
|
## Stream Offsets
|
|
|
|
Each consumer persists its position under a fixed label so process
|
|
restart preserves stream progress.
|
|
|
|
| Stream | Offset key | Block timeout env |
|
|
| --- | --- | --- |
|
|
| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
|
|
| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
|
|
|
|
The labels `startjobs` and `stopjobs` are stable identifiers — they
|
|
are decoupled from the underlying stream key. An operator who renames
|
|
a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
|
|
`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
|
|
Decision: [`workers.md` §9](workers.md).
|
|
|
|
The `runtime:job_results`, `runtime:health_events`, and
|
|
`notification:intents` streams are outbound; RTM does not consume them
|
|
itself.
|
|
|
|
## Configuration Groups
|
|
|
|
The full env-var list with defaults lives in
|
|
[`../README.md` §Configuration](../README.md). The groups below
|
|
summarise the structure:
|
|
|
|
- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
|
|
`RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
|
|
`RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
|
|
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
|
|
- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
|
|
- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
|
|
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
|
|
`RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
|
|
- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
|
|
`RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
|
|
`RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
|
|
`RTMANAGER_CONTAINER_RETENTION_DAYS`,
|
|
`RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
|
|
`RTMANAGER_ENGINE_STATE_ENV_NAME`,
|
|
`RTMANAGER_GAME_STATE_DIR_MODE`,
|
|
`RTMANAGER_GAME_STATE_OWNER_UID`,
|
|
`RTMANAGER_GAME_STATE_OWNER_GID`.
|
|
- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
|
|
`RTMANAGER_POSTGRES_REPLICA_DSNS`,
|
|
`RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
|
|
`RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
|
|
`RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
|
|
`RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
|
|
- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
|
|
`RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
|
|
`RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
|
|
- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
|
|
`RTMANAGER_REDIS_STOP_JOBS_STREAM`,
|
|
`RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
|
|
`RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
|
|
`RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
|
|
`RTMANAGER_STREAM_BLOCK_TIMEOUT`.
|
|
- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
|
|
`RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
|
|
`RTMANAGER_PROBE_FAILURES_THRESHOLD`.
|
|
- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
|
|
`RTMANAGER_CLEANUP_INTERVAL`.
|
|
- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
|
|
- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
|
|
`RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
|
|
- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
|
|
`RTMANAGER_SHUTDOWN_TIMEOUT`.
|
|
- **Telemetry** — standard `OTEL_*`.
|
|
|
|
## Runtime Notes
|
|
|
|
- **Single-instance v1.** Multi-instance Runtime Manager with Redis
|
|
Streams consumer groups is explicitly out of scope for the current
|
|
iteration. The per-game lease serialises operations on one game
|
|
across the consumer + REST entry points; cross-instance
|
|
coordination is deferred until a real workload demands it.
|
|
- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
|
|
`SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
|
|
(default `60s`). The lease is **not renewed mid-operation** in v1;
|
|
long pulls of multi-GB images can therefore expire the lease
|
|
before the operation finishes — the trade-off is documented in
|
|
[`services.md` §1](services.md). The reconciler honours the same
|
|
lease around every drift mutation
|
|
([`workers.md` §14](workers.md)).
|
|
- **Operation log is the source of truth.** Every lifecycle and
|
|
reconcile mutation appends one row to `rtmanager.operation_log`.
|
|
The `runtime:health_events` stream and the `notification:intents`
|
|
emissions are best-effort — a publish failure logs at `Error` and
|
|
proceeds, never rolling back the recorded operation
|
|
([`workers.md` §8](workers.md)).
|
|
- **In-memory probe hysteresis.** The active HTTP probe keeps
|
|
per-game `consecutiveFailures` and `failurePublished` counters in a
|
|
mutex-guarded map. State is non-persistent: a process restart that
|
|
loses the counters re-establishes hysteresis from scratch, and
|
|
state for a game that transitions through `stopped → running` is
|
|
pruned at the start of every probe tick
|
|
([`workers.md` §5](workers.md)).
|
|
- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
|
|
`if_missing` (default), `always`, and `never`. Image labels
|
|
(`com.galaxy.cpu_quota`, `com.galaxy.memory`,
|
|
`com.galaxy.pids_limit`) drive resource limits when present; the
|
|
matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
|
|
label is absent or unparseable. Producers never pass limits.
|
|
- **State directory ownership.** RTM creates per-game state
|
|
directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
|
|
mode and uid/gid, but **never deletes them**. Removing the directory
|
|
is operator domain (backup tooling, a future Admin Service
|
|
workflow). A cleanup that removes the container leaves the
|
|
directory intact.
|