869 lines
38 KiB
Markdown
869 lines
38 KiB
Markdown
# Runtime Manager
|
|
|
|
`Runtime Manager` (RTM) is the only Galaxy platform service permitted to interact with the
|
|
Docker daemon. It owns the lifecycle of `galaxy/game` engine containers and the technical
|
|
runtime view of running games. Other services consume RTM via two transports: an asynchronous
|
|
Redis Streams contract (used by `Game Lobby`) and a synchronous internal REST surface (used by
|
|
`Game Master` and `Admin Service`).
|
|
|
|
## References
|
|
|
|
- [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §9 Runtime Manager.
|
|
- [`../TESTING.md`](../TESTING.md) §7 — testing matrix for RTM.
|
|
- [`./docs/README.md`](./docs/README.md) — service-local documentation entry point.
|
|
- [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — REST contract.
|
|
- [`./api/runtime-jobs-asyncapi.yaml`](./api/runtime-jobs-asyncapi.yaml) — start/stop job
|
|
streams contract.
|
|
- [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml) —
|
|
`runtime:health_events` stream contract.
|
|
- [`../game/README.md`](../game/README.md) — game engine container contract (env, ports,
|
|
`/healthz`).
|
|
- [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with RTM.
|
|
|
|
## Purpose
|
|
|
|
A running Galaxy game lives in exactly one Docker container. The platform must be able to:
|
|
|
|
- create the container with the right engine version and configuration;
|
|
- supply the engine with a stable storage location for game state;
|
|
- keep the runtime status visible to platform-level services;
|
|
- replace the container in place for patch upgrades and restarts;
|
|
- remove containers that are no longer needed;
|
|
- detect and surface engine failures to whoever should react.
|
|
|
|
`Runtime Manager` is the single component that performs these actions. It deliberately does
|
|
**not** reason about platform metadata, membership, schedules, turn cutoffs, or any other
|
|
business state. Game Lobby owns platform metadata; Game Master will own runtime business state
|
|
when implemented.
|
|
|
|
## Scope
|
|
|
|
`Runtime Manager` is the source of truth for:
|
|
|
|
- the mapping `game_id -> current_container_id` for every running container;
|
|
- the durable history of every start, stop, restart, patch, and cleanup operation it performed;
|
|
- the most recent technical health observation per game (last Docker event, last successful or
|
|
failed probe, last inspect result).
|
|
|
|
`Runtime Manager` is not the source of truth for:
|
|
|
|
- any business or platform-level metadata of a game (owned by `Game Lobby`);
|
|
- runtime state visible to players or operators as game state, including current turn,
|
|
generation status, engine version registry (owned by `Game Master`);
|
|
- the engine version catalogue or which engine version a game is allowed to use (`Game Master`
|
|
is the future owner; `Game Lobby` supplies `image_ref` in v1);
|
|
- contents of the engine state directory; that is engine domain;
|
|
- backup, archival, or operator cleanup of state directories.
|
|
|
|
## Non-Goals
|
|
|
|
- Multi-instance operation in v1. Coordination is single-process; multiple replicas are an
|
|
explicit future iteration.
|
|
- Engine version arbitration. The producer (`Game Lobby` in v1, `Game Master` later) supplies `image_ref`.
|
|
- Image registry control. Pull policy is configurable, but RTM does not push, retag, or
|
|
promote images.
|
|
- TLS or mTLS on the internal listener. RTM trusts its network segment.
|
|
- Direct delivery of player-visible push notifications. RTM publishes admin-only notification
|
|
intents only for failures invisible elsewhere; everything else is delegated.
|
|
- Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached
|
|
through `unix:///var/run/docker.sock`.
|
|
|
|
## Position in the System
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
Lobby["Game Lobby"]
|
|
GM["Game Master"]
|
|
Admin["Admin Service"]
|
|
Notify["Notification Service"]
|
|
RTM["Runtime Manager"]
|
|
Engine["Game Engine container"]
|
|
Docker["Docker Daemon"]
|
|
Postgres["PostgreSQL\nschema rtmanager"]
|
|
Redis["Redis\nstreams + leases"]
|
|
|
|
Lobby -->|runtime:start_jobs / stop_jobs| RTM
|
|
RTM -->|runtime:job_results| Lobby
|
|
GM -->|internal REST| RTM
|
|
Admin -->|internal REST| RTM
|
|
RTM -->|notification:intents (admin)| Notify
|
|
RTM -->|runtime:health_events| Redis
|
|
RTM <--> Docker
|
|
Docker -->|create / start / stop / rm| Engine
|
|
RTM --> Postgres
|
|
RTM --> Redis
|
|
Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
|
|
```
|
|
|
|
## Responsibility Boundaries
|
|
|
|
`Runtime Manager` is responsible for:
|
|
|
|
- accepting start, stop, restart, patch, inspect, and cleanup requests through the supported
|
|
transports and producing one durable outcome per request;
|
|
- creating Docker containers from a producer-supplied `image_ref` and binding them to the
|
|
configured Docker network and host state directory;
|
|
- enforcing the one-game-one-container invariant in its own state and on Docker;
|
|
- monitoring container health through Docker events, periodic inspect, and active HTTP probes;
|
|
- publishing technical runtime events (`runtime:job_results`, `runtime:health_events`) and
|
|
admin-only notification intents for failures that no other service can observe;
|
|
- reconciling its persistent state with Docker reality on startup and periodically;
|
|
- removing exited containers automatically by retention TTL or explicitly by admin command.
|
|
|
|
`Runtime Manager` is not responsible for:
|
|
|
|
- evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
|
|
- registering a started runtime with `Game Master` (Lobby calls GM after a successful job
|
|
result);
|
|
- mapping platform users to engine players (GM owns this mapping);
|
|
- player command routing (GM proxies player commands directly to engine);
|
|
- cleaning up host state directories;
|
|
- patching the engine version registry; the registry lives in `Game Master`.
|
|
|
|
## Container Model
|
|
|
|
### Network
|
|
|
|
Containers attach to a single user-defined Docker bridge network. The network is provisioned
|
|
**outside** RTM: docker-compose, Terraform, or an operator runbook creates `galaxy-net` (or
|
|
whatever name is configured via `RTMANAGER_DOCKER_NETWORK`).
|
|
|
|
RTM validates the network's presence at startup. A missing network is a fail-fast condition;
|
|
the process exits non-zero before opening any listener.
|
|
|
|
### DNS name and engine endpoint
|
|
|
|
Each container is created with hostname `galaxy-game-{game_id}` and is attached to the
|
|
configured network. Docker's embedded DNS resolves the hostname for any other container in the
|
|
same network.
|
|
|
|
The `engine_endpoint` published in `runtime:job_results` and visible through the inspect REST
|
|
endpoint is the full URL `http://galaxy-game-{game_id}:8080`. The port is fixed at `8080`
|
|
inside the container; RTM does not publish ports to the host.
|
|
|
|
Restart and patch keep the same DNS name. The `container_id` changes; the `engine_endpoint`
|
|
does not.
|
|
|
|
### State storage (bind mount)
|
|
|
|
Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale
|
|
is operator-friendly backup and inspection.
|
|
|
|
- Host root: `RTMANAGER_GAME_STATE_ROOT` (operator-supplied, e.g. `/var/lib/galaxy/games`).
|
|
- Per-game directory: `<RTMANAGER_GAME_STATE_ROOT>/{game_id}`. RTM creates it with permissions
|
|
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and ownership `RTMANAGER_GAME_STATE_OWNER_UID`
|
|
/ `_GID` (default `0:0` — operator overrides for non-root engine).
|
|
- Bind mount: the per-game directory is mounted into the container at the path declared by
|
|
`RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
|
|
- Environment: the container receives `GAME_STATE_PATH=<mount path>`. The engine resolves the
|
|
path from this variable. The same variable is forwarded to the engine as `STORAGE_PATH` for
|
|
backward compatibility — both names are accepted in v1.
|
|
|
|
RTM never deletes the host state directory. Removing it is the responsibility of operator
|
|
tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container
|
|
through the cleanup endpoint or the retention TTL leaves the directory intact.
|
|
|
|
### Container labels
|
|
|
|
RTM applies the following labels to every container it creates:
|
|
|
|
| Label | Value | Purpose |
|
|
| --- | --- | --- |
|
|
| `com.galaxy.owner` | `rtmanager` | Filter for `docker ps` and reconcile. |
|
|
| `com.galaxy.kind` | `game-engine` | Differentiates from infra containers. |
|
|
| `com.galaxy.game_id` | `{game_id}` | Reverse lookup from container to platform game. |
|
|
| `com.galaxy.engine_image_ref` | `{image_ref}` | Cross-check against `runtime_records`. |
|
|
| `com.galaxy.started_at_ms` | `{ms}` | Unambiguous start timestamp. |
|
|
|
|
Labels are read from the resolved engine image to choose resource limits (see below).
|
|
|
|
### Resource limits
|
|
|
|
Resource limits originate in the **engine image**, not in the producer envelope or RTM config:
|
|
|
|
| Image label | Container limit | RTM fallback config |
|
|
| --- | --- | --- |
|
|
| `com.galaxy.cpu_quota` | `--cpus` value | `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`) |
|
|
| `com.galaxy.memory` | `--memory` value | `RTMANAGER_DEFAULT_MEMORY` (default `512m`) |
|
|
| `com.galaxy.pids_limit` | `--pids-limit` value | `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`) |
|
|
|
|
If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass
|
|
limits.
|
|
|
|
### Logging driver
|
|
|
|
Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver
|
|
and its options when creating the container:
|
|
|
|
- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
|
|
- `RTMANAGER_DOCKER_LOG_OPTS` (default empty; comma-separated `key=value` pairs).
|
|
|
|
RTM never reads the container's stdout itself. Operators consume engine logs via `docker logs`
|
|
or via whatever sink the configured driver feeds (fluentd, journald, etc.).
|
|
|
|
The production Docker SDK adapter that creates and starts these containers lives at
|
|
`internal/adapters/docker/`. Its design rationale — fixed engine port, partial-rollback on
|
|
`ContainerStart` failure, events-stream filter rationale, and the `mockgen`-driven service-test
|
|
fixture — is captured in [`docs/adapters.md`](docs/adapters.md).
|
|
|
|
## Runtime Surface
|
|
|
|
### Listeners
|
|
|
|
| Listener | Default address | Purpose |
|
|
| --- | --- | --- |
|
|
| `internal` HTTP | `:8096` (`RTMANAGER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Game Master` and `Admin Service`. |
|
|
|
|
There is no public listener. The internal listener is unauthenticated and assumes a trusted
|
|
network segment.
|
|
|
|
### Background workers
|
|
|
|
| Worker | Driver | Description |
|
|
| --- | --- | --- |
|
|
| `startjobs` consumer | Redis Stream `runtime:start_jobs` | Decodes start envelope and invokes the start service. |
|
|
| `stopjobs` consumer | Redis Stream `runtime:stop_jobs` | Decodes stop envelope and invokes the stop service. |
|
|
| Docker events listener | Docker `/events` API | Subscribes with the label filter, emits `runtime:health_events` for container_started / exited / oom / disappeared. |
|
|
| Active HTTP probe | Periodic | `GET {engine_endpoint}/healthz` for every running runtime; emits `probe_failed` / `probe_recovered` with hysteresis. |
|
|
| Periodic Docker inspect | Periodic | Refreshes inspect data; emits `inspect_unhealthy` when restart_count grows or status is unexpected. |
|
|
| Reconciler | Startup + periodic | Reconciles `runtime_records` with `docker ps` (see Reconciliation section). |
|
|
| Container cleanup | Periodic | Removes exited containers older than `RTMANAGER_CONTAINER_RETENTION_DAYS`. |
|
|
|
|
### Startup dependencies
|
|
|
|
In start order:
|
|
|
|
1. PostgreSQL primary (DSN `RTMANAGER_POSTGRES_PRIMARY_DSN`). Goose migrations apply
|
|
synchronously before any listener opens.
|
|
2. Redis master (`RTMANAGER_REDIS_MASTER_ADDR`).
|
|
3. Docker daemon at `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`). RTM
|
|
verifies API ping and the presence of `RTMANAGER_DOCKER_NETWORK`.
|
|
4. Telemetry exporter (OTLP grpc/http or stdout).
|
|
5. Internal HTTP listener.
|
|
6. Reconciler runs once and blocks until done.
|
|
7. Background workers start.
|
|
|
|
A failure in any step is fatal and exits the process non-zero.
|
|
|
|
### Probes
|
|
|
|
`/healthz` reports liveness — the process responds when the HTTP server is alive.
|
|
|
|
`/readyz` reports readiness — `200` only when:
|
|
|
|
- the PostgreSQL pool can ping the primary;
|
|
- the Redis master client can ping;
|
|
- the Docker client can ping;
|
|
- the configured Docker network exists.
|
|
|
|
Both probes are documented in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
|
|
|
|
## Lifecycles
|
|
|
|
All operations share a per-game-id Redis lease (`rtmanager:game_lease:{game_id}`,
|
|
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`, default `60`). The lease serialises operations on a
|
|
single game across all entry points (stream consumers and REST handlers). v1 does not renew
|
|
the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before
|
|
the operation finishes — the trade-off is documented in
|
|
[`docs/services.md` §1](docs/services.md).
|
|
|
|
### Start
|
|
|
|
**Triggers:**
|
|
|
|
- Lobby: a Redis Streams entry on `runtime:start_jobs` with envelope
|
|
`{game_id, image_ref, requested_at_ms}`.
|
|
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/start` with body
|
|
`{image_ref}`.
|
|
|
|
**Pre-conditions:**
|
|
|
|
- `image_ref` is a non-empty string and parseable as a Docker reference.
|
|
- Configured Docker network exists.
|
|
- The lease for `{game_id}` is acquired.
|
|
|
|
**Flow on success:**
|
|
|
|
1. Read `runtime_records.{game_id}`. If `status=running` with the same `image_ref`, return
|
|
the existing record (idempotent success, `error_code=replay_no_op`).
|
|
2. Pull the image per `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`).
|
|
3. Inspect the resolved image, derive resource limits from labels.
|
|
4. Ensure the per-game state directory exists with the configured mode and ownership.
|
|
5. `docker create` with the configured network, hostname, labels, env (`GAME_STATE_PATH`,
|
|
`STORAGE_PATH`), bind mount, log driver, resource limits.
|
|
6. `docker start`.
|
|
7. Upsert `runtime_records` (`status=running`, `current_container_id`, `engine_endpoint`,
|
|
`current_image_ref`, `started_at`, `last_op_at`).
|
|
8. Append `operation_log` entry (`op_kind=start`, `outcome=success`, source-specific
|
|
`op_source`).
|
|
9. Publish `runtime:health_events` `container_started`.
|
|
10. For Lobby callers: publish `runtime:job_results`
|
|
`{game_id, outcome=success, container_id, engine_endpoint}`.
|
|
For REST callers: respond `200` with the runtime record.
|
|
|
|
**Failure paths:**
|
|
|
|
| Failure | PG side effect | Notification intent | Outcome to caller |
|
|
| --- | --- | --- | --- |
|
|
| Invalid `image_ref` shape, network missing | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
|
|
| Image pull error | `operation_log` failure | `runtime.image_pull_failed` | `failure / image_pull_failed` |
|
|
| `docker create` / `start` error | `operation_log` failure | `runtime.container_start_failed` | `failure / container_start_failed` |
|
|
| State directory creation error | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
|
|
|
|
A failed start never leaves a partially-running container: if `docker create` succeeded but
|
|
the subsequent step failed, RTM removes the container before recording the failure.
|
|
|
|
The production start orchestrator that implements the flow and the failure paths above lives
|
|
at `internal/service/startruntime/`. Its design rationale — why the per-game lease and the
|
|
health-events publisher live with the start service, the `Result`-shaped contract consumed by
|
|
the stream consumer and the REST handler, the rollback rule on Upsert failure, and the
|
|
`created_at`-preservation rule for re-starts — is captured in
|
|
[`docs/services.md`](docs/services.md).
|
|
|
|
### Stop
|
|
|
|
**Triggers:**
|
|
|
|
- Lobby: Redis Streams entry on `runtime:stop_jobs` with envelope
|
|
`{game_id, reason, requested_at_ms}`. `reason ∈ {orphan_cleanup, cancelled, finished,
|
|
admin_request, timeout}`.
|
|
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/stop` with body
|
|
`{reason}`.
|
|
|
|
**Pre-conditions:**
|
|
|
|
- Lease acquired.
|
|
|
|
**Flow on success:**
|
|
|
|
1. Read `runtime_records.{game_id}`. If `status` is `stopped` or `removed`, return
|
|
idempotent success (`error_code=replay_no_op`).
|
|
2. `docker stop` with `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`). Docker fires
|
|
SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP
|
|
shutdown endpoint on the engine.
|
|
3. Update `runtime_records` (`status=stopped`, `stopped_at`, `last_op_at`).
|
|
4. Append `operation_log` entry.
|
|
5. Publish `runtime:job_results` (for Lobby) or REST `200` (for REST callers).
|
|
|
|
The container stays in `exited` state until the cleanup worker removes it (TTL) or an admin
|
|
command forces removal.
|
|
|
|
**Failure paths:**
|
|
|
|
| Failure | Outcome |
|
|
| --- | --- |
|
|
| Container not found in Docker but record `running` | Update record `status=removed`, publish `container_disappeared`, return `success` (RTM treats this as already-stopped). |
|
|
| `docker stop` returns non-zero, container still alive | Failure recorded, no state change. Caller may retry. |
|
|
|
|
### Restart
|
|
|
|
**Triggers:**
|
|
|
|
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/restart`.
|
|
|
|
Restart is **recreate**: stop + remove + run with the same `image_ref` and the same bind
|
|
mount. `container_id` changes; `engine_endpoint` is stable.
|
|
|
|
**Flow:**
|
|
|
|
1. Read `runtime_records.{game_id}`. The current `image_ref` is captured.
|
|
2. Acquire lease.
|
|
3. Run the stop flow (without releasing the lease).
|
|
4. `docker rm` the container.
|
|
5. Run the start flow with the captured `image_ref`.
|
|
6. Append a single `operation_log` entry with `op_kind=restart` and a correlation id linking
|
|
the implicit stop and start log entries.
|
|
|
|
If any inner step fails, the operation log records the partial outcome and the outer caller
|
|
receives the same failure; the runtime record converges to whatever state Docker reports.
|
|
|
|
### Patch
|
|
|
|
**Triggers:**
|
|
|
|
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/patch` with body
|
|
`{image_ref}`.
|
|
|
|
Patch is restart with a **new** `image_ref`. The engine reads its state from the bind mount
|
|
on startup, so any data written before the patch survives.
|
|
|
|
**Pre-conditions:**
|
|
|
|
- New and current image refs both parse as semver tags. `image_ref_not_semver` failure
|
|
otherwise.
|
|
- Major and minor versions are equal between current and new (`semver_patch_only` failure
|
|
otherwise).
|
|
|
|
**Flow:** identical to restart, with a new `image_ref` injected before the start step.
|
|
`operation_log` entry has `op_kind=patch`.
|
|
|
|
### Cleanup
|
|
|
|
**Triggers:**
|
|
|
|
- Periodic worker: every container with `runtime_records.status=stopped` and
|
|
`last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
|
|
- Admin Service: `DELETE /api/v1/internal/runtimes/{game_id}/container`.
|
|
|
|
**Pre-conditions:**
|
|
|
|
- The container is not in `running` state. RTM refuses to remove a running container through
|
|
this path; stop first.
|
|
|
|
**Flow:**
|
|
|
|
1. Acquire lease.
|
|
2. `docker rm` the container.
|
|
3. Update `runtime_records` (`status=removed`, `removed_at`, `current_container_id=NULL`,
|
|
`last_op_at`).
|
|
4. Append `operation_log` entry (`op_kind=cleanup_container`,
|
|
`op_source ∈ {auto_ttl, admin_rest}`).
|
|
|
|
The host state directory is left untouched.
|
|
|
|
## Health Monitoring
|
|
|
|
Three independent sources feed `runtime:health_events` and `health_snapshots`:
|
|
|
|
1. **Docker events listener.** Subscribes to the Docker events stream and filters
|
|
container-scoped events by the `com.galaxy.owner=rtmanager` label written into every
|
|
container by the start service. Emits:
|
|
- `container_exited` (action=`die` with non-zero exit code; exit `0` is the normal
|
|
graceful stop and is suppressed).
|
|
- `container_oom` (action=`oom`).
|
|
- `container_disappeared` (action=`destroy` observed for a `runtime_records.status=running`
|
|
row whose `current_container_id` still matches the destroyed container, i.e. a destroy
|
|
RTM did not initiate).
|
|
|
|
`container_started` is emitted by the start service when it runs the container (see
|
|
`internal/service/startruntime`), not by this listener.
|
|
2. **Periodic Docker inspect** every `RTMANAGER_INSPECT_INTERVAL` (default `30s`). Emits
|
|
`inspect_unhealthy` when:
|
|
- `RestartCount` increases between observations;
|
|
- `State.Status != "running"` for a record marked running;
|
|
- `State.Health.Status == "unhealthy"` if the image declares a Docker `HEALTHCHECK`.
|
|
3. **Active HTTP probe** every `RTMANAGER_PROBE_INTERVAL` (default `15s`). Calls
|
|
`GET {engine_endpoint}/healthz` with `RTMANAGER_PROBE_TIMEOUT` (default `2s`). Emits:
|
|
- `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures
|
|
(default `3`);
|
|
- `probe_recovered` on the first success after a `probe_failed` was published.
|
|
|
|
Every emission updates `health_snapshots.{game_id}` (latest event becomes the snapshot) and
|
|
appends to `runtime:health_events`.
|
|
|
|
In v1, RTM publishes admin-only notification intents only for first-touch failures of the
|
|
start flow. All ongoing health changes (probe failures, OOMs, exits) flow through
|
|
`runtime:health_events` only. `Game Master` is the consumer that decides whether to escalate
|
|
runtime-level events into notifications.
|
|
|
|
The three workers that implement the sources above live in
|
|
`internal/worker/{dockerevents,dockerinspect,healthprobe}`. Their design rationale —
|
|
`container_started` ownership, `container_disappeared` emission rules, `die` exit-code
|
|
suppression, probe hysteresis state model, parallel-probe cap, and the events-listener
|
|
reconnect policy — is captured in [`docs/workers.md`](docs/workers.md).
|
|
|
|
## Reconciliation
|
|
|
|
RTM never assumes Docker and PostgreSQL are in sync.
|
|
|
|
At startup (blocking, before workers start) and every `RTMANAGER_RECONCILE_INTERVAL`
|
|
(default `5m`):
|
|
|
|
1. List Docker containers with label `com.galaxy.owner=rtmanager`.
|
|
2. For each running container without a matching record:
|
|
- Insert a `runtime_records` row with `status=running`, the discovered
|
|
`current_image_ref`, `engine_endpoint`, and `started_at` taken from
|
|
`com.galaxy.started_at_ms` if present (otherwise from `State.StartedAt`).
|
|
- Append `operation_log` entry with `op_kind=reconcile_adopt`,
|
|
`op_source=auto_reconcile`.
|
|
- **Never stop or remove an unrecorded container.** Operators may have started one
|
|
manually for diagnostics; RTM stays out of their way.
|
|
3. For each `runtime_records` row with `status=running` whose container is missing:
|
|
- Update `status=removed`, `removed_at=now`, `current_container_id=NULL`.
|
|
- Publish `runtime:health_events` `container_disappeared`.
|
|
- Append `operation_log` entry with `op_kind=reconcile_dispose`.
|
|
4. For each `runtime_records` row with `status=running` whose container exists but is in
|
|
`exited`:
|
|
- Update `status=stopped`, `stopped_at=now` (reconciler observation time).
|
|
- Publish `runtime:health_events` `container_exited` with the observed exit code.
|
|
|
|
The reconciler implementation lives at `internal/worker/reconcile/` and the periodic
|
|
TTL-cleanup worker at `internal/worker/containercleanup/`; the cleanup worker delegates
|
|
removal to `internal/service/cleanupcontainer/`. The design rationale — the per-game
|
|
lease around every drift mutation, the third `observed_exited` path beyond the two
|
|
named cases, the synchronous `ReconcileNow` plus periodic `Component` split, and why
|
|
the cleanup worker is a thin TTL filter on top of the existing service — is captured in
|
|
[`docs/workers.md`](docs/workers.md).
|
|
|
|
## Trusted Surfaces
|
|
|
|
### Internal REST
|
|
|
|
The internal REST surface is consumed by `Game Master` (sync interactions for inspect,
|
|
restart, patch, stop, cleanup) and `Admin Service` (operational tooling, force-cleanup).
|
|
The listener is unauthenticated; downstream services rely on network segmentation.
|
|
|
|
| Method | Path | Operation ID | Caller |
|
|
| --- | --- | --- | --- |
|
|
| `GET` | `/healthz` | `internalHealthz` | platform probes |
|
|
| `GET` | `/readyz` | `internalReadyz` | platform probes |
|
|
| `GET` | `/api/v1/internal/runtimes` | `internalListRuntimes` | GM, Admin |
|
|
| `GET` | `/api/v1/internal/runtimes/{game_id}` | `internalGetRuntime` | GM, Admin |
|
|
| `POST` | `/api/v1/internal/runtimes/{game_id}/start` | `internalStartRuntime` | GM, Admin |
|
|
| `POST` | `/api/v1/internal/runtimes/{game_id}/stop` | `internalStopRuntime` | GM, Admin |
|
|
| `POST` | `/api/v1/internal/runtimes/{game_id}/restart` | `internalRestartRuntime` | GM, Admin |
|
|
| `POST` | `/api/v1/internal/runtimes/{game_id}/patch` | `internalPatchRuntime` | GM, Admin |
|
|
| `DELETE` | `/api/v1/internal/runtimes/{game_id}/container` | `internalCleanupRuntimeContainer` | Admin |
|
|
|
|
Request and response shapes are defined in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
|
|
Unknown JSON fields are rejected with `invalid_request`.
|
|
|
|
Callers identify themselves through the optional `X-Galaxy-Caller`
|
|
request header (`gm` for `Game Master`, `admin` for `Admin Service`).
|
|
The header is recorded as `op_source` in `operation_log` (`gm_rest` or
|
|
`admin_rest`); when missing or carrying any other value Runtime
|
|
Manager defaults to `op_source = admin_rest`. The header is documented
|
|
on every runtime endpoint of
|
|
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
|
|
|
|
## Async Stream Contracts
|
|
|
|
### `runtime:start_jobs` (in)
|
|
|
|
Producer: `Game Lobby`.
|
|
|
|
| Field | Type | Notes |
|
|
| --- | --- | --- |
|
|
| `game_id` | string | Lobby `game_id`. |
|
|
| `image_ref` | string | Docker reference. Lobby resolves it from `target_engine_version` using `LOBBY_ENGINE_IMAGE_TEMPLATE`. |
|
|
| `requested_at_ms` | int64 | UTC milliseconds. Used for diagnostics, not authoritative. |
|
|
|
|
### `runtime:stop_jobs` (in)
|
|
|
|
Producer: `Game Lobby`.
|
|
|
|
| Field | Type | Notes |
|
|
| --- | --- | --- |
|
|
| `game_id` | string | |
|
|
| `reason` | enum | `orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`. Recorded in `operation_log.error_code` when the reason matters; otherwise opaque. |
|
|
| `requested_at_ms` | int64 | |
|
|
|
|
### `runtime:job_results` (out)
|
|
|
|
Producer: `Runtime Manager`. Consumer: `Game Lobby`.
|
|
|
|
| Field | Type | Notes |
|
|
| --- | --- | --- |
|
|
| `game_id` | string | |
|
|
| `outcome` | enum | `success`, `failure`. |
|
|
| `container_id` | string | Required for `success`. Empty on `failure`. |
|
|
| `engine_endpoint` | string | Required for `success`. Empty on `failure`. |
|
|
| `error_code` | string | Stable code. `replay_no_op` for idempotent re-runs. |
|
|
| `error_message` | string | Operator-readable detail. |
|
|
|
|
### `runtime:health_events` (out)
|
|
|
|
Producer: `Runtime Manager`. Consumer: `Game Master` — confirmed in
|
|
production. `Game Lobby` and `Admin Service` are reserved as future
|
|
consumers; they do not read the stream in v1.
|
|
|
|
| Field | Type | Notes |
|
|
| --- | --- | --- |
|
|
| `game_id` | string | |
|
|
| `container_id` | string | The container observed (may differ from current after a restart race). |
|
|
| `event_type` | enum | See below. |
|
|
| `occurred_at_ms` | int64 | UTC milliseconds. |
|
|
| `details` | json | Type-specific payload. |
|
|
|
|
`event_type` values and their `details` schemas:
|
|
|
|
| `event_type` | `details` payload |
|
|
| --- | --- |
|
|
| `container_started` | `{image_ref}` |
|
|
| `container_exited` | `{exit_code, oom: bool}` |
|
|
| `container_oom` | `{exit_code}` |
|
|
| `container_disappeared` | `{}` |
|
|
| `inspect_unhealthy` | `{restart_count, state, health}` |
|
|
| `probe_failed` | `{consecutive_failures, last_status, last_error}` |
|
|
| `probe_recovered` | `{prior_failure_count}` |
|
|
|
|
The full schema is enforced by [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml).
|
|
|
|
## Notification Contracts
|
|
|
|
`Runtime Manager` publishes admin-only notification intents only for failures invisible to
|
|
any other service:
|
|
|
|
| Trigger | `notification_type` | Audience | Channels |
|
|
| --- | --- | --- | --- |
|
|
| Image pull error during start | `runtime.image_pull_failed` | admin | email |
|
|
| `docker create` / `docker start` error | `runtime.container_start_failed` | admin | email |
|
|
| Configuration validation error at start (bad image_ref, missing network) | `runtime.start_config_invalid` | admin | email |
|
|
|
|
Constructors live in `galaxy/pkg/notificationintent`. Catalog entries live in
|
|
[`../notification/README.md`](../notification/README.md) and
|
|
[`../notification/api/intents-asyncapi.yaml`](../notification/api/intents-asyncapi.yaml).
|
|
All three intents share the frozen field set
|
|
`{game_id, image_ref, error_code, error_message, attempted_at_ms}`; the
|
|
`_ms` suffix on `attempted_at_ms` follows the repo-wide convention for
|
|
millisecond integer fields.
|
|
The Redis Streams publisher wrapper used to emit these intents from RTM
|
|
ships in `internal/adapters/notificationpublisher/`; the rationale for the
|
|
signature shim that drops the upstream entry id lives in
|
|
[`docs/domain-and-ports.md` §7](docs/domain-and-ports.md) and the production
|
|
wiring is documented in [`docs/adapters.md`](docs/adapters.md).
|
|
|
|
Runtime-level changes after a successful start (probe failures, OOM, container exited)
|
|
**do not** produce notifications from RTM. Game Master decides whether to escalate.
|
|
|
|
## Persistence Layout
|
|
|
|
### PostgreSQL durable state (schema `rtmanager`)
|
|
|
|
| Table | Purpose | Key |
|
|
| --- | --- | --- |
|
|
| `runtime_records` | One row per game, latest known runtime status. | `game_id` |
|
|
| `operation_log` | Append-only audit of every operation RTM performed. | `id` (auto) |
|
|
| `health_snapshots` | Latest health observation per game. | `game_id` |
|
|
|
|
`runtime_records` columns:
|
|
|
|
- `game_id` — primary key, references Lobby's identifier.
|
|
- `status` — `running | stopped | removed`.
|
|
- `current_container_id` — nullable when `status=removed`.
|
|
- `current_image_ref` — non-null when status is `running` or `stopped`.
|
|
- `engine_endpoint` — `http://galaxy-game-{game_id}:8080`.
|
|
- `state_path` — absolute host path of the bind-mounted directory.
|
|
- `docker_network` — network name observed at create time.
|
|
- `started_at`, `stopped_at`, `removed_at` — last transition timestamps.
|
|
- `last_op_at` — drives retention TTL.
|
|
- `created_at` — first time RTM saw the game.
|
|
|
|
`operation_log` columns:
|
|
|
|
- `id`, `game_id`, `op_kind` (`start | stop | restart | patch | cleanup_container |
|
|
reconcile_adopt | reconcile_dispose`), `op_source` (`lobby_stream | gm_rest | admin_rest |
|
|
auto_ttl | auto_reconcile`), `source_ref` (stream entry id, REST request id, or admin
|
|
user), `image_ref`, `container_id`, `outcome` (`success | failure`), `error_code`,
|
|
`error_message`, `started_at`, `finished_at`.
|
|
|
|
`health_snapshots` columns:
|
|
|
|
- `game_id`, `container_id`, `status`
|
|
(`healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared`),
|
|
`source` (`docker_event | inspect | probe`), `details` (jsonb), `observed_at`.
|
|
|
|
Indexes:
|
|
|
|
- `runtime_records (status, last_op_at)` — drives cleanup worker.
|
|
- `operation_log (game_id, started_at DESC)` — drives audit reads.
|
|
|
|
Migrations are embedded `00001_init.sql` (single-init pre-launch policy from
|
|
`ARCHITECTURE.md §Persistence Backends`).
|
|
|
|
### Redis runtime-coordination state
|
|
|
|
| Key shape | Purpose |
|
|
| --- | --- |
|
|
| `rtmanager:stream_offsets:{label}` | Last processed entry id per consumer (`startjobs`, `stopjobs`). Same shape as Lobby. |
|
|
| `rtmanager:game_lease:{game_id}` | Per-game lease string (`SET ... NX PX <ttl>`). TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default 60s); not renewed mid-operation in v1. The trade-off is documented in [`docs/services.md` §1](docs/services.md). |
|
|
|
|
Stream key shapes themselves are configurable:
|
|
|
|
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
|
|
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
|
|
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
|
|
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
|
|
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
|
|
|
|
## Error Model
|
|
|
|
Error envelope: `{ "error": { "code": "...", "message": "..." } }`, identical to Lobby's.
|
|
|
|
Stable error codes:
|
|
|
|
| Code | Meaning |
|
|
| --- | --- |
|
|
| `invalid_request` | Malformed JSON, unknown fields, missing required parameter. |
|
|
| `not_found` | Runtime record does not exist. |
|
|
| `conflict` | Operation incompatible with current `status`. |
|
|
| `service_unavailable` | Dependency unavailable (Docker daemon, PG, Redis). |
|
|
| `internal_error` | Unspecified failure. |
|
|
| `image_pull_failed` | Image pull attempt failed. |
|
|
| `image_ref_not_semver` | Patch attempted with a tag that is not parseable semver. |
|
|
| `semver_patch_only` | Patch attempted across major/minor boundary. |
|
|
| `container_start_failed` | `docker create` / `docker start` failed. |
|
|
| `start_config_invalid` | Network missing, bind path inaccessible, or other config error. |
|
|
| `docker_unavailable` | Docker daemon ping failed. |
|
|
| `replay_no_op` | Idempotent replay; outcome is success but no work was done. |
|
|
|
|
## Configuration
|
|
|
|
All variables use the `RTMANAGER_` prefix. Required variables fail-fast on startup.
|
|
|
|
### Required
|
|
|
|
- `RTMANAGER_INTERNAL_HTTP_ADDR`
|
|
- `RTMANAGER_POSTGRES_PRIMARY_DSN`
|
|
- `RTMANAGER_REDIS_MASTER_ADDR`
|
|
- `RTMANAGER_REDIS_PASSWORD`
|
|
- `RTMANAGER_DOCKER_HOST`
|
|
- `RTMANAGER_DOCKER_NETWORK`
|
|
- `RTMANAGER_GAME_STATE_ROOT`
|
|
|
|
### Configuration groups
|
|
|
|
**Listener:**
|
|
|
|
- `RTMANAGER_INTERNAL_HTTP_ADDR` (e.g. `:8096`).
|
|
- `RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`).
|
|
- `RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `15s`).
|
|
- `RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`).
|
|
|
|
**Docker:**
|
|
|
|
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`).
|
|
- `RTMANAGER_DOCKER_API_VERSION` (default empty — let SDK negotiate).
|
|
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`).
|
|
- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
|
|
- `RTMANAGER_DOCKER_LOG_OPTS` (default empty).
|
|
- `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`,
|
|
values `if_missing | always | never`).
|
|
|
|
**Container defaults:**
|
|
|
|
- `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`).
|
|
- `RTMANAGER_DEFAULT_MEMORY` (default `512m`).
|
|
- `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`).
|
|
- `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`).
|
|
- `RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
|
|
- `RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
|
|
- `RTMANAGER_ENGINE_STATE_ENV_NAME` (default `GAME_STATE_PATH`).
|
|
- `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`).
|
|
- `RTMANAGER_GAME_STATE_OWNER_UID` (default `0`).
|
|
- `RTMANAGER_GAME_STATE_OWNER_GID` (default `0`).
|
|
- `RTMANAGER_GAME_STATE_ROOT` (host path).
|
|
|
|
**Postgres:**
|
|
|
|
- `RTMANAGER_POSTGRES_PRIMARY_DSN` (`postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable`).
|
|
- `RTMANAGER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used in v1).
|
|
- `RTMANAGER_POSTGRES_OPERATION_TIMEOUT` (default `2s`).
|
|
- `RTMANAGER_POSTGRES_MAX_OPEN_CONNS` (default `10`).
|
|
- `RTMANAGER_POSTGRES_MAX_IDLE_CONNS` (default `2`).
|
|
- `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`).
|
|
|
|
**Redis:**
|
|
|
|
- `RTMANAGER_REDIS_MASTER_ADDR`.
|
|
- `RTMANAGER_REDIS_REPLICA_ADDRS` (optional, comma-separated).
|
|
- `RTMANAGER_REDIS_PASSWORD`.
|
|
- `RTMANAGER_REDIS_DB` (default `0`).
|
|
- `RTMANAGER_REDIS_OPERATION_TIMEOUT` (default `2s`).
|
|
|
|
**Streams:**
|
|
|
|
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
|
|
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
|
|
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
|
|
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
|
|
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
|
|
- `RTMANAGER_STREAM_BLOCK_TIMEOUT` (default `5s`).
|
|
|
|
**Health monitoring:**
|
|
|
|
- `RTMANAGER_INSPECT_INTERVAL` (default `30s`).
|
|
- `RTMANAGER_PROBE_INTERVAL` (default `15s`).
|
|
- `RTMANAGER_PROBE_TIMEOUT` (default `2s`).
|
|
- `RTMANAGER_PROBE_FAILURES_THRESHOLD` (default `3`).
|
|
|
|
**Reconciler / cleanup:**
|
|
|
|
- `RTMANAGER_RECONCILE_INTERVAL` (default `5m`).
|
|
- `RTMANAGER_CLEANUP_INTERVAL` (default `1h`).
|
|
|
|
**Coordination:**
|
|
|
|
- `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60`).
|
|
|
|
**Lobby internal client:**
|
|
|
|
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` (e.g. `http://lobby:8095`).
|
|
- `RTMANAGER_LOBBY_INTERNAL_TIMEOUT` (default `2s`).
|
|
|
|
**Logging:**
|
|
|
|
- `RTMANAGER_LOG_LEVEL` (default `info`).
|
|
|
|
**Lifecycle:**
|
|
|
|
- `RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`).
|
|
|
|
**Telemetry:** uses the standard OTLP env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`,
|
|
`OTEL_EXPORTER_OTLP_PROTOCOL`, etc.) shared with other Galaxy services.
|
|
|
|
## Observability
|
|
|
|
### Metrics (OpenTelemetry, low cardinality)
|
|
|
|
- `rtmanager.start_outcomes` — counter, labels `outcome`, `error_code`, `op_source`.
|
|
- `rtmanager.stop_outcomes` — counter, labels `outcome`, `reason`, `op_source`.
|
|
- `rtmanager.restart_outcomes` — counter, labels `outcome`, `error_code`.
|
|
- `rtmanager.patch_outcomes` — counter, labels `outcome`, `error_code`.
|
|
- `rtmanager.cleanup_outcomes` — counter, labels `outcome`, `op_source`.
|
|
- `rtmanager.docker_op_latency` — histogram, label `op` (`pull | create | start | stop | rm
|
|
| inspect | events`).
|
|
- `rtmanager.health_events` — counter, label `event_type`.
|
|
- `rtmanager.reconcile_drift` — counter, label `kind` (`adopt | dispose | observed_exited`).
|
|
- `rtmanager.runtime_records_by_status` — gauge, label `status`.
|
|
- `rtmanager.lease_acquire_latency` — histogram.
|
|
- `rtmanager.notification_intents` — counter, label `notification_type`.
|
|
|
|
### Structured logs (slog JSON to stdout)
|
|
|
|
Common fields on every entry: `service=rtmanager`, `request_id`, `trace_id`, `span_id`,
|
|
`game_id` (when known), `container_id` (when known), `op_kind`, `op_source`, `outcome`,
|
|
`error_code`.
|
|
|
|
Worker-specific fields: `stream_entry_id` (consumers), `event_type` (health), `image_ref`
|
|
(start/patch).
|
|
|
|
## Verification
|
|
|
|
Service-level (TESTING.md §7):
|
|
|
|
- Unit tests for every service-layer operation against mocked Docker.
|
|
- Adapter tests (PG, Redis, Docker) using `testcontainers-go` for PG/Redis and the Docker
|
|
daemon socket for the real Docker adapter.
|
|
- Contract tests for `internal-openapi.yaml`, `runtime-jobs-asyncapi.yaml`,
|
|
`runtime-health-asyncapi.yaml`.
|
|
|
|
Service-local integration suite under `rtmanager/integration/`:
|
|
|
|
- Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real
|
|
`galaxy/game` test image.
|
|
- Replay safety (duplicate stream entries are no-ops).
|
|
- Health observability (kill the engine externally, observe `container_disappeared`; relaunch
|
|
manually, observe reconcile adopt).
|
|
- Notification on first-touch failures (publish a start with an unresolvable image, observe
|
|
`runtime.image_pull_failed` intent and a `failure` job result).
|
|
|
|
Inter-service suite under `integration/lobbyrtm/`:
|
|
|
|
- Real Lobby + real RTM + real `galaxy/game` test image. Covers happy path, cancel, and
|
|
start-failed flows.
|
|
|
|
Manual smoke (development):
|
|
|
|
```sh
|
|
docker network create galaxy-net # once
|
|
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
|
|
RTMANAGER_DOCKER_NETWORK=galaxy-net \
|
|
RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
|
|
... go run ./rtmanager/cmd/rtmanager
|
|
```
|
|
|
|
After start, `curl http://localhost:8096/readyz` returns `200`. Driving Lobby through its
|
|
public flow brings up `galaxy-game-{game_id}` containers; RTM logs each lifecycle transition
|
|
and publishes the corresponding stream entries.
|