feat: runtime manager

This commit is contained in:
Ilia Denisov
2026-04-28 20:39:18 +02:00
committed by GitHub
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
+28
View File
@@ -0,0 +1,28 @@
# Makefile for galaxy/rtmanager.
#
# The `jet` target regenerates the go-jet/v2 query-builder code under
# internal/adapters/postgres/jet/ against a transient PostgreSQL container
# brought up by cmd/jetgen. Generated code is committed.
#
# The `mocks` target regenerates the gomock-driven mocks via the
# //go:generate directives that live next to the interfaces they cover:
# - internal/ports/ — port interfaces (Stage 12)
# - internal/api/internalhttp/handlers/ — REST handler service ports (Stage 16)
# Generated code is committed.
#
# The `integration` target runs the service-local end-to-end suite
# under integration/. It requires a reachable Docker daemon
# (`/var/run/docker.sock` or `DOCKER_HOST`); without one the helpers
# in integration/harness call t.Skip and the tests are no-ops.
.PHONY: jet mocks integration
jet:
go run ./cmd/jetgen
mocks:
go generate ./internal/ports/...
go generate ./internal/api/internalhttp/handlers/...
integration:
go test -tags=integration -count=1 ./integration/...
+1022
View File
File diff suppressed because it is too large Load Diff
+867
View File
@@ -0,0 +1,867 @@
# Runtime Manager
`Runtime Manager` (RTM) is the only Galaxy platform service permitted to interact with the
Docker daemon. It owns the lifecycle of `galaxy/game` engine containers and the technical
runtime view of running games. Other services consume RTM via two transports: an asynchronous
Redis Streams contract (used by `Game Lobby`) and a synchronous internal REST surface (used by
`Game Master` and `Admin Service`).
## References
- [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §9 Runtime Manager.
- [`../TESTING.md`](../TESTING.md) §7 — testing matrix for RTM.
- [`./docs/README.md`](./docs/README.md) — service-local documentation entry point.
- [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — REST contract.
- [`./api/runtime-jobs-asyncapi.yaml`](./api/runtime-jobs-asyncapi.yaml) — start/stop job
streams contract.
- [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml) —
`runtime:health_events` stream contract.
- [`../game/README.md`](../game/README.md) — game engine container contract (env, ports,
`/healthz`).
- [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with RTM.
## Purpose
A running Galaxy game lives in exactly one Docker container. The platform must be able to:
- create the container with the right engine version and configuration;
- supply the engine with a stable storage location for game state;
- keep the runtime status visible to platform-level services;
- replace the container in place for patch upgrades and restarts;
- remove containers that are no longer needed;
- detect and surface engine failures to whoever should react.
`Runtime Manager` is the single component that performs these actions. It deliberately does
**not** reason about platform metadata, membership, schedules, turn cutoffs, or any other
business state. Game Lobby owns platform metadata; Game Master will own runtime business state
when implemented.
## Scope
`Runtime Manager` is the source of truth for:
- the mapping `game_id -> current_container_id` for every running container;
- the durable history of every start, stop, restart, patch, and cleanup operation it performed;
- the most recent technical health observation per game (last Docker event, last successful or
failed probe, last inspect result).
`Runtime Manager` is not the source of truth for:
- any business or platform-level metadata of a game (owned by `Game Lobby`);
- runtime state visible to players or operators as game state, including current turn,
generation status, engine version registry (owned by `Game Master`);
- the engine version catalogue or which engine version a game is allowed to use (`Game Master`
is the future owner; `Game Lobby` supplies `image_ref` in v1);
- contents of the engine state directory; that is engine domain;
- backup, archival, or operator cleanup of state directories.
## Non-Goals
- Multi-instance operation in v1. Coordination is single-process; multiple replicas are an
explicit future iteration.
- Engine version arbitration. The producer (`Game Lobby` in v1, `Game Master` later) supplies `image_ref`.
- Image registry control. Pull policy is configurable, but RTM does not push, retag, or
promote images.
- TLS or mTLS on the internal listener. RTM trusts its network segment.
- Direct delivery of player-visible push notifications. RTM publishes admin-only notification
intents only for failures invisible elsewhere; everything else is delegated.
- Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached
through `unix:///var/run/docker.sock`.
## Position in the System
```mermaid
flowchart LR
Lobby["Game Lobby"]
GM["Game Master"]
Admin["Admin Service"]
Notify["Notification Service"]
RTM["Runtime Manager"]
Engine["Game Engine container"]
Docker["Docker Daemon"]
Postgres["PostgreSQL\nschema rtmanager"]
Redis["Redis\nstreams + leases"]
Lobby -->|runtime:start_jobs / stop_jobs| RTM
RTM -->|runtime:job_results| Lobby
GM -->|internal REST| RTM
Admin -->|internal REST| RTM
RTM -->|notification:intents (admin)| Notify
RTM -->|runtime:health_events| Redis
RTM <--> Docker
Docker -->|create / start / stop / rm| Engine
RTM --> Postgres
RTM --> Redis
Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
```
## Responsibility Boundaries
`Runtime Manager` is responsible for:
- accepting start, stop, restart, patch, inspect, and cleanup requests through the supported
transports and producing one durable outcome per request;
- creating Docker containers from a producer-supplied `image_ref` and binding them to the
configured Docker network and host state directory;
- enforcing the one-game-one-container invariant in its own state and on Docker;
- monitoring container health through Docker events, periodic inspect, and active HTTP probes;
- publishing technical runtime events (`runtime:job_results`, `runtime:health_events`) and
admin-only notification intents for failures that no other service can observe;
- reconciling its persistent state with Docker reality on startup and periodically;
- removing exited containers automatically by retention TTL or explicitly by admin command.
`Runtime Manager` is not responsible for:
- evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
- registering a started runtime with `Game Master` (Lobby calls GM after a successful job
result);
- mapping platform users to engine players (GM owns this mapping);
- player command routing (GM proxies player commands directly to engine);
- cleaning up host state directories;
- patching the engine version registry; the registry lives in `Game Master`.
## Container Model
### Network
Containers attach to a single user-defined Docker bridge network. The network is provisioned
**outside** RTM: docker-compose, Terraform, or an operator runbook creates `galaxy-net` (or
whatever name is configured via `RTMANAGER_DOCKER_NETWORK`).
RTM validates the network's presence at startup. A missing network is a fail-fast condition;
the process exits non-zero before opening any listener.
### DNS name and engine endpoint
Each container is created with hostname `galaxy-game-{game_id}` and is attached to the
configured network. Docker's embedded DNS resolves the hostname for any other container in the
same network.
The `engine_endpoint` published in `runtime:job_results` and visible through the inspect REST
endpoint is the full URL `http://galaxy-game-{game_id}:8080`. The port is fixed at `8080`
inside the container; RTM does not publish ports to the host.
Restart and patch keep the same DNS name. The `container_id` changes; the `engine_endpoint`
does not.
### State storage (bind mount)
Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale
is operator-friendly backup and inspection.
- Host root: `RTMANAGER_GAME_STATE_ROOT` (operator-supplied, e.g. `/var/lib/galaxy/games`).
- Per-game directory: `<RTMANAGER_GAME_STATE_ROOT>/{game_id}`. RTM creates it with permissions
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and ownership `RTMANAGER_GAME_STATE_OWNER_UID`
/ `_GID` (default `0:0` — operator overrides for non-root engine).
- Bind mount: the per-game directory is mounted into the container at the path declared by
`RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
- Environment: the container receives `GAME_STATE_PATH=<mount path>`. The engine resolves the
path from this variable. The same variable is forwarded to the engine as `STORAGE_PATH` for
backward compatibility — both names are accepted in v1.
RTM never deletes the host state directory. Removing it is the responsibility of operator
tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container
through the cleanup endpoint or the retention TTL leaves the directory intact.
### Container labels
RTM applies the following labels to every container it creates:
| Label | Value | Purpose |
| --- | --- | --- |
| `com.galaxy.owner` | `rtmanager` | Filter for `docker ps` and reconcile. |
| `com.galaxy.kind` | `game-engine` | Differentiates from infra containers. |
| `com.galaxy.game_id` | `{game_id}` | Reverse lookup from container to platform game. |
| `com.galaxy.engine_image_ref` | `{image_ref}` | Cross-check against `runtime_records`. |
| `com.galaxy.started_at_ms` | `{ms}` | Unambiguous start timestamp. |
Labels are read from the resolved engine image to choose resource limits (see below).
### Resource limits
Resource limits originate in the **engine image**, not in the producer envelope or RTM config:
| Image label | Container limit | RTM fallback config |
| --- | --- | --- |
| `com.galaxy.cpu_quota` | `--cpus` value | `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`) |
| `com.galaxy.memory` | `--memory` value | `RTMANAGER_DEFAULT_MEMORY` (default `512m`) |
| `com.galaxy.pids_limit` | `--pids-limit` value | `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`) |
If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass
limits.
### Logging driver
Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver
and its options when creating the container:
- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
- `RTMANAGER_DOCKER_LOG_OPTS` (default empty; comma-separated `key=value` pairs).
RTM never reads the container's stdout itself. Operators consume engine logs via `docker logs`
or via whatever sink the configured driver feeds (fluentd, journald, etc.).
The production Docker SDK adapter that creates and starts these containers lives at
`internal/adapters/docker/`. Its design rationale — fixed engine port, partial-rollback on
`ContainerStart` failure, events-stream filter rationale, and the `mockgen`-driven service-test
fixture — is captured in [`docs/adapters.md`](docs/adapters.md).
## Runtime Surface
### Listeners
| Listener | Default address | Purpose |
| --- | --- | --- |
| `internal` HTTP | `:8096` (`RTMANAGER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Game Master` and `Admin Service`. |
There is no public listener. The internal listener is unauthenticated and assumes a trusted
network segment.
### Background workers
| Worker | Driver | Description |
| --- | --- | --- |
| `startjobs` consumer | Redis Stream `runtime:start_jobs` | Decodes start envelope and invokes the start service. |
| `stopjobs` consumer | Redis Stream `runtime:stop_jobs` | Decodes stop envelope and invokes the stop service. |
| Docker events listener | Docker `/events` API | Subscribes with the label filter, emits `runtime:health_events` for container_started / exited / oom / disappeared. |
| Active HTTP probe | Periodic | `GET {engine_endpoint}/healthz` for every running runtime; emits `probe_failed` / `probe_recovered` with hysteresis. |
| Periodic Docker inspect | Periodic | Refreshes inspect data; emits `inspect_unhealthy` when restart_count grows or status is unexpected. |
| Reconciler | Startup + periodic | Reconciles `runtime_records` with `docker ps` (see Reconciliation section). |
| Container cleanup | Periodic | Removes exited containers older than `RTMANAGER_CONTAINER_RETENTION_DAYS`. |
### Startup dependencies
In start order:
1. PostgreSQL primary (DSN `RTMANAGER_POSTGRES_PRIMARY_DSN`). Goose migrations apply
synchronously before any listener opens.
2. Redis master (`RTMANAGER_REDIS_MASTER_ADDR`).
3. Docker daemon at `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`). RTM
verifies API ping and the presence of `RTMANAGER_DOCKER_NETWORK`.
4. Telemetry exporter (OTLP grpc/http or stdout).
5. Internal HTTP listener.
6. Reconciler runs once and blocks until done.
7. Background workers start.
A failure in any step is fatal and exits the process non-zero.
### Probes
`/healthz` reports liveness — the process responds when the HTTP server is alive.
`/readyz` reports readiness — `200` only when:
- the PostgreSQL pool can ping the primary;
- the Redis master client can ping;
- the Docker client can ping;
- the configured Docker network exists.
Both probes are documented in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
## Lifecycles
All operations share a per-game-id Redis lease (`rtmanager:game_lease:{game_id}`,
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`, default `60`). The lease serialises operations on a
single game across all entry points (stream consumers and REST handlers). v1 does not renew
the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before
the operation finishes — the trade-off is documented in
[`docs/services.md` §1](docs/services.md).
### Start
**Triggers:**
- Lobby: a Redis Streams entry on `runtime:start_jobs` with envelope
`{game_id, image_ref, requested_at_ms}`.
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/start` with body
`{image_ref}`.
**Pre-conditions:**
- `image_ref` is a non-empty string and parseable as a Docker reference.
- Configured Docker network exists.
- The lease for `{game_id}` is acquired.
**Flow on success:**
1. Read `runtime_records.{game_id}`. If `status=running` with the same `image_ref`, return
the existing record (idempotent success, `error_code=replay_no_op`).
2. Pull the image per `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`).
3. Inspect the resolved image, derive resource limits from labels.
4. Ensure the per-game state directory exists with the configured mode and ownership.
5. `docker create` with the configured network, hostname, labels, env (`GAME_STATE_PATH`,
`STORAGE_PATH`), bind mount, log driver, resource limits.
6. `docker start`.
7. Upsert `runtime_records` (`status=running`, `current_container_id`, `engine_endpoint`,
`current_image_ref`, `started_at`, `last_op_at`).
8. Append `operation_log` entry (`op_kind=start`, `outcome=success`, source-specific
`op_source`).
9. Publish `runtime:health_events` `container_started`.
10. For Lobby callers: publish `runtime:job_results`
`{game_id, outcome=success, container_id, engine_endpoint}`.
For REST callers: respond `200` with the runtime record.
**Failure paths:**
| Failure | PG side effect | Notification intent | Outcome to caller |
| --- | --- | --- | --- |
| Invalid `image_ref` shape, network missing | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
| Image pull error | `operation_log` failure | `runtime.image_pull_failed` | `failure / image_pull_failed` |
| `docker create` / `start` error | `operation_log` failure | `runtime.container_start_failed` | `failure / container_start_failed` |
| State directory creation error | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
A failed start never leaves a partially-running container: if `docker create` succeeded but
the subsequent step failed, RTM removes the container before recording the failure.
The production start orchestrator that implements the flow and the failure paths above lives
at `internal/service/startruntime/`. Its design rationale — why the per-game lease and the
health-events publisher live with the start service, the `Result`-shaped contract consumed by
the stream consumer and the REST handler, the rollback rule on Upsert failure, and the
`created_at`-preservation rule for re-starts — is captured in
[`docs/services.md`](docs/services.md).
### Stop
**Triggers:**
- Lobby: Redis Streams entry on `runtime:stop_jobs` with envelope
`{game_id, reason, requested_at_ms}`. `reason ∈ {orphan_cleanup, cancelled, finished,
admin_request, timeout}`.
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/stop` with body
`{reason}`.
**Pre-conditions:**
- Lease acquired.
**Flow on success:**
1. Read `runtime_records.{game_id}`. If `status` is `stopped` or `removed`, return
idempotent success (`error_code=replay_no_op`).
2. `docker stop` with `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`). Docker fires
SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP
shutdown endpoint on the engine.
3. Update `runtime_records` (`status=stopped`, `stopped_at`, `last_op_at`).
4. Append `operation_log` entry.
5. Publish `runtime:job_results` (for Lobby) or REST `200` (for REST callers).
The container stays in `exited` state until the cleanup worker removes it (TTL) or an admin
command forces removal.
**Failure paths:**
| Failure | Outcome |
| --- | --- |
| Container not found in Docker but record `running` | Update record `status=removed`, publish `container_disappeared`, return `success` (RTM treats this as already-stopped). |
| `docker stop` returns non-zero, container still alive | Failure recorded, no state change. Caller may retry. |
### Restart
**Triggers:**
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/restart`.
Restart is **recreate**: stop + remove + run with the same `image_ref` and the same bind
mount. `container_id` changes; `engine_endpoint` is stable.
**Flow:**
1. Read `runtime_records.{game_id}`. The current `image_ref` is captured.
2. Acquire lease.
3. Run the stop flow (without releasing the lease).
4. `docker rm` the container.
5. Run the start flow with the captured `image_ref`.
6. Append a single `operation_log` entry with `op_kind=restart` and a correlation id linking
the implicit stop and start log entries.
If any inner step fails, the operation log records the partial outcome and the outer caller
receives the same failure; the runtime record converges to whatever state Docker reports.
### Patch
**Triggers:**
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/patch` with body
`{image_ref}`.
Patch is restart with a **new** `image_ref`. The engine reads its state from the bind mount
on startup, so any data written before the patch survives.
**Pre-conditions:**
- New and current image refs both parse as semver tags. `image_ref_not_semver` failure
otherwise.
- Major and minor versions are equal between current and new (`semver_patch_only` failure
otherwise).
**Flow:** identical to restart, with a new `image_ref` injected before the start step.
`operation_log` entry has `op_kind=patch`.
### Cleanup
**Triggers:**
- Periodic worker: every container with `runtime_records.status=stopped` and
`last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
- Admin Service: `DELETE /api/v1/internal/runtimes/{game_id}/container`.
**Pre-conditions:**
- The container is not in `running` state. RTM refuses to remove a running container through
this path; stop first.
**Flow:**
1. Acquire lease.
2. `docker rm` the container.
3. Update `runtime_records` (`status=removed`, `removed_at`, `current_container_id=NULL`,
`last_op_at`).
4. Append `operation_log` entry (`op_kind=cleanup_container`,
`op_source ∈ {auto_ttl, admin_rest}`).
The host state directory is left untouched.
## Health Monitoring
Three independent sources feed `runtime:health_events` and `health_snapshots`:
1. **Docker events listener.** Subscribes to the Docker events stream and filters
container-scoped events by the `com.galaxy.owner=rtmanager` label written into every
container by the start service. Emits:
- `container_exited` (action=`die` with non-zero exit code; exit `0` is the normal
graceful stop and is suppressed).
- `container_oom` (action=`oom`).
- `container_disappeared` (action=`destroy` observed for a `runtime_records.status=running`
row whose `current_container_id` still matches the destroyed container, i.e. a destroy
RTM did not initiate).
`container_started` is emitted by the start service when it runs the container (see
`internal/service/startruntime`), not by this listener.
2. **Periodic Docker inspect** every `RTMANAGER_INSPECT_INTERVAL` (default `30s`). Emits
`inspect_unhealthy` when:
- `RestartCount` increases between observations;
- `State.Status != "running"` for a record marked running;
- `State.Health.Status == "unhealthy"` if the image declares a Docker `HEALTHCHECK`.
3. **Active HTTP probe** every `RTMANAGER_PROBE_INTERVAL` (default `15s`). Calls
`GET {engine_endpoint}/healthz` with `RTMANAGER_PROBE_TIMEOUT` (default `2s`). Emits:
- `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures
(default `3`);
- `probe_recovered` on the first success after a `probe_failed` was published.
Every emission updates `health_snapshots.{game_id}` (latest event becomes the snapshot) and
appends to `runtime:health_events`.
In v1, RTM publishes admin-only notification intents only for first-touch failures of the
start flow. All ongoing health changes (probe failures, OOMs, exits) flow through
`runtime:health_events` only. `Game Master` is the consumer that decides whether to escalate
runtime-level events into notifications.
The three workers that implement the sources above live in
`internal/worker/{dockerevents,dockerinspect,healthprobe}`. Their design rationale —
`container_started` ownership, `container_disappeared` emission rules, `die` exit-code
suppression, probe hysteresis state model, parallel-probe cap, and the events-listener
reconnect policy — is captured in [`docs/workers.md`](docs/workers.md).
## Reconciliation
RTM never assumes Docker and PostgreSQL are in sync.
At startup (blocking, before workers start) and every `RTMANAGER_RECONCILE_INTERVAL`
(default `5m`):
1. List Docker containers with label `com.galaxy.owner=rtmanager`.
2. For each running container without a matching record:
- Insert a `runtime_records` row with `status=running`, the discovered
`current_image_ref`, `engine_endpoint`, and `started_at` taken from
`com.galaxy.started_at_ms` if present (otherwise from `State.StartedAt`).
- Append `operation_log` entry with `op_kind=reconcile_adopt`,
`op_source=auto_reconcile`.
- **Never stop or remove an unrecorded container.** Operators may have started one
manually for diagnostics; RTM stays out of their way.
3. For each `runtime_records` row with `status=running` whose container is missing:
- Update `status=removed`, `removed_at=now`, `current_container_id=NULL`.
- Publish `runtime:health_events` `container_disappeared`.
- Append `operation_log` entry with `op_kind=reconcile_dispose`.
4. For each `runtime_records` row with `status=running` whose container exists but is in
`exited`:
- Update `status=stopped`, `stopped_at=now` (reconciler observation time).
- Publish `runtime:health_events` `container_exited` with the observed exit code.
The reconciler implementation lives at `internal/worker/reconcile/` and the periodic
TTL-cleanup worker at `internal/worker/containercleanup/`; the cleanup worker delegates
removal to `internal/service/cleanupcontainer/`. The design rationale — the per-game
lease around every drift mutation, the third `observed_exited` path beyond the two
named cases, the synchronous `ReconcileNow` plus periodic `Component` split, and why
the cleanup worker is a thin TTL filter on top of the existing service — is captured in
[`docs/workers.md`](docs/workers.md).
## Trusted Surfaces
### Internal REST
The internal REST surface is consumed by `Game Master` (sync interactions for inspect,
restart, patch, stop, cleanup) and `Admin Service` (operational tooling, force-cleanup).
The listener is unauthenticated; downstream services rely on network segmentation.
| Method | Path | Operation ID | Caller |
| --- | --- | --- | --- |
| `GET` | `/healthz` | `internalHealthz` | platform probes |
| `GET` | `/readyz` | `internalReadyz` | platform probes |
| `GET` | `/api/v1/internal/runtimes` | `internalListRuntimes` | GM, Admin |
| `GET` | `/api/v1/internal/runtimes/{game_id}` | `internalGetRuntime` | GM, Admin |
| `POST` | `/api/v1/internal/runtimes/{game_id}/start` | `internalStartRuntime` | GM, Admin |
| `POST` | `/api/v1/internal/runtimes/{game_id}/stop` | `internalStopRuntime` | GM, Admin |
| `POST` | `/api/v1/internal/runtimes/{game_id}/restart` | `internalRestartRuntime` | GM, Admin |
| `POST` | `/api/v1/internal/runtimes/{game_id}/patch` | `internalPatchRuntime` | GM, Admin |
| `DELETE` | `/api/v1/internal/runtimes/{game_id}/container` | `internalCleanupRuntimeContainer` | Admin |
Request and response shapes are defined in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
Unknown JSON fields are rejected with `invalid_request`.
Callers identify themselves through the optional `X-Galaxy-Caller`
request header (`gm` for `Game Master`, `admin` for `Admin Service`).
The header is recorded as `op_source` in `operation_log` (`gm_rest` or
`admin_rest`); when missing or carrying any other value Runtime
Manager defaults to `op_source = admin_rest`. The header is documented
on every runtime endpoint of
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
## Async Stream Contracts
### `runtime:start_jobs` (in)
Producer: `Game Lobby`.
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | Lobby `game_id`. |
| `image_ref` | string | Docker reference. Lobby resolves it from `target_engine_version` using `LOBBY_ENGINE_IMAGE_TEMPLATE`. |
| `requested_at_ms` | int64 | UTC milliseconds. Used for diagnostics, not authoritative. |
### `runtime:stop_jobs` (in)
Producer: `Game Lobby`.
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | |
| `reason` | enum | `orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`. Recorded in `operation_log.error_code` when the reason matters; otherwise opaque. |
| `requested_at_ms` | int64 | |
### `runtime:job_results` (out)
Producer: `Runtime Manager`. Consumer: `Game Lobby`.
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | |
| `outcome` | enum | `success`, `failure`. |
| `container_id` | string | Required for `success`. Empty on `failure`. |
| `engine_endpoint` | string | Required for `success`. Empty on `failure`. |
| `error_code` | string | Stable code. `replay_no_op` for idempotent re-runs. |
| `error_message` | string | Operator-readable detail. |
### `runtime:health_events` (out, new)
Producer: `Runtime Manager`. Consumers: `Game Master`; `Game Lobby` and `Admin Service`
are reserved as future consumers.
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | |
| `container_id` | string | The container observed (may differ from current after a restart race). |
| `event_type` | enum | See below. |
| `occurred_at_ms` | int64 | UTC milliseconds. |
| `details` | json | Type-specific payload. |
`event_type` values and their `details` schemas:
| `event_type` | `details` payload |
| --- | --- |
| `container_started` | `{image_ref}` |
| `container_exited` | `{exit_code, oom: bool}` |
| `container_oom` | `{exit_code}` |
| `container_disappeared` | `{}` |
| `inspect_unhealthy` | `{restart_count, state, health}` |
| `probe_failed` | `{consecutive_failures, last_status, last_error}` |
| `probe_recovered` | `{prior_failure_count}` |
The full schema is enforced by [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml).
## Notification Contracts
`Runtime Manager` publishes admin-only notification intents only for failures invisible to
any other service:
| Trigger | `notification_type` | Audience | Channels |
| --- | --- | --- | --- |
| Image pull error during start | `runtime.image_pull_failed` | admin | email |
| `docker create` / `docker start` error | `runtime.container_start_failed` | admin | email |
| Configuration validation error at start (bad image_ref, missing network) | `runtime.start_config_invalid` | admin | email |
Constructors live in `galaxy/pkg/notificationintent`. Catalog entries live in
[`../notification/README.md`](../notification/README.md) and
[`../notification/api/intents-asyncapi.yaml`](../notification/api/intents-asyncapi.yaml).
All three intents share the frozen field set
`{game_id, image_ref, error_code, error_message, attempted_at_ms}`; the
`_ms` suffix on `attempted_at_ms` follows the repo-wide convention for
millisecond integer fields.
The Redis Streams publisher wrapper used to emit these intents from RTM
ships in `internal/adapters/notificationpublisher/`; the rationale for the
signature shim that drops the upstream entry id lives in
[`docs/domain-and-ports.md` §7](docs/domain-and-ports.md) and the production
wiring is documented in [`docs/adapters.md`](docs/adapters.md).
Runtime-level changes after a successful start (probe failures, OOM, container exited)
**do not** produce notifications from RTM. Game Master decides whether to escalate.
## Persistence Layout
### PostgreSQL durable state (schema `rtmanager`)
| Table | Purpose | Key |
| --- | --- | --- |
| `runtime_records` | One row per game, latest known runtime status. | `game_id` |
| `operation_log` | Append-only audit of every operation RTM performed. | `id` (auto) |
| `health_snapshots` | Latest health observation per game. | `game_id` |
`runtime_records` columns:
- `game_id` — primary key, references Lobby's identifier.
- `status``running | stopped | removed`.
- `current_container_id` — nullable when `status=removed`.
- `current_image_ref` — non-null when status is `running` or `stopped`.
- `engine_endpoint``http://galaxy-game-{game_id}:8080`.
- `state_path` — absolute host path of the bind-mounted directory.
- `docker_network` — network name observed at create time.
- `started_at`, `stopped_at`, `removed_at` — last transition timestamps.
- `last_op_at` — drives retention TTL.
- `created_at` — first time RTM saw the game.
`operation_log` columns:
- `id`, `game_id`, `op_kind` (`start | stop | restart | patch | cleanup_container |
reconcile_adopt | reconcile_dispose`), `op_source` (`lobby_stream | gm_rest | admin_rest |
auto_ttl | auto_reconcile`), `source_ref` (stream entry id, REST request id, or admin
user), `image_ref`, `container_id`, `outcome` (`success | failure`), `error_code`,
`error_message`, `started_at`, `finished_at`.
`health_snapshots` columns:
- `game_id`, `container_id`, `status`
(`healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared`),
`source` (`docker_event | inspect | probe`), `details` (jsonb), `observed_at`.
Indexes:
- `runtime_records (status, last_op_at)` — drives cleanup worker.
- `operation_log (game_id, started_at DESC)` — drives audit reads.
Migrations are embedded `00001_init.sql` (single-init pre-launch policy from
`ARCHITECTURE.md §Persistence Backends`).
### Redis runtime-coordination state
| Key shape | Purpose |
| --- | --- |
| `rtmanager:stream_offsets:{label}` | Last processed entry id per consumer (`startjobs`, `stopjobs`). Same shape as Lobby. |
| `rtmanager:game_lease:{game_id}` | Per-game lease string (`SET ... NX PX <ttl>`). TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default 60s); not renewed mid-operation in v1. The trade-off is documented in [`docs/services.md` §1](docs/services.md). |
Stream key shapes themselves are configurable:
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
## Error Model
Error envelope: `{ "error": { "code": "...", "message": "..." } }`, identical to Lobby's.
Stable error codes:
| Code | Meaning |
| --- | --- |
| `invalid_request` | Malformed JSON, unknown fields, missing required parameter. |
| `not_found` | Runtime record does not exist. |
| `conflict` | Operation incompatible with current `status`. |
| `service_unavailable` | Dependency unavailable (Docker daemon, PG, Redis). |
| `internal_error` | Unspecified failure. |
| `image_pull_failed` | Image pull attempt failed. |
| `image_ref_not_semver` | Patch attempted with a tag that is not parseable semver. |
| `semver_patch_only` | Patch attempted across major/minor boundary. |
| `container_start_failed` | `docker create` / `docker start` failed. |
| `start_config_invalid` | Network missing, bind path inaccessible, or other config error. |
| `docker_unavailable` | Docker daemon ping failed. |
| `replay_no_op` | Idempotent replay; outcome is success but no work was done. |
## Configuration
All variables use the `RTMANAGER_` prefix. Required variables fail-fast on startup.
### Required
- `RTMANAGER_INTERNAL_HTTP_ADDR`
- `RTMANAGER_POSTGRES_PRIMARY_DSN`
- `RTMANAGER_REDIS_MASTER_ADDR`
- `RTMANAGER_REDIS_PASSWORD`
- `RTMANAGER_DOCKER_HOST`
- `RTMANAGER_DOCKER_NETWORK`
- `RTMANAGER_GAME_STATE_ROOT`
### Configuration groups
**Listener:**
- `RTMANAGER_INTERNAL_HTTP_ADDR` (e.g. `:8096`).
- `RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`).
- `RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `15s`).
- `RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`).
**Docker:**
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`).
- `RTMANAGER_DOCKER_API_VERSION` (default empty — let SDK negotiate).
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`).
- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
- `RTMANAGER_DOCKER_LOG_OPTS` (default empty).
- `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`,
values `if_missing | always | never`).
**Container defaults:**
- `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`).
- `RTMANAGER_DEFAULT_MEMORY` (default `512m`).
- `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`).
- `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`).
- `RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
- `RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
- `RTMANAGER_ENGINE_STATE_ENV_NAME` (default `GAME_STATE_PATH`).
- `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`).
- `RTMANAGER_GAME_STATE_OWNER_UID` (default `0`).
- `RTMANAGER_GAME_STATE_OWNER_GID` (default `0`).
- `RTMANAGER_GAME_STATE_ROOT` (host path).
**Postgres:**
- `RTMANAGER_POSTGRES_PRIMARY_DSN` (`postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable`).
- `RTMANAGER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used in v1).
- `RTMANAGER_POSTGRES_OPERATION_TIMEOUT` (default `2s`).
- `RTMANAGER_POSTGRES_MAX_OPEN_CONNS` (default `10`).
- `RTMANAGER_POSTGRES_MAX_IDLE_CONNS` (default `2`).
- `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`).
**Redis:**
- `RTMANAGER_REDIS_MASTER_ADDR`.
- `RTMANAGER_REDIS_REPLICA_ADDRS` (optional, comma-separated).
- `RTMANAGER_REDIS_PASSWORD`.
- `RTMANAGER_REDIS_DB` (default `0`).
- `RTMANAGER_REDIS_OPERATION_TIMEOUT` (default `2s`).
**Streams:**
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
- `RTMANAGER_STREAM_BLOCK_TIMEOUT` (default `5s`).
**Health monitoring:**
- `RTMANAGER_INSPECT_INTERVAL` (default `30s`).
- `RTMANAGER_PROBE_INTERVAL` (default `15s`).
- `RTMANAGER_PROBE_TIMEOUT` (default `2s`).
- `RTMANAGER_PROBE_FAILURES_THRESHOLD` (default `3`).
**Reconciler / cleanup:**
- `RTMANAGER_RECONCILE_INTERVAL` (default `5m`).
- `RTMANAGER_CLEANUP_INTERVAL` (default `1h`).
**Coordination:**
- `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60`).
**Lobby internal client:**
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` (e.g. `http://lobby:8095`).
- `RTMANAGER_LOBBY_INTERNAL_TIMEOUT` (default `2s`).
**Logging:**
- `RTMANAGER_LOG_LEVEL` (default `info`).
**Lifecycle:**
- `RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`).
**Telemetry:** uses the standard OTLP env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`,
`OTEL_EXPORTER_OTLP_PROTOCOL`, etc.) shared with other Galaxy services.
## Observability
### Metrics (OpenTelemetry, low cardinality)
- `rtmanager.start_outcomes` — counter, labels `outcome`, `error_code`, `op_source`.
- `rtmanager.stop_outcomes` — counter, labels `outcome`, `reason`, `op_source`.
- `rtmanager.restart_outcomes` — counter, labels `outcome`, `error_code`.
- `rtmanager.patch_outcomes` — counter, labels `outcome`, `error_code`.
- `rtmanager.cleanup_outcomes` — counter, labels `outcome`, `op_source`.
- `rtmanager.docker_op_latency` — histogram, label `op` (`pull | create | start | stop | rm
| inspect | events`).
- `rtmanager.health_events` — counter, label `event_type`.
- `rtmanager.reconcile_drift` — counter, label `kind` (`adopt | dispose | observed_exited`).
- `rtmanager.runtime_records_by_status` — gauge, label `status`.
- `rtmanager.lease_acquire_latency` — histogram.
- `rtmanager.notification_intents` — counter, label `notification_type`.
### Structured logs (slog JSON to stdout)
Common fields on every entry: `service=rtmanager`, `request_id`, `trace_id`, `span_id`,
`game_id` (when known), `container_id` (when known), `op_kind`, `op_source`, `outcome`,
`error_code`.
Worker-specific fields: `stream_entry_id` (consumers), `event_type` (health), `image_ref`
(start/patch).
## Verification
Service-level (TESTING.md §7):
- Unit tests for every service-layer operation against mocked Docker.
- Adapter tests (PG, Redis, Docker) using `testcontainers-go` for PG/Redis and the Docker
daemon socket for the real Docker adapter.
- Contract tests for `internal-openapi.yaml`, `runtime-jobs-asyncapi.yaml`,
`runtime-health-asyncapi.yaml`.
Service-local integration suite under `rtmanager/integration/`:
- Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real
`galaxy/game` test image.
- Replay safety (duplicate stream entries are no-ops).
- Health observability (kill the engine externally, observe `container_disappeared`; relaunch
manually, observe reconcile adopt).
- Notification on first-touch failures (publish a start with an unresolvable image, observe
`runtime.image_pull_failed` intent and a `failure` job result).
Inter-service suite under `integration/lobbyrtm/`:
- Real Lobby + real RTM + real `galaxy/game` test image. Covers happy path, cancel, and
start-failed flows.
Manual smoke (development):
```sh
docker network create galaxy-net # once
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
RTMANAGER_DOCKER_NETWORK=galaxy-net \
RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
... go run ./rtmanager/cmd/rtmanager
```
After start, `curl http://localhost:8096/readyz` returns `200`. Driving Lobby through its
public flow brings up `galaxy-game-{game_id}` containers; RTM logs each lifecycle transition
and publishes the corresponding stream entries.
+534
View File
@@ -0,0 +1,534 @@
openapi: 3.0.3
info:
title: Galaxy Runtime Manager Internal REST API
version: v1
description: |
This specification documents the internal trusted REST contract of
`galaxy/rtmanager` served on `RTMANAGER_INTERNAL_HTTP_ADDR`
(default `:8096`).
The listener is not reachable from the public internet. Two caller
classes use it: `Game Master` (inspect / restart / patch / stop /
cleanup) and `Admin Service` (operational tooling, including
force-cleanup). Runtime Manager treats every caller on this port as
trusted and performs no user-level authorization; downstream services
rely on network segmentation. There is no `X-User-ID` header
contract.
Transport rules:
- request bodies are strict JSON only; unknown fields are rejected
with `invalid_request`;
- error responses use `{ "error": { "code", "message" } }`, identical
to the Lobby contract;
- stable error codes are: `invalid_request`, `not_found`, `conflict`,
`service_unavailable`, `internal_error`, `image_pull_failed`,
`image_ref_not_semver`, `semver_patch_only`,
`container_start_failed`, `start_config_invalid`,
`docker_unavailable`, `replay_no_op`.
Caller identification:
- the optional `X-Galaxy-Caller` request header carries the calling
service identity (`gm` for `Game Master`, `admin` for `Admin
Service`). Runtime Manager records the value as `op_source` in
the `operation_log` (`gm_rest` or `admin_rest`). When the header
is missing or carries an unknown value, Runtime Manager defaults
to `op_source = admin_rest`.
servers:
- url: http://localhost:8096
description: Default local internal listener for Runtime Manager.
tags:
- name: Runtimes
description: Runtime lifecycle endpoints called by Game Master and Admin Service.
- name: Probes
description: Health and readiness probes.
paths:
/healthz:
get:
tags:
- Probes
operationId: internalHealthz
summary: Internal listener health probe
responses:
"200":
description: Service is alive.
content:
application/json:
schema:
$ref: "#/components/schemas/ProbeResponse"
examples:
ok:
value:
status: ok
/readyz:
get:
tags:
- Probes
operationId: internalReadyz
summary: Internal listener readiness probe
description: |
Returns `200` only when the PostgreSQL primary, Redis master, and
Docker daemon are reachable and the configured Docker network
exists. Returns `503` with the standard error envelope otherwise.
responses:
"200":
description: Service is ready to serve traffic.
content:
application/json:
schema:
$ref: "#/components/schemas/ProbeResponse"
examples:
ready:
value:
status: ready
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes:
get:
tags:
- Runtimes
operationId: internalListRuntimes
summary: List all known runtime records
description: |
Returns the full list of runtime records known to Runtime Manager.
Pagination is not supported in v1 — the working set is bounded by
the number of games tracked by Lobby and is small enough to return
in one response.
parameters:
- $ref: "#/components/parameters/XGalaxyCallerHeader"
responses:
"200":
description: All runtime records.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimesList"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes/{game_id}:
get:
tags:
- Runtimes
operationId: internalGetRuntime
summary: Get one runtime record by game id
parameters:
- $ref: "#/components/parameters/GameIDPath"
- $ref: "#/components/parameters/XGalaxyCallerHeader"
responses:
"200":
description: Runtime record for the game.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimeRecord"
"404":
$ref: "#/components/responses/NotFoundError"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes/{game_id}/start:
post:
tags:
- Runtimes
operationId: internalStartRuntime
summary: Start a game engine container
description: |
Pulls the supplied `image_ref` per the configured pull policy and
creates the engine container. Idempotent: a re-start with the same
`image_ref` for an already-running record returns `200` with the
current record and `error_code=replay_no_op` recorded in the
operation log.
parameters:
- $ref: "#/components/parameters/GameIDPath"
- $ref: "#/components/parameters/XGalaxyCallerHeader"
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/StartRequest"
responses:
"200":
description: Runtime record after the start operation.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimeRecord"
"400":
$ref: "#/components/responses/InvalidRequestError"
"409":
$ref: "#/components/responses/ConflictError"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes/{game_id}/stop:
post:
tags:
- Runtimes
operationId: internalStopRuntime
summary: Stop a running game engine container
description: |
Issues `docker stop` with the configured timeout. Idempotent: stop
on a record that is already `stopped` or `removed` returns
success with `error_code=replay_no_op` recorded in the operation
log.
parameters:
- $ref: "#/components/parameters/GameIDPath"
- $ref: "#/components/parameters/XGalaxyCallerHeader"
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/StopRequest"
responses:
"200":
description: Runtime record after the stop operation.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimeRecord"
"400":
$ref: "#/components/responses/InvalidRequestError"
"404":
$ref: "#/components/responses/NotFoundError"
"409":
$ref: "#/components/responses/ConflictError"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes/{game_id}/restart:
post:
tags:
- Runtimes
operationId: internalRestartRuntime
summary: Recreate a game engine container with the same image
description: |
Stops, removes, and re-runs the container with the current
`image_ref`. The container id changes; the engine endpoint stays
stable.
parameters:
- $ref: "#/components/parameters/GameIDPath"
- $ref: "#/components/parameters/XGalaxyCallerHeader"
responses:
"200":
description: Runtime record after the restart operation.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimeRecord"
"404":
$ref: "#/components/responses/NotFoundError"
"409":
$ref: "#/components/responses/ConflictError"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes/{game_id}/patch:
post:
tags:
- Runtimes
operationId: internalPatchRuntime
summary: Recreate a game engine container with a new image
description: |
Restart with a new `image_ref`. Allowed only as a semver patch
within the same major and minor line. Cross-major or cross-minor
attempts return `409 conflict` with `error_code=semver_patch_only`.
A non-semver `image_ref` returns `400 invalid_request` with
`error_code=image_ref_not_semver`.
parameters:
- $ref: "#/components/parameters/GameIDPath"
- $ref: "#/components/parameters/XGalaxyCallerHeader"
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/PatchRequest"
responses:
"200":
description: Runtime record after the patch operation.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimeRecord"
"400":
$ref: "#/components/responses/InvalidRequestError"
"404":
$ref: "#/components/responses/NotFoundError"
"409":
$ref: "#/components/responses/ConflictError"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
/api/v1/internal/runtimes/{game_id}/container:
delete:
tags:
- Runtimes
operationId: internalCleanupRuntimeContainer
summary: Remove an exited container
description: |
Calls `docker rm` for an already-stopped container and updates the
runtime record to `removed`. Refuses with `409 conflict` if the
record is still `running`. The host state directory is not
deleted.
parameters:
- $ref: "#/components/parameters/GameIDPath"
- $ref: "#/components/parameters/XGalaxyCallerHeader"
responses:
"200":
description: Runtime record after the cleanup operation.
content:
application/json:
schema:
$ref: "#/components/schemas/RuntimeRecord"
"404":
$ref: "#/components/responses/NotFoundError"
"409":
$ref: "#/components/responses/ConflictError"
"500":
$ref: "#/components/responses/InternalError"
"503":
$ref: "#/components/responses/ServiceUnavailableError"
components:
parameters:
GameIDPath:
name: game_id
in: path
required: true
description: Opaque stable game identifier owned by Lobby.
schema:
type: string
XGalaxyCallerHeader:
name: X-Galaxy-Caller
in: header
required: false
description: |
Identifies the calling service so Runtime Manager can record the
right `op_source` in `operation_log` (`gm_rest` for `gm`,
`admin_rest` for `admin`). Missing or unknown values default to
`admin_rest`.
schema:
type: string
enum:
- gm
- admin
schemas:
RuntimeRecord:
type: object
additionalProperties: false
required:
- game_id
- status
- state_path
- docker_network
- last_op_at
- created_at
properties:
game_id:
type: string
description: Opaque stable game identifier owned by Lobby.
status:
type: string
enum:
- running
- stopped
- removed
description: Current runtime status maintained by Runtime Manager.
current_container_id:
type: string
nullable: true
description: Docker container id; null when status is removed.
current_image_ref:
type: string
nullable: true
description: Image reference of the current container; null when status is removed.
engine_endpoint:
type: string
nullable: true
description: Stable engine URL `http://galaxy-game-{game_id}:8080`; null when status is removed.
state_path:
type: string
description: Absolute host path of the per-game bind-mounted state directory.
docker_network:
type: string
description: Docker network name observed when the container was created.
started_at:
type: string
format: date-time
nullable: true
description: UTC timestamp of the most recent successful start.
stopped_at:
type: string
format: date-time
nullable: true
description: UTC timestamp of the most recent stop.
removed_at:
type: string
format: date-time
nullable: true
description: UTC timestamp of the most recent container removal.
last_op_at:
type: string
format: date-time
description: UTC timestamp of the most recent operation; drives retention TTL.
created_at:
type: string
format: date-time
description: UTC timestamp of the first observation of this game.
RuntimesList:
type: object
additionalProperties: false
required:
- items
properties:
items:
type: array
items:
$ref: "#/components/schemas/RuntimeRecord"
StartRequest:
type: object
additionalProperties: false
required:
- image_ref
properties:
image_ref:
type: string
description: Docker reference resolved by the producer (Game Master or Admin Service).
StopRequest:
type: object
additionalProperties: false
required:
- reason
properties:
reason:
$ref: "#/components/schemas/StopReason"
PatchRequest:
type: object
additionalProperties: false
required:
- image_ref
properties:
image_ref:
type: string
description: New Docker reference within the same semver major and minor line.
StopReason:
type: string
enum:
- orphan_cleanup
- cancelled
- finished
- admin_request
- timeout
description: Reason carried in the stop envelope and recorded in the operation log.
ErrorCode:
type: string
enum:
- invalid_request
- not_found
- conflict
- service_unavailable
- internal_error
- image_pull_failed
- image_ref_not_semver
- semver_patch_only
- container_start_failed
- start_config_invalid
- docker_unavailable
- replay_no_op
description: Stable internal API error code.
ProbeResponse:
type: object
additionalProperties: false
required:
- status
properties:
status:
type: string
ErrorResponse:
type: object
additionalProperties: false
required:
- error
properties:
error:
$ref: "#/components/schemas/ErrorBody"
ErrorBody:
type: object
additionalProperties: false
required:
- code
- message
properties:
code:
$ref: "#/components/schemas/ErrorCode"
message:
type: string
description: Human-readable trusted error message.
responses:
InvalidRequestError:
description: Request validation failed.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
examples:
invalidRequest:
value:
error:
code: invalid_request
message: request is invalid
NotFoundError:
description: The requested runtime record does not exist.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
examples:
notFound:
value:
error:
code: not_found
message: runtime record not found
ConflictError:
description: The requested operation is not allowed in the current runtime state.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
examples:
conflict:
value:
error:
code: conflict
message: operation not allowed in current status
InternalError:
description: Unexpected internal service error.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
examples:
internal:
value:
error:
code: internal_error
message: internal server error
ServiceUnavailableError:
description: An upstream dependency is unavailable.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
examples:
unavailable:
value:
error:
code: service_unavailable
message: service is unavailable
+195
View File
@@ -0,0 +1,195 @@
asyncapi: 3.1.0
info:
title: Galaxy Runtime Health Events Contract
version: 1.0.0
description: |
Stable Redis Streams contract for technical container health events
published by `Runtime Manager`. Consumers include `Game Master`;
`Game Lobby` and `Admin Service` are reserved as future consumers.
Three independent sources feed this stream: the Docker events
listener, the periodic Docker inspect worker, and the active HTTP
`/healthz` probe. Every emission also upserts the latest snapshot
into `health_snapshots` in PostgreSQL.
Polymorphism: the `details` field carries an `event_type`-specific
payload selected via `oneOf` per type. Each variant is a closed object
(no unknown fields).
The `event_type` enum is fixed in this contract; adding a new value
requires a contract bump and a coordinated consumer change.
channels:
healthEvents:
address: runtime:health_events
messages:
runtimeHealthEvent:
$ref: '#/components/messages/RuntimeHealthEvent'
operations:
publishHealthEvent:
action: send
summary: Publish one technical health event for downstream consumers.
channel:
$ref: '#/channels/healthEvents'
messages:
- $ref: '#/channels/healthEvents/messages/runtimeHealthEvent'
components:
messages:
RuntimeHealthEvent:
name: RuntimeHealthEvent
title: Runtime health event
summary: One technical health observation about a game engine container.
payload:
$ref: '#/components/schemas/RuntimeHealthEventPayload'
examples:
- name: containerStarted
summary: Engine container has been created and started.
payload:
game_id: game-123
container_id: 7c2b5d1a4f6e
event_type: container_started
occurred_at_ms: 1775121700000
details:
image_ref: registry.example.com/galaxy/game:1.4.7
- name: containerExited
summary: Engine container terminated with a non-zero exit code.
payload:
game_id: game-123
container_id: 7c2b5d1a4f6e
event_type: container_exited
occurred_at_ms: 1775121800000
details:
exit_code: 137
oom: false
- name: probeFailed
summary: Active probe observed three consecutive failures.
payload:
game_id: game-123
container_id: 7c2b5d1a4f6e
event_type: probe_failed
occurred_at_ms: 1775121810000
details:
consecutive_failures: 3
last_status: 0
last_error: "context deadline exceeded"
schemas:
RuntimeHealthEventPayload:
type: object
additionalProperties: false
required:
- game_id
- container_id
- event_type
- occurred_at_ms
- details
properties:
game_id:
type: string
description: Opaque stable game identifier owned by Lobby.
container_id:
type: string
description: Docker container id observed by Runtime Manager. May differ from the current container id after a restart race.
event_type:
$ref: '#/components/schemas/EventType'
occurred_at_ms:
type: integer
format: int64
description: UTC milliseconds when Runtime Manager observed the event.
details:
oneOf:
- $ref: '#/components/schemas/ContainerStartedDetails'
- $ref: '#/components/schemas/ContainerExitedDetails'
- $ref: '#/components/schemas/ContainerOomDetails'
- $ref: '#/components/schemas/ContainerDisappearedDetails'
- $ref: '#/components/schemas/InspectUnhealthyDetails'
- $ref: '#/components/schemas/ProbeFailedDetails'
- $ref: '#/components/schemas/ProbeRecoveredDetails'
description: Polymorphic payload selected by event_type.
EventType:
type: string
enum:
- container_started
- container_exited
- container_oom
- container_disappeared
- inspect_unhealthy
- probe_failed
- probe_recovered
description: Discriminator selecting the details variant.
ContainerStartedDetails:
type: object
additionalProperties: false
required:
- image_ref
properties:
image_ref:
type: string
description: Image reference of the started container.
ContainerExitedDetails:
type: object
additionalProperties: false
required:
- exit_code
- oom
properties:
exit_code:
type: integer
description: Exit code reported by Docker.
oom:
type: boolean
description: True when the container was killed by the OOM killer.
ContainerOomDetails:
type: object
additionalProperties: false
required:
- exit_code
properties:
exit_code:
type: integer
description: Exit code reported by Docker for the OOM event.
ContainerDisappearedDetails:
type: object
additionalProperties: false
description: Empty payload; emitted when a destroy event is observed for a record Runtime Manager did not initiate.
InspectUnhealthyDetails:
type: object
additionalProperties: false
required:
- restart_count
- state
- health
properties:
restart_count:
type: integer
description: Docker RestartCount observed at this inspection.
state:
type: string
description: Docker State.Status observed at this inspection.
health:
type: string
description: Docker State.Health.Status observed at this inspection; empty when the image declares no HEALTHCHECK.
ProbeFailedDetails:
type: object
additionalProperties: false
required:
- consecutive_failures
- last_status
- last_error
properties:
consecutive_failures:
type: integer
description: Number of consecutive probe failures that crossed the threshold.
last_status:
type: integer
description: HTTP status of the last probe attempt; 0 when the probe failed before receiving a response.
last_error:
type: string
description: Operator-readable error of the last probe attempt; empty when not applicable.
ProbeRecoveredDetails:
type: object
additionalProperties: false
required:
- prior_failure_count
properties:
prior_failure_count:
type: integer
description: Number of consecutive failures observed immediately before the recovery.
+226
View File
@@ -0,0 +1,226 @@
asyncapi: 3.1.0
info:
title: Galaxy Runtime Jobs Stream Contract
version: 1.0.0
description: |
Stable Redis Streams contract carrying runtime jobs between
`Game Lobby` and `Runtime Manager`.
`Game Lobby` is the sole producer for `runtime:start_jobs` and
`runtime:stop_jobs`. `Runtime Manager` consumes both, executes the
Docker work, and publishes one outcome per job to `runtime:job_results`,
which is consumed by `Game Lobby`'s runtime-job-result worker.
Replay safety:
- duplicate start jobs for an already-running game with the same
`image_ref` produce a `success` job result with
`error_code=replay_no_op`;
- duplicate stop jobs for an already-stopped or already-removed game
produce a `success` job result with `error_code=replay_no_op`.
The `reason` enum on `runtime:stop_jobs` is fixed in this contract.
Adding a new value requires a contract bump and a coordinated
Lobby/Runtime Manager change.
channels:
startJobs:
address: runtime:start_jobs
messages:
runtimeStartJob:
$ref: '#/components/messages/RuntimeStartJob'
stopJobs:
address: runtime:stop_jobs
messages:
runtimeStopJob:
$ref: '#/components/messages/RuntimeStopJob'
jobResults:
address: runtime:job_results
messages:
runtimeJobResult:
$ref: '#/components/messages/RuntimeJobResult'
operations:
consumeStartJob:
action: receive
summary: Receive one start job from Game Lobby and run a container.
channel:
$ref: '#/channels/startJobs'
messages:
- $ref: '#/channels/startJobs/messages/runtimeStartJob'
consumeStopJob:
action: receive
summary: Receive one stop job from Game Lobby and stop a container.
channel:
$ref: '#/channels/stopJobs'
messages:
- $ref: '#/channels/stopJobs/messages/runtimeStopJob'
publishJobResult:
action: send
summary: Publish one runtime job outcome for Game Lobby.
channel:
$ref: '#/channels/jobResults'
messages:
- $ref: '#/channels/jobResults/messages/runtimeJobResult'
components:
messages:
RuntimeStartJob:
name: RuntimeStartJob
title: Runtime start job
summary: Lobby request to start one game engine container.
payload:
$ref: '#/components/schemas/RuntimeStartJobPayload'
examples:
- name: startJob
summary: Start a game engine container with a producer-resolved image_ref.
payload:
game_id: game-123
image_ref: registry.example.com/galaxy/game:1.4.7
requested_at_ms: 1775121700000
RuntimeStopJob:
name: RuntimeStopJob
title: Runtime stop job
summary: Lobby request to stop one game engine container.
payload:
$ref: '#/components/schemas/RuntimeStopJobPayload'
examples:
- name: cancelled
summary: Stop the engine because the game was cancelled.
payload:
game_id: game-123
reason: cancelled
requested_at_ms: 1775121800000
- name: orphanCleanup
summary: Stop an engine whose Lobby metadata persistence failed.
payload:
game_id: game-456
reason: orphan_cleanup
requested_at_ms: 1775121810000
RuntimeJobResult:
name: RuntimeJobResult
title: Runtime job result
summary: Outcome of one start or stop job.
payload:
$ref: '#/components/schemas/RuntimeJobResultPayload'
examples:
- name: startSuccess
summary: Successful start, container_id and engine_endpoint are populated.
payload:
game_id: game-123
outcome: success
container_id: 7c2b5d1a4f6e
engine_endpoint: http://galaxy-game-game-123:8080
error_code: ""
error_message: ""
- name: imagePullFailed
summary: Failed start due to an image pull error.
payload:
game_id: game-789
outcome: failure
container_id: ""
engine_endpoint: ""
error_code: image_pull_failed
error_message: "manifest unknown"
- name: replayNoOp
summary: Idempotent replay; the job was a no-op.
payload:
game_id: game-123
outcome: success
container_id: 7c2b5d1a4f6e
engine_endpoint: http://galaxy-game-game-123:8080
error_code: replay_no_op
error_message: ""
schemas:
RuntimeStartJobPayload:
type: object
additionalProperties: false
required:
- game_id
- image_ref
- requested_at_ms
properties:
game_id:
type: string
description: Opaque stable game identifier owned by Lobby.
image_ref:
type: string
description: Docker reference resolved by Lobby from LOBBY_ENGINE_IMAGE_TEMPLATE.
requested_at_ms:
type: integer
format: int64
description: UTC milliseconds; used for diagnostics, not authoritative.
RuntimeStopJobPayload:
type: object
additionalProperties: false
required:
- game_id
- reason
- requested_at_ms
properties:
game_id:
type: string
description: Opaque stable game identifier owned by Lobby.
reason:
$ref: '#/components/schemas/StopReason'
requested_at_ms:
type: integer
format: int64
description: UTC milliseconds; used for diagnostics, not authoritative.
RuntimeJobResultPayload:
type: object
additionalProperties: false
required:
- game_id
- outcome
- container_id
- engine_endpoint
- error_code
- error_message
properties:
game_id:
type: string
description: Opaque stable game identifier matching the originating job.
outcome:
type: string
enum:
- success
- failure
description: High-level outcome of the runtime job.
container_id:
type: string
description: Docker container id of the engine; populated on success, empty on failure.
engine_endpoint:
type: string
description: Stable engine URL `http://galaxy-game-{game_id}:8080`; populated on success, empty on failure.
error_code:
$ref: '#/components/schemas/ErrorCode'
error_message:
type: string
description: Operator-readable detail; empty when not applicable.
StopReason:
type: string
enum:
- orphan_cleanup
- cancelled
- finished
- admin_request
- timeout
description: Reason value carried by every runtime:stop_jobs envelope.
ErrorCode:
type: string
enum:
- ""
- invalid_request
- not_found
- conflict
- service_unavailable
- internal_error
- image_pull_failed
- image_ref_not_semver
- semver_patch_only
- container_start_failed
- start_config_invalid
- docker_unavailable
- replay_no_op
description: |
Stable error code identical to the internal REST contract. The empty
string is a valid value for successful job results that did not
produce a code (the field is required to be present so consumers
can rely on the schema).
+236
View File
@@ -0,0 +1,236 @@
// Command jetgen regenerates the go-jet/v2 query-builder code under
// galaxy/rtmanager/internal/adapters/postgres/jet/ against a transient
// PostgreSQL instance.
//
// The program is intended to be invoked as `go run ./cmd/jetgen` (or via
// the `make jet` Makefile target) from within `galaxy/rtmanager`. It is
// not part of the runtime binary.
//
// Steps:
//
// 1. start a postgres:16-alpine container via testcontainers-go
// 2. open it through pkg/postgres as the superuser
// 3. CREATE ROLE rtmanagerservice and CREATE SCHEMA "rtmanager"
// AUTHORIZATION rtmanagerservice
// 4. open a second pool as rtmanagerservice with search_path=rtmanager
// and apply the embedded goose migrations
// 5. run jet's PostgreSQL generator against schema=rtmanager, writing
// into ../internal/adapters/postgres/jet
package main
import (
"context"
"errors"
"fmt"
"log"
"net/url"
"os"
"path/filepath"
"runtime"
"time"
"galaxy/postgres"
"galaxy/rtmanager/internal/adapters/postgres/migrations"
jetpostgres "github.com/go-jet/jet/v2/generator/postgres"
testcontainers "github.com/testcontainers/testcontainers-go"
tcpostgres "github.com/testcontainers/testcontainers-go/modules/postgres"
"github.com/testcontainers/testcontainers-go/wait"
)
const (
postgresImage = "postgres:16-alpine"
superuserName = "galaxy"
superuserPassword = "galaxy"
superuserDatabase = "galaxy_rtmanager"
serviceRole = "rtmanagerservice"
servicePassword = "rtmanagerservice"
serviceSchema = "rtmanager"
containerStartup = 90 * time.Second
defaultOpTimeout = 10 * time.Second
jetOutputDirSuffix = "internal/adapters/postgres/jet"
)
func main() {
if err := run(context.Background()); err != nil {
log.Fatalf("jetgen: %v", err)
}
}
func run(ctx context.Context) error {
outputDir, err := jetOutputDir()
if err != nil {
return err
}
container, err := tcpostgres.Run(ctx, postgresImage,
tcpostgres.WithDatabase(superuserDatabase),
tcpostgres.WithUsername(superuserName),
tcpostgres.WithPassword(superuserPassword),
testcontainers.WithWaitStrategy(
wait.ForLog("database system is ready to accept connections").
WithOccurrence(2).
WithStartupTimeout(containerStartup),
),
)
if err != nil {
return fmt.Errorf("start postgres container: %w", err)
}
defer func() {
if termErr := testcontainers.TerminateContainer(container); termErr != nil {
log.Printf("jetgen: terminate container: %v", termErr)
}
}()
baseDSN, err := container.ConnectionString(ctx, "sslmode=disable")
if err != nil {
return fmt.Errorf("resolve container dsn: %w", err)
}
if err := provisionRoleAndSchema(ctx, baseDSN); err != nil {
return err
}
scopedDSN, err := dsnForServiceRole(baseDSN)
if err != nil {
return err
}
if err := applyMigrations(ctx, scopedDSN); err != nil {
return err
}
if err := os.RemoveAll(outputDir); err != nil {
return fmt.Errorf("remove existing jet output %q: %w", outputDir, err)
}
if err := os.MkdirAll(filepath.Dir(outputDir), 0o755); err != nil {
return fmt.Errorf("ensure jet output parent: %w", err)
}
jetCfg := postgres.DefaultConfig()
jetCfg.PrimaryDSN = scopedDSN
jetCfg.OperationTimeout = defaultOpTimeout
jetDB, err := postgres.OpenPrimary(ctx, jetCfg)
if err != nil {
return fmt.Errorf("open scoped pool for jet generation: %w", err)
}
defer func() { _ = jetDB.Close() }()
if err := jetpostgres.GenerateDB(jetDB, serviceSchema, outputDir); err != nil {
return fmt.Errorf("jet generate: %w", err)
}
log.Printf("jetgen: generated jet code into %s (schema=%s)", outputDir, serviceSchema)
return nil
}
func provisionRoleAndSchema(ctx context.Context, baseDSN string) error {
cfg := postgres.DefaultConfig()
cfg.PrimaryDSN = baseDSN
cfg.OperationTimeout = defaultOpTimeout
db, err := postgres.OpenPrimary(ctx, cfg)
if err != nil {
return fmt.Errorf("open admin pool: %w", err)
}
defer func() { _ = db.Close() }()
statements := []string{
fmt.Sprintf(`DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = %s) THEN
CREATE ROLE %s LOGIN PASSWORD %s;
END IF;
END $$;`, sqlLiteral(serviceRole), sqlIdentifier(serviceRole), sqlLiteral(servicePassword)),
fmt.Sprintf(`CREATE SCHEMA IF NOT EXISTS %s AUTHORIZATION %s;`,
sqlIdentifier(serviceSchema), sqlIdentifier(serviceRole)),
fmt.Sprintf(`GRANT USAGE ON SCHEMA %s TO %s;`,
sqlIdentifier(serviceSchema), sqlIdentifier(serviceRole)),
}
for _, statement := range statements {
if _, err := db.ExecContext(ctx, statement); err != nil {
return fmt.Errorf("provision %q/%q: %w", serviceSchema, serviceRole, err)
}
}
return nil
}
func dsnForServiceRole(baseDSN string) (string, error) {
parsed, err := url.Parse(baseDSN)
if err != nil {
return "", fmt.Errorf("parse base dsn: %w", err)
}
values := url.Values{}
values.Set("search_path", serviceSchema)
values.Set("sslmode", "disable")
scoped := url.URL{
Scheme: parsed.Scheme,
User: url.UserPassword(serviceRole, servicePassword),
Host: parsed.Host,
Path: parsed.Path,
RawQuery: values.Encode(),
}
return scoped.String(), nil
}
func applyMigrations(ctx context.Context, dsn string) error {
cfg := postgres.DefaultConfig()
cfg.PrimaryDSN = dsn
cfg.OperationTimeout = defaultOpTimeout
db, err := postgres.OpenPrimary(ctx, cfg)
if err != nil {
return fmt.Errorf("open scoped pool: %w", err)
}
defer func() { _ = db.Close() }()
if err := postgres.Ping(ctx, db, defaultOpTimeout); err != nil {
return err
}
if err := postgres.RunMigrations(ctx, db, migrations.FS(), "."); err != nil {
return fmt.Errorf("run migrations: %w", err)
}
return nil
}
// jetOutputDir returns the absolute path that jet should write into. We
// rely on the runtime caller info to anchor it to galaxy/rtmanager
// regardless of the invoking working directory.
func jetOutputDir() (string, error) {
_, file, _, ok := runtime.Caller(0)
if !ok {
return "", errors.New("resolve runtime caller for jet output path")
}
dir := filepath.Dir(file)
// dir = .../galaxy/rtmanager/cmd/jetgen
moduleRoot := filepath.Clean(filepath.Join(dir, "..", ".."))
return filepath.Join(moduleRoot, jetOutputDirSuffix), nil
}
func sqlIdentifier(name string) string {
return `"` + escapeDoubleQuotes(name) + `"`
}
func sqlLiteral(value string) string {
return "'" + escapeSingleQuotes(value) + "'"
}
func escapeDoubleQuotes(value string) string {
out := make([]byte, 0, len(value))
for index := 0; index < len(value); index++ {
if value[index] == '"' {
out = append(out, '"', '"')
continue
}
out = append(out, value[index])
}
return string(out)
}
func escapeSingleQuotes(value string) string {
out := make([]byte, 0, len(value))
for index := 0; index < len(value); index++ {
if value[index] == '\'' {
out = append(out, '\'', '\'')
continue
}
out = append(out, value[index])
}
return string(out)
}
+47
View File
@@ -0,0 +1,47 @@
// Binary rtmanager is the runnable Runtime Manager Service process
// entrypoint.
package main
import (
"context"
"fmt"
"os"
"os/signal"
"syscall"
"galaxy/rtmanager/internal/app"
"galaxy/rtmanager/internal/config"
"galaxy/rtmanager/internal/logging"
)
func main() {
if err := run(); err != nil {
_, _ = fmt.Fprintf(os.Stderr, "rtmanager: %v\n", err)
os.Exit(1)
}
}
func run() error {
cfg, err := config.LoadFromEnv()
if err != nil {
return err
}
logger, err := logging.New(cfg.Logging.Level)
if err != nil {
return err
}
rootCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
runtime, err := app.NewRuntime(rootCtx, cfg, logger)
if err != nil {
return err
}
defer func() {
_ = runtime.Close()
}()
return runtime.Run(rootCtx)
}
+392
View File
@@ -0,0 +1,392 @@
package rtmanager
import (
"os"
"path/filepath"
"runtime"
"testing"
"github.com/stretchr/testify/require"
"gopkg.in/yaml.v3"
)
var expectedStopReasonEnum = []string{
"orphan_cleanup",
"cancelled",
"finished",
"admin_request",
"timeout",
}
var expectedJobResultErrorCodeEnum = []string{
"",
"invalid_request",
"not_found",
"conflict",
"service_unavailable",
"internal_error",
"image_pull_failed",
"image_ref_not_semver",
"semver_patch_only",
"container_start_failed",
"start_config_invalid",
"docker_unavailable",
"replay_no_op",
}
var expectedHealthEventTypeEnum = []string{
"container_started",
"container_exited",
"container_oom",
"container_disappeared",
"inspect_unhealthy",
"probe_failed",
"probe_recovered",
}
var expectedHealthDetailsBranches = []struct {
schema string
required []string
}{
{schema: "ContainerStartedDetails", required: []string{"image_ref"}},
{schema: "ContainerExitedDetails", required: []string{"exit_code", "oom"}},
{schema: "ContainerOomDetails", required: []string{"exit_code"}},
{schema: "ContainerDisappearedDetails", required: nil},
{schema: "InspectUnhealthyDetails", required: []string{"restart_count", "state", "health"}},
{schema: "ProbeFailedDetails", required: []string{"consecutive_failures", "last_status", "last_error"}},
{schema: "ProbeRecoveredDetails", required: []string{"prior_failure_count"}},
}
func TestRuntimeJobsAsyncAPISpecLoads(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
require.Equal(t, "3.1.0", getStringValue(t, doc, "asyncapi"))
}
func TestRuntimeJobsSpecFreezesChannelAddresses(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
channels := getMapValue(t, doc, "channels")
require.Equal(t, "runtime:start_jobs",
getStringValue(t, getMapValue(t, channels, "startJobs"), "address"))
require.Equal(t, "runtime:stop_jobs",
getStringValue(t, getMapValue(t, channels, "stopJobs"), "address"))
require.Equal(t, "runtime:job_results",
getStringValue(t, getMapValue(t, channels, "jobResults"), "address"))
}
func TestRuntimeJobsSpecFreezesOperationActions(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
operations := getMapValue(t, doc, "operations")
cases := []struct {
operation string
action string
channel string
}{
{operation: "consumeStartJob", action: "receive", channel: "#/channels/startJobs"},
{operation: "consumeStopJob", action: "receive", channel: "#/channels/stopJobs"},
{operation: "publishJobResult", action: "send", channel: "#/channels/jobResults"},
}
for _, tc := range cases {
t.Run(tc.operation, func(t *testing.T) {
t.Parallel()
op := getMapValue(t, operations, tc.operation)
require.Equal(t, tc.action, getStringValue(t, op, "action"))
require.Equal(t, tc.channel,
getStringValue(t, getMapValue(t, op, "channel"), "$ref"))
})
}
}
func TestRuntimeJobsSpecFreezesMessageNames(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
messages := getMapValue(t, doc, "components", "messages")
for _, name := range []string{"RuntimeStartJob", "RuntimeStopJob", "RuntimeJobResult"} {
t.Run(name, func(t *testing.T) {
t.Parallel()
message := getMapValue(t, messages, name)
require.Equal(t, name, getStringValue(t, message, "name"))
})
}
}
func TestRuntimeJobsSpecFreezesStartJobPayload(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
payload := getMapValue(t, doc, "components", "schemas", "RuntimeStartJobPayload")
require.ElementsMatch(t,
[]string{"game_id", "image_ref", "requested_at_ms"},
getStringSlice(t, payload, "required"))
require.False(t, getBoolValue(t, payload, "additionalProperties"),
"RuntimeStartJobPayload must reject unknown fields")
}
func TestRuntimeJobsSpecFreezesStopJobPayload(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
payload := getMapValue(t, doc, "components", "schemas", "RuntimeStopJobPayload")
require.ElementsMatch(t,
[]string{"game_id", "reason", "requested_at_ms"},
getStringSlice(t, payload, "required"))
require.False(t, getBoolValue(t, payload, "additionalProperties"),
"RuntimeStopJobPayload must reject unknown fields")
reason := getMapValue(t, payload, "properties", "reason")
require.Equal(t, "#/components/schemas/StopReason",
getStringValue(t, reason, "$ref"),
"RuntimeStopJobPayload.reason must reference StopReason")
stopReason := getMapValue(t, doc, "components", "schemas", "StopReason")
require.ElementsMatch(t, expectedStopReasonEnum,
getStringSlice(t, stopReason, "enum"))
}
func TestRuntimeJobsSpecFreezesJobResultPayload(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
payload := getMapValue(t, doc, "components", "schemas", "RuntimeJobResultPayload")
require.ElementsMatch(t,
[]string{"game_id", "outcome", "container_id", "engine_endpoint", "error_code", "error_message"},
getStringSlice(t, payload, "required"))
require.False(t, getBoolValue(t, payload, "additionalProperties"),
"RuntimeJobResultPayload must reject unknown fields")
outcome := getMapValue(t, payload, "properties", "outcome")
require.ElementsMatch(t, []string{"success", "failure"},
getStringSlice(t, outcome, "enum"))
errorCode := getMapValue(t, payload, "properties", "error_code")
require.Equal(t, "#/components/schemas/ErrorCode",
getStringValue(t, errorCode, "$ref"),
"RuntimeJobResultPayload.error_code must reference ErrorCode")
errorCodeSchema := getMapValue(t, doc, "components", "schemas", "ErrorCode")
require.ElementsMatch(t, expectedJobResultErrorCodeEnum,
getStringSlice(t, errorCodeSchema, "enum"))
}
func TestRuntimeHealthAsyncAPISpecLoads(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
require.Equal(t, "3.1.0", getStringValue(t, doc, "asyncapi"))
}
func TestRuntimeHealthSpecFreezesChannelAndOperation(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
channel := getMapValue(t, doc, "channels", "healthEvents")
require.Equal(t, "runtime:health_events", getStringValue(t, channel, "address"))
operation := getMapValue(t, doc, "operations", "publishHealthEvent")
require.Equal(t, "send", getStringValue(t, operation, "action"))
require.Equal(t, "#/channels/healthEvents",
getStringValue(t, getMapValue(t, operation, "channel"), "$ref"))
message := getMapValue(t, doc, "components", "messages", "RuntimeHealthEvent")
require.Equal(t, "RuntimeHealthEvent", getStringValue(t, message, "name"))
}
func TestRuntimeHealthSpecFreezesEnvelope(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
payload := getMapValue(t, doc, "components", "schemas", "RuntimeHealthEventPayload")
require.ElementsMatch(t,
[]string{"game_id", "container_id", "event_type", "occurred_at_ms", "details"},
getStringSlice(t, payload, "required"))
require.False(t, getBoolValue(t, payload, "additionalProperties"),
"RuntimeHealthEventPayload must reject unknown fields")
eventType := getMapValue(t, payload, "properties", "event_type")
require.Equal(t, "#/components/schemas/EventType",
getStringValue(t, eventType, "$ref"),
"RuntimeHealthEventPayload.event_type must reference EventType")
}
func TestRuntimeHealthSpecFreezesEventTypeEnum(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
schema := getMapValue(t, doc, "components", "schemas", "EventType")
require.ElementsMatch(t, expectedHealthEventTypeEnum,
getStringSlice(t, schema, "enum"))
}
func TestRuntimeHealthSpecFreezesDetailsOneOfBranches(t *testing.T) {
t.Parallel()
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
details := getMapValue(t, doc, "components", "schemas", "RuntimeHealthEventPayload",
"properties", "details")
branches := getSliceValue(t, details, "oneOf")
require.Lenf(t, branches, len(expectedHealthDetailsBranches),
"details.oneOf must have %d branches", len(expectedHealthDetailsBranches))
gotRefs := make([]string, 0, len(branches))
for _, raw := range branches {
branch, ok := raw.(map[string]any)
require.True(t, ok, "details.oneOf entry must be a mapping")
gotRefs = append(gotRefs, getStringValue(t, branch, "$ref"))
}
wantRefs := make([]string, 0, len(expectedHealthDetailsBranches))
for _, branch := range expectedHealthDetailsBranches {
wantRefs = append(wantRefs, "#/components/schemas/"+branch.schema)
}
require.ElementsMatch(t, wantRefs, gotRefs)
for _, branch := range expectedHealthDetailsBranches {
t.Run(branch.schema, func(t *testing.T) {
t.Parallel()
schema := getMapValue(t, doc, "components", "schemas", branch.schema)
require.False(t, getBoolValue(t, schema, "additionalProperties"),
"%s must reject unknown fields", branch.schema)
if branch.required == nil {
_, hasRequired := schema["required"]
require.False(t, hasRequired,
"%s must not declare required fields", branch.schema)
return
}
require.ElementsMatch(t, branch.required,
getStringSlice(t, schema, "required"))
})
}
}
func loadAsyncAPISpec(t *testing.T, relativePath string) map[string]any {
t.Helper()
payload := loadTextFile(t, relativePath)
var doc map[string]any
if err := yaml.Unmarshal([]byte(payload), &doc); err != nil {
require.Failf(t, "test failed", "decode spec: %v", err)
}
return doc
}
func loadTextFile(t *testing.T, relativePath string) string {
t.Helper()
path := filepath.Join(moduleRoot(t), relativePath)
payload, err := os.ReadFile(path)
if err != nil {
require.Failf(t, "test failed", "read file %s: %v", path, err)
}
return string(payload)
}
func moduleRoot(t *testing.T) string {
t.Helper()
_, thisFile, _, ok := runtime.Caller(0)
if !ok {
require.FailNow(t, "runtime.Caller failed")
}
return filepath.Dir(thisFile)
}
func getMapValue(t *testing.T, value map[string]any, path ...string) map[string]any {
t.Helper()
current := value
for _, segment := range path {
raw, ok := current[segment]
if !ok {
require.Failf(t, "test failed", "missing map key %s", segment)
}
next, ok := raw.(map[string]any)
if !ok {
require.Failf(t, "test failed", "value at %s is not a map", segment)
}
current = next
}
return current
}
func getStringValue(t *testing.T, value map[string]any, key string) string {
t.Helper()
raw, ok := value[key]
if !ok {
require.Failf(t, "test failed", "missing key %s", key)
}
result, ok := raw.(string)
if !ok {
require.Failf(t, "test failed", "value at %s is not a string", key)
}
return result
}
func getBoolValue(t *testing.T, value map[string]any, key string) bool {
t.Helper()
raw, ok := value[key]
if !ok {
require.Failf(t, "test failed", "missing key %s", key)
}
result, ok := raw.(bool)
if !ok {
require.Failf(t, "test failed", "value at %s is not a bool", key)
}
return result
}
func getStringSlice(t *testing.T, value map[string]any, key string) []string {
t.Helper()
raw := getSliceValue(t, value, key)
result := make([]string, 0, len(raw))
for _, item := range raw {
text, ok := item.(string)
if !ok {
require.Failf(t, "test failed", "value at %s is not a string slice", key)
}
result = append(result, text)
}
return result
}
func getSliceValue(t *testing.T, value map[string]any, key string) []any {
t.Helper()
raw, ok := value[key]
if !ok {
require.Failf(t, "test failed", "missing key %s", key)
}
result, ok := raw.([]any)
if !ok {
require.Failf(t, "test failed", "value at %s is not a slice", key)
}
return result
}
+384
View File
@@ -0,0 +1,384 @@
package rtmanager
import (
"context"
"net/http"
"path/filepath"
"runtime"
"testing"
"github.com/getkin/kin-openapi/openapi3"
"github.com/stretchr/testify/require"
)
// TestInternalOpenAPISpecValidates loads internal-openapi.yaml and verifies
// it is a syntactically valid OpenAPI 3.0 document.
func TestInternalOpenAPISpecValidates(t *testing.T) {
t.Parallel()
loadInternalOpenAPISpec(t)
}
// TestInternalSpecFreezesOperationIDs verifies that every documented
// endpoint declares the exact operationId required by the Runtime Manager
// internal contract. Missing or renamed operationIds break the contract
// for Game Master and Admin Service.
func TestInternalSpecFreezesOperationIDs(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
cases := []struct {
method string
path string
operationID string
}{
{http.MethodGet, "/healthz", "internalHealthz"},
{http.MethodGet, "/readyz", "internalReadyz"},
{http.MethodGet, "/api/v1/internal/runtimes", "internalListRuntimes"},
{http.MethodGet, "/api/v1/internal/runtimes/{game_id}", "internalGetRuntime"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/start", "internalStartRuntime"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/stop", "internalStopRuntime"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/restart", "internalRestartRuntime"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/patch", "internalPatchRuntime"},
{http.MethodDelete, "/api/v1/internal/runtimes/{game_id}/container", "internalCleanupRuntimeContainer"},
}
for _, tc := range cases {
t.Run(tc.operationID, func(t *testing.T) {
t.Parallel()
op := getOperation(t, doc, tc.path, tc.method)
require.Equal(t, tc.operationID, op.OperationID)
})
}
}
// TestInternalSpecFreezesRuntimeRecordSchema verifies that RuntimeRecord
// declares the required field set documented in
// rtmanager/README.md §Persistence Layout, with the status enum frozen.
func TestInternalSpecFreezesRuntimeRecordSchema(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "RuntimeRecord")
assertRequiredFields(t, schema,
"game_id", "status", "state_path", "docker_network",
"last_op_at", "created_at",
)
for _, optional := range []string{
"current_container_id", "current_image_ref", "engine_endpoint",
"started_at", "stopped_at", "removed_at",
} {
require.Contains(t, schema.Value.Properties, optional,
"RuntimeRecord.%s must be present in properties", optional)
}
assertStringEnum(t, schema, "status", "running", "stopped", "removed")
}
// TestInternalSpecFreezesStartRequest verifies that StartRequest requires
// only image_ref and rejects unknown fields.
func TestInternalSpecFreezesStartRequest(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "StartRequest")
assertRequiredFields(t, schema, "image_ref")
require.NotNil(t, schema.Value.AdditionalProperties.Has)
require.False(t, *schema.Value.AdditionalProperties.Has,
"StartRequest must reject unknown fields")
}
// TestInternalSpecFreezesStopRequest verifies that StopRequest requires
// only reason, that reason references the StopReason schema, and that
// unknown fields are rejected.
func TestInternalSpecFreezesStopRequest(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "StopRequest")
assertRequiredFields(t, schema, "reason")
require.NotNil(t, schema.Value.AdditionalProperties.Has)
require.False(t, *schema.Value.AdditionalProperties.Has,
"StopRequest must reject unknown fields")
reason := schema.Value.Properties["reason"]
require.NotNil(t, reason, "StopRequest.reason must be present")
require.Equal(t, "#/components/schemas/StopReason", reason.Ref,
"StopRequest.reason must reference StopReason")
}
// TestInternalSpecFreezesPatchRequest verifies that PatchRequest requires
// only image_ref and rejects unknown fields.
func TestInternalSpecFreezesPatchRequest(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "PatchRequest")
assertRequiredFields(t, schema, "image_ref")
require.NotNil(t, schema.Value.AdditionalProperties.Has)
require.False(t, *schema.Value.AdditionalProperties.Has,
"PatchRequest must reject unknown fields")
}
// TestInternalSpecFreezesStopReasonEnum verifies that the stop reason enum
// matches the contract recorded in
// rtmanager/README.md §Async Stream Contracts.
func TestInternalSpecFreezesStopReasonEnum(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "StopReason")
got := make([]string, 0, len(schema.Value.Enum))
for _, value := range schema.Value.Enum {
got = append(got, value.(string))
}
require.ElementsMatch(t,
[]string{"orphan_cleanup", "cancelled", "finished", "admin_request", "timeout"},
got)
}
// TestInternalSpecFreezesErrorCodeCatalog verifies that ErrorCode contains
// every stable code declared in rtmanager/README.md §Error Model.
func TestInternalSpecFreezesErrorCodeCatalog(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "ErrorCode")
got := make([]string, 0, len(schema.Value.Enum))
for _, value := range schema.Value.Enum {
got = append(got, value.(string))
}
require.ElementsMatch(t,
[]string{
"invalid_request",
"not_found",
"conflict",
"service_unavailable",
"internal_error",
"image_pull_failed",
"image_ref_not_semver",
"semver_patch_only",
"container_start_failed",
"start_config_invalid",
"docker_unavailable",
"replay_no_op",
},
got)
}
// TestInternalSpecFreezesErrorEnvelope verifies that ErrorResponse uses the
// `{ "error": { "code", "message" } }` shape and that error.code references
// the ErrorCode enum.
func TestInternalSpecFreezesErrorEnvelope(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
envelope := componentSchemaRef(t, doc, "ErrorResponse")
assertRequiredFields(t, envelope, "error")
require.Equal(t, "#/components/schemas/ErrorBody",
envelope.Value.Properties["error"].Ref,
"ErrorResponse.error must reference ErrorBody")
body := componentSchemaRef(t, doc, "ErrorBody")
assertRequiredFields(t, body, "code", "message")
require.Equal(t, "#/components/schemas/ErrorCode",
body.Value.Properties["code"].Ref,
"ErrorBody.code must reference ErrorCode")
require.Equal(t, "string",
body.Value.Properties["message"].Value.Type.Slice()[0],
"ErrorBody.message must be a string")
}
// TestInternalSpecFreezesProbeResponses verifies that /healthz returns 200
// with the probe payload and /readyz declares both 200 and 503.
func TestInternalSpecFreezesProbeResponses(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
healthz := getOperation(t, doc, "/healthz", http.MethodGet)
assertSchemaRef(t, responseSchemaRef(t, healthz, http.StatusOK),
"#/components/schemas/ProbeResponse", "internalHealthz 200")
readyz := getOperation(t, doc, "/readyz", http.MethodGet)
assertSchemaRef(t, responseSchemaRef(t, readyz, http.StatusOK),
"#/components/schemas/ProbeResponse", "internalReadyz 200")
require.NotNil(t, readyz.Responses.Status(http.StatusServiceUnavailable),
"internalReadyz must declare a 503 response")
}
// TestInternalSpecFreezesXGalaxyCallerHeader verifies that the optional
// X-Galaxy-Caller header parameter is declared and referenced from every
// runtime operation. Removing the parameter or detaching it from any of
// the seven runtime endpoints would silently drop the only signal RTM
// uses to distinguish gm_rest from admin_rest in operation_log.
func TestInternalSpecFreezesXGalaxyCallerHeader(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
param := doc.Components.Parameters["XGalaxyCallerHeader"]
require.NotNil(t, param, "XGalaxyCallerHeader parameter must be declared")
require.NotNil(t, param.Value, "XGalaxyCallerHeader parameter must have a value")
require.Equal(t, "header", param.Value.In)
require.Equal(t, "X-Galaxy-Caller", param.Value.Name)
require.False(t, param.Value.Required, "X-Galaxy-Caller must be optional")
enum := param.Value.Schema.Value.Enum
got := make([]string, 0, len(enum))
for _, value := range enum {
got = append(got, value.(string))
}
require.ElementsMatch(t, []string{"gm", "admin"}, got)
runtimeOps := []struct {
method string
path string
}{
{http.MethodGet, "/api/v1/internal/runtimes"},
{http.MethodGet, "/api/v1/internal/runtimes/{game_id}"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/start"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/stop"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/restart"},
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/patch"},
{http.MethodDelete, "/api/v1/internal/runtimes/{game_id}/container"},
}
for _, rop := range runtimeOps {
t.Run(rop.method+" "+rop.path, func(t *testing.T) {
t.Parallel()
op := getOperation(t, doc, rop.path, rop.method)
found := false
for _, ref := range op.Parameters {
if ref.Ref == "#/components/parameters/XGalaxyCallerHeader" {
found = true
break
}
}
require.Truef(t, found,
"%s %s must reference XGalaxyCallerHeader", rop.method, rop.path)
})
}
}
// TestInternalSpecFreezesRuntimesListShape verifies that the list endpoint
// returns the items envelope expected by callers.
func TestInternalSpecFreezesRuntimesListShape(t *testing.T) {
t.Parallel()
doc := loadInternalOpenAPISpec(t)
schema := componentSchemaRef(t, doc, "RuntimesList")
assertRequiredFields(t, schema, "items")
items := schema.Value.Properties["items"]
require.NotNil(t, items, "RuntimesList.items must be declared")
require.Equal(t, "#/components/schemas/RuntimeRecord", items.Value.Items.Ref,
"RuntimesList.items[] must reference RuntimeRecord")
}
func loadInternalOpenAPISpec(t *testing.T) *openapi3.T {
t.Helper()
_, thisFile, _, ok := runtime.Caller(0)
if !ok {
require.FailNow(t, "runtime.Caller failed")
}
specPath := filepath.Join(filepath.Dir(thisFile), "api", "internal-openapi.yaml")
loader := openapi3.NewLoader()
doc, err := loader.LoadFromFile(specPath)
if err != nil {
require.Failf(t, "test failed", "load spec %s: %v", specPath, err)
}
if doc == nil {
require.Failf(t, "test failed", "load spec %s: returned nil document", specPath)
}
if err := doc.Validate(context.Background()); err != nil {
require.Failf(t, "test failed", "validate spec %s: %v", specPath, err)
}
return doc
}
func getOperation(t *testing.T, doc *openapi3.T, path, method string) *openapi3.Operation {
t.Helper()
if doc.Paths == nil {
require.FailNow(t, "spec is missing paths")
}
pathItem := doc.Paths.Value(path)
if pathItem == nil {
require.Failf(t, "test failed", "spec is missing path %s", path)
}
op := pathItem.GetOperation(method)
if op == nil {
require.Failf(t, "test failed", "spec is missing %s operation for path %s", method, path)
}
return op
}
func responseSchemaRef(t *testing.T, op *openapi3.Operation, status int) *openapi3.SchemaRef {
t.Helper()
ref := op.Responses.Status(status)
if ref == nil || ref.Value == nil {
require.Failf(t, "test failed", "operation is missing %d response", status)
}
mt := ref.Value.Content.Get("application/json")
if mt == nil || mt.Schema == nil {
require.Failf(t, "test failed", "operation is missing application/json schema for %d response", status)
}
return mt.Schema
}
func componentSchemaRef(t *testing.T, doc *openapi3.T, name string) *openapi3.SchemaRef {
t.Helper()
if doc.Components.Schemas == nil {
require.FailNow(t, "spec is missing component schemas")
}
ref := doc.Components.Schemas[name]
if ref == nil {
require.Failf(t, "test failed", "spec is missing component schema %s", name)
}
return ref
}
func assertSchemaRef(t *testing.T, schemaRef *openapi3.SchemaRef, want, name string) {
t.Helper()
require.NotNil(t, schemaRef, "%s schema ref", name)
require.Equal(t, want, schemaRef.Ref, "%s schema ref", name)
}
func assertRequiredFields(t *testing.T, schemaRef *openapi3.SchemaRef, fields ...string) {
t.Helper()
require.NotNil(t, schemaRef)
require.ElementsMatch(t, fields, schemaRef.Value.Required)
}
func assertStringEnum(t *testing.T, schemaRef *openapi3.SchemaRef, property string, values ...string) {
t.Helper()
require.NotNil(t, schemaRef)
propRef := schemaRef.Value.Properties[property]
require.NotNil(t, propRef, "schema property %s", property)
got := make([]string, 0, len(propRef.Value.Enum))
for _, v := range propRef.Value.Enum {
got = append(got, v.(string))
}
require.ElementsMatch(t, values, got)
}
+44
View File
@@ -0,0 +1,44 @@
# Runtime Manager — Service-Local Documentation
This directory hosts the service-local documentation for `Runtime
Manager`. The top-level [`../README.md`](../README.md) describes the
current-state contract (purpose, scope, lifecycles, surfaces,
configuration, observability); the documents below complement it with
focused content docs and design-rationale records.
## Content docs
- [Runtime and components](runtime.md) — process diagram, listeners,
workers, lifecycle services, stream offsets, configuration groups,
runtime invariants.
- [Flows](flows.md) — mermaid sequence diagrams for the lifecycle and
observability flows.
- [Operator runbook](runbook.md) — startup, readiness, shutdown, and
recovery scenarios.
- [Configuration and contract examples](examples.md) — `.env`,
REST request bodies, stream payloads, storage inspection snippets.
## Design rationale
- [PostgreSQL schema decisions](postgres-migration.md) — the schema
decision record consolidating the persistence-layer agreements
(tables, indexes, CAS shape, `created_at` preservation, jsonb
round-trip, schema/role provisioning split).
- [Domain and ports](domain-and-ports.md) — string-typed enums, the
four allowed runtime transitions, why `Inspect` splits into
`InspectImage` / `InspectContainer`, why `LobbyGameRecord` is
minimal, and other domain-layer choices.
- [Adapters](adapters.md) — Docker SDK adapter, Lobby internal HTTP
client, the three Redis publishers, the `mockgen` convention for
wide ports, and the unit-test strategy for HTTP-backed adapters.
- [Lifecycle services](services.md) — per-game lease semantics, the
`Result`-shaped contract, failure-mode tables, the lease-bypass
`Run` method on inner services, the `X-Galaxy-Caller` header
convention, and the canonical error code → HTTP status mapping.
- [Background workers](workers.md) — single-ownership table per
`event_type`, `container_disappeared` suppression rules, probe
hysteresis, the events listener reconnect policy, the reconciler's
per-game lease and three drift kinds.
- [Service-local integration suite](integration-tests.md) — the
`integration` build tag, the in-process `app.NewRuntime` choice,
the Lobby HTTP stub, and the test isolation strategy.
+192
View File
@@ -0,0 +1,192 @@
# Adapters
This document explains why the production adapters under
[`../internal/adapters/`](../internal/adapters) — Docker SDK,
Lobby internal HTTP client, notification-intent publisher, health-event
publisher, job-result publisher — are shaped the way they are. The
PostgreSQL stores and the Redis-coordination adapters live in
[`postgres-migration.md`](postgres-migration.md).
## 1. `mockgen` is the repo-wide convention for wide ports
The Docker port has nine methods plus eight value types in the
signatures, and most lifecycle services exercise nearly every method
pair (start, stop, restart, patch, cleanup, reconcile, events, probe).
A hand-rolled fake would either miss methods or balloon to a per-test
fixture.
`internal/adapters/docker/` therefore uses `go.uber.org/mock` mocks:
- `//go:generate` directives live next to the interface declaration in
`internal/ports/dockerclient.go`;
- generated code is committed under `internal/adapters/docker/mocks/`
(matching the `internal/adapters/postgres/jet/` discipline);
- `make -C rtmanager mocks` is the single command operators run after
a port-signature change.
The maintained `go.uber.org/mock` fork is preferred over the archived
`github.com/golang/mock`. This convention applies to wide / recorder
ports across the repository — Lobby uses the same pipeline for its
narrow recorder ports (`RuntimeManager`, `IntentPublisher`,
`GMClient`, `UserService`); see
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for the cross-service
rule.
The other two RTM ports (`LobbyInternalClient`,
`NotificationIntentPublisher`) keep inline `_test.go` fakes: small
surfaces, easy to fake by hand inside a single test file when needed.
## 2. `EngineEndpoint` is built inside the Docker adapter
The engine port is fixed at `8080`. Pushing it into `RunSpec` would
force the start service to know an engine implementation detail;
pushing it into config would give operators a knob that the engine
image already does not honour. The Docker adapter exposes
`EnginePort = 8080` as a package constant and constructs
`RunResult.EngineEndpoint = "http://" + spec.Hostname + ":8080"`
itself.
The adapter also leaves `container.Config.ExposedPorts` empty: RTM
never publishes ports to the host. The user-defined Docker bridge
network gives every container in the network DNS access to the engine
via `galaxy-game-{game_id}:8080`.
## 3. `Run` removes the container on `ContainerStart` failure
`README.md §Lifecycles → Start` requires no orphan to remain after a
failed start path. If `ContainerCreate` succeeds but `ContainerStart`
fails, the adapter calls `ContainerRemove(force=true)` inside a fresh
`context.Background()` (with a 10s timeout) so the cleanup runs even
when the original ctx is already cancelled. The cleanup is best-effort:
a remove failure is silently discarded because the original start
failure is the actionable error returned to the caller.
The alternative — leaving rollback to the start service — would either
duplicate the same code in every caller or invite a service that forgets
to do it. Centralising the rule in the adapter keeps the port contract
simple. The start service adds an additional rollback layer for the
post-`Run` `Upsert` failure path; see [`services.md`](services.md) §5.
## 4. `RunSpec.Cmd` is optional
`ports.RunSpec` exposes an optional `Cmd []string`. Production callers
leave it `nil` so the engine image's own `CMD` runs;
`internal/adapters/docker/smoke_test.go` uses it to drive
`["/bin/sh","-c","sleep 60"]` against `alpine:3.21`.
The alternative — building a dedicated test image with a pre-baked
`sleep` command — would require an extra `Dockerfile` under testdata
and a build step inside the smoke test. The single new field is
documented as optional and ignored when empty; production behaviour is
unchanged.
## 5. `EventsListen` filters at the adapter boundary
The Docker `/events` API accepts a `filters` query parameter, but the
daemon treats it as a hint, not a guarantee. The adapter therefore
double-checks at the boundary: only `Type == events.ContainerEventType`
messages are passed through to the typed `<-chan ports.DockerEvent`.
Doing the filter at the SDK level would still require a defensive
recheck on the consumer side; consolidating the check in the adapter
keeps the contract crisp and the consumer free of Docker-internal type
discriminants.
The decoded event copies the actor's full `Attributes` map into
`DockerEvent.Labels`. Docker mixes container labels and runtime
attributes (`exitCode`, `image`, `name`, etc.) flat in the same map;
RTM consumers filter by the `com.galaxy.` prefix when they care about
labels, and the adapter extracts `exitCode` separately for `die`
events.
## 6. Lobby HTTP client error mapping
`ports.LobbyInternalClient.GetGame` fixes:
- `200``LobbyGameRecord` decoded tolerantly (unknown fields
ignored);
- `404``ports.ErrLobbyGameNotFound`;
- transport, timeout, or any other non-2xx → `ports.ErrLobbyUnavailable`
wrapped with the original error so callers can `errors.Is` and still
log the cause.
The start service treats `ErrLobbyUnavailable` as recoverable: it
continues without the diagnostic data because the start envelope
already carries the only required field (`image_ref`). The client
mirrors `notification/internal/adapters/userservice/client.go`: cloned
`*http.Transport`, `otelhttp.NewTransport` wrap, per-request
`context.WithTimeout`, idempotent `Close()` releasing idle connections.
JSON decoding is tolerant: unknown fields in the success body do not
break the call, so additive changes to Lobby's `GameRecord` schema do
not require an RTM release.
## 7. Notification publisher wrapper signature
The wrapper drops the entry id returned by
`notificationintent.Publisher.Publish` (rationale in
[`domain-and-ports.md`](domain-and-ports.md) §7). The adapter is a
thin shim:
- `NewPublisher(cfg)` constructs the inner publisher and forwards
validation;
- `Publish(ctx, intent)` calls the inner publisher and discards the
entry id.
The compile-time assertion `var _ ports.NotificationIntentPublisher =
(*Publisher)(nil)` lives in `publisher.go`.
## 8. Health-events publisher: snapshot upsert before stream XADD
Every emission goes through
`ports.HealthEventPublisher.Publish`, which both XADDs to
`runtime:health_events` and upserts `health_snapshots`. The snapshot
upsert runs **before** the XADD: a successful Publish always leaves
the snapshot store at least as fresh as the stream, and a partial
failure leaves the snapshot a best-effort lower bound. Reversing the
order would let consumers observe a stream entry whose
`health_snapshots` row reflects the prior observation — a misleading
inversion.
The `event_type → SnapshotStatus / SnapshotSource` mapping mirrors the
table in [`../README.md` §Health Monitoring](../README.md). In
particular, `container_started` collapses to `SnapshotStatusHealthy`
and `probe_recovered` does the same (rationale in
[`domain-and-ports.md`](domain-and-ports.md) §4).
## 9. Unit-test strategy
Both HTTP-backed adapters (Docker SDK, Lobby client) use
`httptest.Server` fixtures. The Docker SDK speaks HTTP under the hood
for both unix sockets and TCP, so adapter unit tests construct a
Docker client with `client.WithHost(server.URL)` and
`client.WithHTTPClient(server.Client())`, which lets table-driven
handlers fake every Docker API endpoint without touching the real
daemon. The Docker API version is pinned to `1.45`
(`client.WithVersion("1.45")`) so the URL prefix is stable across CI
machines whose daemon advertises a different default. Production
wiring (in `internal/app/bootstrap.go`) keeps API negotiation enabled.
The notification publisher uses `miniredis` directly because the
adapter's only side effect is an `XADD`, which `miniredis` reproduces
faithfully and matches every other Galaxy intent test.
## 10. Docker smoke test
`internal/adapters/docker/smoke_test.go` runs on the default
`go test ./...` invocation and calls `t.Skip` unless the local daemon
is reachable (`/var/run/docker.sock` exists or `DOCKER_HOST` is set).
The covered sequence:
1. provision a temporary user-defined bridge network;
2. assert `EnsureNetwork` for present and missing names;
3. pull `alpine:3.21` (`PullPolicyIfMissing`);
4. subscribe to events;
5. run a sleep container with the full `RunSpec` field set;
6. observe a `start` event for the new container id;
7. inspect, stop, remove, and verify `ErrContainerNotFound` is
reported afterwards.
This is the production adapter's only end-to-end check that runs from
the default `go test` pass; the broader service-local integration
suite ([`integration-tests.md`](integration-tests.md)) is gated
behind `-tags=integration`.
+167
View File
@@ -0,0 +1,167 @@
# Domain and Ports
This document explains why the `rtmanager` domain layer
([`../internal/domain/`](../internal/domain)) and the port interfaces
([`../internal/ports/`](../internal/ports)) are shaped the way they are.
The current-state types and method signatures are the source of truth in
the code; this file records the rationale so future readers do not
re-litigate the same trade-offs.
For the surrounding behaviour see
[`../README.md`](../README.md), the SQL CHECK constraints in
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql),
the wire contracts under [`../api/`](../api), and
[`postgres-migration.md`](postgres-migration.md) for the persistence
layer.
## 1. String-typed status enums
`runtime.Status`, `operation.OpKind`, `operation.OpSource`,
`operation.Outcome`, `health.EventType`, `health.SnapshotStatus`, and
`health.SnapshotSource` are all `type X string`.
The string approach wins on three counts:
- the SQL CHECK constraints already store the values as `text`, so a
string domain type maps one-to-one with no codec layer;
- it matches Lobby (`game.Status`, `membership.Status`,
`application.Status`), so reviewers do not switch encoding mental
models when crossing service boundaries;
- `IsKnown` keeps the invariant cheap (a single switch); a `type X uint8`
with stringer-generated names would pay a constant lookup and make raw
SQL columns harder to read in diagnostics.
## 2. Plain `string` for `CurrentContainerID` and `CurrentImageRef`
The PostgreSQL columns are nullable. The domain model uses plain
`string` with empty == NULL and bridges the SQL nullability inside the
adapter. Pointer fields would force every consumer to dereference
defensively even though business logic rarely cares about the
NULL/empty distinction (removed records may legitimately carry either
form depending on whether the record passed through `stopped` first).
The adapter's job is to translate `sql.NullString``string`; the rest
of the codebase reads the field as a regular value.
## 3. `*time.Time` for nullable timestamps
`StartedAt`, `StoppedAt`, `RemovedAt` retain pointer types. `time.Time{}`
is a real, comparable value in Go (`IsZero` only reports the canonical
zero time); mixing "missing" and "set to UTC zero" through plain
`time.Time` would invite bugs. The jet-generated `model.RuntimeRecords`
already declares the same fields as `*time.Time`, so the domain type
aligns with the persistence type and the adapter does not re-shape
pointers.
## 4. `EventType` and `SnapshotStatus` are deliberately distinct
`runtime-health-asyncapi.yaml.EventType` enumerates seven values; the
SQL CHECK on `health_snapshots.status` enumerates six. The two sets
overlap but are not identical:
- `container_started` is an *event*; the snapshot collapses it to
`healthy` (a successful start is observed as the container being
live, not as an ongoing event);
- `probe_recovered` is an *event*; it does not become a snapshot row of
its own — the next inspect/probe overwrites the prior `probe_failed`
with `healthy`.
Modelling them as one shared enum would require a separate "event vs
snapshot" boolean and invite accidental mismatches. Two distinct types
with explicit `IsKnown` matrices keep each surface honest at compile
time.
## 5. `Inspect` split into `InspectImage` + `InspectContainer`
Two narrow methods replace a single polymorphic `Inspect`. The surface
RTM exercises has two shapes:
- the start service inspects the *image* by reference to read resource
limits from labels;
- the periodic inspect worker, the reconciler, and the events listener
inspect *containers* by id to read state, health, restart count, and
exit code.
The inputs differ (ref vs id), and the result types differ
(`ImageInspect.Labels` is the only field used at start time, while
`ContainerInspect` carries a dozen state fields). One polymorphic
method would either split internally on input type or return a tagged
union; either is messier than two narrow methods.
## 6. `LobbyGameRecord` is intentionally minimal
`LobbyInternalClient.GetGame` returns `GameID`, `Status`, and
`TargetEngineVersion`. The fetch is classified as ancillary diagnostics
because the start envelope already carries the only required field
(`image_ref`).
Anything more would invite RTM consumers to depend on Lobby's schema in
ways that violate the "RTM never resolves engine versions" rule.
Future fields are additive: each new field is opt-in to the consumer
and does not break existing call sites. The minimalism is also a hedge
against schema drift — Lobby's `GameRecord` is large and changes more
often than RTM needs to track.
## 7. `NotificationIntentPublisher.Publish` returns `error`, not `(string, error)`
Lobby's `IntentPublisher.Publish` returns the Redis Stream entry id so
business workflows that key on it (idempotency keys, audit
correlation) can capture it. RTM publishes admin-only failure intents
where the entry id has no consumer — failing starts do not loop back
to RTM, and notification routing keys on the producer-supplied
`idempotency_key` rather than the stream id. The adapter wraps
`pkg/notificationintent.Publisher` and discards the entry id at the
wrapper boundary.
## 8. Exactly four allowed runtime transitions
`runtime.AllowedTransitions` covers:
- `running → stopped` — graceful stop, observed exit, reconcile
observed exited;
- `running → removed``reconcile_dispose` when the container
vanished;
- `stopped → running` — restart and patch inner start;
- `stopped → removed` — cleanup TTL or admin DELETE.
Other pairs are intentionally rejected:
- `running → running` and `stopped → stopped` would mean Upsert
overwrote state without a CAS guard. Idempotent re-start / re-stop
never transitions; the service layer returns `replay_no_op` and the
record is left untouched.
- `removed → *` is forbidden because `removed` is terminal. The
reconciler creates fresh records with `reconcile_adopt` rather than
resurrecting old ones.
Encoding the table this way means a future bug where a service tries
to revive a removed record is rejected at the domain layer rather than
the adapter, which keeps the failure mode close to the offending code.
## 9. `PullPolicy` re-declared inside `ports/dockerclient.go`
The same enum exists as `config.ImagePullPolicy`. Importing
`internal/config` from the ports package would couple two unrelated
layers and create a cyclic risk once the wiring layer pulls both in.
The runtime/wiring layer (in `internal/app`) is the single point that
translates between the two type aliases — both are `string`-typed, the
value sets are identical, and the validation lives on each side
independently.
## 10. Compile-time interface assertions live with adapters
Every interface has a `var _ ports.X = (*Y)(nil)` assertion, but the
assertion lives in the adapter package (e.g.
`var _ ports.RuntimeRecordStore = (*Store)(nil)` inside
`internal/adapters/postgres/runtimerecordstore`). Putting the
assertions in the port package would force the port package to import
its own implementations and create an obvious import cycle.
## 11. `RunSpec.Validate` lives on the request type
The Docker port carries a non-trivial request type (`RunSpec`) with
eight required fields and per-mount invariants. Putting `Validate` on
the request struct keeps the rule next to the type definition, mirrors
the pattern used by `lobby/internal/ports/gmclient.go`
(`RegisterGameRequest.Validate`), and lets the adapter call it as the
first defensive check before invoking the Docker SDK.
+429
View File
@@ -0,0 +1,429 @@
# Configuration And Contract Examples
The examples below are illustrative. Replace `localhost`, port
numbers, IDs, and timestamps with values that match the deployment
under inspection.
## Example `.env`
A minimum-viable `RTMANAGER_*` set for a local run against a single
Redis container plus a PostgreSQL container with the `rtmanager`
schema and the `rtmanagerservice` role provisioned. The full list
with defaults lives in [`../README.md` §Configuration](../README.md).
```bash
# Required
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
RTMANAGER_REDIS_PASSWORD=local
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
RTMANAGER_DOCKER_NETWORK=galaxy-net
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
# Lobby internal client (diagnostic GET only in v1)
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
# Container defaults (image labels override these per container)
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
RTMANAGER_DEFAULT_MEMORY=512m
RTMANAGER_DEFAULT_PIDS_LIMIT=512
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
RTMANAGER_CONTAINER_RETENTION_DAYS=30
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
RTMANAGER_GAME_STATE_DIR_MODE=0750
RTMANAGER_GAME_STATE_OWNER_UID=0
RTMANAGER_GAME_STATE_OWNER_GID=0
# Workers
RTMANAGER_INSPECT_INTERVAL=30s
RTMANAGER_PROBE_INTERVAL=15s
RTMANAGER_PROBE_TIMEOUT=2s
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
RTMANAGER_RECONCILE_INTERVAL=5m
RTMANAGER_CLEANUP_INTERVAL=1h
# Coordination
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
# Process and logging
RTMANAGER_LOG_LEVEL=info
RTMANAGER_SHUTDOWN_TIMEOUT=30s
# Telemetry (disabled for local dev — enable to ship traces / metrics)
OTEL_SERVICE_NAME=galaxy-rtmanager
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
For a production-shaped deployment, set
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
a tag mutation is immediately visible to the next runtime),
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
container's user, and configure `OTEL_*` against the cluster's OTLP
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
sink the operator runs (fluentd, journald, etc.).
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
service-local harness brings up
([`integration-tests.md` §7](integration-tests.md)).
## Internal HTTP Examples
Every endpoint admits the optional `X-Galaxy-Caller` header which the
handler records as `op_source` in `operation_log` (`gm``gm_rest`,
`admin``admin_rest`; missing or unknown values default to
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
### Probe a runtime record
```bash
curl -s -H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
```
Response (`200 OK`):
```json
{
"game_id": "game-01HZ...",
"status": "running",
"current_container_id": "1f2a...",
"current_image_ref": "galaxy/game:1.4.0",
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
"state_path": "/var/lib/galaxy/games/game-01HZ...",
"docker_network": "galaxy-net",
"started_at": "2026-04-28T07:18:54Z",
"stopped_at": null,
"removed_at": null,
"last_op_at": "2026-04-28T07:18:54Z",
"created_at": "2026-04-28T07:18:54Z"
}
```
### List all runtimes
```bash
curl -s -H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes
```
The response shape is `{"items":[<RuntimeRecord>...]}`.
### Start a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
-d '{"image_ref": "galaxy/game:1.4.0"}'
```
A `200` returns the `RuntimeRecord` for the running runtime. Failure
shapes use the canonical envelope; e.g. an invalid image_ref:
```json
{
"error": {
"code": "start_config_invalid",
"message": "image_ref shape rejected by docker reference parser"
}
}
```
### Stop a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
-d '{"reason": "admin_request"}'
```
Valid `reason` values:
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
### Restart a runtime
```bash
curl -s -X POST \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
```
The body is empty; restart re-uses the current `image_ref`.
### Patch a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
```
Patch enforces the semver-only rule: a non-semver tag returns
`image_ref_not_semver`; a cross-major or cross-minor change returns
`semver_patch_only`.
### Cleanup a stopped runtime container
```bash
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
```
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
## Stream Payload Examples
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
the defaults are used below. Field types and required/optional
semantics are frozen by
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
and
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
### `runtime:start_jobs` (Lobby → RTM)
```bash
redis-cli XADD runtime:start_jobs '*' \
game_id 'game-01HZ...' \
image_ref 'galaxy/game:1.4.0' \
requested_at_ms 1714081234567
```
### `runtime:stop_jobs` (Lobby → RTM)
```bash
redis-cli XADD runtime:stop_jobs '*' \
game_id 'game-01HZ...' \
reason 'cancelled' \
requested_at_ms 1714081234567
```
### `runtime:job_results` (RTM → Lobby)
Success envelope:
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code '' \
error_message ''
```
Failure envelope:
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'failure' \
container_id '' \
engine_endpoint '' \
error_code 'image_pull_failed' \
error_message 'pull failed: manifest unknown'
```
Idempotent replay envelope (success outcome with explicit
`replay_no_op`):
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code 'replay_no_op' \
error_message ''
```
The contract permits empty `container_id` and `engine_endpoint`
strings on every value of `outcome` so the consumer can decode the
envelope uniformly ([`workers.md` §11](workers.md)).
### `runtime:health_events` (RTM out)
The wire shape is the same for every event type — only the
`details` payload differs.
`container_started`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_started' \
occurred_at_ms 1714081234567 \
details '{"image_ref":"galaxy/game:1.4.0"}'
```
`container_exited`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_exited' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137,"oom":false}'
```
`container_oom`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_oom' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137}'
```
`container_disappeared`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_disappeared' \
occurred_at_ms 1714081234567 \
details '{}'
```
`inspect_unhealthy`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'inspect_unhealthy' \
occurred_at_ms 1714081234567 \
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
```
`probe_failed` (after the threshold is crossed):
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_failed' \
occurred_at_ms 1714081234567 \
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
```
`probe_recovered`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_recovered' \
occurred_at_ms 1714081234567 \
details '{"prior_failure_count":3}'
```
### `notification:intents` (RTM admin notifications)
RTM publishes admin-only notification intents only for the three
first-touch start failures. Every payload shares the frozen field
set `{game_id, image_ref, error_code, error_message,
attempted_at_ms}`
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
`runtime.image_pull_failed`:
```bash
redis-cli XADD notification:intents '*' \
envelope '{
"type": "runtime.image_pull_failed",
"producer": "rtmanager",
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
"payload": {
"game_id": "game-01HZ...",
"image_ref": "galaxy/game:1.4.0",
"error_code": "image_pull_failed",
"error_message": "pull failed: manifest unknown",
"attempted_at_ms": 1714081234567
}
}'
```
`runtime.container_start_failed` and `runtime.start_config_invalid`
share the same envelope with their respective `type` and
`error_code` values.
## Storage Inspection
### Inspect a runtime record (PostgreSQL)
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
```
Columns mirror the fields documented in
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
### Inspect runtime status counts
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
```
### Inspect the operation log for a game
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = 'game-01HZ...'
ORDER BY started_at DESC, id DESC
LIMIT 50"
```
### Inspect the latest health snapshot
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT game_id, container_id, status, source, observed_at, details
FROM rtmanager.health_snapshots
WHERE game_id = 'game-01HZ...'"
```
### Inspect Redis runtime-coordination keys
```bash
# Stream offsets
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Per-game lease (only present while an operation is in flight)
redis-cli GET rtmanager:game_lease:game-01HZ...
redis-cli TTL rtmanager:game_lease:game-01HZ...
# Recent stream entries
redis-cli XRANGE runtime:start_jobs - + COUNT 20
redis-cli XRANGE runtime:job_results - + COUNT 20
redis-cli XRANGE runtime:health_events - + COUNT 50
# Stream metadata
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli XINFO STREAM runtime:health_events
```
+305
View File
@@ -0,0 +1,305 @@
# Flows
This document collects the lifecycle and observability flows that
span Runtime Manager and its synchronous and asynchronous neighbours.
Narrative descriptions of the rules these flows enforce live in
[`../README.md`](../README.md); the diagrams here focus on the message
order across the boundary. Design-rationale records linked from each
section explain the *why*.
## Start (happy path)
```mermaid
sequenceDiagram
participant Lobby as Lobby publisher
participant Stream as runtime:start_jobs
participant Consumer as startjobsconsumer
participant Service as startruntime
participant Lease as Redis lease
participant Docker
participant PG as Postgres
participant Health as runtime:health_events
participant Results as runtime:job_results
Lobby->>Stream: XADD {game_id, image_ref, requested_at_ms}
Consumer->>Stream: XREAD
Consumer->>Service: Handle(game_id, image_ref, OpSourceLobbyStream, entry_id)
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Service->>PG: SELECT runtime_records WHERE game_id
Service->>Docker: PullImage(image_ref) per pull policy
Service->>Docker: InspectImage → resource limits
Service->>Service: prepareStateDir(<root>/{game_id})
Service->>Docker: ContainerCreate + ContainerStart
Service->>PG: Upsert runtime_records (status=running)
Service->>PG: INSERT operation_log (op_kind=start, outcome=success)
Service->>Health: XADD container_started
Service-->>Consumer: Result{Outcome=success, ContainerID, EngineEndpoint}
Consumer->>Results: XADD {outcome=success, container_id, engine_endpoint}
Service->>Lease: DEL rtmanager:game_lease:{game_id}
```
REST callers (Game Master, Admin Service) drive the same service
through `POST /api/v1/internal/runtimes/{game_id}/start`; the
diagram's last two arrows collapse to an HTTP `200` response carrying
the runtime record. Sources:
[`../README.md` §Lifecycles → Start](../README.md#start),
[`services.md` §3](services.md).
## Start failure (image pull)
```mermaid
sequenceDiagram
participant Service as startruntime
participant Docker
participant PG as Postgres
participant Intents as notification:intents
participant Results as runtime:job_results
Service->>Docker: PullImage(image_ref)
Docker-->>Service: error
Service->>PG: INSERT operation_log (op_kind=start, outcome=failure, error_code=image_pull_failed)
Service->>Intents: XADD runtime.image_pull_failed {game_id, image_ref, error_code, error_message, attempted_at_ms}
Service-->>Service: Result{Outcome=failure, ErrorCode=image_pull_failed}
Service->>Results: XADD {outcome=failure, error_code=image_pull_failed}
```
The same shape applies to the configuration-validation failures
(`start_config_invalid` from `EnsureNetwork(ErrNetworkMissing)`,
`prepareStateDir`, or invalid `image_ref` shape) and the Docker
create/start failure (`container_start_failed`); only the error code
and the matching `runtime.*` notification type differ. Three failure
codes do **not** raise an admin notification: `conflict`,
`service_unavailable`, `internal_error`
([`services.md` §4](services.md)).
## Start failure (orphan / Upsert-after-Run rollback)
```mermaid
sequenceDiagram
participant Service as startruntime
participant Docker
participant PG as Postgres
participant Intents as notification:intents
Service->>Docker: ContainerCreate + ContainerStart
Docker-->>Service: container running
Service->>PG: Upsert runtime_records
PG-->>Service: error (transport / constraint)
Note over Service: container is now an orphan<br/>(running, no PG record)
Service->>Docker: Remove(container_id) [fresh background context]
Docker-->>Service: ok or logged failure
Service->>PG: INSERT operation_log (outcome=failure, error_code=container_start_failed)
Service->>Intents: XADD runtime.container_start_failed
Service-->>Service: Result{Outcome=failure, ErrorCode=container_start_failed}
```
The Docker adapter already removes the container when `Run` itself
fails after a successful `ContainerCreate`
([`adapters.md` §3](adapters.md)); the start service adds the
post-`Run` rollback for the `Upsert` path. A `Remove` failure is
logged but not propagated; the reconciler adopts surviving orphans on
its periodic pass ([`services.md` §5](services.md)).
## Stop
```mermaid
sequenceDiagram
participant Caller as Lobby / GM / Admin
participant Service as stopruntime
participant Lease as Redis lease
participant PG as Postgres
participant Docker
participant Results as runtime:job_results
Caller->>Service: stop(game_id, reason)
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Service->>PG: SELECT runtime_records WHERE game_id
alt status in {stopped, removed}
Service->>PG: INSERT operation_log (outcome=success, error_code=replay_no_op)
Service-->>Caller: success / replay_no_op
else status = running
Service->>Docker: ContainerStop(container_id, RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS)
Docker-->>Service: ok
Service->>PG: UpdateStatus running→stopped (CAS by container_id)
Service->>PG: INSERT operation_log (op_kind=stop, outcome=success)
Service-->>Caller: success
end
Service->>Lease: DEL rtmanager:game_lease:{game_id}
```
Lobby callers receive the outcome through `runtime:job_results`; REST
callers receive an HTTP `200`. The `reason` enum
(`orphan_cleanup | cancelled | finished | admin_request | timeout`)
is recorded in `operation_log` and is otherwise opaque to the stop
service — RTM does not branch on the reason in v1
([`services.md` §15, §17](services.md)).
## Restart
```mermaid
sequenceDiagram
participant Admin as GM / Admin
participant Service as restartruntime
participant Stop as stopruntime.Run
participant Start as startruntime.Run
participant Docker
participant PG as Postgres
Admin->>Service: POST /restart
Service->>PG: SELECT runtime_records WHERE game_id
Note over Service: capture current image_ref
Service->>Service: acquire per-game lease (held across both inner ops)
Service->>Stop: Run(game_id) [lease bypass]
Stop->>Docker: ContainerStop
Stop->>PG: UpdateStatus running→stopped
Service->>Docker: ContainerRemove
Service->>Start: Run(game_id, image_ref) [lease bypass]
Start->>Docker: PullImage / Run
Start->>PG: Upsert runtime_records (status=running)
Service->>PG: INSERT operation_log (op_kind=restart, outcome=success, source_ref=correlation_id)
Service-->>Admin: 200 {runtime_record}
Service->>Service: release lease
```
The lease is acquired by `restartruntime` and held across both inner
operations; `stopruntime.Run` and `startruntime.Run` are
lease-bypass entry points that skip the inner lease acquisition
([`services.md` §12](services.md)). The single `operation_log` row
uses `Input.SourceRef` as a correlation id linking the implicit stop
and start entries ([`services.md` §13](services.md)).
## Patch
```mermaid
sequenceDiagram
participant Admin as GM / Admin
participant Service as patchruntime
participant Restart as restartruntime.Run
Admin->>Service: POST /patch {image_ref: "galaxy/game:1.4.2"}
Service->>Service: parse new image_ref + current image_ref
alt either ref not semver
Service-->>Admin: 422 image_ref_not_semver
else major or minor differ
Service-->>Admin: 422 semver_patch_only
else major.minor match, patch differs (or equal)
Service->>Restart: Run(game_id, new_image_ref)
Restart-->>Service: Result
Service-->>Admin: 200 {runtime_record}
end
```
The semver gate uses the tag fragment of the Docker reference; the
extraction strategy is recorded in [`services.md` §14](services.md).
The restart delegate already owns the lease, the inner stop/start,
the operation log, and the `runtime:health_events container_started`
emission ([`workers.md` §1](workers.md)).
## Cleanup TTL
```mermaid
sequenceDiagram
participant Worker as containercleanup worker
participant PG as Postgres
participant Service as cleanupcontainer
participant Lease as Redis lease
participant Docker
loop every RTMANAGER_CLEANUP_INTERVAL
Worker->>PG: SELECT runtime_records WHERE status='stopped' AND last_op_at < now - retention
loop per game
Worker->>Service: cleanup(game_id, op_source=auto_ttl)
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Service->>PG: re-read runtime_records WHERE game_id
alt status = running
Service-->>Worker: refused / conflict
else status in {stopped, removed}
Service->>Docker: ContainerRemove(container_id)
Service->>PG: UpdateStatus stopped→removed (CAS)
Service->>PG: INSERT operation_log (op_kind=cleanup_container)
Service-->>Worker: success
end
Service->>Lease: DEL rtmanager:game_lease:{game_id}
end
end
```
Admin-driven cleanup follows the same path through
`DELETE /api/v1/internal/runtimes/{game_id}/container` with
`op_source=admin_rest` instead of `auto_ttl`. The host state directory
is **never** removed by this flow
([`../README.md` §Cleanup](../README.md#cleanup),
[`services.md` §17](services.md),
[`workers.md` §19](workers.md)).
## Reconcile drift adopt
```mermaid
sequenceDiagram
participant Reconciler as reconcile worker
participant Docker
participant PG as Postgres
participant Lease as Redis lease
Note over Reconciler: read pass (lockless)
Reconciler->>Docker: List({label=com.galaxy.owner=rtmanager})
Reconciler->>PG: ListByStatus(running)
Note over Reconciler: write pass (per-game lease)
loop per Docker container without matching record
Reconciler->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Reconciler->>PG: re-read runtime_records WHERE game_id
alt record now exists
Reconciler-->>Reconciler: skip (state changed since read pass)
else record still missing
Reconciler->>PG: Upsert runtime_records (status=running, image_ref, started_at)
Reconciler->>PG: INSERT operation_log (op_kind=reconcile_adopt, op_source=auto_reconcile)
end
Reconciler->>Lease: DEL rtmanager:game_lease:{game_id}
end
```
The reconciler **never** stops or removes an unrecorded container —
operators may have started one manually for diagnostics. The
`reconcile_dispose` and `observed_exited` paths follow the same
read-pass / write-pass split, with `dispose` updating the orphaned
record to `removed` and emitting `container_disappeared`, and
`observed_exited` updating to `stopped` and emitting `container_exited`
([`../README.md` §Reconciliation](../README.md#reconciliation),
[`workers.md` §14–§16](workers.md)).
## Health probe hysteresis
```mermaid
sequenceDiagram
participant Worker as healthprobe worker
participant State as in-memory probe state
participant Engine as galaxy-game-{id}:8080
participant Health as runtime:health_events
loop every RTMANAGER_PROBE_INTERVAL
Worker->>Worker: ListByStatus(running)
Worker->>State: prune entries for games no longer running
loop per game (semaphore cap = 16)
Worker->>Engine: GET /healthz (RTMANAGER_PROBE_TIMEOUT)
alt success
State->>State: consecutiveFailures = 0
opt failurePublished was true
Worker->>Health: XADD probe_recovered {prior_failure_count}
State->>State: failurePublished = false
end
else failure
State->>State: consecutiveFailures++
opt consecutiveFailures == RTMANAGER_PROBE_FAILURES_THRESHOLD AND not failurePublished
Worker->>Health: XADD probe_failed {consecutive_failures, last_status, last_error}
State->>State: failurePublished = true
end
end
end
end
```
Hysteresis prevents a single transient failure from emitting a
`probe_failed` event, and prevents repeated emission while the failure
persists. State is non-persistent: a process restart re-establishes
the counters from scratch; a game's state is pruned when it transitions
out of the running list ([`workers.md` §5–§6](workers.md)).
+163
View File
@@ -0,0 +1,163 @@
# Service-Local Integration Suite
This document explains the design of the service-local integration
suite under [`../integration/`](../integration). The current-state
behaviour (harness layout, env knobs, scenario coverage) lives next
to the files themselves; this document records the rationale.
The cross-service Lobby↔RTM suite at
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
different rules (it lives in the top-level `galaxy/integration`
module) and is documented inside that package.
## 1. Build tag `integration`
The scenarios under [`../integration/*_test.go`](../integration) are
guarded by `//go:build integration`. The default `go test ./...`
invocation skips them, while `go test -tags=integration
./integration/...` (and the `make integration` target) runs the full
set:
```sh
make -C rtmanager integration
```
The harness package itself ([`../integration/harness`](../integration/harness))
has no build tag. It compiles on every run because each helper guards
its Docker-dependent paths with `t.Skip` when the daemon is
unavailable. This keeps the harness loadable from a tagless `go vet`
or IDE workflow without dragging Docker into the default `go test`
critical path.
## 2. Smoke test runs in the default `go test` pass
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
runs in the regular `go test ./...` pass and falls back on
`skipUnlessDockerAvailable` when no Docker socket is present. The
smoke test is intentionally kept separate from the new `integration/`
suite because it exercises the production adapter shape (one
container at a time against `alpine:3.21`), not the full runtime;
both surfaces are useful.
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
logger)` directly rather than spawning the binary from
`cmd/rtmanager/main.go`:
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
the runtime context and call `runtime.Close()`; the goroutine
driving `runtime.Run` returns with `context.Canceled` and the
helper waits on it via the `runDone` channel. With a subprocess the
equivalent dance requires SIGTERM, output capture, and graceful
shutdown timing tied to the child's signal handler.
- **Goroutine and store visibility.** Tests read the durable PG state
directly through the harness-owned pool and read every Redis stream
through the harness-owned client. Both observe the exact wire shape
Lobby will see in the cross-service suite.
- **Logger isolation.** The harness defaults to `slog.Discard` so the
default test output stays focused on assertions; flipping
`EnvOptions.LogToStderr` lights up the runtime's structured logs
for local debugging without requiring any subprocess plumbing.
The cross-service inter-process suite at `integration/lobbyrtm/`
re-uses the existing `integration/internal/harness` binary-spawn
helpers; the in-process choice here is specific to the service-local
scope.
## 4. `httptest.Server` stub for the Lobby internal client
Runtime Manager configuration requires a non-empty
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
as a no-op (the start envelope already carries the only required
field, `image_ref`; rationale in [`services.md`](services.md) §7).
The harness therefore stands up a tiny `httptest.Server` per test
that returns a stable `200 OK` response. The stub is intentionally
unconfigurable: every integration scenario produces the same
ancillary fetch, and adding routing/error injection would invite
test code to depend on a contract the start service deliberately
ignores.
## 5. One built engine image, two semver-compatible tags
The patch lifecycle expects the new and current image refs to share
the same major / minor version (`semver_patch_only` failure
otherwise). Building two distinct images would multiply the per-run
build cost without changing what the test verifies — the patch path
exercises `image_ref_not_semver` and `semver_patch_only` validation
plus the recreate-with-new-tag flow, none of which depend on
distinct image *content*. The harness builds the engine once and
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
The integration tags use the `*-rtm-it` suffix (rather than plain
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
accidentally consume a hand-built dev image, and so a `docker image
rm` of integration leftovers does not nuke a production-shaped tag.
## 6. Per-test Docker network and per-test state root
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
and registers cleanup; `t.ArtifactDir()` provides the per-game state
root. Both ensure that two scenarios running back-to-back cannot
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
filesystem state. Game ids are themselves unique per test
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
the per-test network and state root, the suite is safe to run with
`-count` greater than one.
`t.ArtifactDir()` keeps the engine state directory around when a
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
failure and inspect what the engine wrote. On success the directory
is automatically cleaned up.
## 7. PostgreSQL and Redis containers shared per-package
Both fixtures use `sync.Once` to start one testcontainer per test
package, mirroring the
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
pattern. `TruncatePostgres` and `FlushRedis` reset state between
tests so each scenario starts on an empty stack. The trade-off versus
per-test containers is the standard one: container startup dominates
the per-package latency, so amortising it across the suite keeps the
loop tight while the truncate/flush ensures isolation. The ~12 s
difference matters in CI.
## 8. Engine image cache is intentionally retained between runs
`buildAndTagEngineImage` runs once per package via `sync.Once` and
leaves both image tags in the local Docker cache after the suite
exits. The cache is a substantial speed-up on a developer laptop
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
hot), and a stale image is unlikely because the tags carry the
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
with multiple test runs. Operators who suspect a stale image can
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
the next run rebuilds.
## 9. Scenario coverage
The suite covers the four end-to-end flows operators care about:
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
restart → patch → stop → cleanup. The intermediate `stop` between
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
to remove a running container per
[`../README.md` §Cleanup](../README.md#cleanup).
- **replay** (`replay_test.go`) — duplicate start / stop entries
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
- **health** (`health_test.go`) — external `docker rm` produces
`container_disappeared`; manual `docker run` is adopted by the
reconciler.
- **notification** (`notification_test.go`) — unresolvable `image_ref`
produces `runtime.image_pull_failed` plus a `failure` job_result.
## 10. Service-local scope only
This suite runs Runtime Manager against a real Docker daemon plus
testcontainers PG / Redis but **does not** include any other Galaxy
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
in the top-level `galaxy/integration/` module, where the harness
spawns multiple service binaries and uses real (not stubbed) cross-
service streams.
+531
View File
@@ -0,0 +1,531 @@
# PostgreSQL Schema Decisions
Runtime Manager has been PostgreSQL-and-Redis from day one — there is
no Redis-only predecessor and no migration window. This document
records the schema decisions and the non-obvious agreements behind
them, mirroring the shape of
[`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md)
and serving the same role: a single coherent reference for "why does
the persistence layer look this way".
Use this document together with the migration script
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
and the runtime wiring
[`../internal/app/runtime.go`](../internal/app/runtime.go).
## Outcomes
- Schema `rtmanager` (provisioned externally) holds the durable
service state across three tables: `runtime_records`,
`operation_log`, `health_snapshots`. The three tables map onto the
three runtime concerns documented in
[`../README.md` §Persistence Layout](../README.md#persistence-layout):
current state per game, audit trail per operation, and latest
technical health observation per game.
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
applies embedded goose migrations strictly before any HTTP listener
becomes ready, and exits non-zero when migration or ping fails.
Already-applied migrations exit zero — the
`pkg/postgres`-supplied migrator treats "no work to do" as success.
- The runtime opens one shared `*redis.Client` via
`pkg/redisconn.NewMasterClient` and passes it to the stream offset
store, the per-game lease store, the consumer pipelines, and every
publisher (`runtime:job_results`, `runtime:health_events`,
`notification:intents`).
- The Redis adapter package
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
owns one shared `Keyspace` struct with the
`defaultPrefix = "rtmanager:"` constant and per-store subpackages
for stream offsets and the per-game lease.
- Generated jet code under
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
is committed; `make -C rtmanager jet` regenerates it via the
testcontainers-driven `cmd/jetgen` pipeline.
- Configuration uses the `RTMANAGER_` prefix for every variable.
The schema-per-service rule from
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
applies: each service's role is grant-restricted to its own
schema; RTM never touches Lobby's `lobby` schema or vice versa.
## Decisions
### 1. One schema, externally-provisioned `rtmanagerservice` role
**Decision.** The `rtmanager` schema and the matching
`rtmanagerservice` role are created outside the migration sequence
(in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema`
and by the integration harness; in production, by an ops init script
not in scope for any service stage). The embedded migration
`00001_init.sql` only contains DDL for the service-owned tables and
indexes and assumes it runs as the schema owner with
`search_path=rtmanager`.
**Why.** Mixing role creation, schema creation, and table DDL into
one script forces every consumer of the migration to run as a
superuser. The schema-per-service architectural rule
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
operational split: ops provisions roles and schemas, the service
applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA`
from its runtime role would relax the
"each service's role grants are restricted to its own schema"
defense-in-depth rule.
### 2. `runtime_records.game_id` is the natural primary key
**Decision.** `runtime_records` uses
`game_id text PRIMARY KEY`. There is no surrogate key. The `status`
column carries a CHECK constraint enforcing the
`running | stopped | removed` enum.
```sql
CREATE TABLE runtime_records (
game_id text PRIMARY KEY,
status text NOT NULL,
-- ...
CONSTRAINT runtime_records_status_chk
CHECK (status IN ('running', 'stopped', 'removed'))
);
```
**Why.** `game_id` is the platform-wide identifier owned by Lobby;
RTM stores at most one record per game ever. A surrogate
`bigserial` would force every cross-service join to translate
through a lookup table; the natural key keeps RTM's persistence
layer pin-compatible with the streams contract (every
`runtime:start_jobs` envelope already names the `game_id`). The
status CHECK reproduces the Go-level enum from
[`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go)
as a defense-in-depth gate at the storage boundary. Decision context:
[`domain-and-ports.md`](domain-and-ports.md).
### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus`
**Decision.** `runtime_records_status_last_op_idx` is a composite
index on `(status, last_op_at)`. The container cleanup worker scans
`status='stopped' AND last_op_at < cutoff`; the
`runtimerecordstore.ListByStatus` adapter method orders rows
`last_op_at DESC, game_id ASC`.
```sql
CREATE INDEX runtime_records_status_last_op_idx
ON runtime_records (status, last_op_at);
```
**Why.** Both read shapes share the same composite. The cleanup
worker drives the index from one direction (range scan on
`last_op_at` filtered by status); `ListByStatus` drives it from the
other (equality on status, sorted by `last_op_at`). PostgreSQL
satisfies both shapes through one index scan once the planner picks
the index for the WHERE clause. The secondary `game_id ASC` tiebreak
in the adapter ORDER BY is satisfied by primary-key ordering after
the index returns the rows.
A second supporting index for the cleanup worker was considered and
rejected: the workload is so small (single-instance v1, bounded
running game count) that one composite is dominantly cheaper than
two narrow ones.
### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index
**Decision.** `operation_log` carries a `bigserial id PRIMARY KEY`
and is written exclusively through INSERT — there is no UPDATE
pathway, no soft-delete column, and no foreign key to
`runtime_records`. The audit index
`operation_log_game_started_idx (game_id, started_at DESC)` drives
the GM/Admin REST audit reads. The adapter's `ListByGame` orders
results `started_at DESC, id DESC` and applies `LIMIT $2`.
```sql
CREATE INDEX operation_log_game_started_idx
ON operation_log (game_id, started_at DESC);
```
**Why.** The audit's correctness invariant is "every operation RTM
performed gets exactly one row"; CASCADE deletes from
`runtime_records` would silently lose history when an admin removes
a runtime and would break the
[`../README.md` §Persistence Layout](../README.md) commitment. The
secondary `id DESC` tiebreak inside the adapter is necessary because
the audit log can write multiple rows in the same millisecond when
`reconcile_adopt` and a real operation interleave on a single tick;
without the tiebreak the test that asserts insertion-order-stable
reads becomes flaky. A non-positive `limit` is rejected before the
SQL is issued; an empty result set returns as `nil` (matching the
lobby pattern, so service-layer callers can do `len(entries) == 0`
without an extra allocation).
### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome`
**Decision.** `operation_log` reproduces the three Go-level enums
as CHECK constraints:
```sql
CONSTRAINT operation_log_op_kind_chk
CHECK (op_kind IN (
'start', 'stop', 'restart', 'patch',
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
)),
CONSTRAINT operation_log_op_source_chk
CHECK (op_source IN (
'lobby_stream', 'gm_rest', 'admin_rest',
'auto_ttl', 'auto_reconcile'
)),
CONSTRAINT operation_log_outcome_chk
CHECK (outcome IN ('success', 'failure'))
```
The Go-level enums in
[`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
remain the source of truth.
**Why.** A defence-in-depth gate at the storage boundary catches any
adapter regression that would otherwise persist an unexpected
string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`)
benefit from the enum being verifiable directly in psql without
consulting the Go source. Adding a new value requires editing two
places (the Go enum and the migration), which is the right friction
level: every new value is a wire-protocol change and deserves an
explicit migration. The alternative of using PostgreSQL's `CREATE
TYPE … AS ENUM` was rejected because adding a value to a PG enum
type requires `ALTER TYPE` outside a transaction and complicates the
single-init pre-launch policy (decision §12).
### 6. `health_snapshots` is one row per game; status enum collapses event types
**Decision.** `health_snapshots` carries `game_id text PRIMARY KEY`
and stores the latest technical health observation per game. The
`status` column enumerates the **observed engine state**, not the
**triggering event type**:
```sql
CONSTRAINT health_snapshots_status_chk
CHECK (status IN (
'healthy', 'probe_failed', 'exited',
'oom', 'inspect_unhealthy', 'container_disappeared'
))
```
The `runtime:health_events` `event_type` enum has seven values
(`container_started`, `container_exited`, `container_oom`,
`container_disappeared`, `inspect_unhealthy`, `probe_failed`,
`probe_recovered`). The snapshot status has six — the two probe
events fold into `healthy` (after `probe_recovered`) and
`probe_failed`, and `container_started` collapses into `healthy`.
**Why.** Health snapshots answer "what state is the engine in
**right now**", not "what event was just emitted". A consumer who
wants the event firehose reads `runtime:health_events`; a consumer
who wants the latest verdict reads `health_snapshots`. The two
surfaces have different lifetimes (stream entries are bounded only
by Redis trim; snapshot rows are overwritten on every new
observation), so collapsing the seven event types into six status
states aligns the column with the consumer's mental model. The
adapter that implements this collapse lives in
[`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go);
every emission to the stream also upserts the snapshot.
### 7. Two-axis CAS shape on `runtime_records.UpdateStatus`
**Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS
guard into a single `WHERE … AND …` clause. Status must equal the
caller's `ExpectedFrom`; when the caller supplies a non-empty
`ExpectedContainerID`, `current_container_id` must equal it as
well:
```sql
UPDATE rtmanager.runtime_records
SET status = $1, last_op_at = $2, ...
WHERE game_id = $3
AND status = $4
[AND current_container_id = $5]
```
A `RowsAffected() == 0` result is ambiguous — the row may be absent
or the predicate may have failed. The adapter resolves the ambiguity
through a follow-up `SELECT status FROM ... WHERE game_id = $1`:
missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`.
The probe runs only on the slow path; happy-path UPDATEs cost a
single round trip.
**Why.** The two-axis CAS is what services need: a stop driven by an
old container_id (from a stale REST request) must not clobber a
fresh `running` record installed by a concurrent restart. Status-only
CAS would collapse those two cases. The optional shape on
`ExpectedContainerID` lets reconciliation flows that legitimately
target "this game in `running` state without caring which container"
omit the second predicate. The follow-up probe matches the
gamestore / invitestore precedent in `lobby/internal/adapters/postgres`
and produces clean per-error sentinels at the service layer.
`TestUpdateStatusConcurrentCAS` exercises the path end to end with
eight goroutines racing the same transition: exactly one returns
`nil`, the rest see `runtime.ErrConflict`. The test is deterministic
because PostgreSQL serialises row-level UPDATEs through the row's
MVCC tuple.
### 8. Destination-driven `SET` clause on `UpdateStatus`
**Decision.** `UpdateStatus` updates a different column subset
depending on the destination status:
| Destination | Columns set |
| --- | --- |
| `stopped` | `status`, `last_op_at`, `stopped_at` |
| `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` |
| `running` | `status`, `last_op_at` |
The implementation switches on `input.To` and writes the UPDATE
chain inline per branch — three short branches read better than one
parametric helper.
**Why.** Each destination has a different invariant. `stopped`
records the wall-clock at which the engine ceased serving; `removed`
nulls the container_id because the row no longer points at any
Docker resource; `running` only updates the status and the
last-op timestamp because the running invariants
(`current_container_id`, fresh `started_at`, `current_image_ref`,
`engine_endpoint`) are installed through `Upsert` on the `start`
path.
A previous draft built the SET list via `[]pg.Column` / `[]any`
slices and a helper, but jet's `UPDATE(columns ...jet.Column)`
variadic refuses a `[]postgres.Column` slice spread because the
element type does not match `jet.Column` after the type-alias
resolution. The final code switches inline per branch.
The `running` destination is implemented even though the start
service uses `Upsert` for the inner start of restart and patch.
Keeping the `running` path live preserves a one-to-one match between
`runtime.AllowedTransitions()` and the adapter's capability matrix —
otherwise a future caller exercising the `stopped → running`
transition through `UpdateStatus` would hit a runtime error inside
the adapter rather than a domain rejection. The path only updates
`status` and `last_op_at`; callers responsible for the running
invariants install them through `Upsert` first.
### 9. `created_at` preservation on `Upsert`
**Decision.** `runtimerecordstore.Upsert` is implemented as
`INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable
column from EXCLUDED>``created_at` is deliberately omitted from
the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt`
value never overwrites the stored timestamp.
```sql
INSERT INTO rtmanager.runtime_records (...)
VALUES (...)
ON CONFLICT (game_id) DO UPDATE
SET status = EXCLUDED.status,
current_container_id = EXCLUDED.current_container_id,
current_image_ref = EXCLUDED.current_image_ref,
engine_endpoint = EXCLUDED.engine_endpoint,
state_path = EXCLUDED.state_path,
docker_network = EXCLUDED.docker_network,
started_at = EXCLUDED.started_at,
stopped_at = EXCLUDED.stopped_at,
removed_at = EXCLUDED.removed_at,
last_op_at = EXCLUDED.last_op_at
-- created_at intentionally NOT updated
```
`TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the
invariant.
**Why.** `runtime_records.created_at` records "first time RTM saw
the game". Every restart and every reconcile_adopt re-Upserts the
row with the current wall-clock as `CreatedAt` from the adapter
boundary; without the omission rule the timestamp would drift
forward. Preserving the original creation time keeps a stable
horizon for retention reasoning and matches
`lobby/internal/adapters/postgres/gamestore.Save`, which uses the
same approach for the `games.created_at` column.
### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default
**Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT
'{}'::jsonb`. The jet-generated model declares
`Details string` (jet maps `jsonb` to `string`). The adapter:
- on `Upsert`, substitutes the SQL DEFAULT `{}` when
`snapshot.Details` is empty, so the column never holds a non-JSON
empty string;
- on `Get`, scans `details` as `[]byte` and wraps the bytes in a
`json.RawMessage` so the caller receives verbatim bytes without
an extra round of parsing.
`TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and
`TestUpsertAndGetRoundTrip` cover the two cases.
**Why.** The detail payload is type-specific (the keys differ
between `probe_failed` and `inspect_unhealthy`) and is opaque to
queries — the column is never element-filtered. JSONB matches the
"everything outside primary fields is JSON" pattern that the
Notification Service already established and allows a future
GIN index (e.g. for an admin search-by-key feature) without a
schema rewrite. Substituting the SQL DEFAULT for an empty
parameter avoids the trap where the database accepts `''` for
`text` but rejects it for `jsonb`.
### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary
**Decision.** Every time-valued column on every RTM table uses
PostgreSQL's `timestamptz`. The domain model continues to use
`time.Time`; the adapter normalises every `time.Time` parameter to
UTC at the binding site (`record.X.UTC()` or the `nullableTime`
helper that wraps a possibly-zero `time.Time`), and re-wraps every
scanned `time.Time` with `.UTC()` (directly or via
`timeFromNullable` for nullable columns) before the value leaves
the adapter.
The architecture-wide form of this rule lives in
[`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md).
**Why.** `timestamptz` is the right column type for every cross-
service timestamp the platform observes, and the domain model needs
a `time.Time` API the service layer can compare and arithmetise.
Without explicit `.UTC()` on the bind site, the pgx driver returns
scanned values in `time.Local`, which silently breaks equality
tests, JSON formatting, and comparison against pointer fields
elsewhere in the codebase. The defensive `.UTC()` rule on both
sides eliminates the class of bug where a timezone difference
between the adapter and the test harness flips assertions
intermittently.
The same shape is used in User Service, Mail Service, and
Notification Service — RTM matches the existing convention rather
than introducing a fourth encoding path.
### 12. Single-init pre-launch policy
**Decision.** `00001_init.sql` evolves in place until first
production deploy. Adding a column, an index, or a new table during
the pre-launch development window edits this file directly rather
than producing `00002_*.sql`. The runtime applies the migration on
every boot; if the schema is already at head, `pkg/postgres`'s
goose adapter exits zero.
**Why.** The schema-per-service architectural rule
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
endorses a single-init policy for pre-launch services. The
pre-launch window allows non-additive changes (column rename, type
narrowing, CHECK tightening) that a multi-step migration sequence
would force into awkward two-step rewrites. Once the service ships
to production, the next schema change becomes `00002_*.sql` and
the policy lifts; from that point onward edits to `00001_init.sql`
are rejected by code review.
This applies to RTM exactly the same way it applies to every other
PG-backed service in the workspace; the README explicitly carries
the reminder. The exit-zero behaviour for already-applied
migrations is what makes the policy operationally cheap: a
freshly-spawned replica re-applies the same `00001_init.sql` with
no work to do, no logged error, and proceeds to open its
listeners.
### 13. Query layer is `go-jet/jet/v2`; generated code is committed
**Decision.** All three RTM PG-store packages
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore))
build SQL through the jet builder API
(`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
`pg.AND/OR/SET/COALESCE/...` DSL).
Generated table models live under
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
and are regenerated by `make -C rtmanager jet`. The target invokes
[`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a
transient PostgreSQL container via testcontainers, provisions the
`rtmanager` schema and `rtmanagerservice` role, applies the embedded
goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
against the provisioned schema. Generated code is committed to the
repo, so build consumers do not need Docker.
Statements are run through the `database/sql` API
(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
helpers preserve the codecs.go boundary translations and
domain-type mapping (status enum decoding, `time.Time` UTC
normalisation, JSONB `[]byte``json.RawMessage`).
PostgreSQL constructs that the jet builder does not cover natively
(`COALESCE`, `LOWER` on subselects, JSONB params) are expressed
through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct
`[]byte`/string params for JSONB columns).
**Why.** Aligns with the workspace-wide convention from
[`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is
`github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed
service. Hand-rolled SQL would multiply boundary-translation paths
and require per-store query-builder helpers for what jet already
covers. Committing generated code keeps `go build ./...` working
without Docker.
### 14. `redisstate` keyspace ownership and per-store subpackages
**Decision.** The
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
package owns one shared `Keyspace` struct with a
`defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter
lives in its own subpackage:
- [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/)
for the stream offset store consumed by the start-jobs and
stop-jobs consumers;
- [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/)
for the per-game lease store consumed by every lifecycle service
and the reconciler.
Both subpackages take a `redisstate.Keyspace{}` value and use it to
build their key shapes (`rtmanager:stream_offsets:{label}`,
`rtmanager:game_lease:{game_id}`).
**Why.** Keeping the parent package as the single owner of the prefix
and the key-shape builder mirrors the way Lobby's `redisstate`
namespace centralises every key shape and supports multiple Redis-
backed adapters (stream offsets, the per-game lease) without a
restructure as the surface grows.
The per-store subpackage choice (rather than Lobby's flat
single-package shape) is driven by three considerations:
- It keeps the docker mock generator scoped to one package, since
`mockgen` regenerates per-directory.
- It allows finer-grained dependency selection: `miniredis` is a
dev-only dep, and keeping the `streamoffsets` package
self-contained leaves room for `gamelease` to depend only on the
production `redis` client.
- Each subpackage carries its own tests, which keeps the test
surface focused on one Redis primitive rather than mixing offset
semantics with lease semantics in shared fixtures.
## Cross-References
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
— the embedded schema migration.
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
`//go:embed *.sql` and `FS()` exporter consumed by the runtime.
- [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)
— the three jet-backed PG adapters and their testcontainers-driven
unit suites.
- [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
— committed generated jet models.
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and
[`../Makefile`](../Makefile) `jet` target — the regeneration
pipeline.
- [`../internal/adapters/redisstate/`](../internal/adapters/redisstate),
[`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/),
[`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/)
— Redis adapter package layout.
- [`../internal/app/runtime.go`](../internal/app/runtime.go)
— runtime wiring: PG pool open + migration apply + Redis client
open + adapter assembly.
- [`../internal/config/`](../internal/config) — the config groups
consumed by the wiring (`Postgres`, `Redis`, `Streams`,
`Coordination`).
- Companion design rationales:
[`domain-and-ports.md`](domain-and-ports.md) for status enum and
domain shape, [`adapters.md`](adapters.md) for the redisstate
publishers and clients.
+368
View File
@@ -0,0 +1,368 @@
# Operator Runbook
This runbook covers the checks that matter most during startup,
steady-state readiness, shutdown, and the handful of recovery paths
specific to Runtime Manager.
## Startup Checks
Before starting the process, confirm:
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`)
reaches a Docker daemon the operator controls. RTM is the only
Galaxy service permitted to interact with the Docker socket;
scoping the daemon to RTM-only callers is operator domain.
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a
user-defined bridge network that has already been created (e.g.
via `docker network create galaxy-net` in the environment's
bootstrap script). RTM **validates** the network at startup but
never creates it. A missing network is fail-fast and the process
exits non-zero before opening any listener.
- `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user
can read and write. Per-game subdirectories are created with
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the
uid/gid to match the engine container's user when running with a
non-root engine.
- `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
that hosts the `rtmanager` schema. The DSN must include
`search_path=rtmanager` and `sslmode=disable` (or a real SSL mode
for production). Embedded goose migrations apply at startup before
any HTTP listener opens; a migration or ping failure terminates the
process with a non-zero exit. The `rtmanager` schema and the
matching `rtmanagerservice` role are provisioned externally
([`postgres-migration.md` §1](postgres-migration.md)).
- `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach
the Redis deployment used for the runtime-coordination state:
stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`),
publishers (`runtime:job_results`, `runtime:health_events`,
`notification:intents`), persisted offsets, and the per-game
lease. RTM does not maintain durable business state on Redis.
- Stream names match the producers and consumers RTM integrates with:
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`)
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`)
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal
HTTP listener. RTM's start service issues a diagnostic
`GET /api/v1/internal/games/{game_id}` per start; failure is logged
at debug and does not abort the start
([`services.md` §7](services.md)).
The startup sequence runs in the order recorded in
[`../README.md` §Startup dependencies](../README.md#startup-dependencies):
1. PostgreSQL primary opens; goose migrations apply synchronously.
2. Redis master client opens and pings.
3. Docker daemon ping; configured network presence check.
4. Telemetry exporter (OTLP grpc/http or stdout).
5. Internal HTTP listener.
6. Reconciler runs **once synchronously** and blocks until done.
7. Background workers start.
A failure at any step is fatal. The synchronous reconciler pass is
the reason orphaned containers from a prior process never reach the
periodic workers in an inconsistent state
([`workers.md` §17](workers.md)).
Expected log lines on a healthy boot:
- `migrations applied`,
- `postgres ping ok`,
- `redis ping ok`,
- `docker ping ok` and `docker network found`,
- `telemetry exporter started`,
- `internal http listening`,
- `reconciler initial pass completed`,
- one `worker started` entry per background worker (seven expected).
## Readiness
Use the probes according to what they actually verify:
- `GET /healthz` confirms the listener is alive — no dependency
check.
- `GET /readyz` live-pings PostgreSQL primary, Redis master, and the
Docker daemon, then asserts the configured Docker network exists.
Returns `{"status":"ready"}` when every check passes; otherwise
returns `503` with the canonical
`{"error":{"code":"service_unavailable","message":"…"}}` envelope
identifying the first failing dependency.
`/readyz` is the strongest readiness signal RTM exposes; unlike
Lobby's `/readyz`, it does **not** rely on a one-shot boot ping.
Each request hits the daemon and the database fresh.
For a practical readiness check in production:
1. confirm the process emitted the listener and worker startup logs;
2. check `GET /healthz` and `GET /readyz`;
3. verify `rtmanager.runtime_records_by_status{status="running"}`
gauge tracks the expected live game count after the first start
completes;
4. verify `rtmanager.docker_op_latency` histograms have at least one
sample after the first lifecycle operation.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behaviour:
- the per-component shutdown budget is controlled by
`RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`);
- the internal HTTP listener drains in-flight requests before closing;
- stream consumers stop their `XREAD` loops and persist the latest
offset before returning; the offset survives the restart
([`workers.md` §9](workers.md));
- the Docker events listener cancels its subscription;
- the in-flight services release their per-game lease through the
surrounding context cancellation;
- the reconciler completes its current pass or aborts mid-write at
the next lease re-acquisition.
During planned restarts:
1. send `SIGTERM`;
2. wait for the listener and component-stop logs;
3. expect any consumer that was mid-cycle to retry from the persisted
offset on the next process start;
4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`.
## Engine Container Died
A running engine container that exits unexpectedly surfaces through
three observation channels:
- The Docker events listener emits `container_exited` (non-zero exit
code) or `container_oom` (Docker action `oom`).
- The active probe worker eventually emits `probe_failed` once the
threshold is crossed.
- The Docker inspect worker may emit `inspect_unhealthy` if the
engine restarts under Docker's healthcheck or if Docker reports an
unexpected status.
Triage:
1. Inspect the `runtime:health_events` stream for the affected
`game_id` and `event_type`:
```bash
redis-cli XRANGE runtime:health_events - + COUNT 200 \
| grep -A4 'game_id\s*<game_id>'
```
2. Read the runtime record and the operation log:
```bash
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code, started_at
FROM rtmanager.operation_log
WHERE game_id = '<game_id>'
ORDER BY started_at DESC LIMIT 20"
```
3. If Lobby has not reacted (the game's status remains `running` in
`lobby.games`), check `runtime:job_results` lag and Lobby's
`runtimejobresult` worker. RTM publishes the result; Lobby is the
consumer.
4. If the container is already gone (`docker ps -a` shows no row for
`galaxy-game-<game_id>`), the reconciler will move the record to
`removed` on its next pass. Run the periodic reconcile manually
by sending `SIGHUP` is **not** supported — wait
`RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the
process; the synchronous boot pass will handle the drift.
5. The `notification:intents` stream is **not** the place to look
for ongoing health changes. Only the three first-touch start
failures (`runtime.image_pull_failed`,
`runtime.container_start_failed`,
`runtime.start_config_invalid`) produce a notification intent;
probe failures, OOMs, and exits flow through health events only
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
## Patch Upgrade
A patch upgrade replaces the container with a new `image_ref` while
preserving the bind-mounted state directory.
Pre-conditions:
- The new and current `image_ref` tags both parse as semver. RTM
rejects non-semver tags with `image_ref_not_semver`.
- The new and current major / minor versions match. A cross-major or
cross-minor patch returns `semver_patch_only`.
Driving the upgrade:
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
```
Behaviour:
- The container is stopped, removed, and recreated. The
`current_container_id` changes; the `engine_endpoint`
(`http://galaxy-game-<game_id>:8080`) is stable.
- The engine reads its state from the bind mount on startup, so any
data written before the patch survives.
- A single `operation_log` row is appended with `op_kind=patch` and
the old / new image refs.
- A `runtime:health_events container_started` is emitted by the
inner start ([`workers.md` §1](workers.md)).
Post-patch verification:
```bash
curl -s http://galaxy-game-<game_id>:8080/healthz
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
```
The `current_image_ref` field on the runtime record reflects the new
tag.
## Manual Cleanup
The cleanup endpoint removes the container and updates the record to
`removed`. It refuses to remove a `running` container — stop first.
```bash
# Stop, then clean up
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
-d '{"reason":"admin_request"}'
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
```
The host state directory under `<RTMANAGER_GAME_STATE_ROOT>/<game_id>`
is **never** deleted by RTM. Removing the directory is operator
domain (backup tooling, future Admin Service workflow). The
operation_log records `op_kind=cleanup_container` with
`op_source=admin_rest`.
## Reconcile Drift After Docker Daemon Restart
A Docker daemon restart drops every running engine container; PG
records remain. On RTM's next boot (or its next periodic reconcile):
1. The reconciler observes `running` records whose containers are
missing from `docker ps`. It updates each record to `removed`,
appends `operation_log` with `op_kind=reconcile_dispose`, and
publishes `runtime:health_events container_disappeared`
([`workers.md` §14–§15](workers.md)).
2. Lobby's `runtimejobresult` worker does not consume the dispose
event in v1, so the cascade does not auto-restart the engine.
Operators trigger restarts through Lobby's user-facing flow or
directly via the GM/Admin REST `restart` endpoint.
3. If the operator brings up an engine container manually for
diagnostics (`docker run` with the
`com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>` labels),
the reconciler **adopts** it on the next pass: a new
`runtime_records` row appears with `op_kind=reconcile_adopt`.
The reconciler **never stops or removes** an unrecorded
container — operators stay in control of manual containers
([`../README.md` §Reconciliation](../README.md#reconciliation)).
Three drift kinds run through the same lease-guarded write pass:
`adopt`, `dispose`, and the README-level path
`observed_exited` (a record marked `running` whose container exists
but is in `exited`). Telemetry counter
`rtmanager.reconcile_drift{kind}` exposes the three independently
([`workers.md` §15](workers.md)).
## Testing Locally
```sh
# One-time bootstrap
docker network create galaxy-net
# Minimal env (see docs/examples.md for a complete .env)
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
export RTMANAGER_DOCKER_NETWORK=galaxy-net
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
export RTMANAGER_REDIS_PASSWORD=local
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
go run ./rtmanager/cmd/rtmanager
```
After start:
- `curl http://localhost:8096/healthz` returns `{"status":"ok"}`;
- `curl http://localhost:8096/readyz` returns `{"status":"ready"}`
once PG, Redis, and Docker pings pass and the configured network
exists;
- driving Lobby through its public flow (`POST /api/v1/lobby/games/<id>/start`)
brings up `galaxy-game-<game_id>` containers; RTM logs each
lifecycle transition.
The integration suite under `rtmanager/integration/` exercises the
end-to-end flows against the real Docker daemon. The default
`go test ./...` skips it via the `integration` build tag; run
explicitly with:
```sh
make -C rtmanager integration
```
The suite requires a reachable Docker daemon. Without one, the
harness helpers call `t.Skip` and the package becomes a no-op
([`integration-tests.md` §1](integration-tests.md)).
## Diagnostic Queries
Durable runtime state lives in PostgreSQL; runtime-coordination state
stays in Redis. CLI snippets that help during incidents:
```bash
# Live runtime count by status (PostgreSQL)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
# Inspect a specific runtime record
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
# Last 20 operations for a game (newest first)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = '<game_id>'
ORDER BY started_at DESC, id DESC
LIMIT 20"
# Latest health snapshot
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
# Containers RTM owns (Docker)
docker ps --filter label=com.galaxy.owner=rtmanager \
--format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
# Stream lag (Redis)
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Recent health events (oldest first)
redis-cli XRANGE runtime:health_events - + COUNT 100
# Per-game lease (only present while an operation runs)
redis-cli GET rtmanager:game_lease:<game_id>
redis-cli TTL rtmanager:game_lease:<game_id>
```
Operators reach the gauges and counters surfaced through
OpenTelemetry as the primary observability surface; raw PostgreSQL
and Redis access is for last-resort triage.
+309
View File
@@ -0,0 +1,309 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/rtmanager` process
and its runtime dependencies. The current-state contract for every
listener, worker, and adapter lives in [`../README.md`](../README.md);
this document is the navigation aid that points at the right code path
and the right design-rationale record.
```mermaid
flowchart LR
subgraph Clients
GM["Game Master"]
Admin["Admin Service"]
Lobby["Game Lobby"]
end
subgraph RTM["Runtime Manager process"]
InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
StartJobs["startjobsconsumer"]
StopJobs["stopjobsconsumer"]
DockerEvents["dockerevents listener"]
HealthProbe["healthprobe worker"]
DockerInspect["dockerinspect worker"]
Reconcile["reconcile worker"]
Cleanup["containercleanup worker"]
Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
IntentPublisher["notification:intents publisher"]
ResultsPublisher["runtime:job_results publisher"]
HealthPublisher["runtime:health_events publisher"]
Telemetry["Logs, traces, metrics"]
end
Docker["Docker Daemon"]
Engine["galaxy-game-{game_id} container"]
Postgres["PostgreSQL\nschema rtmanager"]
Redis["Redis\nstreams + leases + offsets"]
LobbyHTTP["Lobby internal HTTP"]
Lobby -. runtime:start_jobs .-> StartJobs
Lobby -. runtime:stop_jobs .-> StopJobs
GM --> InternalHTTP
Admin --> InternalHTTP
StartJobs --> Services
StopJobs --> Services
InternalHTTP --> Services
Services --> Docker
Services --> Postgres
Services --> Redis
Services --> ResultsPublisher
Services --> HealthPublisher
Services --> IntentPublisher
Services -. GET diagnostic .-> LobbyHTTP
DockerEvents --> Docker
DockerInspect --> Docker
HealthProbe --> Engine
Reconcile --> Docker
Reconcile --> Postgres
Cleanup --> Postgres
Cleanup --> Services
DockerEvents --> HealthPublisher
DockerInspect --> HealthPublisher
HealthProbe --> HealthPublisher
HealthPublisher --> Redis
ResultsPublisher --> Redis
IntentPublisher --> Redis
StartJobs --> Redis
StopJobs --> Redis
InternalHTTP --> Postgres
Docker -->|create / start / stop / rm| Engine
Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
InternalHTTP --> Telemetry
Services --> Telemetry
StartJobs --> Telemetry
StopJobs --> Telemetry
DockerEvents --> Telemetry
HealthProbe --> Telemetry
DockerInspect --> Telemetry
Reconcile --> Telemetry
Cleanup --> Telemetry
```
Notes:
- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
goose migrations fail, when Redis ping fails, when the Docker daemon
ping fails, or when the configured Docker network is missing. Lobby
reachability is **not** verified at boot — the start service's
diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
outside of debug logging
([`services.md` §7](services.md)).
- The reconciler runs **synchronously** once on startup before
`app.App.Run` registers any other component, then re-runs
periodically as a regular `Component`. The synchronous pass is the
reason why orphaned containers from a prior process can never be
observed by the events listener with no PG record
([`workers.md` §17](workers.md)).
- A single internal HTTP listener exposes both probes
(`/healthz`, `/readyz`) and the trusted REST surface for Game Master
and Admin Service. There is no public listener — RTM does not face
end users.
## Listeners
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
- read timeout: `5s`
- write timeout: `15s`
- idle timeout: `60s`
The listener is unauthenticated and assumes a trusted network segment.
The `X-Galaxy-Caller` request header carries an optional caller
identity (`gm` or `admin`) that the handler records as
`operation_log.op_source`
([`services.md` §18](services.md)).
Probe routes:
- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
the listener is up.
- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
Docker daemon, then asserts the configured Docker network exists.
Returns `{"status":"ready"}` only when every check passes; otherwise
returns `503` with the canonical error envelope.
## Background Workers
Every worker runs as an `app.Component` and is registered in the
order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
| Worker | Source | Trigger | Function |
| --- | --- | --- | --- |
| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
The events listener and the inspect worker do **not** emit
`container_started` — that event is owned by the start service
([`workers.md` §1](workers.md)). The events listener and the inspect
worker also do not emit `container_disappeared` autonomously when a
record is missing or stale; the conditional emission rules live in
[`workers.md` §2](workers.md) and [`§4`](workers.md).
## Lifecycle Services
The five lifecycle services are pure orchestrators called from both
the stream consumers and the REST handlers. Each service owns the
per-game lease for the duration of its operation.
| Service | Source | Triggers | Failure envelope |
| --- | --- | --- | --- |
| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
All services share three behaviours captured in
[`services.md`](services.md):
- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
not by the caller — which keeps consumer and REST callers symmetric
([`services.md` §1](services.md));
- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
`ContainerID`, `EngineEndpoint`) is what consumers and REST
handlers translate into job_results / HTTP responses
([`services.md` §3](services.md));
- failures pass through one `operation_log` write before returning,
and three of the failure codes (`start_config_invalid`,
`image_pull_failed`, `container_start_failed`) also publish a
`runtime.*` admin notification intent
([`services.md` §4](services.md)).
## Synchronous Upstream Client
| Client | Endpoint | Failure mapping |
| --- | --- | --- |
| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
Lobby's outbound transport is the only synchronous client RTM holds.
Every other interaction (Notification Service, Game Master, Admin
Service) crosses an asynchronous boundary or is initiated by the peer.
## Stream Offsets
Each consumer persists its position under a fixed label so process
restart preserves stream progress.
| Stream | Offset key | Block timeout env |
| --- | --- | --- |
| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
The labels `startjobs` and `stopjobs` are stable identifiers — they
are decoupled from the underlying stream key. An operator who renames
a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
Decision: [`workers.md` §9](workers.md).
The `runtime:job_results`, `runtime:health_events`, and
`notification:intents` streams are outbound; RTM does not consume them
itself.
## Configuration Groups
The full env-var list with defaults lives in
[`../README.md` §Configuration](../README.md). The groups below
summarise the structure:
- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
`RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
`RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
`RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
`RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
`RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
`RTMANAGER_CONTAINER_RETENTION_DAYS`,
`RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
`RTMANAGER_ENGINE_STATE_ENV_NAME`,
`RTMANAGER_GAME_STATE_DIR_MODE`,
`RTMANAGER_GAME_STATE_OWNER_UID`,
`RTMANAGER_GAME_STATE_OWNER_GID`.
- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
`RTMANAGER_POSTGRES_REPLICA_DSNS`,
`RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
`RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
`RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
`RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
`RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
`RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
`RTMANAGER_REDIS_STOP_JOBS_STREAM`,
`RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
`RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
`RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
`RTMANAGER_STREAM_BLOCK_TIMEOUT`.
- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
`RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
`RTMANAGER_PROBE_FAILURES_THRESHOLD`.
- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
`RTMANAGER_CLEANUP_INTERVAL`.
- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
`RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
`RTMANAGER_SHUTDOWN_TIMEOUT`.
- **Telemetry** — standard `OTEL_*`.
## Runtime Notes
- **Single-instance v1.** Multi-instance Runtime Manager with Redis
Streams consumer groups is explicitly out of scope for the current
iteration. The per-game lease serialises operations on one game
across the consumer + REST entry points; cross-instance
coordination is deferred until a real workload demands it.
- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
`SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
(default `60s`). The lease is **not renewed mid-operation** in v1;
long pulls of multi-GB images can therefore expire the lease
before the operation finishes — the trade-off is documented in
[`services.md` §1](services.md). The reconciler honours the same
lease around every drift mutation
([`workers.md` §14](workers.md)).
- **Operation log is the source of truth.** Every lifecycle and
reconcile mutation appends one row to `rtmanager.operation_log`.
The `runtime:health_events` stream and the `notification:intents`
emissions are best-effort — a publish failure logs at `Error` and
proceeds, never rolling back the recorded operation
([`workers.md` §8](workers.md)).
- **In-memory probe hysteresis.** The active HTTP probe keeps
per-game `consecutiveFailures` and `failurePublished` counters in a
mutex-guarded map. State is non-persistent: a process restart that
loses the counters re-establishes hysteresis from scratch, and
state for a game that transitions through `stopped → running` is
pruned at the start of every probe tick
([`workers.md` §5](workers.md)).
- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
`if_missing` (default), `always`, and `never`. Image labels
(`com.galaxy.cpu_quota`, `com.galaxy.memory`,
`com.galaxy.pids_limit`) drive resource limits when present; the
matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
label is absent or unparseable. Producers never pass limits.
- **State directory ownership.** RTM creates per-game state
directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
mode and uid/gid, but **never deletes them**. Removing the directory
is operator domain (backup tooling, a future Admin Service
workflow). A cleanup that removes the container leaves the
directory intact.
+443
View File
@@ -0,0 +1,443 @@
# Lifecycle Services
This document explains the design of the five lifecycle services
(`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
`cleanupcontainer`) under [`../internal/service/`](../internal/service)
plus the per-handler REST glue under
[`../internal/api/internalhttp/`](../internal/api/internalhttp).
The current-state behaviour (lifecycle steps, failure tables, the
per-game lease semantics, the wire contracts) lives in
[`../README.md`](../README.md), the OpenAPI spec at
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml), and the
AsyncAPI spec at
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml).
This file records the *why*.
## 1. Per-game lease lives at the service layer
Every lifecycle service acquires `rtmanager:game_lease:{game_id}` via
[`ports.GameLeaseStore`](../internal/ports/gamelease.go) before doing
any work, and releases it on the way out:
- the lease primitive serialises operations on a single game across
every entry point (stream consumers and REST handlers);
- holding the lease at the service layer keeps the consumer / REST
callers symmetric — neither acquires the lease itself, both call
the service the same way;
- the Redis-backed adapter
([`../internal/adapters/redisstate/gamelease/store.go`](../internal/adapters/redisstate/gamelease/store.go))
uses `SET NX PX` on acquire, Lua compare-and-delete on release; a
release whose caller-supplied token no longer matches is a silent
no-op.
The lease key shape is `rtmanager:game_lease:{base64url(game_id)}` so
opaque game ids may contain any characters without leaking through
the key syntax.
The lease TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60s`)
and is **not renewed mid-operation** in v1. A multi-GB image pull can
theoretically expire the lease before the start service finishes;
operators see this as a `reconcile_adopt` event later because the
container is created with the standard owner labels. A renewal helper
is deliberately deferred until a workload makes it necessary.
The reconciler ([`workers.md`](workers.md) §4) honours the same lease
around every drift mutation, which closes the
restart-vs-`reconcile_dispose` race documented in §6 below.
## 2. Health-events publisher lands with the start service
The start service publishes `container_started` after `docker run`
returns; the events listener intentionally does **not** duplicate
the event ([`workers.md`](workers.md) §1). Centralising the publisher
on the start service avoids a "who emits what" ambiguity and lets the
publisher be a thin port wrapper rather than a worker-specific
helper.
The publisher port lives next to the snapshot-upsert rule
([`adapters.md`](adapters.md) §8): one Publish call updates both
surfaces.
## 3. `Result`-shaped contract
`Service.Handle` returns `(Result, error)`. The Go-level `error` is
reserved for system-level / programmer faults (nil context, nil
service). All business outcomes flow through `Result`:
- `Outcome=success`, `ErrorCode=""` — fresh start succeeded;
- `Outcome=success`, `ErrorCode="replay_no_op"` — idempotent replay;
- `Outcome=failure`, `ErrorCode` set — business failure
(`start_config_invalid` / `image_pull_failed` /
`container_start_failed` / `conflict` / `service_unavailable` /
`internal_error`).
The stream consumer uses `Outcome` and `ErrorCode` to populate
`runtime:job_results` directly; the REST handler maps `Outcome=failure`
plus `ErrorCode` to the matching HTTP status. Both callers are simpler
with this contract than with an `errors.Is`-driven sentinel taxonomy.
`ports.JobResult` and the two `JobOutcome*` string constants live in
the ports package next to `JobResultPublisher` so the wire shape is
defined exactly once. The constants are intentionally not aliases of
`operation.Outcome` — the audit-log enum is allowed to grow without
breaking the wire format.
## 4. Start service failure-mode mapping
| Failure | Error code | Notification intent |
| --- | --- | --- |
| Invalid input (empty fields, unknown op_source) | `start_config_invalid` | `runtime.start_config_invalid` |
| Lease busy | `conflict` | — |
| Existing record running with a different image_ref | `conflict` | — |
| Get returns a non-NotFound transport error | `internal_error` | — |
| `image_ref` shape rejected by `distribution/reference` | `start_config_invalid` | `runtime.start_config_invalid` |
| `EnsureNetwork` returns `ErrNetworkMissing` | `start_config_invalid` | `runtime.start_config_invalid` |
| `EnsureNetwork` returns any other error | `service_unavailable` | — |
| `PullImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
| `InspectImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
| `prepareStateDir` failure | `start_config_invalid` | `runtime.start_config_invalid` |
| `Run` failure | `container_start_failed` | `runtime.container_start_failed` |
| `Upsert` failure after successful Run | `container_start_failed` | `runtime.container_start_failed` |
Three error codes do **not** raise an admin notification: `conflict`,
`service_unavailable`, and `internal_error` are operational classes
(another caller is in flight, a dependency is down, an unclassified
fault) where the corrective action is not a configuration change. The
operator already sees them through telemetry and structured logs; an
email per occurrence would be noise.
## 5. Upsert-after-Run rollback
A `Run` that succeeded but whose `Upsert` failed leaves a running
container with no PG record. The service issues a best-effort
`docker.Remove(containerID)` in a fresh `context.Background()` (the
request context may already be cancelled) before recording the failure.
A Remove failure is logged but not propagated; the reconciler adopts
surviving orphans on its periodic pass.
The Docker adapter already removes the container when `Run` itself
returns an error after a successful `ContainerCreate` ([`adapters.md`](adapters.md) §3).
The service-layer rollback covers the additional post-`Run` Upsert
failure path.
## 6. Pre-existing record handling
Only `status=running` + same `image_ref` is a `replay_no_op`.
`running` + a different `image_ref` returns `failure / conflict` (use
`patch` to change the image of a running container).
Anything else (`stopped`, `removed`, missing record) proceeds with a
fresh start that ends in `Upsert`. `Upsert` overwrites verbatim and is
not bound by the transitions table, so installing a `running` record
over a `removed` row is permitted — the `removed` terminus rule lives
in `runtime.AllowedTransitions` (which guards `UpdateStatus`), not in
`Upsert`.
`created_at` is preserved across re-starts: the start service reuses
`existing.CreatedAt` when the record was found, so the
"first time RTM saw the game" semantics from
[`postgres-migration.md`](postgres-migration.md) §9 hold even when the
start path goes through `Upsert` rather than through the runtime
adapter's `INSERT ... ON CONFLICT DO UPDATE` EXCLUDED list.
A residual `galaxy-game-{game_id}` container left over from a previous
start that was stopped but never cleaned up will fail at `docker run`
with a name conflict. The service surfaces that as
`container_start_failed`; cleanup plus the reconciler is the standard
remedy. A pre-emptive Remove inside the start service was rejected
because it would silently undo manual operator inspection on stopped
containers.
## 7. `LobbyInternalClient.GetGame` is best-effort
The fetch happens after the lease is acquired and before the Docker
work, with the configured `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
`ErrLobbyUnavailable` and `ErrLobbyGameNotFound` are logged at
`debug`; the start operation continues either way. The fetched
`Status` and `TargetEngineVersion` enrich logs only — the start
envelope already carries the only required field (`image_ref`), and
the port docstring fixes the recoverable-failure contract.
## 8. `image_ref` validation
Validation uses `github.com/distribution/reference.ParseNormalizedNamed`
before any Docker round-trip. Rejected shapes surface as
`start_config_invalid` plus a `runtime.start_config_invalid` intent.
Daemon-side rejections after a valid parse (manifest unknown,
authentication required) surface as `image_pull_failed` plus a
`runtime.image_pull_failed` intent. The split keeps operator-actionable
configuration mistakes distinct from registry-side failures.
## 9. State-directory preparer is overrideable
`Dependencies.PrepareStateDir` is a `func(gameID string) (string, error)`
injection point that defaults to `os.MkdirAll` + `os.Chmod` +
`os.Chown` against `RTMANAGER_GAME_STATE_ROOT`. Tests override it to
point at a `t.TempDir()`-style fake without exercising the real
filesystem permissions (which require either matching uid/gid or
root). This is a deliberate non-port abstraction: the start service
does no other filesystem work and the cost of a new port for one
helper is not worth the indirection.
## 10. Container env: both `GAME_STATE_PATH` and `STORAGE_PATH`
Both names are accepted by the v1 engine. The start service always
sets both; the configured `RTMANAGER_ENGINE_STATE_ENV_NAME` controls
the primary. When the operator overrides the primary to `STORAGE_PATH`,
the deduplicating map collapses the two entries into one.
## 11. Wiring layer construction
`internal/app/wiring.go` is the single point that builds every
production store, adapter, and service from `config.Config`. The
struct exposes typed fields so handlers and workers can grab the
singletons without re-wiring; an `addCloser` slice releases adapter
resources (currently the Lobby HTTP client's idle-connection pool) at
runtime shutdown. The `runtimeRecordsProbe` adapter installed during
construction registers the `rtmanager.runtime_records_by_status`
gauge documented in [`../README.md` §Observability](../README.md).
The persistence-only `CountByStatus` method on the `runtimerecordstore`
adapter is **not** part of `ports.RuntimeRecordStore` because it is
only used by the gauge probe; widening the port for one caller would
force every adapter and test fake to grow with no benefit. The adapter
exposes it directly and the wiring composes a concrete-typed wrapper.
## 12. Shared lease across composed operations (restart, patch)
Restart and patch must hold the lease across the inner
`stop → docker rm → start` sequence, otherwise a concurrent stop or
restart could observe a half-recreated runtime.
`startruntime.Service` and `stopruntime.Service` therefore expose a
second public method:
```go
// Run executes the lifecycle assuming the per-game lease is already
// held by the caller. Reserved for orchestrator services that compose
// stop or start with another operation under a single outer lease.
// External callers must use Handle.
func (service *Service) Run(ctx context.Context, input Input) (Result, error)
```
`Handle` acquires the lease, defers its release, and calls `Run`.
Restart and patch acquire the outer lease themselves and call `Run`
on the inner services. The inner services record their own
`operation_log` entries, telemetry counters, health events, and admin
notification intents identically to a top-level `Handle`.
A typed `LeaseTicket` parameter (a small internal-package zero-size
struct that only the lease store can construct) was considered and
rejected for v1: only sister services in `internal/service/` ever call
`Run`, the docstring is loud about the precondition, and the pattern
can be tightened later without breaking the public surface that
consumers and handlers consume.
## 13. Correlation id on `source_ref`
The outer restart and patch services reuse the existing
`Input.SourceRef` as a correlation key:
- when `Input.SourceRef` is non-empty (REST request id, stream entry
id), all three entries — outer restart / patch + inner stop +
inner start — share that value;
- when empty, the outer service generates a 32-byte base64url string
via the same `NewToken` generator that produces lease tokens, and
uses it as the correlation key for all three entries.
The outer entry's `source_ref` keeps its dual semantics: actor ref
when the caller supplied one, generated correlation id otherwise. Pure
top-level operations (caller invokes start, stop, or cleanup directly)
keep the original meaning. Composed operations (restart, patch) use
the same value in three places to make audit queries trivial.
This is not the cleanest end-state — a dedicated `correlation_id`
column would carry the link without ambiguity — but it is the smallest
change that does not touch the schema. A future stage that adds the
column can rename the field and clear up the dual role in one move.
## 14. Semver validation for patch
`internal/service/patchruntime/semver.go` enforces the
patch-precondition (current and new `image_ref` parse as semver, share
major and minor):
- `extractSemverTag(imageRef)` parses with
`github.com/distribution/reference.ParseNormalizedNamed`, casts to
`reference.NamedTagged`, then validates the tag with
`golang.org/x/mod/semver.IsValid` (after prepending `v` when the tag
omits it). Failures map to `image_ref_not_semver`;
- `samePatchSeries(currentSemver, newSemver)` compares
`semver.MajorMinor` of the two canonical strings; mismatch maps to
`semver_patch_only`.
`golang.org/x/mod` is a direct require to avoid a transitive-version
surprise. `github.com/Masterminds/semver/v3` (also in the module
graph) was rejected to avoid two semver libraries on disk for the
same job; `x/mod/semver` already covers Lobby. A hand-rolled
`vMajor.Minor.Patch` parser was rejected as premature.
Pre-checks run before any inner stop or `docker rm`: a rejected patch
never disturbs the running runtime. Patch with
`new_image_ref == current_image_ref` proceeds through the recreate
flow unchanged (not `replay_no_op`: the inner start still runs); the
outer `op_kind=patch` entry records the no-op patch for audit.
## 15. `StopReason` placement
The reason enum mirrors `lobby/internal/ports/runtimemanager.go`
verbatim and lives at `internal/service/stopruntime/stopreason.go`.
The stream consumer and the REST handler import `stopruntime` for
the same enum the service requires.
Inner stop calls from restart and patch always pass
`StopReasonAdminRequest`. Restart and patch are platform-internal
recreate flows; `admin_request` is the closest semantic match in the
five-value vocabulary. The actor that originated the recreate (REST
request id, admin user id) flows through the `op_source` /
`source_ref` pair, not through the stop reason.
## 16. Error code centralisation
`internal/service/startruntime/errors.go` is the canonical home for
the stable error codes returned in `Result.ErrorCode`. The other four
services (`stopruntime`, `restartruntime`, `patchruntime`,
`cleanupcontainer`) import the constants from `startruntime` rather
than redeclaring them. The package comment of `errors.go` flags the
shared usage so future readers do not chase per-service declarations.
`start_config_invalid` is reserved for start because every start
validation failure also raises an admin notification intent. The
other services use the more general `invalid_request` for input
validation failures.
## 17. Stop / restart / patch / cleanup failure tables
### `stopruntime`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | No notification intent. |
| Lease busy | `conflict` | Lease release skipped because acquire returned false. |
| Lease error | `service_unavailable` | Redis unreachable. |
| Record missing | `not_found` | |
| Status `stopped` / `removed` | success / `replay_no_op` | Idempotent re-stop. |
| `docker.Stop` returns `ErrContainerNotFound` | success | Record transitions `running → removed`, `container_disappeared` health event published. |
| `docker.Stop` other error | `service_unavailable` | Record untouched; caller may retry. |
| `UpdateStatus` returns `ErrConflict` (CAS race) | success / `replay_no_op` | The desired state was reached by another path (reconciler / restart). |
| `UpdateStatus` returns `ErrNotFound` | `not_found` | Record vanished mid-stop. |
| `UpdateStatus` other error | `internal_error` | |
### `restartruntime`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | |
| Lease busy / lease error | `conflict` / `service_unavailable` | Same as stop. |
| Record missing | `not_found` | |
| Status `removed` | `conflict` | Image_ref may be empty; restart cannot proceed. |
| Inner stop fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner stop failed: ". |
| `docker.Remove` fails | `service_unavailable` | Inner stop already moved record to `stopped`; runtime stays in `stopped`. Admin must call `cleanup_container` before retrying restart. |
| Inner start fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner start failed: ". |
The post-stop `docker rm` failure is the only path that leaves the
runtime in a state from which the same operation cannot recover by
itself: a residual `galaxy-game-{game_id}` container blocks a fresh
inner start (the start service surfaces this as
`container_start_failed`). The runbook entry — "call cleanup, then
restart again" — is the standard remedy.
### `patchruntime`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | |
| Lease busy / lease error | `conflict` / `service_unavailable` | |
| Record missing | `not_found` | |
| Status `removed` | `conflict` | |
| Current `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
| New `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
| Major / minor mismatch | `semver_patch_only` | Pre-check; no inner ops fired. |
| Inner stop / `docker rm` / inner start fails | inherits inner code | Same propagation as restart. |
### `cleanupcontainer`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | |
| Lease busy / lease error | `conflict` / `service_unavailable` | |
| Record missing | `not_found` | |
| Status `removed` | success / `replay_no_op` | |
| Status `running` | `conflict` | Error message: "stop the runtime first". |
| Status `stopped` | proceed | |
| `docker.Remove` returns `ErrContainerNotFound` | success | Adapter swallows not-found into nil. |
| `docker.Remove` other error | `service_unavailable` | Record untouched; caller may retry. |
| `UpdateStatus` returns `ErrConflict` | success / `replay_no_op` | Race with reconciler dispose. |
| `UpdateStatus` returns `ErrNotFound` | `not_found` | |
| `UpdateStatus` other error | `internal_error` | |
## 18. REST handler conventions
The internal HTTP handlers under
[`../internal/api/internalhttp/handlers/`](../internal/api/internalhttp/handlers)
follow these rules:
- **`X-Galaxy-Caller` header.** The optional header carries the
calling service identity (`gm` / `admin`); the handler records the
value as `op_source` in `operation_log` (`gm_rest` / `admin_rest`).
Missing or unknown values default to `admin_rest` because every
audit-log query already filters on the cleanup endpoint
(`op_source ∈ {auto_ttl, admin_rest}`); making the default match
the most-restricted surface keeps existing dashboards correct when
an unconfigured client hits the listener. The header is declared as
a reusable parameter (`components.parameters.XGalaxyCallerHeader`)
in the OpenAPI spec and is referenced from each runtime operation
but not from `/healthz` and `/readyz`.
- **Error code → HTTP status mapping.** One canonical table in
`handlers/common.go`:
| ErrorCode | HTTP status |
| --- | ---: |
| (success, including `replay_no_op`) | 200 |
| `invalid_request`, `start_config_invalid`, `image_ref_not_semver` | 400 |
| `not_found` | 404 |
| `conflict`, `semver_patch_only` | 409 |
| `service_unavailable`, `docker_unavailable` | 503 |
| `internal_error`, `image_pull_failed`, `container_start_failed` | 500 |
`image_pull_failed` and `container_start_failed` are operational
failures that originate inside RTM (registry / daemon problems),
not client-side validation issues; they map to `500` so callers
retry through their normal resilience paths instead of treating
the call as a 4xx that must be fixed at the source.
`docker_unavailable` is reserved for future producers; today the
start service emits `service_unavailable` for Docker-daemon
failures. Unknown error codes default to `500`.
- **List and Get bypass the service layer.** `internalListRuntimes`
and `internalGetRuntime` read directly from
`ports.RuntimeRecordStore`. Reads do not produce `operation_log`
rows, do not change Docker state, do not need the per-game lease,
and do not have a stream-side counterpart — none of the lifecycle
service machinery is justified.
- **`RuntimeRecordStore.List(ctx)` returns every record regardless
of status.** A single SELECT ordered by
`(last_op_at DESC, game_id ASC)` — the same direction the
`runtime_records_status_last_op_idx` index supports, so freshly
active games surface first. Pagination is intentionally not
modelled in v1; the working set is bounded by the games tracked
by Lobby.
- **Per-handler service ports use `mockgen`.** The handler layer
depends on five narrow interfaces — one per lifecycle service —
declared in `handlers/services.go`. Production wiring passes the
concrete `*<lifecycle>.Service` pointers (each satisfies the
matching interface implicitly); tests pass the mockgen-generated
mocks under `handlers/mocks/`.
- **Conformance test scope.** `internalhttp/conformance_test.go`
drives every documented runtime operation against a real
`internalhttp.Server` whose service deps are deterministic stubs.
The test uses `kin-openapi/routers/legacy.NewRouter`, calls
`openapi3filter.ValidateRequest` and
`openapi3filter.ValidateResponse` so both directions match the
contract. The scope is happy-path only; the failure-path response
shapes are validated by the per-handler tests.
+412
View File
@@ -0,0 +1,412 @@
# Background Workers
This document explains the design of the seven background workers
under [`../internal/worker/`](../internal/worker):
- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
[`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
subscription;
- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
`InspectContainer` worker;
- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
`/healthz` probe;
- [`reconcile`](../internal/worker/reconcile) — startup + periodic
drift reconciliation;
- [`containercleanup`](../internal/worker/containercleanup) —
periodic TTL cleanup.
The current-state behaviour and configuration surface live in
[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
§Reconciliation), and operational notes are in
[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
[`runbook.md`](runbook.md). This file records the rationale.
## 1. Single ownership per `event_type`
The `runtime:health_events` vocabulary is shared across four sources;
each event type is owned by exactly one of them.
| `event_type` | Owner |
| --- | --- |
| `container_started` | `internal/service/startruntime` |
| `container_exited` | `internal/worker/dockerevents` |
| `container_oom` | `internal/worker/dockerevents` |
| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
| `inspect_unhealthy` | `internal/worker/dockerinspect` |
| `probe_failed` | `internal/worker/healthprobe` |
| `probe_recovered` | `internal/worker/healthprobe` |
`container_started` is intentionally not duplicated by the events
listener, even though Docker emits a `start` action whenever the start
service runs the container. The start service already publishes the
event with the same wire shape; observing the action in the listener
would produce two entries per real start.
## 2. `container_disappeared` is conditional on PG state
The Docker events listener inspects the runtime record before emitting
`container_disappeared` for a `destroy` action. Three suppression rules
apply:
- record missing → suppress (the destroyed container was never owned
by RTM as a tracked runtime, so no consumer cares);
- record `status != running` → suppress (RTM already finished a stop
or cleanup; the destroy is the expected tail of that operation);
- record `current_container_id != event.ContainerID` → suppress (RTM
swapped to a new container through restart or patch; the destroy is
the expected removal of the prior container id).
Only a destroy that arrives for a `running` record whose
`current_container_id` still equals the event id is treated as
unexpected. This is the wire-side analogue of the reconciler's
PG-drift check: the reconciler observes "PG=running, no Docker
container" while the events listener observes "Docker says destroy,
PG still says running pointing at this container". Together they cover
both directions of drift.
A read failure against `runtime_records` is treated conservatively as
"suppress" — the listener cannot tell whether the destroy was external
or RTM-initiated, and over-emitting `container_disappeared` would lead
to a real consumer (`Game Master`) escalating a false positive.
## 3. `die` with exit code `0` is suppressed
`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
event with exit code `0`. The `container_exited` contract guarantees a
non-zero exit; emitting on exit `0` would shower consumers with
normal-stop noise. The listener silently drops the event; the
operation log already records the stop on the caller side.
## 4. Inspect worker leaves `container_disappeared` to the reconciler
When `dockerinspect` calls `InspectContainer` and the daemon returns
`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
- the reconciler is the single authority for PG-drift reconciliation.
Adding a third source for `container_disappeared` would risk double
emission and complicate the consumer story;
- inspect ticks every 30 seconds; the reconciler ticks every 5
minutes. The latency window for "Docker drops the container, RTM
notices" is therefore at most 5 minutes in v1, which is acceptable
for the kinds of drift the reconciler covers (manual `docker rm`
outside RTM, daemon restart with stale records). If a future
requirement tightens the window, promoting the inspect-side
observation to a real `container_disappeared` is a one-line change.
## 5. Probe hysteresis is in-memory and pruned per tick
The active probe worker keeps per-game state in a
`map[string]*probeState` guarded by a mutex. Two counters live there:
- `consecutiveFailures` — incremented on every failed probe, reset on
every success;
- `failurePublished` — prevents repeated `probe_failed` emission while
the failure persists, and triggers a single `probe_recovered` on the
first success after the threshold was crossed.
The state is non-persistent. RTM is single-instance in v1, and a
process restart that loses the counters merely re-establishes the
hysteresis from scratch — the only consequence is that a probe failure
already in progress at the moment of restart needs another full
threshold of failures to surface. Making the state durable would add a
Redis round-trip to every probe attempt without buying anything that
operators or downstream consumers depend on.
State pruning happens at the start of every tick. The worker reads the
current running list and removes any state entry whose `game_id` is
not in the list. A game that transitions through stopped → running
again starts fresh; previously-accumulated counters do not bleed into
the new lifecycle.
## 6. Probe concurrency is bounded by a fixed cap
Probes inside one tick run in parallel through a buffered-channel
semaphore (`defaultMaxConcurrency = 16`). Three reasons:
- A single slow engine cannot delay the entire cohort. Sequential
per-game probing would multiply the worst case by `len(records)`,
which is the wrong shape for what is fundamentally a fan-out
observation pattern.
- An unbounded fan-out (one goroutine per record per tick without a
cap) was rejected to avoid pathological CPU and connection bursts
if the running list ever grows beyond what RTM was sized for. 16
in-flight probes at the default 2s timeout fit a single RTM
instance well within typical OS file-descriptor and TCP
ephemeral-port limits.
- The cap is a constant rather than an env var because RTM v1 is
single-instance and the active-game count is bounded by Lobby; a
configurable cap is something we promote to env if a real workload
demands it.
The same reasoning argues against parallelism in the inspect worker:
inspect calls are cheap (sub-ms in the local Docker socket case) and
serial execution avoids unnecessary concurrency on the daemon socket.
## 7. Events listener reconnects with fixed backoff
The Docker daemon's events stream is a long-lived subscription; the
SDK channel terminates on any transport error (daemon restart, socket
hiccup, connection reset). The listener's outer loop handles this by
re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
indefinitely while ctx is alive.
Crashing the process on a transport error was rejected because losing
a few seconds of health observations is a much smaller blast radius
than losing the entire RTM process while the start/stop pipelines are
running. The save-offset case is different: a lost offset replays the
entire backlog and breaks correctness, while a missed health event is
observation-only.
A subscription error is logged at `Warn` so operators can see the
reconnect activity without it dominating the log volume.
## 8. Health publisher remains best-effort
Every emission goes through `ports.HealthEventPublisher.Publish`, the
same surface the start service already uses
([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
and proceeds; the worker does not retry, does not adjust its in-memory
hysteresis, and does not surface the failure to the caller. The
operation log is the source of truth for runtime state; the event
stream is a best-effort notification surface to consumers.
## 9. Stream offset labels are stable identifiers
Both consumers persist their progress through
`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
labels match `rtmanager:stream_offsets:{label}` and stay stable when
the underlying stream key is renamed via
`RTMANAGER_REDIS_START_JOBS_STREAM` /
`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
consumer at a different stream key does not lose the persisted offset.
## 10. `OpSource` and `SourceRef` originate at the consumer boundary
Every consumed envelope is translated into a `Service.Handle` call
with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
`SourceRef` is the Redis Stream entry id (`message.ID`); the
`operation_log` rows therefore record the originating envelope id, and
restart / patch correlation logic ([`services.md`](services.md) §13)
keeps working when those services are invoked indirectly.
## 11. Replay-no-op detection lives in the service layer
The consumer does not detect replays itself. `startruntime.Service`
returns `Outcome=success, ErrorCode=replay_no_op` when the existing
record is already `running` with the same `image_ref`;
`stopruntime.Service` does the same for an already-stopped or
already-removed record. The consumer copies the result fields into
the `RuntimeJobResult` payload verbatim and lets Lobby observe the
replay through `error_code`.
The wire-shape consequences:
- `success` + empty `error_code` → fresh start / fresh stop;
- `success` + `error_code=replay_no_op` → idempotent replay. For
start, the existing record carries `container_id` and
`engine_endpoint`; for stop on `status=removed`, both fields are
empty strings (the record was nulled by an earlier cleanup) — the
AsyncAPI contract permits empty strings on these required fields;
- `failure` + non-empty `error_code` → the start / stop service
returned a zero `Record`; the consumer publishes empty
`container_id` and `engine_endpoint`.
## 12. Per-message errors are absorbed; the offset always advances
The consumer run loop logs and absorbs any decode error, any go-level
service error, and any publish failure; `streamOffsetStore.Save` runs
unconditionally after each handled message. Pinning the offset on a
single transient publish failure was rejected because the durable side
effect (operation_log row, runtime_records mutation, Docker state) has
already happened on the first pass; pinning the offset to retry the
publish would duplicate audit rows for hours until the operator
intervened.
The exception is `streamOffsetStore.Save` itself: a save failure
returns a wrapped error from `Run`. The component supervisor in
`internal/app/app.go` then exits the process and lets the operator
escalate, because losing the offset would cause every subsequent
restart to re-process every prior envelope.
## 13. `requested_at_ms` is logged-only
The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
consumer parses it (rejecting unparseable values) but only includes
the value in structured logs — the field is "used for diagnostics, not
authoritative" per the contract. The service layer ignores it; the
operation_log uses `service.clock()` for `started_at` / `finished_at`
so Lobby's wall-clock skew never bleeds into RTM persistence.
## 14. Reconciler: per-game lease around every write
A `running → removed` mutation that races a restart's inner stop
would clobber the restart's freshly-installed `running` record without
any other guard. The reconciler honours the same per-game lease that
the lifecycle services hold ([`services.md`](services.md) §1).
The reconciler splits its work into two phases:
- **Read pass — lockless.**
`docker.List({com.galaxy.owner=rtmanager})` followed by
`RuntimeRecords.ListByStatus(running)`. No lease is taken; both
reads are point-in-time observations of independent systems and a
stale view here only delays a mutation by one tick.
- **Write pass — lease-guarded.** Every drift mutation
(`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
per-game lease, re-reads the record under the lease, and then
either applies the mutation or returns when state has changed.
A lease conflict (`acquired=false`) is logged at `info` and the
game is silently skipped — the next tick will retry. A lease-store
error is logged at `warn`; the rest of the pass continues.
The re-read after lease acquisition is intentional: the read pass is
lockless, so by the time the lease is held the runtime record may
have moved. `UpdateStatus` already provides CAS via
`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
does not, so the explicit re-read keeps the three paths uniform and
makes the skip condition obvious in code review.
## 15. Three drift kinds covered by the reconciler
- `adopt` — Docker reports a container labelled
`com.galaxy.owner=rtmanager` for which RTM has no record; insert a
fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
stop or remove the container (operators may have started it
manually for diagnostics).
- `dispose` — RTM has a `running` record whose container is missing
in Docker; mark `status=removed`, publish
`container_disappeared`, append `op_kind=reconcile_dispose`.
- `observed_exited` — RTM has a `running` record whose container
exists but is in `exited`; mark `status=stopped`, publish
`container_exited` with the observed exit code. This third path
exists because the events listener sees only live events; a
container that died while RTM was offline would otherwise stay
`running` indefinitely. The drift is exposed through
`rtmanager.reconcile_drift{kind=observed_exited}` and through the
`container_exited` health event; no `operation_log` entry is
written because the audit log records explicit RTM operations, not
passive observations of Docker state.
## 16. `stopped_at = now (reconciler observation time)`
The `observed_exited` path writes `stopped_at = now`, where `now` is
the reconciler's observation time. The persistence adapter
([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
`stopped_at = now` for the `stopped` destination — there is no
port-level knob for an explicit timestamp, and the reconciler does not
read `State.FinishedAt` from Docker.
The trade-off: `stopped_at` diverges from the daemon's
`State.FinishedAt` by at most one tick interval (default 5 minutes).
If a downstream consumer ever needs the daemon-observed exit
timestamp, the upgrade path is a one-call extension of
`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
that change is deferred until a consumer materialises.
## 17. Synchronous initial pass + periodic Component
`README §Startup dependencies` step 6 demands "Reconciler runs once
and blocks until done" before background workers start, but
`app.App.Run` starts every registered `Component` concurrently —
component ordering does not translate into start ordering.
The reconciler exposes a public `ReconcileNow(ctx)` method that the
runtime calls synchronously between `newWiring` and `app.New`. The
same `*Reconciler` is then registered as a `Component`; its `Run`
only ticks (no immediate pass) so the startup work is not duplicated.
The cost is one public method on the worker; the benefit is that the
README invariant holds verbatim and the periodic loop is a textbook
`Component`.
## 18. Adopt through `Upsert`, race with start is benign
The adopt path constructs a fresh `runtime.RuntimeRecord` (status
running, container id and image_ref from labels, `started_at` from
`com.galaxy.started_at_ms` or inspect, state path and docker network
from configuration, engine endpoint from the
`http://galaxy-game-{game_id}:8080` rule) and calls
`RuntimeRecords.Upsert`.
Race scenario: the start service has called `docker.Run` but has not
yet finished its own `Upsert` when the reconciler observes the
container without a record. Both writers eventually arrive at PG with
the same key data — the start service knows the canonical
`image_ref`, but the reconciler reads it from the
`com.galaxy.engine_image_ref` label that the start service itself
wrote. The CAS-free overwrite is therefore benign:
- `created_at` is preserved across upserts by the
`ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
game" timestamp stays stable regardless of which writer lands last;
- all other fields in this race carry identical values (same
container, same image, same hostname, same state path).
Under the per-game lease this is doubly safe: the reconciler only
issues `Upsert` while holding the lease, and only after re-reading
the record finds it absent. Concurrent start would block on the same
lease; concurrent stop / restart would have moved the record out of
"absent" by the time the reconciler re-reads.
## 19. Cleanup worker delegates to the service
The TTL-cleanup worker is intentionally tiny: it lists
`runtime_records.status='stopped'`, filters in process by
`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
candidate. The service already owns:
- the per-game lease around the Docker `Remove` call;
- the `running → removed` CAS via `UpdateStatus`;
- the operation_log entry (`op_kind=cleanup_container`,
`op_source=auto_ttl`);
- the telemetry counter and structured log fields.
In-memory filtering is acceptable in v1 because the cardinality of
`status=stopped` rows is bounded by Lobby's active-game count plus
retention period. The dedicated `(status, last_op_at)` index drives
the underlying `ListByStatus(stopped)` query so the database does
the heavy lifting; the Go-side filter is microseconds-per-row.
The worker uses a small `Cleaner` interface in its own package rather
than depending on `*cleanupcontainer.Service` directly. This keeps
the worker's tests light — no need to construct Docker, lease,
operation-log, and telemetry doubles just to verify TTL math — while
the production wiring still binds the real service via a compile-time
interface assertion in `internal/app/wiring.go`.
## 20. Sequential per-game work in reconciler and cleanup
Both workers process games sequentially within a tick. The
reconciler's mutations are dominated by `Get` + `Upsert` /
`UpdateStatus` round-trips against PG plus an occasional Docker
`InspectContainer`; the cleanup worker's mutations are dominated by
the cleanup service's `docker.Remove` call. Parallelising either
would multiply the load on the Docker daemon socket and the PG pool
without buying anything that v1 cardinality demands.
## 21. Cross-module test boundary for the consumer integration test
[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
covers the contract roundtrip without importing
`lobby/internal/...`:
- it XADDs a start envelope in the AsyncAPI wire shape (the same
shape Lobby's `runtimemanager.Publisher` writes);
- it runs the real `startruntime.Service` against in-memory fakes for
the persistence stores, the lease, and the notification / health
publishers, plus a gomock-backed `ports.DockerClient`;
- it lets the real `jobresultspublisher.Publisher` write to
`runtime:job_results`;
- it reads the resulting entry and asserts the symmetric wire shape;
- it then XADDs the same envelope a second time and asserts the
`error_code=replay_no_op` outcome with no further Docker calls.
The cross-module integration that runs both the real Lobby publisher
and the real Lobby consumer alongside RTM lives at
`integration/lobbyrtm/`, which is the home for inter-service
fixtures. Keeping the in-package test free of `lobby/...` imports
avoids module-internal coupling and keeps `rtmanager`'s test suite
buildable on its own.
+129
View File
@@ -1,3 +1,132 @@
module galaxy/rtmanager
go 1.26.2
require (
galaxy/notificationintent v0.0.0-00010101000000-000000000000
galaxy/postgres v0.0.0-00010101000000-000000000000
galaxy/redisconn v0.0.0-00010101000000-000000000000
github.com/alicebob/miniredis/v2 v2.37.0
github.com/containerd/errdefs v1.0.0
github.com/distribution/reference v0.6.0
github.com/docker/docker v28.5.2+incompatible
github.com/docker/go-units v0.5.0
github.com/getkin/kin-openapi v0.135.0
github.com/go-jet/jet/v2 v2.14.1
github.com/jackc/pgx/v5 v5.9.2
github.com/redis/go-redis/v9 v9.18.0
github.com/stretchr/testify v1.11.1
github.com/testcontainers/testcontainers-go v0.42.0
github.com/testcontainers/testcontainers-go/modules/postgres v0.42.0
github.com/testcontainers/testcontainers-go/modules/redis v0.42.0
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.68.0
go.opentelemetry.io/otel v1.43.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0
go.opentelemetry.io/otel/exporters/stdout/stdoutmetric v1.43.0
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0
go.opentelemetry.io/otel/metric v1.43.0
go.opentelemetry.io/otel/sdk v1.43.0
go.opentelemetry.io/otel/sdk/metric v1.43.0
go.opentelemetry.io/otel/trace v1.43.0
go.uber.org/mock v0.6.0
golang.org/x/mod v0.35.0
gopkg.in/yaml.v3 v3.0.1
)
require (
dario.cat/mergo v1.0.2 // indirect
github.com/Azure/go-ansiterm v0.0.0-20250102033503-faa5f7b0171c // indirect
github.com/Microsoft/go-winio v0.6.2 // indirect
github.com/XSAM/otelsql v0.42.0 // indirect
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
github.com/cenkalti/backoff/v5 v5.0.3 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/containerd/errdefs/pkg v0.3.0 // indirect
github.com/containerd/log v0.1.0 // indirect
github.com/containerd/platforms v0.2.1 // indirect
github.com/cpuguy83/dockercfg v0.3.2 // indirect
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect
github.com/docker/go-connections v0.7.0 // indirect
github.com/ebitengine/purego v0.10.0 // indirect
github.com/felixge/httpsnoop v1.0.4 // indirect
github.com/go-logr/logr v1.4.3 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/go-ole/go-ole v1.2.6 // indirect
github.com/go-openapi/jsonpointer v0.21.0 // indirect
github.com/go-openapi/swag v0.23.0 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0 // indirect
github.com/jackc/chunkreader/v2 v2.0.1 // indirect
github.com/jackc/pgconn v1.14.3 // indirect
github.com/jackc/pgio v1.0.0 // indirect
github.com/jackc/pgpassfile v1.0.0 // indirect
github.com/jackc/pgproto3/v2 v2.3.3 // indirect
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect
github.com/jackc/pgtype v1.14.4 // indirect
github.com/jackc/puddle/v2 v2.2.2 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/klauspost/compress v1.18.5 // indirect
github.com/lib/pq v1.10.9 // indirect
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 // indirect
github.com/magiconair/properties v1.8.10 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mdelapenya/tlscert v0.2.0 // indirect
github.com/mfridman/interpolate v0.0.2 // indirect
github.com/moby/docker-image-spec v1.3.1 // indirect
github.com/moby/go-archive v0.2.0 // indirect
github.com/moby/moby/api v1.54.2 // indirect
github.com/moby/moby/client v0.4.1 // indirect
github.com/moby/patternmatcher v0.6.1 // indirect
github.com/moby/sys/atomicwriter v0.1.0 // indirect
github.com/moby/sys/sequential v0.6.0 // indirect
github.com/moby/sys/user v0.4.0 // indirect
github.com/moby/sys/userns v0.1.0 // indirect
github.com/moby/term v0.5.2 // indirect
github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826 // indirect
github.com/morikuni/aec v1.1.0 // indirect
github.com/oasdiff/yaml v0.0.9 // indirect
github.com/oasdiff/yaml3 v0.0.9 // indirect
github.com/opencontainers/go-digest v1.0.0 // indirect
github.com/opencontainers/image-spec v1.1.1 // indirect
github.com/perimeterx/marshmallow v1.1.5 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55 // indirect
github.com/pressly/goose/v3 v3.27.1 // indirect
github.com/redis/go-redis/extra/rediscmd/v9 v9.18.0 // indirect
github.com/redis/go-redis/extra/redisotel/v9 v9.18.0 // indirect
github.com/sethvargo/go-retry v0.3.0 // indirect
github.com/shirou/gopsutil/v4 v4.26.3 // indirect
github.com/sirupsen/logrus v1.9.4 // indirect
github.com/tklauser/go-sysconf v0.3.16 // indirect
github.com/tklauser/numcpus v0.11.0 // indirect
github.com/ugorji/go/codec v1.3.1 // indirect
github.com/woodsbury/decimal128 v1.3.0 // indirect
github.com/yuin/gopher-lua v1.1.1 // indirect
github.com/yusufpapurcu/wmi v1.2.4 // indirect
go.opentelemetry.io/auto/sdk v1.2.1 // indirect
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 // indirect
go.opentelemetry.io/proto/otlp v1.10.0 // indirect
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/crypto v0.50.0 // indirect
golang.org/x/net v0.53.0 // indirect
golang.org/x/sync v0.20.0 // indirect
golang.org/x/sys v0.43.0 // indirect
golang.org/x/text v0.36.0 // indirect
golang.org/x/time v0.15.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20260401024825-9d38bb4040a9 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20260420184626-e10c466a9529 // indirect
google.golang.org/grpc v1.80.0 // indirect
google.golang.org/protobuf v1.36.11 // indirect
)
replace galaxy/postgres => ../pkg/postgres
replace galaxy/redisconn => ../pkg/redisconn
replace galaxy/notificationintent => ../pkg/notificationintent
+475
View File
@@ -0,0 +1,475 @@
dario.cat/mergo v1.0.2 h1:85+piFYR1tMbRrLcDwR18y4UKJ3aH1Tbzi24VRW1TK8=
dario.cat/mergo v1.0.2/go.mod h1:E/hbnu0NxMFBjpMIE34DRGLWqDy0g5FuKDhCb31ngxA=
github.com/AdaLogics/go-fuzz-headers v0.0.0-20240806141605-e8a1dd7889d6 h1:He8afgbRMd7mFxO99hRNu+6tazq8nFF9lIwo9JFroBk=
github.com/AdaLogics/go-fuzz-headers v0.0.0-20240806141605-e8a1dd7889d6/go.mod h1:8o94RPi1/7XTJvwPpRSzSUedZrtlirdB3r9Z20bi2f8=
github.com/Azure/go-ansiterm v0.0.0-20250102033503-faa5f7b0171c h1:udKWzYgxTojEKWjV8V+WSxDXJ4NFATAsZjh8iIbsQIg=
github.com/Azure/go-ansiterm v0.0.0-20250102033503-faa5f7b0171c/go.mod h1:xomTg63KZ2rFqZQzSB4Vz2SUXa1BpHTVz9L5PTmPC4E=
github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
github.com/Masterminds/semver/v3 v3.1.1/go.mod h1:VPu/7SZ7ePZ3QOrcuXROw5FAcLl4a0cBrbBpGY/8hQs=
github.com/Microsoft/go-winio v0.6.2 h1:F2VQgta7ecxGYO8k3ZZz3RS8fVIXVxONVUPlNERoyfY=
github.com/Microsoft/go-winio v0.6.2/go.mod h1:yd8OoFMLzJbo9gZq8j5qaps8bJ9aShtEA8Ipt1oGCvU=
github.com/XSAM/otelsql v0.42.0 h1:Li0xF4eJUxG2e0x3D4rvRlys1f27yJKvjTh7ljkUP5o=
github.com/XSAM/otelsql v0.42.0/go.mod h1:4mOrEv+cS1KmKzrvTktvJnstr5GtKSAK+QHvFR9OcpI=
github.com/alicebob/miniredis/v2 v2.37.0 h1:RheObYW32G1aiJIj81XVt78ZHJpHonHLHW7OLIshq68=
github.com/alicebob/miniredis/v2 v2.37.0/go.mod h1:TcL7YfarKPGDAthEtl5NBeHZfeUQj6OXMm/+iu5cLMM=
github.com/bsm/ginkgo/v2 v2.12.0 h1:Ny8MWAHyOepLGlLKYmXG4IEkioBysk6GpaRTLC8zwWs=
github.com/bsm/ginkgo/v2 v2.12.0/go.mod h1:SwYbGRRDovPVboqFv0tPTcG1sN61LM1Z4ARdbAV9g4c=
github.com/bsm/gomega v1.27.10 h1:yeMWxP2pV2fG3FgAODIY8EiRE3dy0aeFYt4l7wh6yKA=
github.com/bsm/gomega v1.27.10/go.mod h1:JyEr/xRbxbtgWNi8tIEVPUYZ5Dzef52k01W3YH0H+O0=
github.com/cenkalti/backoff/v4 v4.3.0 h1:MyRJ/UdXutAwSAT+s3wNd7MfTIcy71VQueUuFK343L8=
github.com/cenkalti/backoff/v4 v4.3.0/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE=
github.com/cenkalti/backoff/v5 v5.0.3 h1:ZN+IMa753KfX5hd8vVaMixjnqRZ3y8CuJKRKj1xcsSM=
github.com/cenkalti/backoff/v5 v5.0.3/go.mod h1:rkhZdG3JZukswDf7f0cwqPNk4K0sa+F97BxZthm/crw=
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
github.com/cockroachdb/apd v1.1.0/go.mod h1:8Sl8LxpKi29FqWXR16WEFZRNSz3SoPzUzeMeY4+DwBQ=
github.com/containerd/errdefs v1.0.0 h1:tg5yIfIlQIrxYtu9ajqY42W3lpS19XqdxRQeEwYG8PI=
github.com/containerd/errdefs v1.0.0/go.mod h1:+YBYIdtsnF4Iw6nWZhJcqGSg/dwvV7tyJ/kCkyJ2k+M=
github.com/containerd/errdefs/pkg v0.3.0 h1:9IKJ06FvyNlexW690DXuQNx2KA2cUJXx151Xdx3ZPPE=
github.com/containerd/errdefs/pkg v0.3.0/go.mod h1:NJw6s9HwNuRhnjJhM7pylWwMyAkmCQvQ4GpJHEqRLVk=
github.com/containerd/log v0.1.0 h1:TCJt7ioM2cr/tfR8GPbGf9/VRAX8D2B4PjzCpfX540I=
github.com/containerd/log v0.1.0/go.mod h1:VRRf09a7mHDIRezVKTRCrOq78v577GXq3bSa3EhrzVo=
github.com/containerd/platforms v0.2.1 h1:zvwtM3rz2YHPQsF2CHYM8+KtB5dvhISiXh5ZpSBQv6A=
github.com/containerd/platforms v0.2.1/go.mod h1:XHCb+2/hzowdiut9rkudds9bE5yJ7npe7dG/wG+uFPw=
github.com/coreos/go-systemd v0.0.0-20190321100706-95778dfbb74e/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
github.com/coreos/go-systemd v0.0.0-20190719114852-fd7a80b32e1f/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
github.com/cpuguy83/dockercfg v0.3.2 h1:DlJTyZGBDlXqUZ2Dk2Q3xHs/FtnooJJVaad2S9GKorA=
github.com/cpuguy83/dockercfg v0.3.2/go.mod h1:sugsbF4//dDlL/i+S+rtpIWp+5h0BHJHfjj5/jFyUJc=
github.com/creack/pty v1.1.7/go.mod h1:lj5s0c3V2DBrqTV7llrYr5NG6My20zk30Fl46Y7DoTY=
github.com/creack/pty v1.1.24 h1:bJrF4RRfyJnbTJqzRLHzcGaZK1NeM5kTC9jGgovnR1s=
github.com/creack/pty v1.1.24/go.mod h1:08sCNb52WyoAwi2QDyzUCTgcvVFhUzewun7wtTfvcwE=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM=
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f h1:lO4WD4F/rVNCu3HqELle0jiPLLBs70cWOduZpkS1E78=
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f/go.mod h1:cuUVRXasLTGF7a8hSLbxyZXjz+1KgoB3wDUb6vlszIc=
github.com/distribution/reference v0.6.0 h1:0IXCQ5g4/QMHHkarYzh5l+u8T3t73zM5QvfrDyIgxBk=
github.com/distribution/reference v0.6.0/go.mod h1:BbU0aIcezP1/5jX/8MP0YiH4SdvB5Y4f/wlDRiLyi3E=
github.com/docker/docker v28.5.2+incompatible h1:DBX0Y0zAjZbSrm1uzOkdr1onVghKaftjlSWt4AFexzM=
github.com/docker/docker v28.5.2+incompatible/go.mod h1:eEKB0N0r5NX/I1kEveEz05bcu8tLC/8azJZsviup8Sk=
github.com/docker/go-connections v0.7.0 h1:6SsRfJddP22WMrCkj19x9WKjEDTB+ahsdiGYf0mN39c=
github.com/docker/go-connections v0.7.0/go.mod h1:no1qkHdjq7kLMGUXYAduOhYPSJxxvgWBh7ogVvptn3Q=
github.com/docker/go-units v0.5.0 h1:69rxXcBk27SvSaaxTtLh/8llcHD8vYHT7WSdRZ/jvr4=
github.com/docker/go-units v0.5.0/go.mod h1:fgPhTUdO+D/Jk86RDLlptpiXQzgHJF7gydDDbaIK4Dk=
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
github.com/ebitengine/purego v0.10.0 h1:QIw4xfpWT6GWTzaW5XEKy3HXoqrJGx1ijYHzTF0/ISU=
github.com/ebitengine/purego v0.10.0/go.mod h1:iIjxzd6CiRiOG0UyXP+V1+jWqUXVjPKLAI0mRfJZTmQ=
github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg=
github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U=
github.com/getkin/kin-openapi v0.135.0 h1:751SjYfbiwqukYuVjwYEIKNfrSwS5YpA7DZnKSwQgtg=
github.com/getkin/kin-openapi v0.135.0/go.mod h1:6dd5FJl6RdX4usBtFBaQhk9q62Yb2J0Mk5IhUO/QqFI=
github.com/go-jet/jet/v2 v2.14.1 h1:wsfD9e7CGP9h46+IFNlftfncBcmVnKddikbTtapQM3M=
github.com/go-jet/jet/v2 v2.14.1/go.mod h1:dqTAECV2Mo3S2NFjbm4vJ1aDruZjhaJ1RAAR8rGUkkc=
github.com/go-kit/log v0.1.0/go.mod h1:zbhenjAZHb184qTLMA9ZjW7ThYL0H2mk7Q6pNt4vbaY=
github.com/go-logfmt/logfmt v0.5.0/go.mod h1:wCYkCAKZfumFQihp8CzCvQ3paCTfi41vtzG1KdI/P7A=
github.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=
github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=
github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag=
github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE=
github.com/go-ole/go-ole v1.2.6 h1:/Fpf6oFPoeFik9ty7siob0G6Ke8QvQEuVcuChpwXzpY=
github.com/go-ole/go-ole v1.2.6/go.mod h1:pprOEPIfldk/42T2oK7lQ4v4JSDwmV0As9GaiUsvbm0=
github.com/go-openapi/jsonpointer v0.21.0 h1:YgdVicSA9vH5RiHs9TZW5oyafXZFc6+2Vc1rr/O9oNQ=
github.com/go-openapi/jsonpointer v0.21.0/go.mod h1:IUyH9l/+uyhIYQ/PXVA41Rexl+kOkAPDdXEYns6fzUY=
github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+GrE=
github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ=
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
github.com/go-test/deep v1.0.8 h1:TDsG77qcSprGbC6vTN8OuXp5g+J+b5Pcguhf7Zt61VM=
github.com/go-test/deep v1.0.8/go.mod h1:5C2ZWiW0ErCdrYzpqxLbTX7MG14M9iiw8DgHncVwcsE=
github.com/gofrs/uuid v4.0.0+incompatible/go.mod h1:b2aQJv3Z4Fp6yNu3cdSllBxTCLRxnplIgP/c0N/04lM=
github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
github.com/google/go-cmp v0.5.6/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI=
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/gorilla/mux v1.8.0 h1:i40aqfkR1h2SlN9hojwV5ZA91wcXFOvkdNIeFDP5koI=
github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0 h1:HWRh5R2+9EifMyIHV7ZV+MIZqgz+PMpZ14Jynv3O2Zs=
github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0/go.mod h1:JfhWUomR1baixubs02l85lZYYOm7LV6om4ceouMv45c=
github.com/jackc/chunkreader v1.0.0/go.mod h1:RT6O25fNZIuasFJRyZ4R/Y2BbhasbmZXF9QQ7T3kePo=
github.com/jackc/chunkreader/v2 v2.0.0/go.mod h1:odVSm741yZoC3dpHEUXIqA9tQRhFrgOHwnPIn9lDKlk=
github.com/jackc/chunkreader/v2 v2.0.1 h1:i+RDz65UE+mmpjTfyz0MoVTnzeYxroil2G82ki7MGG8=
github.com/jackc/chunkreader/v2 v2.0.1/go.mod h1:odVSm741yZoC3dpHEUXIqA9tQRhFrgOHwnPIn9lDKlk=
github.com/jackc/pgconn v0.0.0-20190420214824-7e0022ef6ba3/go.mod h1:jkELnwuX+w9qN5YIfX0fl88Ehu4XC3keFuOJJk9pcnA=
github.com/jackc/pgconn v0.0.0-20190824142844-760dd75542eb/go.mod h1:lLjNuW/+OfW9/pnVKPazfWOgNfH2aPem8YQ7ilXGvJE=
github.com/jackc/pgconn v0.0.0-20190831204454-2fabfa3c18b7/go.mod h1:ZJKsE/KZfsUgOEh9hBm+xYTstcNHg7UPMVJqRfQxq4s=
github.com/jackc/pgconn v1.8.0/go.mod h1:1C2Pb36bGIP9QHGBYCjnyhqu7Rv3sGshaQUvmfGIB/o=
github.com/jackc/pgconn v1.9.0/go.mod h1:YctiPyvzfU11JFxoXokUOOKQXQmDMoJL9vJzHH8/2JY=
github.com/jackc/pgconn v1.9.1-0.20210724152538-d89c8390a530/go.mod h1:4z2w8XhRbP1hYxkpTuBjTS3ne3J48K83+u0zoyvg2pI=
github.com/jackc/pgconn v1.14.3 h1:bVoTr12EGANZz66nZPkMInAV/KHD2TxH9npjXXgiB3w=
github.com/jackc/pgconn v1.14.3/go.mod h1:RZbme4uasqzybK2RK5c65VsHxoyaml09lx3tXOcO/VM=
github.com/jackc/pgio v1.0.0 h1:g12B9UwVnzGhueNavwioyEEpAmqMe1E/BN9ES+8ovkE=
github.com/jackc/pgio v1.0.0/go.mod h1:oP+2QK2wFfUWgr+gxjoBH9KGBb31Eio69xUb0w5bYf8=
github.com/jackc/pgmock v0.0.0-20190831213851-13a1b77aafa2/go.mod h1:fGZlG77KXmcq05nJLRkk0+p82V8B8Dw8KN2/V9c/OAE=
github.com/jackc/pgmock v0.0.0-20201204152224-4fe30f7445fd/go.mod h1:hrBW0Enj2AZTNpt/7Y5rr2xe/9Mn757Wtb2xeBzPv2c=
github.com/jackc/pgmock v0.0.0-20210724152146-4ad1a8207f65 h1:DadwsjnMwFjfWc9y5Wi/+Zz7xoE5ALHsRQlOctkOiHc=
github.com/jackc/pgmock v0.0.0-20210724152146-4ad1a8207f65/go.mod h1:5R2h2EEX+qri8jOWMbJCtaPWkrrNc7OHwsp2TCqp7ak=
github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg=
github.com/jackc/pgproto3 v1.1.0/go.mod h1:eR5FA3leWg7p9aeAqi37XOTgTIbkABlvcPB3E5rlc78=
github.com/jackc/pgproto3/v2 v2.0.0-alpha1.0.20190420180111-c116219b62db/go.mod h1:bhq50y+xrl9n5mRYyCBFKkpRVTLYJVWeCc+mEAI3yXA=
github.com/jackc/pgproto3/v2 v2.0.0-alpha1.0.20190609003834-432c2951c711/go.mod h1:uH0AWtUmuShn0bcesswc4aBTWGvw0cAxIJp+6OB//Wg=
github.com/jackc/pgproto3/v2 v2.0.0-rc3/go.mod h1:ryONWYqW6dqSg1Lw6vXNMXoBJhpzvWKnT95C46ckYeM=
github.com/jackc/pgproto3/v2 v2.0.0-rc3.0.20190831210041-4c03ce451f29/go.mod h1:ryONWYqW6dqSg1Lw6vXNMXoBJhpzvWKnT95C46ckYeM=
github.com/jackc/pgproto3/v2 v2.0.6/go.mod h1:WfJCnwN3HIg9Ish/j3sgWXnAfK8A9Y0bwXYU5xKaEdA=
github.com/jackc/pgproto3/v2 v2.1.1/go.mod h1:WfJCnwN3HIg9Ish/j3sgWXnAfK8A9Y0bwXYU5xKaEdA=
github.com/jackc/pgproto3/v2 v2.3.3 h1:1HLSx5H+tXR9pW3in3zaztoEwQYRC9SQaYUHjTSUOag=
github.com/jackc/pgproto3/v2 v2.3.3/go.mod h1:WfJCnwN3HIg9Ish/j3sgWXnAfK8A9Y0bwXYU5xKaEdA=
github.com/jackc/pgservicefile v0.0.0-20200714003250-2b9c44734f2b/go.mod h1:vsD4gTJCa9TptPL8sPkXrLZ+hDuNrZCnj29CQpr4X1E=
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 h1:iCEnooe7UlwOQYpKFhBabPMi4aNAfoODPEFNiAnClxo=
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
github.com/jackc/pgtype v0.0.0-20190421001408-4ed0de4755e0/go.mod h1:hdSHsc1V01CGwFsrv11mJRHWJ6aifDLfdV3aVjFF0zg=
github.com/jackc/pgtype v0.0.0-20190824184912-ab885b375b90/go.mod h1:KcahbBH1nCMSo2DXpzsoWOAfFkdEtEJpPbVLq8eE+mc=
github.com/jackc/pgtype v0.0.0-20190828014616-a8802b16cc59/go.mod h1:MWlu30kVJrUS8lot6TQqcg7mtthZ9T0EoIBFiJcmcyw=
github.com/jackc/pgtype v1.8.1-0.20210724151600-32e20a603178/go.mod h1:C516IlIV9NKqfsMCXTdChteoXmwgUceqaLfjg2e3NlM=
github.com/jackc/pgtype v1.14.0/go.mod h1:LUMuVrfsFfdKGLw+AFFVv6KtHOFMwRgDDzBt76IqCA4=
github.com/jackc/pgtype v1.14.4 h1:fKuNiCumbKTAIxQwXfB/nsrnkEI6bPJrrSiMKgbJ2j8=
github.com/jackc/pgtype v1.14.4/go.mod h1:aKeozOde08iifGosdJpz9MBZonJOUJxqNpPBcMJTlVA=
github.com/jackc/pgx/v4 v4.0.0-20190420224344-cc3461e65d96/go.mod h1:mdxmSJJuR08CZQyj1PVQBHy9XOp5p8/SHH6a0psbY9Y=
github.com/jackc/pgx/v4 v4.0.0-20190421002000-1b8f0016e912/go.mod h1:no/Y67Jkk/9WuGR0JG/JseM9irFbnEPbuWV2EELPNuM=
github.com/jackc/pgx/v4 v4.0.0-pre1.0.20190824185557-6972a5742186/go.mod h1:X+GQnOEnf1dqHGpw7JmHqHc1NxDoalibchSk9/RWuDc=
github.com/jackc/pgx/v4 v4.12.1-0.20210724153913-640aa07df17c/go.mod h1:1QD0+tgSXP7iUjYm9C1NxKhny7lq6ee99u/z+IHFcgs=
github.com/jackc/pgx/v4 v4.18.2/go.mod h1:Ey4Oru5tH5sB6tV7hDmfWFahwF15Eb7DNXlRKx2CkVw=
github.com/jackc/pgx/v4 v4.18.3 h1:dE2/TrEsGX3RBprb3qryqSV9Y60iZN1C6i8IrmW9/BA=
github.com/jackc/pgx/v4 v4.18.3/go.mod h1:Ey4Oru5tH5sB6tV7hDmfWFahwF15Eb7DNXlRKx2CkVw=
github.com/jackc/pgx/v5 v5.9.2 h1:3ZhOzMWnR4yJ+RW1XImIPsD1aNSz4T4fyP7zlQb56hw=
github.com/jackc/pgx/v5 v5.9.2/go.mod h1:mal1tBGAFfLHvZzaYh77YS/eC6IX9OWbRV1QIIM0Jn4=
github.com/jackc/puddle v0.0.0-20190413234325-e4ced69a3a2b/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
github.com/jackc/puddle v0.0.0-20190608224051-11cab39313c9/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
github.com/jackc/puddle v1.1.3/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
github.com/jackc/puddle v1.3.0/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
github.com/jackc/puddle/v2 v2.2.2 h1:PR8nw+E/1w0GLuRFSmiioY6UooMp6KJv0/61nB7icHo=
github.com/jackc/puddle/v2 v2.2.2/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4=
github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY=
github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y=
github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
github.com/klauspost/compress v1.18.5 h1:/h1gH5Ce+VWNLSWqPzOVn6XBO+vJbCNGvjoaGBFW2IE=
github.com/klauspost/compress v1.18.5/go.mod h1:cwPg85FWrGar70rWktvGQj8/hthj3wpl0PGDogxkrSQ=
github.com/klauspost/cpuid/v2 v2.3.0 h1:S4CRMLnYUhGeDFDqkGriYKdfoFlDnMtqTiI/sFzhA9Y=
github.com/klauspost/cpuid/v2 v2.3.0/go.mod h1:hqwkgyIinND0mEev00jJYCxPNVRVXFQeu1XKlok6oO0=
github.com/konsorten/go-windows-terminal-sequences v1.0.1/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
github.com/konsorten/go-windows-terminal-sequences v1.0.2/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
github.com/kr/pty v1.1.8/go.mod h1:O1sed60cT9XZ5uDucP5qwvh+TE3NnUj51EiZO/lmSfw=
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
github.com/lib/pq v1.0.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
github.com/lib/pq v1.1.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
github.com/lib/pq v1.2.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
github.com/lib/pq v1.10.2/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o=
github.com/lib/pq v1.10.9 h1:YXG7RB+JIjhP29X+OtkiDnYaXQwpS4JEWq7dtCCRUEw=
github.com/lib/pq v1.10.9/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o=
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 h1:6E+4a0GO5zZEnZ81pIr0yLvtUWk2if982qA3F3QD6H4=
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0/go.mod h1:zJYVVT2jmtg6P3p1VtQj7WsuWi/y4VnjVBn7F8KPB3I=
github.com/magiconair/properties v1.8.10 h1:s31yESBquKXCV9a/ScB3ESkOjUYYv+X0rg8SYxI99mE=
github.com/magiconair/properties v1.8.10/go.mod h1:Dhd985XPs7jluiymwWYZ0G4Z61jb3vdS329zhj2hYo0=
github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0=
github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc=
github.com/mattn/go-colorable v0.1.1/go.mod h1:FuOcm+DKB9mbwrcAfNl7/TZVBZ6rcnceauSikq3lYCQ=
github.com/mattn/go-colorable v0.1.6/go.mod h1:u6P/XSegPjTcexA+o6vUJrdnUu04hMope9wVRipJSqc=
github.com/mattn/go-isatty v0.0.5/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/mattn/go-isatty v0.0.7/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
github.com/mattn/go-isatty v0.0.12/go.mod h1:cbi8OIDigv2wuxKPP5vlRcQ1OAZbq2CE4Kysco4FUpU=
github.com/mattn/go-isatty v0.0.21 h1:xYae+lCNBP7QuW4PUnNG61ffM4hVIfm+zUzDuSzYLGs=
github.com/mattn/go-isatty v0.0.21/go.mod h1:ZXfXG4SQHsB/w3ZeOYbR0PrPwLy+n6xiMrJlRFqopa4=
github.com/mdelapenya/tlscert v0.2.0 h1:7H81W6Z/4weDvZBNOfQte5GpIMo0lGYEeWbkGp5LJHI=
github.com/mdelapenya/tlscert v0.2.0/go.mod h1:O4njj3ELLnJjGdkN7M/vIVCpZ+Cf0L6muqOG4tLSl8o=
github.com/mfridman/interpolate v0.0.2 h1:pnuTK7MQIxxFz1Gr+rjSIx9u7qVjf5VOoM/u6BbAxPY=
github.com/mfridman/interpolate v0.0.2/go.mod h1:p+7uk6oE07mpE/Ik1b8EckO0O4ZXiGAfshKBWLUM9Xg=
github.com/moby/docker-image-spec v1.3.1 h1:jMKff3w6PgbfSa69GfNg+zN/XLhfXJGnEx3Nl2EsFP0=
github.com/moby/docker-image-spec v1.3.1/go.mod h1:eKmb5VW8vQEh/BAr2yvVNvuiJuY6UIocYsFu/DxxRpo=
github.com/moby/go-archive v0.2.0 h1:zg5QDUM2mi0JIM9fdQZWC7U8+2ZfixfTYoHL7rWUcP8=
github.com/moby/go-archive v0.2.0/go.mod h1:mNeivT14o8xU+5q1YnNrkQVpK+dnNe/K6fHqnTg4qPU=
github.com/moby/moby/api v1.54.2 h1:wiat9QAhnDQjA7wk1kh/TqHz2I1uUA7M7t9SAl/JNXg=
github.com/moby/moby/api v1.54.2/go.mod h1:+RQ6wluLwtYaTd1WnPLykIDPekkuyD/ROWQClE83pzs=
github.com/moby/moby/client v0.4.1 h1:DMQgisVoMkmMs7fp3ROSdiBnoAu8+vo3GggFl06M/wY=
github.com/moby/moby/client v0.4.1/go.mod h1:z52C9O2POPOsnxZAy//WtKcQ32P+jT/NGeXu/7nfjGQ=
github.com/moby/patternmatcher v0.6.1 h1:qlhtafmr6kgMIJjKJMDmMWq7WLkKIo23hsrpR3x084U=
github.com/moby/patternmatcher v0.6.1/go.mod h1:hDPoyOpDY7OrrMDLaYoY3hf52gNCR/YOUYxkhApJIxc=
github.com/moby/sys/atomicwriter v0.1.0 h1:kw5D/EqkBwsBFi0ss9v1VG3wIkVhzGvLklJ+w3A14Sw=
github.com/moby/sys/atomicwriter v0.1.0/go.mod h1:Ul8oqv2ZMNHOceF643P6FKPXeCmYtlQMvpizfsSoaWs=
github.com/moby/sys/sequential v0.6.0 h1:qrx7XFUd/5DxtqcoH1h438hF5TmOvzC/lspjy7zgvCU=
github.com/moby/sys/sequential v0.6.0/go.mod h1:uyv8EUTrca5PnDsdMGXhZe6CCe8U/UiTWd+lL+7b/Ko=
github.com/moby/sys/user v0.4.0 h1:jhcMKit7SA80hivmFJcbB1vqmw//wU61Zdui2eQXuMs=
github.com/moby/sys/user v0.4.0/go.mod h1:bG+tYYYJgaMtRKgEmuueC0hJEAZWwtIbZTB+85uoHjs=
github.com/moby/sys/userns v0.1.0 h1:tVLXkFOxVu9A64/yh59slHVv9ahO9UIev4JZusOLG/g=
github.com/moby/sys/userns v0.1.0/go.mod h1:IHUYgu/kao6N8YZlp9Cf444ySSvCmDlmzUcYfDHOl28=
github.com/moby/term v0.5.2 h1:6qk3FJAFDs6i/q3W/pQ97SX192qKfZgGjCQqfCJkgzQ=
github.com/moby/term v0.5.2/go.mod h1:d3djjFCrjnB+fl8NJux+EJzu0msscUP+f8it8hPkFLc=
github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826 h1:RWengNIwukTxcDr9M+97sNutRR1RKhG96O6jWumTTnw=
github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826/go.mod h1:TaXosZuwdSHYgviHp1DAtfrULt5eUgsSMsZf+YrPgl8=
github.com/morikuni/aec v1.1.0 h1:vBBl0pUnvi/Je71dsRrhMBtreIqNMYErSAbEeb8jrXQ=
github.com/morikuni/aec v1.1.0/go.mod h1:xDRgiq/iw5l+zkao76YTKzKttOp2cwPEne25HDkJnBw=
github.com/ncruces/go-strftime v1.0.0 h1:HMFp8mLCTPp341M/ZnA4qaf7ZlsbTc+miZjCLOFAw7w=
github.com/ncruces/go-strftime v1.0.0/go.mod h1:Fwc5htZGVVkseilnfgOVb9mKy6w1naJmn9CehxcKcls=
github.com/oasdiff/yaml v0.0.9 h1:zQOvd2UKoozsSsAknnWoDJlSK4lC0mpmjfDsfqNwX48=
github.com/oasdiff/yaml v0.0.9/go.mod h1:8lvhgJG4xiKPj3HN5lDow4jZHPlx1i7dIwzkdAo6oAM=
github.com/oasdiff/yaml3 v0.0.9 h1:rWPrKccrdUm8J0F3sGuU+fuh9+1K/RdJlWF7O/9yw2g=
github.com/oasdiff/yaml3 v0.0.9/go.mod h1:y5+oSEHCPT/DGrS++Wc/479ERge0zTFxaF8PbGKcg2o=
github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=
github.com/opencontainers/go-digest v1.0.0/go.mod h1:0JzlMkj0TRzQZfJkVvzbP0HBR3IKzErnv2BNG4W4MAM=
github.com/opencontainers/image-spec v1.1.1 h1:y0fUlFfIZhPF1W537XOLg0/fcx6zcHCJwooC2xJA040=
github.com/opencontainers/image-spec v1.1.1/go.mod h1:qpqAh3Dmcf36wStyyWU+kCeDgrGnAve2nCC8+7h8Q0M=
github.com/perimeterx/marshmallow v1.1.5 h1:a2LALqQ1BlHM8PZblsDdidgv1mWi1DgC2UmX50IvK2s=
github.com/perimeterx/marshmallow v1.1.5/go.mod h1:dsXbUu8CRzfYP5a87xpp0xq9S3u0Vchtcl8we9tYaXw=
github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U=
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55 h1:o4JXh1EVt9k/+g42oCprj/FisM4qX9L3sZB3upGN2ZU=
github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE=
github.com/pressly/goose/v3 v3.27.1 h1:6uEvcprBybDmW4hcz3gYujhARhye+GoWKhEWyzD5sh4=
github.com/pressly/goose/v3 v3.27.1/go.mod h1:maruOxsPnIG2yHHyo8UqKWXYKFcH7Q76csUV7+7KYoM=
github.com/redis/go-redis/extra/rediscmd/v9 v9.18.0 h1:QY4nmPHLFAJjtT5O4OMUEOxP8WVaRNOFpcbmxT2NLZU=
github.com/redis/go-redis/extra/rediscmd/v9 v9.18.0/go.mod h1:WH8cY/0fT41Bsf341qzo8v4nx0GCE8FykAA23IVbVmo=
github.com/redis/go-redis/extra/redisotel/v9 v9.18.0 h1:2dKdoEYBJ0CZCLPiCdvvc7luz3DPwY6hKdzjL6m1eHE=
github.com/redis/go-redis/extra/redisotel/v9 v9.18.0/go.mod h1:WzkrVG9ro9BwCQD0eJOWn6AGL4Z1CleGflM45w1hu10=
github.com/redis/go-redis/v9 v9.18.0 h1:pMkxYPkEbMPwRdenAzUNyFNrDgHx9U+DrBabWNfSRQs=
github.com/redis/go-redis/v9 v9.18.0/go.mod h1:k3ufPphLU5YXwNTUcCRXGxUoF1fqxnhFQmscfkCoDA0=
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec h1:W09IVJc94icq4NjY3clb7Lk8O1qJ8BdBEF8z0ibU0rE=
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec/go.mod h1:qqbHyh8v60DhA7CoWK5oRCqLrMHRGoxYCSS9EjAz6Eo=
github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4=
github.com/rogpeppe/go-internal v1.14.1 h1:UQB4HGPB6osV0SQTLymcB4TgvyWu6ZyliaW0tI/otEQ=
github.com/rogpeppe/go-internal v1.14.1/go.mod h1:MaRKkUm5W0goXpeCfT7UZI6fk/L7L7so1lCWt35ZSgc=
github.com/rs/xid v1.2.1/go.mod h1:+uKXf+4Djp6Md1KODXJxgGQPKngRmWyn10oCKFzNHOQ=
github.com/rs/zerolog v1.13.0/go.mod h1:YbFCdg8HfsridGWAh22vktObvhZbQsZXe4/zB0OKkWU=
github.com/rs/zerolog v1.15.0/go.mod h1:xYTKnLHcpfU2225ny5qZjxnj9NvkumZYjJHlAThCjNc=
github.com/satori/go.uuid v1.2.0/go.mod h1:dA0hQrYB0VpLJoorglMZABFdXlWrHn1NEOzdhQKdks0=
github.com/sethvargo/go-retry v0.3.0 h1:EEt31A35QhrcRZtrYFDTBg91cqZVnFL2navjDrah2SE=
github.com/sethvargo/go-retry v0.3.0/go.mod h1:mNX17F0C/HguQMyMyJxcnU471gOZGxCLyYaFyAZraas=
github.com/shirou/gopsutil/v4 v4.26.3 h1:2ESdQt90yU3oXF/CdOlRCJxrP+Am1aBYubTMTfxJ1qc=
github.com/shirou/gopsutil/v4 v4.26.3/go.mod h1:LZ6ewCSkBqUpvSOf+LsTGnRinC6iaNUNMGBtDkJBaLQ=
github.com/shopspring/decimal v0.0.0-20180709203117-cd690d0c9e24/go.mod h1:M+9NzErvs504Cn4c5DxATwIqPbtswREoFCre64PpcG4=
github.com/shopspring/decimal v1.2.0/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFRcu2hWCYk4o=
github.com/sirupsen/logrus v1.4.1/go.mod h1:ni0Sbl8bgC9z8RoU9G6nDWqqs/fq4eDPysMBDgk/93Q=
github.com/sirupsen/logrus v1.4.2/go.mod h1:tLMulIdttU9McNUspp0xgXVQah82FyeX6MwdIuYE2rE=
github.com/sirupsen/logrus v1.9.4 h1:TsZE7l11zFCLZnZ+teH4Umoq5BhEIfIzfRDZ1Uzql2w=
github.com/sirupsen/logrus v1.9.4/go.mod h1:ftWc9WdOfJ0a92nsE2jF5u5ZwH8Bv2zdeOC42RjbV2g=
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
github.com/stretchr/objx v0.2.0/go.mod h1:qt09Ya8vawLte6SNmTgCsAVtYtaKzEcn8ATUoHMkEqE=
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
github.com/stretchr/objx v0.5.3 h1:jmXUvGomnU1o3W/V5h2VEradbpJDwGrzugQQvL0POH4=
github.com/stretchr/objx v0.5.3/go.mod h1:rDQraq+vQZU7Fde9LOZLr8Tax6zZvy4kuNKF+QYS+U0=
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=
github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA=
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
github.com/testcontainers/testcontainers-go v0.42.0 h1:He3IhTzTZOygSXLJPMX7n44XtK+qhjat1nI9cneBbUY=
github.com/testcontainers/testcontainers-go v0.42.0/go.mod h1:vZjdY1YmUA1qEForxOIOazfsrdyORJAbhi0bp8plN30=
github.com/testcontainers/testcontainers-go/modules/postgres v0.42.0 h1:GCbb1ndrF7OTDiIvxXyItaDab4qkzTFJ48LKFdM7EIo=
github.com/testcontainers/testcontainers-go/modules/postgres v0.42.0/go.mod h1:IRPBaI8jXdrNfD0e4Zm7Fbcgaz5shKxOQv4axiL09xs=
github.com/testcontainers/testcontainers-go/modules/redis v0.42.0 h1:id/6LH8ZeDrtAUVSuNvZUAJ1kVpb82y1pr9yweAWsRg=
github.com/tklauser/go-sysconf v0.3.16 h1:frioLaCQSsF5Cy1jgRBrzr6t502KIIwQ0MArYICU0nA=
github.com/tklauser/go-sysconf v0.3.16/go.mod h1:/qNL9xxDhc7tx3HSRsLWNnuzbVfh3e7gh/BmM179nYI=
github.com/tklauser/numcpus v0.11.0 h1:nSTwhKH5e1dMNsCdVBukSZrURJRoHbSEQjdEbY+9RXw=
github.com/tklauser/numcpus v0.11.0/go.mod h1:z+LwcLq54uWZTX0u/bGobaV34u6V7KNlTZejzM6/3MQ=
github.com/ugorji/go/codec v1.3.1 h1:waO7eEiFDwidsBN6agj1vJQ4AG7lh2yqXyOXqhgQuyY=
github.com/ugorji/go/codec v1.3.1/go.mod h1:pRBVtBSKl77K30Bv8R2P+cLSGaTtex6fsA2Wjqmfxj4=
github.com/woodsbury/decimal128 v1.3.0 h1:8pffMNWIlC0O5vbyHWFZAt5yWvWcrHA+3ovIIjVWss0=
github.com/woodsbury/decimal128 v1.3.0/go.mod h1:C5UTmyTjW3JftjUFzOVhC20BEQa2a4ZKOB5I6Zjb+ds=
github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=
github.com/yuin/gopher-lua v1.1.1 h1:kYKnWBjvbNP4XLT3+bPEwAXJx262OhaHDWDVOPjL46M=
github.com/yuin/gopher-lua v1.1.1/go.mod h1:GBR0iDaNXjAgGg9zfCvksxSRnQx76gclCIb7kdAd1Pw=
github.com/yusufpapurcu/wmi v1.2.4 h1:zFUKzehAFReQwLys1b/iSMl+JQGSCSjtVqQn9bBrPo0=
github.com/yusufpapurcu/wmi v1.2.4/go.mod h1:SBZ9tNy3G9/m5Oi98Zks0QjeHVDvuK0qfxQmPyzfmi0=
github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0=
github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA=
github.com/zenazn/goji v0.9.0/go.mod h1:7S9M489iMyHBNxwZnk9/EHS098H4/F6TATF2mIxtB1Q=
go.opentelemetry.io/auto/sdk v1.2.1 h1:jXsnJ4Lmnqd11kwkBV2LgLoFMZKizbCi5fNZ/ipaZ64=
go.opentelemetry.io/auto/sdk v1.2.1/go.mod h1:KRTj+aOaElaLi+wW1kO/DZRXwkF4C5xPbEe3ZiIhN7Y=
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.68.0 h1:CqXxU8VOmDefoh0+ztfGaymYbhdB/tT3zs79QaZTNGY=
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.68.0/go.mod h1:BuhAPThV8PBHBvg8ZzZ/Ok3idOdhWIodywz2xEcRbJo=
go.opentelemetry.io/otel v1.43.0 h1:mYIM03dnh5zfN7HautFE4ieIig9amkNANT+xcVxAj9I=
go.opentelemetry.io/otel v1.43.0/go.mod h1:JuG+u74mvjvcm8vj8pI5XiHy1zDeoCS2LB1spIq7Ay0=
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0 h1:8UQVDcZxOJLtX6gxtDt3vY2WTgvZqMQRzjsqiIHQdkc=
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0/go.mod h1:2lmweYCiHYpEjQ/lSJBYhj9jP1zvCvQW4BqL9dnT7FQ=
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0 h1:w1K+pCJoPpQifuVpsKamUdn9U0zM3xUziVOqsGksUrY=
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0/go.mod h1:HBy4BjzgVE8139ieRI75oXm3EcDN+6GhD88JT1Kjvxg=
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 h1:88Y4s2C8oTui1LGM6bTWkw0ICGcOLCAI5l6zsD1j20k=
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0/go.mod h1:Vl1/iaggsuRlrHf/hfPJPvVag77kKyvrLeD10kpMl+A=
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0 h1:RAE+JPfvEmvy+0LzyUA25/SGawPwIUbZ6u0Wug54sLc=
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0/go.mod h1:AGmbycVGEsRx9mXMZ75CsOyhSP6MFIcj/6dnG+vhVjk=
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0 h1:3iZJKlCZufyRzPzlQhUIWVmfltrXuGyfjREgGP3UUjc=
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0/go.mod h1:/G+nUPfhq2e+qiXMGxMwumDrP5jtzU+mWN7/sjT2rak=
go.opentelemetry.io/otel/exporters/stdout/stdoutmetric v1.43.0 h1:TC+BewnDpeiAmcscXbGMfxkO+mwYUwE/VySwvw88PfA=
go.opentelemetry.io/otel/exporters/stdout/stdoutmetric v1.43.0/go.mod h1:J/ZyF4vfPwsSr9xJSPyQ4LqtcTPULFR64KwTikGLe+A=
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0 h1:mS47AX77OtFfKG4vtp+84kuGSFZHTyxtXIN269vChY0=
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0/go.mod h1:PJnsC41lAGncJlPUniSwM81gc80GkgWJWr3cu2nKEtU=
go.opentelemetry.io/otel/metric v1.43.0 h1:d7638QeInOnuwOONPp4JAOGfbCEpYb+K6DVWvdxGzgM=
go.opentelemetry.io/otel/metric v1.43.0/go.mod h1:RDnPtIxvqlgO8GRW18W6Z/4P462ldprJtfxHxyKd2PY=
go.opentelemetry.io/otel/sdk v1.43.0 h1:pi5mE86i5rTeLXqoF/hhiBtUNcrAGHLKQdhg4h4V9Dg=
go.opentelemetry.io/otel/sdk v1.43.0/go.mod h1:P+IkVU3iWukmiit/Yf9AWvpyRDlUeBaRg6Y+C58QHzg=
go.opentelemetry.io/otel/sdk/metric v1.43.0 h1:S88dyqXjJkuBNLeMcVPRFXpRw2fuwdvfCGLEo89fDkw=
go.opentelemetry.io/otel/sdk/metric v1.43.0/go.mod h1:C/RJtwSEJ5hzTiUz5pXF1kILHStzb9zFlIEe85bhj6A=
go.opentelemetry.io/otel/trace v1.43.0 h1:BkNrHpup+4k4w+ZZ86CZoHHEkohws8AY+WTX09nk+3A=
go.opentelemetry.io/otel/trace v1.43.0/go.mod h1:/QJhyVBUUswCphDVxq+8mld+AvhXZLhe+8WVFxiFff0=
go.opentelemetry.io/proto/otlp v1.10.0 h1:IQRWgT5srOCYfiWnpqUYz9CVmbO8bFmKcwYxpuCSL2g=
go.opentelemetry.io/proto/otlp v1.10.0/go.mod h1:/CV4QoCR/S9yaPj8utp3lvQPoqMtxXdzn7ozvvozVqk=
go.uber.org/atomic v1.3.2/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
go.uber.org/atomic v1.5.0/go.mod h1:sABNBOSYdrvTF6hTgEIbc7YasKWGhgEQZyfxyTvoXHQ=
go.uber.org/atomic v1.6.0/go.mod h1:sABNBOSYdrvTF6hTgEIbc7YasKWGhgEQZyfxyTvoXHQ=
go.uber.org/atomic v1.11.0 h1:ZvwS0R+56ePWxUNi+Atn9dWONBPp/AUETXlHW0DxSjE=
go.uber.org/atomic v1.11.0/go.mod h1:LUxbIzbOniOlMKjJjyPfpl4v+PKK2cNJn91OQbhoJI0=
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
go.uber.org/mock v0.6.0 h1:hyF9dfmbgIX5EfOdasqLsWD6xqpNZlXblLB/Dbnwv3Y=
go.uber.org/mock v0.6.0/go.mod h1:KiVJ4BqZJaMj4svdfmHM0AUx4NJYO8ZNpPnZn1Z+BBU=
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
go.uber.org/multierr v1.3.0/go.mod h1:VgVr7evmIr6uPjLBxg28wmKNXyqE9akIJ5XnfpiKl+4=
go.uber.org/multierr v1.5.0/go.mod h1:FeouvMocqHpRaaGuG9EjoKcStLC43Zu/fmqdUMPcKYU=
go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0=
go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y=
go.uber.org/tools v0.0.0-20190618225709-2cfd321de3ee/go.mod h1:vJERXedbb3MVM5f9Ejo0C68/HhF8uaILCdgjnY+goOA=
go.uber.org/zap v1.9.1/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
go.uber.org/zap v1.10.0/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
go.uber.org/zap v1.13.0/go.mod h1:zwrFLgMcdUuIBviXEYEH1YKNaOBnKXsx2IPda5bBwHM=
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/crypto v0.0.0-20190411191339-88737f569e3a/go.mod h1:WFFai1msRO1wXaEeE5yQxYXgSfI8pQAWXbQop6sCtWE=
golang.org/x/crypto v0.0.0-20190510104115-cbcb75029529/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20190820162420-60c769a6c586/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
golang.org/x/crypto v0.0.0-20201203163018-be400aefbc4c/go.mod h1:jdWPYTVW3xRLrWPugEBEK3UY2ZEsg3UU495nc5E+M+I=
golang.org/x/crypto v0.0.0-20210616213533-5ff15b29337e/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
golang.org/x/crypto v0.0.0-20210711020723-a769d52b0f97/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
golang.org/x/crypto v0.19.0/go.mod h1:Iy9bg/ha4yyC70EfRS8jz+B6ybOBKMaSxLj6P6oBDfU=
golang.org/x/crypto v0.20.0/go.mod h1:Xwo95rrVNIoSMx9wa1JroENMToLWn3RNVrTBpLHgZPQ=
golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI=
golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q=
golang.org/x/lint v0.0.0-20190930215403-16217165b5de/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
golang.org/x/mod v0.0.0-20190513183733-4bf6d317e70e/go.mod h1:mXi4GBBbnImb6dmsKGUJ2LatrhH/nqhxcFungHvyanc=
golang.org/x/mod v0.1.1-0.20191105210325-c90efee705ee/go.mod h1:QqPTAvyqsEbceGzBzNggFXnrqF1CaUcvgkdR5Ot7KZg=
golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4=
golang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs=
golang.org/x/mod v0.35.0 h1:Ww1D637e6Pg+Zb2KrWfHQUnH2dQRLBQyAtpr/haaJeM=
golang.org/x/mod v0.35.0/go.mod h1:+GwiRhIInF8wPm+4AoT6L0FA1QWAad3OMdTRx4tFYlU=
golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20190813141303-74dc4d7220e7/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=
golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c=
golang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=
golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
golang.org/x/net v0.21.0/go.mod h1:bIjVDfnllIU7BJ2DNgfnXvpSvtn8VRwhlsaeUTyUS44=
golang.org/x/net v0.53.0 h1:d+qAbo5L0orcWAr0a9JweQpjXF19LMXJE8Ey7hwOdUA=
golang.org/x/net v0.53.0/go.mod h1:JvMuJH7rrdiCfbeHoo3fCQU24Lf5JJwT9W3sJFulfgs=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4=
golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0=
golang.org/x/sys v0.0.0-20180905080454-ebe1bf3edb33/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190222072716-a9d3bda3a223/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20190403152447-81d4e9dc473e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190422165155-953cdadca894/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190813064441-fde4db37ae7a/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190916202348-b4ddaad3f8a3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20191026070338-33540a1f6037/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200116001909-b77594299b42/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201204225414-ed752295db88/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20210616094352-59db8d763f22/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI=
golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
golang.org/x/term v0.0.0-20201117132131-f5c789dd3221/go.mod h1:Nr5EML6q2oocZ2LXRh80K7BxOlk5/8JxuGnuhpl+muw=
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
golang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k=
golang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo=
golang.org/x/term v0.17.0/go.mod h1:lLRBjIVuehSbZlaOtGMbcMncT+aqLLLmKrsjNrUguwk=
golang.org/x/term v0.42.0 h1:UiKe+zDFmJobeJ5ggPwOshJIVt6/Ft0rcfrXZDLWAWY=
golang.org/x/term v0.42.0/go.mod h1:Dq/D+snpsbazcBG5+F9Q1n2rXV8Ma+71xEjTRufARgY=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk=
golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/text v0.3.4/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ=
golang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=
golang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8=
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=
golang.org/x/text v0.36.0 h1:JfKh3XmcRPqZPKevfXVpI1wXPTqbkE5f7JA92a55Yxg=
golang.org/x/text v0.36.0/go.mod h1:NIdBknypM8iqVmPiuco0Dh6P5Jcdk8lJL0CUebqK164=
golang.org/x/time v0.15.0 h1:bbrp8t3bGUeFOx08pvsMYRTCVSMk89u4tKbNOZbp88U=
golang.org/x/time v0.15.0/go.mod h1:Y4YMaQmXwGQZoFaVFk4YpCt4FLQMYKZe9oeV/f4MSno=
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
golang.org/x/tools v0.0.0-20190311212946-11955173bddd/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs=
golang.org/x/tools v0.0.0-20190425163242-31fd60d6bfdc/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q=
golang.org/x/tools v0.0.0-20190621195816-6e04913cbbac/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc=
golang.org/x/tools v0.0.0-20190823170909-c4a336ef6a2f/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191029041327-9cc4af7d6b2c/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191029190741-b9c20aec41a5/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/tools v0.0.0-20200103221440-774c71fcf114/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28=
golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc=
golang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU=
golang.org/x/xerrors v0.0.0-20190410155217-1f06c39b4373/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20190513163551-3ee3066db522/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
gonum.org/v1/gonum v0.17.0 h1:VbpOemQlsSMrYmn7T2OUvQ4dqxQXU+ouZFQsZOx50z4=
gonum.org/v1/gonum v0.17.0/go.mod h1:El3tOrEuMpv2UdMrbNlKEh9vd86bmQ6vqIcDwxEOc1E=
google.golang.org/genproto/googleapis/api v0.0.0-20260401024825-9d38bb4040a9 h1:VPWxll4HlMw1Vs/qXtN7BvhZqsS9cdAittCNvVENElA=
google.golang.org/genproto/googleapis/api v0.0.0-20260401024825-9d38bb4040a9/go.mod h1:7QBABkRtR8z+TEnmXTqIqwJLlzrZKVfAUm7tY3yGv0M=
google.golang.org/genproto/googleapis/rpc v0.0.0-20260420184626-e10c466a9529 h1:XF8+t6QQiS0o9ArVan/HW8Q7cycNPGsJf6GA2nXxYAg=
google.golang.org/genproto/googleapis/rpc v0.0.0-20260420184626-e10c466a9529/go.mod h1:4Hqkh8ycfw05ld/3BWL7rJOSfebL2Q+DVDeRgYgxUU8=
google.golang.org/grpc v1.80.0 h1:Xr6m2WmWZLETvUNvIUmeD5OAagMw3FiKmMlTdViWsHM=
google.golang.org/grpc v1.80.0/go.mod h1:ho/dLnxwi3EDJA4Zghp7k2Ec1+c2jqup0bFkw07bwF4=
google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE=
google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
gopkg.in/inconshreveable/log15.v2 v2.0.0-20180818164646-67afb5ed74ec/go.mod h1:aPpfJ7XW+gOuirDoZ8gHhLh3kZ1B08FtV2bbmy7Jv3s=
gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
gotest.tools/v3 v3.5.2 h1:7koQfIKdy+I8UTetycgUqXWSDwpgv193Ka+qRsmBY8Q=
gotest.tools/v3 v3.5.2/go.mod h1:LtdLGcnqToBH83WByAAi/wiwSFCArdFIUV/xxN4pcjA=
honnef.co/go/tools v0.0.1-2019.2.3/go.mod h1:a3bituU0lyd329TUQxRnasdCoJDkEUEAqEt0JzvZhAg=
modernc.org/libc v1.72.1 h1:db1xwJ6u1kE3KHTFTTbe2GCrczHPKzlURP0aDC4NGD0=
modernc.org/libc v1.72.1/go.mod h1:HRMiC/PhPGLIPM7GzAFCbI+oSgE3dhZ8FWftmRrHVlY=
modernc.org/mathutil v1.7.1 h1:GCZVGXdaN8gTqB1Mf/usp1Y/hSqgI2vAGGP4jZMCxOU=
modernc.org/mathutil v1.7.1/go.mod h1:4p5IwJITfppl0G4sUEDtCr4DthTaT47/N3aT6MhfgJg=
modernc.org/memory v1.11.0 h1:o4QC8aMQzmcwCK3t3Ux/ZHmwFPzE6hf2Y5LbkRs+hbI=
modernc.org/memory v1.11.0/go.mod h1:/JP4VbVC+K5sU2wZi9bHoq2MAkCnrt2r98UGeSK7Mjw=
modernc.org/sqlite v1.49.1 h1:dYGHTKcX1sJ+EQDnUzvz4TJ5GbuvhNJa8Fg6ElGx73U=
modernc.org/sqlite v1.49.1/go.mod h1:m0w8xhwYUVY3H6pSDwc3gkJ/irZT/0YEXwBlhaxQEew=
pgregory.net/rapid v1.2.0 h1:keKAYRcjm+e1F0oAuU5F5+YPAWcyxNNRK2wud503Gnk=
pgregory.net/rapid v1.2.0/go.mod h1:PY5XlDGj0+V1FCq0o192FdRhpKHGTRIWBgqjDBTrq04=
+236
View File
@@ -0,0 +1,236 @@
package harness
import (
"context"
"crypto/rand"
"encoding/hex"
"errors"
"fmt"
"os"
"os/exec"
"path/filepath"
"runtime"
"strings"
"sync"
"testing"
"time"
cerrdefs "github.com/containerd/errdefs"
"github.com/docker/docker/api/types/network"
dockerclient "github.com/docker/docker/client"
)
// Engine image tags used by the integration suite. `EngineImageRef` is
// the image we actually build from `galaxy/game/Dockerfile`;
// `PatchedEngineImageRef` is the same image content tagged at a higher
// semver patch so the patch lifecycle test exercises the
// `semver_patch_only` validation against a real image. Keeping both at
// the same digest avoids a redundant build.
const (
EngineImageRef = "galaxy/game:1.0.0-rtm-it"
PatchedEngineImageRef = "galaxy/game:1.0.1-rtm-it"
dockerNetworkPrefix = "rtmanager-it-"
dockerPingTimeout = 5 * time.Second
dockerNetworkTimeout = 30 * time.Second
imageBuildTimeout = 10 * time.Minute
)
// DockerEnv carries the per-package Docker client plus the workspace
// root used by image builds. The client is opened lazily on the first
// EnsureDocker call and closed by ShutdownDocker at TestMain exit.
type DockerEnv struct {
client *dockerclient.Client
workspaceRoot string
}
// Client returns the harness-owned Docker SDK client. Tests use it
// directly for "external actions" the harness does not wrap (e.g.,
// removing a running container behind RTM's back in `health_test`).
func (env *DockerEnv) Client() *dockerclient.Client { return env.client }
// WorkspaceRoot returns the absolute path of the galaxy/ workspace
// root. It is exported so the runtime helper can resolve the host
// game-state root relative to it if a test needs a deterministic
// location, though the default places state under `t.ArtifactDir()`.
func (env *DockerEnv) WorkspaceRoot() string { return env.workspaceRoot }
var (
dockerOnce sync.Once
dockerEnv *DockerEnv
dockerErr error
imageOnce sync.Once
imageErr error
)
// EnsureDocker opens the shared Docker SDK client and verifies the
// daemon is reachable. When the daemon is unavailable the helper calls
// `t.Skip` so suites stay green on hosts without `/var/run/docker.sock`
// or `DOCKER_HOST`.
func EnsureDocker(t testing.TB) *DockerEnv {
t.Helper()
dockerOnce.Do(func() {
dockerEnv, dockerErr = openDocker()
})
if dockerErr != nil {
t.Skipf("rtmanager integration: docker daemon unavailable: %v", dockerErr)
}
return dockerEnv
}
// EnsureEngineImage builds the `galaxy/game` engine image from the
// workspace root once per package run via `sync.Once`, then tags the
// resulting image at both `EngineImageRef` and `PatchedEngineImageRef`
// so the patch lifecycle has a second semver-valid tag to point at.
// Subsequent calls re-use the cached image. Any test that asks for the
// engine image must invoke this helper first; it is intentionally
// separate from `EnsureDocker` so suites that only need the daemon
// (e.g., a future "Docker network missing" negative test) do not pay
// the build cost.
func EnsureEngineImage(t testing.TB) string {
t.Helper()
env := EnsureDocker(t)
imageOnce.Do(func() {
imageErr = buildAndTagEngineImage(env)
})
if imageErr != nil {
t.Skipf("rtmanager integration: build galaxy/game image: %v", imageErr)
}
return EngineImageRef
}
// EnsureNetwork creates a uniquely-named Docker bridge network for the
// caller's test and registers cleanup. Each test gets its own network
// so concurrent scenarios cannot collide on the per-game DNS hostname.
func EnsureNetwork(t testing.TB) string {
t.Helper()
env := EnsureDocker(t)
name := dockerNetworkPrefix + uniqueSuffix(t)
createCtx, cancel := context.WithTimeout(context.Background(), dockerNetworkTimeout)
defer cancel()
if _, err := env.client.NetworkCreate(createCtx, name, network.CreateOptions{Driver: "bridge"}); err != nil {
t.Fatalf("rtmanager integration: create docker network %q: %v", name, err)
}
t.Cleanup(func() {
removeCtx, removeCancel := context.WithTimeout(context.Background(), dockerNetworkTimeout)
defer removeCancel()
if err := env.client.NetworkRemove(removeCtx, name); err != nil && !cerrdefs.IsNotFound(err) {
t.Logf("rtmanager integration: remove docker network %q: %v", name, err)
}
})
return name
}
// ShutdownDocker closes the shared Docker SDK client. `TestMain`
// invokes it after `m.Run`. The harness deliberately leaves the engine
// image in the local Docker cache so the next package run benefits
// from the layer cache; operators can `docker image rm` the
// `*-rtm-it` tags by hand if a stale image gets in the way.
func ShutdownDocker() {
if dockerEnv == nil {
return
}
if dockerEnv.client != nil {
_ = dockerEnv.client.Close()
}
dockerEnv = nil
}
// uniqueSuffix returns 8 hex characters of randomness suitable for a
// per-test resource name. The same helper is used in
// `internal/adapters/docker/smoke_test.go`; we duplicate it instead of
// importing because `_test.go`-only helpers cannot be exported.
func uniqueSuffix(t testing.TB) string {
t.Helper()
buf := make([]byte, 4)
if _, err := rand.Read(buf); err != nil {
t.Fatalf("rtmanager integration: read random suffix: %v", err)
}
return hex.EncodeToString(buf)
}
func openDocker() (*DockerEnv, error) {
if os.Getenv("DOCKER_HOST") == "" {
if _, err := os.Stat("/var/run/docker.sock"); err != nil {
return nil, fmt.Errorf("set DOCKER_HOST or expose /var/run/docker.sock: %w", err)
}
}
client, err := dockerclient.NewClientWithOpts(
dockerclient.FromEnv,
dockerclient.WithAPIVersionNegotiation(),
)
if err != nil {
return nil, fmt.Errorf("new docker client: %w", err)
}
pingCtx, cancel := context.WithTimeout(context.Background(), dockerPingTimeout)
defer cancel()
if _, err := client.Ping(pingCtx); err != nil {
_ = client.Close()
return nil, fmt.Errorf("ping docker daemon: %w", err)
}
root, err := workspaceRoot()
if err != nil {
_ = client.Close()
return nil, fmt.Errorf("resolve workspace root: %w", err)
}
return &DockerEnv{
client: client,
workspaceRoot: root,
}, nil
}
// buildAndTagEngineImage invokes `docker build` against the workspace
// root context to materialise the `galaxy/game` image, then tags the
// resulting image at the patch tag. Shelling out to the CLI keeps the
// implementation tiny — using the SDK would require streaming a tar
// of the workspace root, which is heavy and duplicates what the CLI
// already optimises. The workspace-root build context is required by
// `galaxy/game` (see `galaxy/game/README.md` §Build).
func buildAndTagEngineImage(env *DockerEnv) error {
if env == nil {
return errors.New("nil docker env")
}
ctx, cancel := context.WithTimeout(context.Background(), imageBuildTimeout)
defer cancel()
dockerfilePath := filepath.Join("game", "Dockerfile")
cmd := exec.CommandContext(ctx, "docker", "build",
"-f", dockerfilePath,
"-t", EngineImageRef,
".",
)
cmd.Dir = env.workspaceRoot
cmd.Env = append(os.Environ(), "DOCKER_BUILDKIT=1")
output, err := cmd.CombinedOutput()
if err != nil {
return fmt.Errorf("docker build (-f %s) in %s: %w; output:\n%s",
dockerfilePath, env.workspaceRoot, err, strings.TrimSpace(string(output)))
}
if err := env.client.ImageTag(ctx, EngineImageRef, PatchedEngineImageRef); err != nil {
return fmt.Errorf("tag %s as %s: %w", EngineImageRef, PatchedEngineImageRef, err)
}
return nil
}
// workspaceRoot resolves the absolute path of the galaxy/ workspace
// root by anchoring on this file's location. The harness lives at
// `galaxy/rtmanager/integration/harness/docker.go`, so the workspace
// root is three directories up. Mirrors the `cmd/jetgen` strategy.
func workspaceRoot() (string, error) {
_, file, _, ok := runtime.Caller(0)
if !ok {
return "", errors.New("resolve runtime caller for workspace root")
}
dir := filepath.Dir(file)
// dir = .../galaxy/rtmanager/integration/harness
root := filepath.Clean(filepath.Join(dir, "..", "..", ".."))
return root, nil
}
@@ -0,0 +1,59 @@
package harness
import (
"encoding/json"
"net/http"
"net/http/httptest"
"strings"
"testing"
)
// LobbyStub answers the single Lobby internal request the start
// service performs ([`internal/adapters/lobbyclient`]). The start
// service treats this response as ancillary diagnostics — the start
// envelope already carries `image_ref` — so the stub returns a
// deterministic 200 OK and lets the runtime ignore the payload.
//
// The stub only validates that the runtime configuration treats the
// Lobby URL as required (so it cannot regress to nil-out the
// ancillary fetch); the response body itself is unused by the
// integration assertions.
type LobbyStub struct {
Server *httptest.Server
}
// NewLobbyStub returns a started httptest.Server. The caller registers
// `t.Cleanup(stub.Close)` themselves through the runtime helper so the
// stub follows the same lifecycle as the rest of the per-test wiring.
func NewLobbyStub(t testing.TB) *LobbyStub {
t.Helper()
mux := http.NewServeMux()
mux.HandleFunc("GET /api/v1/internal/games/{game_id}", func(w http.ResponseWriter, r *http.Request) {
gameID := strings.TrimSpace(r.PathValue("game_id"))
if gameID == "" {
writeStubError(w, http.StatusBadRequest, "invalid_request", "game_id is required")
return
}
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]string{
"game_id": gameID,
"status": "running",
"target_engine_version": "1.0.0",
})
})
server := httptest.NewServer(mux)
t.Cleanup(server.Close)
return &LobbyStub{Server: server}
}
// URL returns the base URL of the running stub.
func (stub *LobbyStub) URL() string { return stub.Server.URL }
func writeStubError(w http.ResponseWriter, status int, code, message string) {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
w.WriteHeader(status)
_ = json.NewEncoder(w).Encode(map[string]any{
"error": map[string]string{"code": code, "message": message},
})
}
+224
View File
@@ -0,0 +1,224 @@
// Package harness exposes the testcontainers / Docker / image-build
// scaffolding shared by the Runtime Manager service-local integration
// suite under [`galaxy/rtmanager/integration`](..).
//
// Only `_test.go` files (and the harness itself) reference this
// package; production code paths in `cmd/rtmanager` never import it.
// The package therefore stays out of the production binary's import
// graph, identical to the in-package `pgtest` and `integration/internal/harness`
// patterns it mirrors.
package harness
import (
"context"
"database/sql"
"net/url"
"os"
"sync"
"testing"
"time"
"galaxy/postgres"
"galaxy/rtmanager/internal/adapters/postgres/migrations"
testcontainers "github.com/testcontainers/testcontainers-go"
tcpostgres "github.com/testcontainers/testcontainers-go/modules/postgres"
"github.com/testcontainers/testcontainers-go/wait"
)
const (
pgImage = "postgres:16-alpine"
pgSuperUser = "galaxy"
pgSuperPassword = "galaxy"
pgSuperDatabase = "galaxy_rtmanager_it"
pgServiceRole = "rtmanagerservice"
pgServicePassword = "rtmanagerservice"
pgServiceSchema = "rtmanager"
pgStartupTimeout = 90 * time.Second
// pgOperationTimeout bounds the per-statement deadline used by every
// pool the harness opens. Short enough to surface a runaway
// integration test promptly, long enough to absorb laptop-grade I/O.
pgOperationTimeout = 10 * time.Second
)
// PostgresEnv carries the per-package PostgreSQL fixture. The container
// is started lazily on the first EnsurePostgres call and torn down by
// ShutdownPostgres at TestMain exit.
type PostgresEnv struct {
container *tcpostgres.PostgresContainer
pool *sql.DB
scopedDSN string
}
// Pool returns the harness-owned `*sql.DB` scoped to the rtmanager
// schema. Tests use it to read durable state directly through the
// existing store adapters.
func (env *PostgresEnv) Pool() *sql.DB { return env.pool }
// DSN returns the rtmanager-role-scoped DSN suitable for
// `RTMANAGER_POSTGRES_PRIMARY_DSN`. Both this DSN and Pool address the
// same database; the pool is reused across tests, while the runtime
// under test opens its own pool through this DSN.
func (env *PostgresEnv) DSN() string { return env.scopedDSN }
var (
pgOnce sync.Once
pgEnv *PostgresEnv
pgErr error
)
// EnsurePostgres starts the per-package PostgreSQL container on first
// invocation and applies the embedded goose migrations. Subsequent
// invocations reuse the same container. When Docker is unavailable the
// helper calls `t.Skip` so the suite stays green on hosts without a
// daemon (mirrors the contract from `internal/adapters/postgres/internal/pgtest`).
func EnsurePostgres(t testing.TB) *PostgresEnv {
t.Helper()
pgOnce.Do(func() {
pgEnv, pgErr = startPostgres()
})
if pgErr != nil {
t.Skipf("rtmanager integration: postgres container start failed (Docker unavailable?): %v", pgErr)
}
return pgEnv
}
// TruncatePostgres wipes every Runtime Manager table inside the shared
// pool, leaving the schema and indexes intact. Tests call this from
// their setup so each scenario starts on an empty state.
func TruncatePostgres(t testing.TB) {
t.Helper()
env := EnsurePostgres(t)
const stmt = `TRUNCATE TABLE runtime_records, operation_log, health_snapshots RESTART IDENTITY CASCADE`
if _, err := env.pool.ExecContext(context.Background(), stmt); err != nil {
t.Fatalf("truncate rtmanager tables: %v", err)
}
}
// ShutdownPostgres terminates the shared container and closes the pool.
// `TestMain` invokes it after `m.Run` so the container is released even
// if individual tests panic.
func ShutdownPostgres() {
if pgEnv == nil {
return
}
if pgEnv.pool != nil {
_ = pgEnv.pool.Close()
}
if pgEnv.container != nil {
_ = testcontainers.TerminateContainer(pgEnv.container)
}
pgEnv = nil
}
// RunMain is a convenience helper for the integration package
// `TestMain`: it runs the suite, captures the exit code, tears every
// shared container down, and exits. Wiring it through one helper keeps
// `TestMain` to two lines and centralises ordering.
func RunMain(m *testing.M) {
code := m.Run()
ShutdownRedis()
ShutdownPostgres()
ShutdownDocker()
os.Exit(code)
}
func startPostgres() (*PostgresEnv, error) {
ctx := context.Background()
container, err := tcpostgres.Run(ctx, pgImage,
tcpostgres.WithDatabase(pgSuperDatabase),
tcpostgres.WithUsername(pgSuperUser),
tcpostgres.WithPassword(pgSuperPassword),
testcontainers.WithWaitStrategy(
wait.ForLog("database system is ready to accept connections").
WithOccurrence(2).
WithStartupTimeout(pgStartupTimeout),
),
)
if err != nil {
return nil, err
}
baseDSN, err := container.ConnectionString(ctx, "sslmode=disable")
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
if err := provisionRoleAndSchema(ctx, baseDSN); err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
scopedDSN, err := scopedDSNForRole(baseDSN)
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
cfg := postgres.DefaultConfig()
cfg.PrimaryDSN = scopedDSN
cfg.OperationTimeout = pgOperationTimeout
pool, err := postgres.OpenPrimary(ctx, cfg)
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
if err := postgres.Ping(ctx, pool, pgOperationTimeout); err != nil {
_ = pool.Close()
_ = testcontainers.TerminateContainer(container)
return nil, err
}
if err := postgres.RunMigrations(ctx, pool, migrations.FS(), "."); err != nil {
_ = pool.Close()
_ = testcontainers.TerminateContainer(container)
return nil, err
}
return &PostgresEnv{
container: container,
pool: pool,
scopedDSN: scopedDSN,
}, nil
}
func provisionRoleAndSchema(ctx context.Context, baseDSN string) error {
cfg := postgres.DefaultConfig()
cfg.PrimaryDSN = baseDSN
cfg.OperationTimeout = pgOperationTimeout
db, err := postgres.OpenPrimary(ctx, cfg)
if err != nil {
return err
}
defer func() { _ = db.Close() }()
statements := []string{
`DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'rtmanagerservice') THEN
CREATE ROLE rtmanagerservice LOGIN PASSWORD 'rtmanagerservice';
END IF;
END $$;`,
`CREATE SCHEMA IF NOT EXISTS rtmanager AUTHORIZATION rtmanagerservice;`,
`GRANT USAGE ON SCHEMA rtmanager TO rtmanagerservice;`,
}
for _, statement := range statements {
if _, err := db.ExecContext(ctx, statement); err != nil {
return err
}
}
return nil
}
func scopedDSNForRole(baseDSN string) (string, error) {
parsed, err := url.Parse(baseDSN)
if err != nil {
return "", err
}
values := url.Values{}
values.Set("search_path", pgServiceSchema)
values.Set("sslmode", "disable")
scoped := url.URL{
Scheme: parsed.Scheme,
User: url.UserPassword(pgServiceRole, pgServicePassword),
Host: parsed.Host,
Path: parsed.Path,
RawQuery: values.Encode(),
}
return scoped.String(), nil
}
+102
View File
@@ -0,0 +1,102 @@
package harness
import (
"context"
"sync"
"testing"
"github.com/redis/go-redis/v9"
testcontainers "github.com/testcontainers/testcontainers-go"
rediscontainer "github.com/testcontainers/testcontainers-go/modules/redis"
)
const redisImage = "redis:7"
// RedisEnv carries the per-package Redis fixture. The container is
// started lazily on the first EnsureRedis call and torn down by
// ShutdownRedis at TestMain exit. Both stream consumers and the
// per-game lease store hit this real Redis (miniredis would suffice
// for streams alone, but the lease semantics and eviction-by-TTL we
// rely on in `health_test` are easier to verify against a real
// daemon).
type RedisEnv struct {
container *rediscontainer.RedisContainer
addr string
}
// Addr returns the externally reachable host:port of the Redis
// container. Both the runtime under test and the harness-owned client
// connect through the same endpoint.
func (env *RedisEnv) Addr() string { return env.addr }
// NewClient opens a fresh `*redis.Client` against the harness Redis.
// Tests close their client through `t.Cleanup`; the harness keeps no
// shared client to avoid cross-test connection-pool surprises.
func (env *RedisEnv) NewClient(t testing.TB) *redis.Client {
t.Helper()
client := redis.NewClient(&redis.Options{Addr: env.addr})
t.Cleanup(func() { _ = client.Close() })
return client
}
var (
redisOnce sync.Once
redisEnv *RedisEnv
redisErr error
)
// EnsureRedis starts the per-package Redis container on first
// invocation and returns it. When Docker is unavailable the helper
// calls `t.Skip` so the suite stays green on hosts without a daemon.
func EnsureRedis(t testing.TB) *RedisEnv {
t.Helper()
redisOnce.Do(func() {
redisEnv, redisErr = startRedis()
})
if redisErr != nil {
t.Skipf("rtmanager integration: redis container start failed (Docker unavailable?): %v", redisErr)
}
return redisEnv
}
// FlushRedis drops every key on the harness Redis. Tests call it from
// their setup so streams, offset records, and leases from previous
// scenarios do not leak.
func FlushRedis(t testing.TB) {
t.Helper()
env := EnsureRedis(t)
client := redis.NewClient(&redis.Options{Addr: env.addr})
defer func() { _ = client.Close() }()
if _, err := client.FlushAll(context.Background()).Result(); err != nil {
t.Fatalf("flush rtmanager redis: %v", err)
}
}
// ShutdownRedis terminates the shared container. `TestMain` invokes it
// after `m.Run`.
func ShutdownRedis() {
if redisEnv == nil {
return
}
if redisEnv.container != nil {
_ = testcontainers.TerminateContainer(redisEnv.container)
}
redisEnv = nil
}
func startRedis() (*RedisEnv, error) {
ctx := context.Background()
container, err := rediscontainer.Run(ctx, redisImage)
if err != nil {
return nil, err
}
addr, err := container.Endpoint(ctx, "")
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
return &RedisEnv{
container: container,
addr: addr,
}, nil
}
+195
View File
@@ -0,0 +1,195 @@
package harness
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"net/url"
"strings"
"testing"
"time"
)
// defaultHTTPClient backs the runtime-readiness poll and the REST
// helpers below. A short timeout is enough — every internal endpoint
// runs against an in-process listener.
var defaultHTTPClient = &http.Client{Timeout: 5 * time.Second}
// newRequest is a thin shim over `http.NewRequestWithContext` so the
// readiness poll and the REST client share one constructor.
func newRequest(ctx context.Context, method, fullURL string, body io.Reader) (*http.Request, error) {
req, err := http.NewRequestWithContext(ctx, method, fullURL, body)
if err != nil {
return nil, err
}
if body != nil {
req.Header.Set("Content-Type", "application/json; charset=utf-8")
}
req.Header.Set("Accept", "application/json")
req.Header.Set("X-Galaxy-Caller", "admin")
return req, nil
}
// REST is a tiny client for the trusted internal HTTP surface RTM
// exposes to Game Master and Admin Service. It always identifies the
// caller as `admin` (the operation_log records `admin_rest`); tests
// that need GM semantics should add an option later. v1 keeps the
// helper minimal because the integration scenarios only need
// admin-driven flows.
type REST struct {
baseURL string
httpc *http.Client
}
// NewREST builds a REST client targeting env.InternalAddr.
func NewREST(env *Env) *REST {
return &REST{
baseURL: "http://" + env.InternalAddr,
httpc: defaultHTTPClient,
}
}
// Get issues GET path and returns the response body and status code.
func (r *REST) Get(t testing.TB, path string) ([]byte, int) {
t.Helper()
return r.do(t, http.MethodGet, path, nil)
}
// Post issues POST path with body (a Go value JSON-marshaled).
func (r *REST) Post(t testing.TB, path string, body any) ([]byte, int) {
t.Helper()
return r.do(t, http.MethodPost, path, body)
}
// Delete issues DELETE path with no body.
func (r *REST) Delete(t testing.TB, path string) ([]byte, int) {
t.Helper()
return r.do(t, http.MethodDelete, path, nil)
}
// GetRuntime fetches a runtime record by game id and returns the
// decoded payload, the status code, and the raw bytes for diagnostics.
func (r *REST) GetRuntime(t testing.TB, gameID string) (RuntimeRecordResponse, int) {
t.Helper()
body, status := r.Get(t, fmt.Sprintf("/api/v1/internal/runtimes/%s", url.PathEscape(gameID)))
var resp RuntimeRecordResponse
if status == http.StatusOK {
if err := json.Unmarshal(body, &resp); err != nil {
t.Fatalf("decode get-runtime response: %v; body=%s", err, string(body))
}
}
return resp, status
}
// StartRuntime invokes the start endpoint with imageRef.
func (r *REST) StartRuntime(t testing.TB, gameID, imageRef string) (RuntimeRecordResponse, int) {
t.Helper()
body, status := r.Post(t,
fmt.Sprintf("/api/v1/internal/runtimes/%s/start", url.PathEscape(gameID)),
map[string]string{"image_ref": imageRef},
)
return decodeRecord(t, body, status, "start")
}
// StopRuntime invokes the stop endpoint with reason.
func (r *REST) StopRuntime(t testing.TB, gameID, reason string) (RuntimeRecordResponse, int) {
t.Helper()
body, status := r.Post(t,
fmt.Sprintf("/api/v1/internal/runtimes/%s/stop", url.PathEscape(gameID)),
map[string]string{"reason": reason},
)
return decodeRecord(t, body, status, "stop")
}
// RestartRuntime invokes the restart endpoint.
func (r *REST) RestartRuntime(t testing.TB, gameID string) (RuntimeRecordResponse, int) {
t.Helper()
body, status := r.Post(t,
fmt.Sprintf("/api/v1/internal/runtimes/%s/restart", url.PathEscape(gameID)),
struct{}{},
)
return decodeRecord(t, body, status, "restart")
}
// PatchRuntime invokes the patch endpoint with imageRef.
func (r *REST) PatchRuntime(t testing.TB, gameID, imageRef string) (RuntimeRecordResponse, int) {
t.Helper()
body, status := r.Post(t,
fmt.Sprintf("/api/v1/internal/runtimes/%s/patch", url.PathEscape(gameID)),
map[string]string{"image_ref": imageRef},
)
return decodeRecord(t, body, status, "patch")
}
// CleanupRuntime invokes the DELETE container endpoint.
func (r *REST) CleanupRuntime(t testing.TB, gameID string) (RuntimeRecordResponse, int) {
t.Helper()
body, status := r.Delete(t,
fmt.Sprintf("/api/v1/internal/runtimes/%s/container", url.PathEscape(gameID)),
)
return decodeRecord(t, body, status, "cleanup")
}
// RuntimeRecordResponse mirrors the OpenAPI RuntimeRecord schema. Only
// the fields integration scenarios assert against live here; the
// listener encodes everything else.
type RuntimeRecordResponse struct {
GameID string `json:"game_id"`
Status string `json:"status"`
CurrentContainerID *string `json:"current_container_id"`
CurrentImageRef *string `json:"current_image_ref"`
EngineEndpoint *string `json:"engine_endpoint"`
StatePath string `json:"state_path"`
DockerNetwork string `json:"docker_network"`
StartedAt *string `json:"started_at"`
StoppedAt *string `json:"stopped_at"`
RemovedAt *string `json:"removed_at"`
LastOpAt string `json:"last_op_at"`
CreatedAt string `json:"created_at"`
}
func (r *REST) do(t testing.TB, method, path string, body any) ([]byte, int) {
t.Helper()
var reader io.Reader
if body != nil {
raw, err := json.Marshal(body)
if err != nil {
t.Fatalf("marshal request body: %v", err)
}
reader = bytes.NewReader(raw)
}
req, err := newRequest(context.Background(), method, r.baseURL+path, reader)
if err != nil {
t.Fatalf("build %s %s request: %v", method, path, err)
}
resp, err := r.httpc.Do(req)
if err != nil {
t.Fatalf("execute %s %s: %v", method, path, err)
}
defer resp.Body.Close()
raw, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read %s %s response: %v", method, path, err)
}
return raw, resp.StatusCode
}
func decodeRecord(t testing.TB, body []byte, status int, op string) (RuntimeRecordResponse, int) {
t.Helper()
if status != http.StatusOK {
return RuntimeRecordResponse{}, status
}
var resp RuntimeRecordResponse
if err := json.Unmarshal(body, &resp); err != nil {
t.Fatalf("decode %s response: %v; body=%s", op, err, string(body))
}
return resp, status
}
// PathEscape is a re-export so test files can call it without
// importing `net/url` directly. Keeps the test source focused on
// scenarios.
func PathEscape(value string) string { return url.PathEscape(strings.TrimSpace(value)) }
+398
View File
@@ -0,0 +1,398 @@
package harness
import (
"context"
"errors"
"io"
"log/slog"
"net/url"
"os"
"strconv"
"strings"
"sync"
"testing"
"time"
"galaxy/postgres"
"galaxy/redisconn"
"galaxy/rtmanager/internal/app"
"galaxy/rtmanager/internal/config"
"github.com/redis/go-redis/v9"
)
// Default stream key shapes used by the integration suite. They match
// the production defaults so the wire shapes asserted in `streams.go`
// are identical to what Game Lobby sees in `integration/lobbyrtm`.
const (
StartJobsStream = "runtime:start_jobs"
StopJobsStream = "runtime:stop_jobs"
JobResultsStream = "runtime:job_results"
HealthEventsStream = "runtime:health_events"
NotificationIntentsKey = "notification:intents"
gameStateRootSubdir = "game-state"
listenAddr = "127.0.0.1:0"
listenerWaitTimeout = 10 * time.Second
readyzPollInterval = 25 * time.Millisecond
cleanupShutdownTimeout = 30 * time.Second
)
// Env carries everything one integration scenario needs to drive the
// Runtime Manager process. The struct is value-typed so tests reach
// fields without intermediate getters.
type Env struct {
// Cfg is the resolved Runtime Manager configuration handed to
// `app.NewRuntime`. Tests inspect it for stream key shapes,
// container defaults, and timeout knobs.
Cfg config.Config
// Runtime is the in-process Runtime Manager exposed for tests that
// need to peek at internal state (`runtime.InternalServer().Addr()`).
Runtime *app.Runtime
// Postgres holds the per-package PostgreSQL fixture.
Postgres *PostgresEnv
// Redis holds the per-package Redis fixture plus a fresh client the
// test owns.
Redis *RedisEnv
RedisClient *redis.Client
// Docker holds the per-package Docker daemon handle.
Docker *DockerEnv
// Lobby is the per-test stub HTTP server.
Lobby *LobbyStub
// Network is the unique Docker network name created for this test.
Network string
// EngineImageRef and PatchedImageRef are the two semver-compatible
// engine image tags the harness builds once per package. Patch
// scenarios point at the second tag.
EngineImageRef string
PatchedImageRef string
// GameStateRoot is the host filesystem path RTM writes per-game
// state directories under. It lives inside `t.ArtifactDir()` so
// failed scenarios leave the engine state behind for inspection.
GameStateRoot string
// InternalAddr is the bound address of RTM's internal HTTP listener
// (resolved after Run binds the port).
InternalAddr string
}
// EnvOptions carry per-test overrides to the harness defaults. Empty
// fields fall back to the defaults declared at the top of this file.
type EnvOptions struct {
// ReconcileInterval overrides the periodic reconciler interval.
// Default 500ms (so reconcile drift is observable inside a single
// scenario timeout).
ReconcileInterval time.Duration
// CleanupInterval overrides the container-cleanup interval.
CleanupInterval time.Duration
// InspectInterval overrides the Docker inspect worker interval.
InspectInterval time.Duration
// ProbeInterval / ProbeTimeout / ProbeFailuresThreshold override
// the active engine probe knobs.
ProbeInterval time.Duration
ProbeTimeout time.Duration
ProbeFailuresThreshold int
// GameLeaseTTL overrides the per-game Redis lease TTL.
GameLeaseTTL time.Duration
// StreamBlockTimeout overrides the consumer XREAD block window.
StreamBlockTimeout time.Duration
// LogToStderr makes the harness write the runtime's structured
// logs to stderr; the default discards them so test output stays
// focused on assertions.
LogToStderr bool
}
// NewEnv stands up a fresh Runtime Manager process for the calling
// test. It blocks until the internal HTTP listener is bound; tests can
// issue REST and stream requests immediately after the call returns.
//
// `t.Cleanup` runs in reverse order: stop the runtime, close the
// runtime, close the per-test redis client, remove the docker network,
// terminate the lobby stub. Containers RTM created during the test are
// removed by the test's own cleanup paths or by the integration
// `health_test` external-action helpers.
func NewEnv(t *testing.T, opts EnvOptions) *Env {
t.Helper()
pg := EnsurePostgres(t)
rd := EnsureRedis(t)
dk := EnsureDocker(t)
imageRef := EnsureEngineImage(t)
TruncatePostgres(t)
FlushRedis(t)
network := EnsureNetwork(t)
lobby := NewLobbyStub(t)
stateRoot := stateRoot(t)
cfg := buildConfig(buildConfigInput{
PostgresDSN: pg.DSN(),
RedisAddr: rd.Addr(),
DockerHost: resolveDockerHost(),
Network: network,
LobbyURL: lobby.URL(),
GameStateRoot: stateRoot,
ReconcileInterval: pickDuration(opts.ReconcileInterval, 500*time.Millisecond),
CleanupInterval: pickDuration(opts.CleanupInterval, 500*time.Millisecond),
InspectInterval: pickDuration(opts.InspectInterval, 500*time.Millisecond),
ProbeInterval: pickDuration(opts.ProbeInterval, 500*time.Millisecond),
ProbeTimeout: pickDuration(opts.ProbeTimeout, time.Second),
ProbeFailures: pickInt(opts.ProbeFailuresThreshold, 2),
GameLeaseTTL: pickDuration(opts.GameLeaseTTL, 5*time.Second),
StreamBlockTimeout: pickDuration(opts.StreamBlockTimeout, 200*time.Millisecond),
})
logger := newLogger(opts.LogToStderr)
ctx, cancel := context.WithCancel(context.Background())
runtime, err := app.NewRuntime(ctx, cfg, logger)
if err != nil {
cancel()
t.Fatalf("rtmanager integration: new runtime: %v", err)
}
runDone := make(chan error, 1)
go func() {
runDone <- runtime.Run(ctx)
}()
internalAddr := waitForListener(t, runtime)
waitForReady(t, runtime, listenerWaitTimeout)
var cleanupOnce sync.Once
t.Cleanup(func() {
cleanupOnce.Do(func() {
cancel()
waitCtx, waitCancel := context.WithTimeout(context.Background(), cleanupShutdownTimeout)
defer waitCancel()
select {
case err := <-runDone:
if err != nil && !isCleanShutdownErr(err) {
t.Logf("rtmanager integration: runtime.Run returned: %v", err)
}
case <-waitCtx.Done():
t.Logf("rtmanager integration: runtime did not stop within %s", cleanupShutdownTimeout)
}
if err := runtime.Close(); err != nil {
t.Logf("rtmanager integration: runtime.Close: %v", err)
}
})
})
return &Env{
Cfg: cfg,
Runtime: runtime,
Postgres: pg,
Redis: rd,
RedisClient: rd.NewClient(t),
Docker: dk,
Lobby: lobby,
Network: network,
EngineImageRef: imageRef,
PatchedImageRef: PatchedEngineImageRef,
GameStateRoot: stateRoot,
InternalAddr: internalAddr,
}
}
type buildConfigInput struct {
PostgresDSN string
RedisAddr string
DockerHost string
Network string
LobbyURL string
GameStateRoot string
ReconcileInterval time.Duration
CleanupInterval time.Duration
InspectInterval time.Duration
ProbeInterval time.Duration
ProbeTimeout time.Duration
ProbeFailures int
GameLeaseTTL time.Duration
StreamBlockTimeout time.Duration
}
func buildConfig(in buildConfigInput) config.Config {
cfg := config.DefaultConfig()
cfg.InternalHTTP.Addr = listenAddr
cfg.Docker.Host = in.DockerHost
cfg.Docker.Network = in.Network
cfg.Docker.PullPolicy = config.ImagePullPolicyIfMissing
cfg.Postgres = config.PostgresConfig{
Conn: postgres.Config{
PrimaryDSN: in.PostgresDSN,
OperationTimeout: pgOperationTimeout,
MaxOpenConns: 5,
MaxIdleConns: 2,
ConnMaxLifetime: 30 * time.Minute,
},
}
cfg.Redis = config.RedisConfig{
Conn: redisconn.Config{
MasterAddr: in.RedisAddr,
Password: "integration",
OperationTimeout: 2 * time.Second,
},
}
cfg.Streams.StartJobs = StartJobsStream
cfg.Streams.StopJobs = StopJobsStream
cfg.Streams.JobResults = JobResultsStream
cfg.Streams.HealthEvents = HealthEventsStream
cfg.Streams.NotificationIntents = NotificationIntentsKey
cfg.Streams.BlockTimeout = in.StreamBlockTimeout
cfg.Container.GameStateRoot = in.GameStateRoot
// Pin chown target to the current process uid/gid; the dev sandbox
// (and unprivileged dev machines) cannot chown to root.
cfg.Container.GameStateOwnerUID = os.Getuid()
cfg.Container.GameStateOwnerGID = os.Getgid()
cfg.Health.InspectInterval = in.InspectInterval
cfg.Health.ProbeInterval = in.ProbeInterval
cfg.Health.ProbeTimeout = in.ProbeTimeout
cfg.Health.ProbeFailuresThreshold = in.ProbeFailures
cfg.Cleanup.ReconcileInterval = in.ReconcileInterval
cfg.Cleanup.CleanupInterval = in.CleanupInterval
cfg.Coordination.GameLeaseTTL = in.GameLeaseTTL
cfg.Lobby = config.LobbyConfig{
BaseURL: in.LobbyURL,
Timeout: 2 * time.Second,
}
cfg.Telemetry.TracesExporter = "none"
cfg.Telemetry.MetricsExporter = "none"
return cfg
}
func newLogger(toStderr bool) *slog.Logger {
if toStderr {
return slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelDebug}))
}
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
}
func stateRoot(t *testing.T) string {
t.Helper()
dir := t.ArtifactDir()
root := dir + string(os.PathSeparator) + gameStateRootSubdir
if err := os.MkdirAll(root, 0o755); err != nil {
t.Fatalf("rtmanager integration: create game-state root %q: %v", root, err)
}
return root
}
func resolveDockerHost() string {
if host := strings.TrimSpace(os.Getenv("DOCKER_HOST")); host != "" {
return host
}
return "unix:///var/run/docker.sock"
}
func pickDuration(value, fallback time.Duration) time.Duration {
if value > 0 {
return value
}
return fallback
}
func pickInt(value, fallback int) int {
if value > 0 {
return value
}
return fallback
}
// waitForListener spins until `runtime.InternalServer().Addr()` returns
// a non-empty value or the deadline fires. The internal listener binds
// during `runtime.Run`, which runs in its own goroutine; this helper
// is the bridge between "Run started" and "tests can use REST".
func waitForListener(t *testing.T, runtime *app.Runtime) string {
t.Helper()
deadline := time.Now().Add(listenerWaitTimeout)
for {
if runtime != nil && runtime.InternalServer() != nil {
if addr := runtime.InternalServer().Addr(); addr != "" {
return addr
}
}
if time.Now().After(deadline) {
t.Fatalf("rtmanager integration: internal HTTP listener did not bind within %s", listenerWaitTimeout)
}
time.Sleep(readyzPollInterval)
}
}
// waitForReady polls `/readyz` until it returns 200 or the deadline
// fires. RTM's readyz pings PG, Redis, and Docker; a successful
// response means every dependency is reachable through the runtime
// process.
func waitForReady(t *testing.T, runtime *app.Runtime, timeout time.Duration) {
t.Helper()
deadline := time.Now().Add(timeout)
addr := runtime.InternalServer().Addr()
probeURL := (&url.URL{Scheme: "http", Host: addr, Path: "/readyz"}).String()
for {
req, err := newRequest(context.Background(), "GET", probeURL, nil)
if err == nil {
resp, err := defaultHTTPClient.Do(req)
if err == nil {
_, _ = io.Copy(io.Discard, resp.Body)
_ = resp.Body.Close()
if resp.StatusCode == 200 {
return
}
}
}
if time.Now().After(deadline) {
t.Fatalf("rtmanager integration: /readyz did not return 200 within %s", timeout)
}
time.Sleep(readyzPollInterval)
}
}
func isCleanShutdownErr(err error) bool {
return err == nil || errors.Is(err, context.Canceled)
}
// IDFromTestName builds a deterministic-but-unique game id from the
// caller's test name. Two tests with the same name running back-to-back
// would otherwise collide on PG state through the per-test
// `TruncatePostgres` window; pinning the suffix to `Now().UnixNano()`
// rules that out.
func IDFromTestName(t *testing.T) string {
t.Helper()
// The container hostname is `galaxy-game-{game_id}` and must fit
// HOST_NAME_MAX=64 chars; runc rejects longer values with
// "sethostname: invalid argument". Cap the lowercased test-name
// component at 36 chars and append a 16-char base36 suffix so the
// total stays comfortably under the limit (12 + 36 + 1 + 16 = 65 →
// trim further if needed).
const maxNameLen = 35
suffix := strconv.FormatInt(time.Now().UnixNano(), 36)
prefix := strings.ToLower(strings.NewReplacer("/", "-", " ", "-").Replace(t.Name()))
if len(prefix) > maxNameLen {
prefix = prefix[:maxNameLen]
}
return prefix + "-" + suffix
}
+128
View File
@@ -0,0 +1,128 @@
package harness
import (
"context"
"errors"
"testing"
"time"
"galaxy/rtmanager/internal/adapters/postgres/healthsnapshotstore"
"galaxy/rtmanager/internal/adapters/postgres/operationlogstore"
"galaxy/rtmanager/internal/adapters/postgres/runtimerecordstore"
"galaxy/rtmanager/internal/domain/health"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/domain/runtime"
"github.com/stretchr/testify/require"
)
// RuntimeRecord returns the persisted runtime record for gameID. The
// helper opens the store on every call (cheap; the harness `*sql.DB`
// is shared) so individual scenarios stay isolated even if a previous
// test mutated store state.
func RuntimeRecord(t testing.TB, env *Env, gameID string) (runtime.RuntimeRecord, error) {
t.Helper()
store, err := runtimerecordstore.New(runtimerecordstore.Config{
DB: env.Postgres.Pool(),
OperationTimeout: pgOperationTimeout,
})
require.NoError(t, err)
return store.Get(context.Background(), gameID)
}
// MustRuntimeRecord asserts that the record exists and returns it.
func MustRuntimeRecord(t testing.TB, env *Env, gameID string) runtime.RuntimeRecord {
t.Helper()
record, err := RuntimeRecord(t, env, gameID)
require.NoErrorf(t, err, "load runtime record for %s", gameID)
return record
}
// EventuallyRuntimeRecord polls until predicate matches the runtime
// record for gameID, or the deadline fires. Returns the matching
// record. Used by lifecycle assertions that depend on async state
// transitions (start consumer → record).
func EventuallyRuntimeRecord(t testing.TB, env *Env, gameID string, predicate func(runtime.RuntimeRecord) bool, timeout time.Duration) runtime.RuntimeRecord {
t.Helper()
if timeout <= 0 {
timeout = defaultStreamTimeout
}
deadline := time.Now().Add(timeout)
for {
record, err := RuntimeRecord(t, env, gameID)
if err == nil && predicate(record) {
return record
}
if err != nil && !errors.Is(err, runtime.ErrNotFound) {
t.Fatalf("rtmanager integration: load runtime record: %v", err)
}
if time.Now().After(deadline) {
if err != nil {
t.Fatalf("rtmanager integration: runtime record predicate not met within %s; last err=%v",
timeout, err)
}
t.Fatalf("rtmanager integration: runtime record predicate not met within %s; last record=%+v",
timeout, record)
}
time.Sleep(defaultStreamPoll)
}
}
// OperationEntries returns up to `limit` most-recent operation_log
// entries for gameID, ordered descending by started_at.
func OperationEntries(t testing.TB, env *Env, gameID string, limit int) []operation.OperationEntry {
t.Helper()
store, err := operationlogstore.New(operationlogstore.Config{
DB: env.Postgres.Pool(),
OperationTimeout: pgOperationTimeout,
})
require.NoError(t, err)
entries, err := store.ListByGame(context.Background(), gameID, limit)
require.NoErrorf(t, err, "list operation log entries for %s", gameID)
return entries
}
// EventuallyOperationKind polls operation_log until at least one entry
// for gameID has the requested kind, or the deadline fires. Returns
// the matching entry.
func EventuallyOperationKind(t testing.TB, env *Env, gameID string, kind operation.OpKind, timeout time.Duration) operation.OperationEntry {
t.Helper()
if timeout <= 0 {
timeout = defaultStreamTimeout
}
deadline := time.Now().Add(timeout)
for {
entries := OperationEntries(t, env, gameID, 50)
for _, entry := range entries {
if entry.OpKind == kind {
return entry
}
}
if time.Now().After(deadline) {
t.Fatalf("rtmanager integration: operation_log entry with op_kind=%s not seen within %s; observed=%v",
kind, timeout, opKindSummary(entries))
}
time.Sleep(defaultStreamPoll)
}
}
// HealthSnapshot returns the latest persisted health snapshot for
// gameID, or the underlying not-found sentinel when nothing has been
// recorded yet.
func HealthSnapshot(t testing.TB, env *Env, gameID string) (health.HealthSnapshot, error) {
t.Helper()
store, err := healthsnapshotstore.New(healthsnapshotstore.Config{
DB: env.Postgres.Pool(),
OperationTimeout: pgOperationTimeout,
})
require.NoError(t, err)
return store.Get(context.Background(), gameID)
}
func opKindSummary(entries []operation.OperationEntry) []string {
out := make([]string, 0, len(entries))
for _, entry := range entries {
out = append(out, string(entry.OpKind)+"/"+string(entry.Outcome))
}
return out
}
+334
View File
@@ -0,0 +1,334 @@
package harness
import (
"context"
"encoding/json"
"fmt"
"strconv"
"strings"
"testing"
"time"
"galaxy/rtmanager/internal/ports"
"github.com/redis/go-redis/v9"
"github.com/stretchr/testify/require"
)
// Default scenario timeouts. Stream-driven assertions sit on top of
// the runtime's worker tickers (defaults of 200-500ms in
// `EnvOptions`); 30s gives every reconcile / probe / event tick more
// than enough headroom even on a slow CI runner.
const (
defaultStreamTimeout = 30 * time.Second
defaultStreamPoll = 25 * time.Millisecond
)
// XAddStartJob appends one start-job entry in the
// `runtime:start_jobs` AsyncAPI shape and returns the assigned entry
// id. Mirrors the wire shape produced by Lobby's
// `runtimemanager.Publisher` so the consumer treats the entry exactly
// like a real Lobby-published job.
func XAddStartJob(t testing.TB, env *Env, gameID, imageRef string) string {
t.Helper()
id, err := env.RedisClient.XAdd(context.Background(), &redis.XAddArgs{
Stream: env.Cfg.Streams.StartJobs,
Values: map[string]any{
"game_id": gameID,
"image_ref": imageRef,
"requested_at_ms": time.Now().UTC().UnixMilli(),
},
}).Result()
require.NoErrorf(t, err, "xadd start_jobs for game %s", gameID)
return id
}
// XAddStopJob appends one stop-job entry classified by reason. The
// reason enum is documented at `ports.StopReason`.
func XAddStopJob(t testing.TB, env *Env, gameID, reason string) string {
t.Helper()
id, err := env.RedisClient.XAdd(context.Background(), &redis.XAddArgs{
Stream: env.Cfg.Streams.StopJobs,
Values: map[string]any{
"game_id": gameID,
"reason": reason,
"requested_at_ms": time.Now().UTC().UnixMilli(),
},
}).Result()
require.NoErrorf(t, err, "xadd stop_jobs for game %s", gameID)
return id
}
// JobResultEntry is the decoded shape of one `runtime:job_results`
// stream entry. Mirrors `ports.JobResult` plus the entry id surfaced
// by Redis so tests can correlate XADD ids with results.
type JobResultEntry struct {
StreamID string
GameID string
Outcome string
ContainerID string
EngineEndpoint string
ErrorCode string
ErrorMessage string
}
// HealthEventEntry mirrors the `runtime:health_events` AsyncAPI shape
// in decoded form.
type HealthEventEntry struct {
StreamID string
GameID string
ContainerID string
EventType string
OccurredAtMs int64
Details map[string]any
}
// NotificationIntentEntry decodes one `notification:intents` entry
// that RTM publishes for first-touch start failures.
type NotificationIntentEntry struct {
StreamID string
NotificationType string
IdempotencyKey string
Payload map[string]any
}
// WaitForJobResult polls `runtime:job_results` until predicate
// matches, or the timeout fires. Returns the matching entry. The
// helper does not consume the stream — every call rescans from `0-0`
// — because RTM's writes are append-only and the cardinality per test
// is small.
func WaitForJobResult(t testing.TB, env *Env, predicate func(JobResultEntry) bool, timeout time.Duration) JobResultEntry {
t.Helper()
if timeout <= 0 {
timeout = defaultStreamTimeout
}
deadline := time.Now().Add(timeout)
for {
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.JobResults, "-", "+").Result()
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.JobResults)
for _, entry := range entries {
decoded := decodeJobResult(entry)
if predicate(decoded) {
return decoded
}
}
if time.Now().After(deadline) {
t.Fatalf("rtmanager integration: no job_result matched within %s; observed=%v",
timeout, jobResultStreamSummary(entries))
}
time.Sleep(defaultStreamPoll)
}
}
// AllJobResults returns every entry on `runtime:job_results` in stream
// order. Useful for assertions that depend on cardinality (replay
// tests).
func AllJobResults(t testing.TB, env *Env) []JobResultEntry {
t.Helper()
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.JobResults, "-", "+").Result()
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.JobResults)
out := make([]JobResultEntry, 0, len(entries))
for _, entry := range entries {
out = append(out, decodeJobResult(entry))
}
return out
}
// WaitForHealthEvent polls `runtime:health_events` until predicate
// matches, or the timeout fires.
func WaitForHealthEvent(t testing.TB, env *Env, predicate func(HealthEventEntry) bool, timeout time.Duration) HealthEventEntry {
t.Helper()
if timeout <= 0 {
timeout = defaultStreamTimeout
}
deadline := time.Now().Add(timeout)
for {
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.HealthEvents, "-", "+").Result()
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.HealthEvents)
for _, entry := range entries {
decoded := decodeHealthEvent(t, entry)
if predicate(decoded) {
return decoded
}
}
if time.Now().After(deadline) {
t.Fatalf("rtmanager integration: no health_event matched within %s; observed=%v",
timeout, healthEventStreamSummary(entries))
}
time.Sleep(defaultStreamPoll)
}
}
// WaitForNotificationIntent polls `notification:intents` until
// predicate matches.
func WaitForNotificationIntent(t testing.TB, env *Env, predicate func(NotificationIntentEntry) bool, timeout time.Duration) NotificationIntentEntry {
t.Helper()
if timeout <= 0 {
timeout = defaultStreamTimeout
}
deadline := time.Now().Add(timeout)
for {
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.NotificationIntents, "-", "+").Result()
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.NotificationIntents)
for _, entry := range entries {
decoded := decodeNotificationIntent(t, entry)
if predicate(decoded) {
return decoded
}
}
if time.Now().After(deadline) {
t.Fatalf("rtmanager integration: no notification_intent matched within %s; observed=%v",
timeout, notificationStreamSummary(entries))
}
time.Sleep(defaultStreamPoll)
}
}
// JobOutcomeIs returns a predicate matching a job result whose game id
// and outcome equal the inputs.
func JobOutcomeIs(gameID, outcome string) func(JobResultEntry) bool {
return func(entry JobResultEntry) bool {
return entry.GameID == gameID && entry.Outcome == outcome
}
}
// JobOutcomeWithErrorCode matches a job result whose game id, outcome,
// and error_code all equal the inputs. Used by replay-no-op
// assertions.
func JobOutcomeWithErrorCode(gameID, outcome, errorCode string) func(JobResultEntry) bool {
return func(entry JobResultEntry) bool {
return entry.GameID == gameID && entry.Outcome == outcome && entry.ErrorCode == errorCode
}
}
// HealthEventTypeIs returns a predicate matching a health event whose
// game id and event_type equal the inputs.
func HealthEventTypeIs(gameID, eventType string) func(HealthEventEntry) bool {
return func(entry HealthEventEntry) bool {
return entry.GameID == gameID && entry.EventType == eventType
}
}
func decodeJobResult(message redis.XMessage) JobResultEntry {
return JobResultEntry{
StreamID: message.ID,
GameID: streamString(message.Values, "game_id"),
Outcome: streamString(message.Values, "outcome"),
ContainerID: streamString(message.Values, "container_id"),
EngineEndpoint: streamString(message.Values, "engine_endpoint"),
ErrorCode: streamString(message.Values, "error_code"),
ErrorMessage: streamString(message.Values, "error_message"),
}
}
func decodeHealthEvent(t testing.TB, message redis.XMessage) HealthEventEntry {
t.Helper()
occurredAt, _ := strconv.ParseInt(streamString(message.Values, "occurred_at_ms"), 10, 64)
entry := HealthEventEntry{
StreamID: message.ID,
GameID: streamString(message.Values, "game_id"),
ContainerID: streamString(message.Values, "container_id"),
EventType: streamString(message.Values, "event_type"),
OccurredAtMs: occurredAt,
}
rawDetails := streamString(message.Values, "details")
if rawDetails != "" {
var parsed map[string]any
if err := json.Unmarshal([]byte(rawDetails), &parsed); err == nil {
entry.Details = parsed
}
}
return entry
}
func decodeNotificationIntent(t testing.TB, message redis.XMessage) NotificationIntentEntry {
t.Helper()
entry := NotificationIntentEntry{
StreamID: message.ID,
NotificationType: streamString(message.Values, "notification_type"),
IdempotencyKey: streamString(message.Values, "idempotency_key"),
}
rawPayload := streamString(message.Values, "payload_json")
if rawPayload == "" {
rawPayload = streamString(message.Values, "payload")
}
if rawPayload != "" {
var parsed map[string]any
if err := json.Unmarshal([]byte(rawPayload), &parsed); err == nil {
entry.Payload = parsed
}
}
return entry
}
func streamString(values map[string]any, key string) string {
raw, ok := values[key]
if !ok {
return ""
}
switch typed := raw.(type) {
case string:
return typed
case []byte:
return string(typed)
default:
return fmt.Sprintf("%v", typed)
}
}
func jobResultStreamSummary(entries []redis.XMessage) []string {
out := make([]string, 0, len(entries))
for _, entry := range entries {
decoded := decodeJobResult(entry)
out = append(out, fmt.Sprintf("%s game=%s outcome=%s err=%s",
decoded.StreamID, decoded.GameID, decoded.Outcome, decoded.ErrorCode))
}
return out
}
func healthEventStreamSummary(entries []redis.XMessage) []string {
out := make([]string, 0, len(entries))
for _, entry := range entries {
out = append(out, fmt.Sprintf("%s %s %s",
entry.ID, streamString(entry.Values, "game_id"), streamString(entry.Values, "event_type")))
}
return out
}
func notificationStreamSummary(entries []redis.XMessage) []string {
out := make([]string, 0, len(entries))
for _, entry := range entries {
out = append(out, fmt.Sprintf("%s %s",
entry.ID, streamString(entry.Values, "notification_type")))
}
return out
}
// EnsureJobOutcomeConstants pins the constants from `ports` so suite
// authors can build predicates without importing `ports` themselves.
// Re-exported here to keep test source focused.
var (
JobOutcomeSuccess = ports.JobOutcomeSuccess
JobOutcomeFailure = ports.JobOutcomeFailure
)
// AssertNoJobResultBeyond fails the test if the count of entries on
// `runtime:job_results` exceeds `expectedCount`. Used by the replay
// tests to prove the second envelope was no-op.
func AssertNoJobResultBeyond(t testing.TB, env *Env, expectedCount int) {
t.Helper()
entries, err := env.RedisClient.XLen(context.Background(), env.Cfg.Streams.JobResults).Result()
require.NoError(t, err)
require.LessOrEqualf(t, entries, int64(expectedCount),
"job_results stream has more entries than expected; got=%d expected<=%d", entries, expectedCount)
}
// SanitizeContainerSummaryFor returns a stable diagnostic string for a
// container summary keyed by game id. Used in test failures.
func SanitizeContainerSummaryFor(values map[string]string, gameID string) string {
parts := make([]string, 0, len(values))
for key, value := range values {
parts = append(parts, key+"="+value)
}
return fmt.Sprintf("game=%s {%s}", gameID, strings.Join(parts, ", "))
}
+303
View File
@@ -0,0 +1,303 @@
//go:build integration
// Package integration_test owns the service-local end-to-end scenarios
// for Runtime Manager. The build tag keeps the suite out of the
// default `go test ./...` run; CI invokes the suite explicitly with
// `go test -tags=integration ./rtmanager/integration/...`.
//
// Design rationale for the suite — build tag, in-process harness,
// per-test isolation, two-tag engine image — lives in
// `rtmanager/docs/integration-tests.md`. Each test stands up its own
// Runtime Manager process via `harness.NewEnv`, drives the same
// streams Game Lobby uses in `integration/lobbyrtm`, and asserts the
// resulting PostgreSQL, Redis-stream, and Docker side-effects.
package integration_test
import (
"context"
"net/http"
"testing"
"time"
"galaxy/rtmanager/integration/harness"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
"github.com/docker/docker/api/types/container"
"github.com/docker/docker/api/types/filters"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
// TestMain centralises shared-container teardown so individual
// failing tests do not leak the testcontainers postgres / redis pair.
func TestMain(m *testing.M) {
harness.RunMain(m)
}
// TestLifecycle_StartInspectStopRestartPatchCleanup drives one game
// through every supported lifecycle operation against the real engine
// image and asserts each step's PG, Redis-stream, and Docker
// side-effects.
func TestLifecycle_StartInspectStopRestartPatchCleanup(t *testing.T) {
env := harness.NewEnv(t, harness.EnvOptions{LogToStderr: true})
rest := harness.NewREST(env)
gameID := harness.IDFromTestName(t)
// Step 1 — start through the Lobby async stream contract.
startEntryID := harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
t.Logf("start_jobs xadd id=%s", startEntryID)
startResult := harness.WaitForJobResult(t, env,
harness.JobOutcomeIs(gameID, ports.JobOutcomeSuccess),
30*time.Second,
)
require.Equal(t, "", startResult.ErrorCode, "fresh start must publish empty error_code")
require.NotEmpty(t, startResult.ContainerID, "fresh start job result must carry container_id")
require.NotEmpty(t, startResult.EngineEndpoint, "fresh start job result must carry engine_endpoint")
// PG record reflects the start.
startedRecord := harness.EventuallyRuntimeRecord(t, env, gameID,
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusRunning },
15*time.Second,
)
assert.Equal(t, env.EngineImageRef, startedRecord.CurrentImageRef)
assert.Equal(t, env.Network, startedRecord.DockerNetwork)
assert.Equal(t, startResult.ContainerID, startedRecord.CurrentContainerID)
assert.Equal(t, startResult.EngineEndpoint, startedRecord.EngineEndpoint)
// operation_log captures the start.
startEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindStart, 5*time.Second)
assert.Equal(t, operation.OutcomeSuccess, startEntry.Outcome)
assert.Equal(t, operation.OpSourceLobbyStream, startEntry.OpSource)
// Step 2 — inspect via the GM/Admin REST surface.
getResp, status := rest.GetRuntime(t, gameID)
require.Equal(t, http.StatusOK, status)
require.Equal(t, "running", getResp.Status)
require.NotNil(t, getResp.CurrentContainerID)
require.Equal(t, startResult.ContainerID, *getResp.CurrentContainerID)
require.NotNil(t, getResp.CurrentImageRef)
require.Equal(t, env.EngineImageRef, *getResp.CurrentImageRef)
require.NotNil(t, getResp.EngineEndpoint)
require.Equal(t, startResult.EngineEndpoint, *getResp.EngineEndpoint)
// Step 3 — stop through the Lobby async stream contract.
harness.XAddStopJob(t, env, gameID, "cancelled")
stopResult := waitForLatestStopOrStartResult(t, env, gameID)
require.Equal(t, ports.JobOutcomeSuccess, stopResult.Outcome)
require.Equal(t, "", stopResult.ErrorCode, "fresh stop must publish empty error_code")
stoppedRecord := harness.EventuallyRuntimeRecord(t, env, gameID,
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusStopped },
15*time.Second,
)
assert.Equal(t, startResult.ContainerID, stoppedRecord.CurrentContainerID,
"stop preserves the current container id until cleanup")
// Step 4 — restart via REST. Container id changes; engine endpoint
// stays stable.
restartResp, status := rest.RestartRuntime(t, gameID)
require.Equal(t, http.StatusOK, status)
require.Equal(t, "running", restartResp.Status)
require.NotNil(t, restartResp.CurrentContainerID)
require.NotEqual(t, startResult.ContainerID, *restartResp.CurrentContainerID,
"restart must produce a new container id")
require.NotNil(t, restartResp.EngineEndpoint)
require.Equal(t, startResult.EngineEndpoint, *restartResp.EngineEndpoint,
"restart must keep the engine endpoint stable")
restartContainerID := *restartResp.CurrentContainerID
restartEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindRestart, 5*time.Second)
assert.Equal(t, operation.OutcomeSuccess, restartEntry.Outcome)
assert.Equal(t, operation.OpSourceAdminRest, restartEntry.OpSource)
// Step 5 — patch to the second semver-compatible tag. Same image
// content, but the runtime should still record the new tag and
// recreate the container.
patchResp, status := rest.PatchRuntime(t, gameID, env.PatchedImageRef)
require.Equal(t, http.StatusOK, status)
require.Equal(t, "running", patchResp.Status)
require.NotNil(t, patchResp.CurrentImageRef)
assert.Equal(t, env.PatchedImageRef, *patchResp.CurrentImageRef)
require.NotNil(t, patchResp.CurrentContainerID)
assert.NotEqual(t, restartContainerID, *patchResp.CurrentContainerID,
"patch must recreate the container")
patchEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindPatch, 5*time.Second)
assert.Equal(t, operation.OutcomeSuccess, patchEntry.Outcome)
// Step 6 — quiesce via REST stop so cleanup is allowed (cleanup
// refuses to remove a running container per
// `rtmanager/README.md §Lifecycles → Cleanup`).
stopResp, status := rest.StopRuntime(t, gameID, "admin_request")
require.Equal(t, http.StatusOK, status)
require.Equal(t, "stopped", stopResp.Status)
// Step 7 — cleanup the container. PG record flips to removed and
// current_container_id becomes nil.
cleanupResp, status := rest.CleanupRuntime(t, gameID)
require.Equal(t, http.StatusOK, status)
require.Equal(t, "removed", cleanupResp.Status)
require.Nil(t, cleanupResp.CurrentContainerID)
cleanupEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindCleanupContainer, 5*time.Second)
assert.Equal(t, operation.OutcomeSuccess, cleanupEntry.Outcome)
assert.Equal(t, operation.OpSourceAdminRest, cleanupEntry.OpSource)
}
// TestReplay_StartJobIsNoop publishes the same start envelope twice
// and asserts that Runtime Manager produces a fresh job_result for
// the first XADD and a `replay_no_op` outcome for the second, without
// recreating the engine container.
func TestReplay_StartJobIsNoop(t *testing.T) {
env := harness.NewEnv(t, harness.EnvOptions{})
gameID := harness.IDFromTestName(t)
// First XADD: fresh start.
harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
first := harness.WaitForJobResult(t, env,
harness.JobOutcomeIs(gameID, ports.JobOutcomeSuccess),
30*time.Second,
)
require.Equal(t, "", first.ErrorCode)
// Second XADD: same envelope; the start service must short-circuit
// at the `runtime_records.status=running && image_ref` check.
harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
replay := harness.WaitForJobResult(t, env,
harness.JobOutcomeWithErrorCode(gameID, ports.JobOutcomeSuccess, "replay_no_op"),
15*time.Second,
)
assert.Equal(t, first.ContainerID, replay.ContainerID,
"replay must surface the same container id as the original start")
assert.Equal(t, first.EngineEndpoint, replay.EngineEndpoint)
// Docker view: exactly one engine container exists for this game.
assertSingleEngineContainer(t, env, gameID)
// Lifecycle stream produced exactly two entries: fresh + replay.
entries := harness.AllJobResults(t, env)
require.Len(t, entries, 2)
assert.Equal(t, "", entries[0].ErrorCode)
assert.Equal(t, "replay_no_op", entries[1].ErrorCode)
}
// TestReplay_StopJobIsNoop publishes a stop envelope twice after a
// successful start and asserts the second stop surfaces as
// `replay_no_op` without altering the runtime record's `stopped_at`.
func TestReplay_StopJobIsNoop(t *testing.T) {
env := harness.NewEnv(t, harness.EnvOptions{})
gameID := harness.IDFromTestName(t)
// Bring the game to `running`. The start path publishes one entry
// to `runtime:job_results`; the stops below publish two more, so
// per-game stream order is [start, first-stop, replay-stop].
harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
harness.WaitForJobResult(t, env,
harness.JobOutcomeIs(gameID, ports.JobOutcomeSuccess),
30*time.Second,
)
// First stop: fresh. The expectedCount accounts for the start
// entry that is already on the stream.
harness.XAddStopJob(t, env, gameID, "cancelled")
first := waitForJobResultByIndex(t, env, gameID, 2)
require.Equal(t, ports.JobOutcomeSuccess, first.Outcome)
require.Equal(t, "", first.ErrorCode)
stoppedRecord := harness.EventuallyRuntimeRecord(t, env, gameID,
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusStopped },
15*time.Second,
)
require.NotNil(t, stoppedRecord.StoppedAt, "stopped record must carry stopped_at")
originalStoppedAt := *stoppedRecord.StoppedAt
// Second stop: replay (third entry on the per-game stream).
harness.XAddStopJob(t, env, gameID, "cancelled")
replay := waitForJobResultByIndex(t, env, gameID, 3)
require.Equal(t, ports.JobOutcomeSuccess, replay.Outcome)
assert.Equal(t, "replay_no_op", replay.ErrorCode)
// stopped_at stays anchored to the first stop.
postReplay := harness.MustRuntimeRecord(t, env, gameID)
require.Equal(t, runtime.StatusStopped, postReplay.Status)
require.NotNil(t, postReplay.StoppedAt)
assert.True(t, originalStoppedAt.Equal(*postReplay.StoppedAt),
"stopped_at must not move on a replay stop; was %s, now %s",
originalStoppedAt, *postReplay.StoppedAt)
}
// waitForLatestStopOrStartResult finds the most recent `outcome=success`
// entry on `runtime:job_results` for gameID. The lifecycle scenario
// emits two consecutive successes (start then stop); the helper picks
// the second one without re-scanning the stream every iteration.
func waitForLatestStopOrStartResult(t *testing.T, env *harness.Env, gameID string) harness.JobResultEntry {
t.Helper()
deadline := time.Now().Add(30 * time.Second)
for {
entries := harness.AllJobResults(t, env)
// Two entries means we've observed both the start and stop
// outcomes for this game.
matched := 0
var last harness.JobResultEntry
for _, entry := range entries {
if entry.GameID == gameID && entry.Outcome == ports.JobOutcomeSuccess {
matched++
last = entry
}
}
if matched >= 2 {
return last
}
if time.Now().After(deadline) {
t.Fatalf("expected two job_results for %s, got %d", gameID, matched)
}
time.Sleep(50 * time.Millisecond)
}
}
// waitForJobResultByIndex polls the job_results stream until it has
// at least `expectedCount` entries for gameID and returns the
// expectedCount-th. Used by the replay tests to deterministically
// pick the second / nth result.
func waitForJobResultByIndex(t *testing.T, env *harness.Env, gameID string, expectedCount int) harness.JobResultEntry {
t.Helper()
deadline := time.Now().Add(30 * time.Second)
for {
entries := harness.AllJobResults(t, env)
matches := make([]harness.JobResultEntry, 0, len(entries))
for _, entry := range entries {
if entry.GameID == gameID {
matches = append(matches, entry)
}
}
if len(matches) >= expectedCount {
return matches[expectedCount-1]
}
if time.Now().After(deadline) {
t.Fatalf("expected at least %d job_results for %s, got %d",
expectedCount, gameID, len(matches))
}
time.Sleep(50 * time.Millisecond)
}
}
// assertSingleEngineContainer queries Docker by the per-game label and
// asserts exactly one matching container exists. Catches replay
// regressions that would let RTM start two containers for the same
// game id.
func assertSingleEngineContainer(t *testing.T, env *harness.Env, gameID string) {
t.Helper()
args := filters.NewArgs(
filters.Arg("label", "com.galaxy.owner=rtmanager"),
filters.Arg("label", "com.galaxy.game_id="+gameID),
)
containers, err := env.Docker.Client().ContainerList(
context.Background(),
container.ListOptions{All: true, Filters: args},
)
require.NoError(t, err)
require.Lenf(t, containers, 1, "expected one engine container for game %s, got %d", gameID, len(containers))
}
+200
View File
@@ -0,0 +1,200 @@
//go:build integration
package integration_test
import (
"context"
"fmt"
"strconv"
"testing"
"time"
"galaxy/notificationintent"
"galaxy/rtmanager/integration/harness"
"galaxy/rtmanager/internal/domain/health"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
"galaxy/rtmanager/internal/service/startruntime"
dockercontainer "github.com/docker/docker/api/types/container"
"github.com/docker/docker/api/types/network"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
// TestHealth_ContainerDisappearedAndAdopt verifies the two
// drift-detection paths. The Docker events listener emits
// `container_disappeared` when a tracked container is destroyed
// outside RTM, and the reconciler adopts a fresh container labelled
// `com.galaxy.owner=rtmanager` that has no PG row.
//
// `runtime_records.status=removed` is terminal per
// `runtime.AllowedTransitions`; the adoption path therefore uses a
// **fresh** game_id rather than re-adopting the disposed one. That
// matches the documented contract: reconciler adopts containers
// labelled `com.galaxy.owner=rtmanager` for which no PG row exists.
func TestHealth_ContainerDisappearedAndAdopt(t *testing.T) {
env := harness.NewEnv(t, harness.EnvOptions{
ReconcileInterval: 500 * time.Millisecond,
})
// Step 1 — bring a game to running through the start consumer.
disposalGameID := harness.IDFromTestName(t) + "-d"
harness.XAddStartJob(t, env, disposalGameID, env.EngineImageRef)
startResult := harness.WaitForJobResult(t, env,
harness.JobOutcomeIs(disposalGameID, ports.JobOutcomeSuccess),
30*time.Second,
)
originalContainerID := startResult.ContainerID
require.NotEmpty(t, originalContainerID)
// Step 2 — externally remove the container; the events listener
// should observe the destroy and publish `container_disappeared`.
removeContainer(t, env, originalContainerID)
disappeared := harness.WaitForHealthEvent(t, env,
harness.HealthEventTypeIs(disposalGameID, string(health.EventTypeContainerDisappeared)),
20*time.Second,
)
assert.Equal(t, originalContainerID, disappeared.ContainerID)
// The reconciler also marks the runtime record as removed within
// one or two ticks (`reconcile_dispose`).
harness.EventuallyRuntimeRecord(t, env, disposalGameID,
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusRemoved },
15*time.Second,
)
harness.EventuallyOperationKind(t, env, disposalGameID, operation.OpKindReconcileDispose, 5*time.Second)
// Step 3 — bring up an adoption candidate for an unseen game id
// by hand. The reconciler must label-match it, find no record,
// and insert one with status=running.
adoptionGameID := harness.IDFromTestName(t) + "-a"
manualContainerID := runManualEngineContainer(t, env, adoptionGameID)
t.Logf("manual container id=%s", manualContainerID)
adopted := harness.EventuallyRuntimeRecord(t, env, adoptionGameID,
func(r runtime.RuntimeRecord) bool {
return r.Status == runtime.StatusRunning && r.CurrentContainerID == manualContainerID
},
20*time.Second,
)
assert.Equal(t, env.EngineImageRef, adopted.CurrentImageRef)
adoptEntry := harness.EventuallyOperationKind(t, env, adoptionGameID, operation.OpKindReconcileAdopt, 5*time.Second)
assert.Equal(t, operation.OutcomeSuccess, adoptEntry.Outcome)
assert.Equal(t, operation.OpSourceAutoReconcile, adoptEntry.OpSource)
assert.Equal(t, manualContainerID, adoptEntry.ContainerID)
}
// TestNotification_ImagePullFailed drives Runtime Manager with a
// start envelope pointing at an unresolvable image reference. The
// start service must surface the failure on `runtime:job_results` and
// publish a `runtime.image_pull_failed` admin notification on
// `notification:intents`.
func TestNotification_ImagePullFailed(t *testing.T) {
env := harness.NewEnv(t, harness.EnvOptions{})
gameID := harness.IDFromTestName(t)
const missingImage = "galaxy/integration-missing:0.0.0"
harness.XAddStartJob(t, env, gameID, missingImage)
// Job result publishes a failure with the stable image_pull_failed
// code.
jobResult := harness.WaitForJobResult(t, env,
harness.JobOutcomeIs(gameID, ports.JobOutcomeFailure),
60*time.Second,
)
assert.Equal(t, startruntime.ErrorCodeImagePullFailed, jobResult.ErrorCode)
assert.Empty(t, jobResult.ContainerID, "failure must not surface a container id")
assert.Empty(t, jobResult.EngineEndpoint, "failure must not surface an engine endpoint")
assert.NotEmpty(t, jobResult.ErrorMessage, "failure must carry an operator-readable message")
// Notification stream carries the matching admin-only intent.
intent := harness.WaitForNotificationIntent(t, env,
func(entry harness.NotificationIntentEntry) bool {
if entry.NotificationType != string(notificationintent.NotificationTypeRuntimeImagePullFailed) {
return false
}
payloadGameID, _ := entry.Payload["game_id"].(string)
return payloadGameID == gameID
},
30*time.Second,
)
require.NotNil(t, intent.Payload, "notification intent must carry a payload")
assert.Equal(t, gameID, intent.Payload["game_id"])
assert.Equal(t, missingImage, intent.Payload["image_ref"])
assert.Equal(t, startruntime.ErrorCodeImagePullFailed, intent.Payload["error_code"])
// PG state: no running record was installed; operation_log
// captures one failed start with the stable error code.
_, err := harness.RuntimeRecord(t, env, gameID)
if err == nil {
// If an entry was upserted (rollback gap), it must not be
// running.
record := harness.MustRuntimeRecord(t, env, gameID)
assert.NotEqual(t, runtime.StatusRunning, record.Status,
"failed image pull must not leave a running record behind")
}
failureEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindStart, 5*time.Second)
assert.Equal(t, operation.OutcomeFailure, failureEntry.Outcome)
assert.Equal(t, startruntime.ErrorCodeImagePullFailed, failureEntry.ErrorCode)
}
// removeContainer terminates and removes the container behind RTM's
// back. Force=true is required because the engine has not received a
// SIGTERM and stop signal handling is engine-internal.
func removeContainer(t *testing.T, env *harness.Env, containerID string) {
t.Helper()
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
require.NoError(t, env.Docker.Client().ContainerRemove(ctx, containerID, dockercontainer.RemoveOptions{Force: true}))
}
// runManualEngineContainer bypasses RTM and starts an engine container
// directly through the Docker SDK. The container carries every label
// the reconciler reads at adopt time (`com.galaxy.owner`,
// `com.galaxy.kind`, `com.galaxy.game_id`, `com.galaxy.engine_image_ref`,
// `com.galaxy.started_at_ms`) plus the per-game hostname so the
// computed `engine_endpoint` matches what `rtmanager` would have
// written.
func runManualEngineContainer(t *testing.T, env *harness.Env, gameID string) string {
t.Helper()
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
hostname := "galaxy-game-" + gameID
cfg := &dockercontainer.Config{
Image: env.EngineImageRef,
Hostname: hostname,
Labels: map[string]string{
"com.galaxy.owner": "rtmanager",
"com.galaxy.kind": "game-engine",
"com.galaxy.game_id": gameID,
"com.galaxy.engine_image_ref": env.EngineImageRef,
"com.galaxy.started_at_ms": strconv.FormatInt(time.Now().UnixMilli(), 10),
},
Env: []string{
"GAME_STATE_PATH=/var/lib/galaxy-game",
"STORAGE_PATH=/var/lib/galaxy-game",
},
}
hostCfg := &dockercontainer.HostConfig{}
netCfg := &network.NetworkingConfig{
EndpointsConfig: map[string]*network.EndpointSettings{
env.Network: {Aliases: []string{hostname}},
},
}
containerName := fmt.Sprintf("galaxy-game-%s-manual", gameID)
created, err := env.Docker.Client().ContainerCreate(ctx, cfg, hostCfg, netCfg, nil, containerName)
require.NoError(t, err)
t.Cleanup(func() {
removeCtx, removeCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer removeCancel()
_ = env.Docker.Client().ContainerRemove(removeCtx, created.ID, dockercontainer.RemoveOptions{Force: true})
})
require.NoError(t, env.Docker.Client().ContainerStart(ctx, created.ID, dockercontainer.StartOptions{}))
return created.ID
}
@@ -0,0 +1,493 @@
// Package docker provides the production Docker SDK adapter that
// implements `galaxy/rtmanager/internal/ports.DockerClient`. The
// adapter is the single component allowed to talk to the local Docker
// daemon; every Runtime Manager service path that needs container
// lifecycle operations goes through this surface.
//
// The adapter is intentionally narrow — it does not orchestrate, log,
// or retry. Cross-cutting concerns (lease coordination, durable state,
// notification side-effects) live in the service layer.
package docker
import (
"context"
"errors"
"fmt"
"io"
"maps"
"strings"
"sync"
"time"
cerrdefs "github.com/containerd/errdefs"
"github.com/docker/docker/api/types/container"
"github.com/docker/docker/api/types/events"
"github.com/docker/docker/api/types/filters"
"github.com/docker/docker/api/types/image"
"github.com/docker/docker/api/types/network"
dockerclient "github.com/docker/docker/client"
"github.com/docker/go-units"
"galaxy/rtmanager/internal/ports"
)
// EnginePort is the in-container HTTP port the engine listens on. The
// value is fixed by `rtmanager/README.md §Container Model` and by the
// engine's Dockerfile (`game/Dockerfile`); RTM never publishes the port
// to the host. Keeping the constant here lets the adapter own the URL
// shape so the start service does not have to know it.
const EnginePort = 8080
// Config groups the dependencies and per-process defaults required to
// construct a Client. The struct is value-typed so wiring code can
// build it inline without intermediate variables.
type Config struct {
// Docker stores the SDK client this adapter wraps. It must be
// non-nil; callers typically construct it via `client.NewClientWithOpts`.
Docker *dockerclient.Client
// LogDriver stores the Docker logging driver applied to every
// container the adapter creates (e.g. `json-file`).
LogDriver string
// LogOpts stores the comma-separated `key=value` driver options
// forwarded to Docker. Empty disables driver-specific options.
LogOpts string
// Clock supplies the wall-clock used for `RunResult.StartedAt`.
// Defaults to `time.Now` when nil.
Clock func() time.Time
}
// Client is the production adapter implementing `ports.DockerClient`.
// Construct it via NewClient; do not zero-initialise.
type Client struct {
docker *dockerclient.Client
logDriver string
logOpts string
clock func() time.Time
}
// NewClient constructs a Client from cfg. It returns an error if cfg
// does not carry the minimum collaborator set the adapter needs to
// function.
func NewClient(cfg Config) (*Client, error) {
if cfg.Docker == nil {
return nil, errors.New("new docker adapter: nil docker client")
}
if strings.TrimSpace(cfg.LogDriver) == "" {
return nil, errors.New("new docker adapter: log driver must not be empty")
}
clock := cfg.Clock
if clock == nil {
clock = time.Now
}
return &Client{
docker: cfg.Docker,
logDriver: cfg.LogDriver,
logOpts: cfg.LogOpts,
clock: clock,
}, nil
}
// EnsureNetwork verifies the user-defined Docker network is present.
// The adapter never creates networks; provisioning is the operator's
// job per `rtmanager/README.md §Container Model`.
func (client *Client) EnsureNetwork(ctx context.Context, name string) error {
if _, err := client.docker.NetworkInspect(ctx, name, network.InspectOptions{}); err != nil {
if cerrdefs.IsNotFound(err) {
return ports.ErrNetworkMissing
}
return fmt.Errorf("ensure network %q: %w", name, err)
}
return nil
}
// PullImage pulls ref according to policy. The pull stream is drained
// to completion because the Docker SDK only finishes the underlying
// pull when the body is consumed.
func (client *Client) PullImage(ctx context.Context, ref string, policy ports.PullPolicy) error {
if !policy.IsKnown() {
return fmt.Errorf("pull image %q: unknown pull policy %q", ref, policy)
}
switch policy {
case ports.PullPolicyAlways:
return client.runPull(ctx, ref)
case ports.PullPolicyIfMissing:
if present, err := client.imagePresent(ctx, ref); err != nil {
return err
} else if present {
return nil
}
return client.runPull(ctx, ref)
case ports.PullPolicyNever:
present, err := client.imagePresent(ctx, ref)
if err != nil {
return err
}
if !present {
return ports.ErrImageNotFound
}
return nil
default:
return fmt.Errorf("pull image %q: unsupported pull policy %q", ref, policy)
}
}
// InspectImage returns image metadata for ref. RTM only reads labels
// at start time; the broader inspect struct stays accessible for
// diagnostics.
func (client *Client) InspectImage(ctx context.Context, ref string) (ports.ImageInspect, error) {
inspect, err := client.docker.ImageInspect(ctx, ref)
if err != nil {
if cerrdefs.IsNotFound(err) {
return ports.ImageInspect{}, ports.ErrImageNotFound
}
return ports.ImageInspect{}, fmt.Errorf("inspect image %q: %w", ref, err)
}
var labels map[string]string
if inspect.Config != nil {
labels = copyStringMap(inspect.Config.Labels)
}
return ports.ImageInspect{Ref: ref, Labels: labels}, nil
}
// InspectContainer returns container metadata for containerID. The
// adapter best-effort decodes Docker timestamps; malformed values map
// to the zero time so callers do not have to defend against nil
// pointers in the SDK response.
func (client *Client) InspectContainer(ctx context.Context, containerID string) (ports.ContainerInspect, error) {
inspect, err := client.docker.ContainerInspect(ctx, containerID)
if err != nil {
if cerrdefs.IsNotFound(err) {
return ports.ContainerInspect{}, ports.ErrContainerNotFound
}
return ports.ContainerInspect{}, fmt.Errorf("inspect container %q: %w", containerID, err)
}
result := ports.ContainerInspect{ID: inspect.ID}
if inspect.ContainerJSONBase != nil {
result.RestartCount = inspect.RestartCount
if inspect.State != nil {
result.Status = string(inspect.State.Status)
result.OOMKilled = inspect.State.OOMKilled
result.ExitCode = inspect.State.ExitCode
result.StartedAt = parseDockerTime(inspect.State.StartedAt)
result.FinishedAt = parseDockerTime(inspect.State.FinishedAt)
if inspect.State.Health != nil {
result.Health = string(inspect.State.Health.Status)
}
}
}
if inspect.Config != nil {
result.ImageRef = inspect.Config.Image
result.Hostname = inspect.Config.Hostname
result.Labels = copyStringMap(inspect.Config.Labels)
}
return result, nil
}
// Run creates and starts one container according to spec. On
// `ContainerStart` failure the adapter best-effort removes the partial
// container so the start service never has to clean up after a failed
// start path.
func (client *Client) Run(ctx context.Context, spec ports.RunSpec) (ports.RunResult, error) {
if err := spec.Validate(); err != nil {
return ports.RunResult{}, fmt.Errorf("run container: %w", err)
}
memoryBytes, err := units.RAMInBytes(spec.Memory)
if err != nil {
return ports.RunResult{}, fmt.Errorf("run container %q: parse memory %q: %w", spec.Name, spec.Memory, err)
}
pidsLimit := int64(spec.PIDsLimit)
containerCfg := &container.Config{
Image: spec.Image,
Hostname: spec.Hostname,
Env: envMapToSlice(spec.Env),
Labels: copyStringMap(spec.Labels),
Cmd: append([]string(nil), spec.Cmd...),
}
hostCfg := &container.HostConfig{
Binds: bindMountsToBinds(spec.BindMounts),
LogConfig: container.LogConfig{
Type: client.logDriver,
Config: parseLogOpts(client.logOpts),
},
Resources: container.Resources{
NanoCPUs: int64(spec.CPUQuota * 1e9),
Memory: memoryBytes,
PidsLimit: &pidsLimit,
},
}
netCfg := &network.NetworkingConfig{
EndpointsConfig: map[string]*network.EndpointSettings{
spec.Network: {
Aliases: []string{spec.Hostname},
},
},
}
created, err := client.docker.ContainerCreate(ctx, containerCfg, hostCfg, netCfg, nil, spec.Name)
if err != nil {
return ports.RunResult{}, fmt.Errorf("create container %q: %w", spec.Name, err)
}
if err := client.docker.ContainerStart(ctx, created.ID, container.StartOptions{}); err != nil {
client.cleanupAfterFailedStart(created.ID)
return ports.RunResult{}, fmt.Errorf("start container %q: %w", spec.Name, err)
}
return ports.RunResult{
ContainerID: created.ID,
EngineEndpoint: fmt.Sprintf("http://%s:%d", spec.Hostname, EnginePort),
StartedAt: client.clock(),
}, nil
}
// Stop bounds graceful shutdown by timeout. A missing container is
// surfaced as ErrContainerNotFound so the service layer can treat it
// as already-stopped per `rtmanager/README.md §Lifecycles → Stop`.
func (client *Client) Stop(ctx context.Context, containerID string, timeout time.Duration) error {
seconds := max(int(timeout.Round(time.Second).Seconds()), 0)
if err := client.docker.ContainerStop(ctx, containerID, container.StopOptions{Timeout: &seconds}); err != nil {
if cerrdefs.IsNotFound(err) {
return ports.ErrContainerNotFound
}
return fmt.Errorf("stop container %q: %w", containerID, err)
}
return nil
}
// Remove removes the container without forcing kill. A missing
// container is reported as success so callers can treat the operation
// as idempotent.
func (client *Client) Remove(ctx context.Context, containerID string) error {
if err := client.docker.ContainerRemove(ctx, containerID, container.RemoveOptions{}); err != nil {
if cerrdefs.IsNotFound(err) {
return nil
}
return fmt.Errorf("remove container %q: %w", containerID, err)
}
return nil
}
// List returns container summaries that match filter. Empty Labels
// match every container; the reconciler always passes
// `com.galaxy.owner=rtmanager`.
func (client *Client) List(ctx context.Context, filter ports.ListFilter) ([]ports.ContainerSummary, error) {
args := filters.NewArgs()
for key, value := range filter.Labels {
args.Add("label", key+"="+value)
}
summaries, err := client.docker.ContainerList(ctx, container.ListOptions{All: true, Filters: args})
if err != nil {
return nil, fmt.Errorf("list containers: %w", err)
}
out := make([]ports.ContainerSummary, 0, len(summaries))
for _, summary := range summaries {
hostname := ""
if len(summary.Names) > 0 {
hostname = strings.TrimPrefix(summary.Names[0], "/")
}
out = append(out, ports.ContainerSummary{
ID: summary.ID,
ImageRef: summary.Image,
Hostname: hostname,
Labels: copyStringMap(summary.Labels),
Status: string(summary.State),
StartedAt: time.Unix(summary.Created, 0).UTC(),
})
}
return out, nil
}
// EventsListen subscribes to the Docker events stream and returns a
// typed channel of decoded container events plus an asynchronous
// error channel. The caller cancels ctx to terminate the subscription;
// the goroutine closes both channels on termination.
func (client *Client) EventsListen(ctx context.Context) (<-chan ports.DockerEvent, <-chan error, error) {
msgs, sdkErrs := client.docker.Events(ctx, events.ListOptions{})
out := make(chan ports.DockerEvent)
outErrs := make(chan error, 1)
var closeOnce sync.Once
closeAll := func() {
closeOnce.Do(func() {
close(out)
close(outErrs)
})
}
go func() {
defer closeAll()
for {
select {
case <-ctx.Done():
return
case msg, ok := <-msgs:
if !ok {
return
}
if msg.Type != events.ContainerEventType {
continue
}
select {
case <-ctx.Done():
return
case out <- decodeEvent(msg):
}
case err, ok := <-sdkErrs:
if !ok {
return
}
if err == nil {
continue
}
select {
case <-ctx.Done():
case outErrs <- err:
}
return
}
}
}()
return out, outErrs, nil
}
func (client *Client) cleanupAfterFailedStart(containerID string) {
cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
_ = client.docker.ContainerRemove(cleanupCtx, containerID, container.RemoveOptions{Force: true})
}
func (client *Client) imagePresent(ctx context.Context, ref string) (bool, error) {
if _, err := client.docker.ImageInspect(ctx, ref); err != nil {
if cerrdefs.IsNotFound(err) {
return false, nil
}
return false, fmt.Errorf("inspect image %q: %w", ref, err)
}
return true, nil
}
func (client *Client) runPull(ctx context.Context, ref string) error {
body, err := client.docker.ImagePull(ctx, ref, image.PullOptions{})
if err != nil {
if cerrdefs.IsNotFound(err) {
return ports.ErrImageNotFound
}
return fmt.Errorf("pull image %q: %w", ref, err)
}
defer body.Close()
if _, err := io.Copy(io.Discard, body); err != nil {
return fmt.Errorf("drain pull stream for %q: %w", ref, err)
}
return nil
}
func envMapToSlice(envMap map[string]string) []string {
if len(envMap) == 0 {
return nil
}
out := make([]string, 0, len(envMap))
for key, value := range envMap {
out = append(out, key+"="+value)
}
return out
}
func bindMountsToBinds(mounts []ports.BindMount) []string {
if len(mounts) == 0 {
return nil
}
binds := make([]string, 0, len(mounts))
for _, mount := range mounts {
bind := mount.HostPath + ":" + mount.MountPath
if mount.ReadOnly {
bind += ":ro"
}
binds = append(binds, bind)
}
return binds
}
func parseLogOpts(raw string) map[string]string {
if strings.TrimSpace(raw) == "" {
return nil
}
out := make(map[string]string)
for part := range strings.SplitSeq(raw, ",") {
entry := strings.TrimSpace(part)
if entry == "" {
continue
}
index := strings.IndexByte(entry, '=')
if index <= 0 {
continue
}
out[entry[:index]] = entry[index+1:]
}
if len(out) == 0 {
return nil
}
return out
}
func parseDockerTime(raw string) time.Time {
if raw == "" {
return time.Time{}
}
parsed, err := time.Parse(time.RFC3339Nano, raw)
if err != nil {
return time.Time{}
}
return parsed.UTC()
}
func copyStringMap(in map[string]string) map[string]string {
if in == nil {
return nil
}
out := make(map[string]string, len(in))
maps.Copy(out, in)
return out
}
func decodeEvent(msg events.Message) ports.DockerEvent {
occurredAt := time.Time{}
switch {
case msg.TimeNano != 0:
occurredAt = time.Unix(0, msg.TimeNano).UTC()
case msg.Time != 0:
occurredAt = time.Unix(msg.Time, 0).UTC()
}
exitCode := 0
if raw, ok := msg.Actor.Attributes["exitCode"]; ok {
if value, err := parseExitCode(raw); err == nil {
exitCode = value
}
}
return ports.DockerEvent{
Action: string(msg.Action),
ContainerID: msg.Actor.ID,
Labels: copyStringMap(msg.Actor.Attributes),
ExitCode: exitCode,
OccurredAt: occurredAt,
}
}
func parseExitCode(raw string) (int, error) {
value := 0
for _, r := range raw {
if r < '0' || r > '9' {
return 0, fmt.Errorf("non-numeric exit code %q", raw)
}
value = value*10 + int(r-'0')
}
return value, nil
}
// Compile-time assertion: Client implements ports.DockerClient.
var _ ports.DockerClient = (*Client)(nil)
@@ -0,0 +1,561 @@
package docker
import (
"context"
"encoding/json"
"errors"
"fmt"
"io"
"net/http"
"net/http/httptest"
"net/url"
"strings"
"sync/atomic"
"testing"
"time"
dockerclient "github.com/docker/docker/client"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"galaxy/rtmanager/internal/ports"
)
// newTestClient wires an httptest.Server backed Docker SDK client to our
// adapter. The handler is invoked for every Docker API request issued
// during the test; tests assert on path and method to route the
// response.
func newTestClient(t *testing.T, handler http.HandlerFunc) *Client {
t.Helper()
server := httptest.NewServer(handler)
t.Cleanup(server.Close)
docker, err := dockerclient.NewClientWithOpts(
dockerclient.WithHost(server.URL),
dockerclient.WithHTTPClient(server.Client()),
dockerclient.WithVersion("1.45"),
)
require.NoError(t, err)
t.Cleanup(func() { _ = docker.Close() })
client, err := NewClient(Config{
Docker: docker,
LogDriver: "json-file",
LogOpts: "max-size=1m,max-file=3",
Clock: func() time.Time { return time.Date(2026, time.April, 27, 12, 0, 0, 0, time.UTC) },
})
require.NoError(t, err)
return client
}
func writeJSON(t *testing.T, w http.ResponseWriter, status int, body any) {
t.Helper()
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(status)
require.NoError(t, json.NewEncoder(w).Encode(body))
}
func writeNotFound(t *testing.T, w http.ResponseWriter, msg string) {
t.Helper()
writeJSON(t, w, http.StatusNotFound, map[string]string{"message": msg})
}
// Docker SDK uses /v1.45 prefix when client is pinned to API 1.45.
func dockerPath(suffix string) string {
return "/v1.45" + suffix
}
func TestNewClientValidatesConfig(t *testing.T) {
t.Run("nil docker client", func(t *testing.T) {
_, err := NewClient(Config{LogDriver: "json-file"})
require.Error(t, err)
assert.Contains(t, err.Error(), "nil docker client")
})
t.Run("empty log driver", func(t *testing.T) {
docker, err := dockerclient.NewClientWithOpts(dockerclient.WithHost("tcp://127.0.0.1:65535"))
require.NoError(t, err)
t.Cleanup(func() { _ = docker.Close() })
_, err = NewClient(Config{Docker: docker, LogDriver: " "})
require.Error(t, err)
assert.Contains(t, err.Error(), "log driver")
})
}
func TestEnsureNetwork(t *testing.T) {
t.Run("present", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
require.Equal(t, dockerPath("/networks/galaxy-net"), r.URL.Path)
writeJSON(t, w, http.StatusOK, map[string]any{"Id": "net-1", "Name": "galaxy-net"})
})
require.NoError(t, client.EnsureNetwork(context.Background(), "galaxy-net"))
})
t.Run("missing", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
writeNotFound(t, w, "no such network")
})
err := client.EnsureNetwork(context.Background(), "missing")
require.Error(t, err)
assert.ErrorIs(t, err, ports.ErrNetworkMissing)
})
t.Run("transport error", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
http.Error(w, "boom", http.StatusInternalServerError)
})
err := client.EnsureNetwork(context.Background(), "x")
require.Error(t, err)
assert.NotErrorIs(t, err, ports.ErrNetworkMissing)
})
}
func TestInspectImage(t *testing.T) {
t.Run("present", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
require.Equal(t, dockerPath("/images/galaxy/game:test/json"), r.URL.Path)
writeJSON(t, w, http.StatusOK, map[string]any{
"Id": "sha256:abc",
"Config": map[string]any{
"Labels": map[string]string{
"com.galaxy.cpu_quota": "1.0",
"com.galaxy.memory": "512m",
"com.galaxy.pids_limit": "512",
},
},
})
})
got, err := client.InspectImage(context.Background(), "galaxy/game:test")
require.NoError(t, err)
assert.Equal(t, "galaxy/game:test", got.Ref)
assert.Equal(t, "1.0", got.Labels["com.galaxy.cpu_quota"])
assert.Equal(t, "512m", got.Labels["com.galaxy.memory"])
})
t.Run("not found", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
writeNotFound(t, w, "no such image")
})
_, err := client.InspectImage(context.Background(), "galaxy/missing:tag")
require.Error(t, err)
assert.ErrorIs(t, err, ports.ErrImageNotFound)
})
}
func TestInspectContainer(t *testing.T) {
t.Run("present", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
require.Equal(t, dockerPath("/containers/cont-1/json"), r.URL.Path)
writeJSON(t, w, http.StatusOK, map[string]any{
"Id": "cont-1",
"RestartCount": 2,
"State": map[string]any{
"Status": "running",
"OOMKilled": false,
"ExitCode": 0,
"StartedAt": "2026-04-27T11:00:00.5Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": map[string]any{"Status": "healthy"},
},
"Config": map[string]any{
"Image": "galaxy/game:test",
"Hostname": "galaxy-game-game-1",
"Labels": map[string]string{
"com.galaxy.owner": "rtmanager",
"com.galaxy.game_id": "game-1",
},
},
})
})
got, err := client.InspectContainer(context.Background(), "cont-1")
require.NoError(t, err)
assert.Equal(t, "cont-1", got.ID)
assert.Equal(t, 2, got.RestartCount)
assert.Equal(t, "running", got.Status)
assert.Equal(t, "healthy", got.Health)
assert.Equal(t, "galaxy/game:test", got.ImageRef)
assert.Equal(t, "galaxy-game-game-1", got.Hostname)
assert.Equal(t, "rtmanager", got.Labels["com.galaxy.owner"])
assert.False(t, got.StartedAt.IsZero())
})
t.Run("not found", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
writeNotFound(t, w, "no such container")
})
_, err := client.InspectContainer(context.Background(), "missing")
require.Error(t, err)
assert.ErrorIs(t, err, ports.ErrContainerNotFound)
})
}
func TestPullImagePolicies(t *testing.T) {
t.Run("if_missing/found skips pull", func(t *testing.T) {
hits := struct {
inspect atomic.Int32
pull atomic.Int32
}{}
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
switch {
case strings.HasSuffix(r.URL.Path, "/json") && r.Method == http.MethodGet:
hits.inspect.Add(1)
writeJSON(t, w, http.StatusOK, map[string]any{"Id": "sha256:x"})
case strings.Contains(r.URL.Path, "/images/create"):
hits.pull.Add(1)
w.WriteHeader(http.StatusOK)
default:
t.Fatalf("unexpected request %s %s", r.Method, r.URL.Path)
}
})
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyIfMissing))
assert.Equal(t, int32(1), hits.inspect.Load())
assert.Equal(t, int32(0), hits.pull.Load())
})
t.Run("if_missing/absent triggers pull", func(t *testing.T) {
hits := struct {
inspect atomic.Int32
pull atomic.Int32
}{}
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
switch {
case strings.HasSuffix(r.URL.Path, "/json") && r.Method == http.MethodGet:
hits.inspect.Add(1)
writeNotFound(t, w, "no such image")
case strings.Contains(r.URL.Path, "/images/create"):
hits.pull.Add(1)
w.WriteHeader(http.StatusOK)
_, _ = io.WriteString(w, `{"status":"Pulling..."}`+"\n"+`{"status":"Done"}`+"\n")
default:
t.Fatalf("unexpected request %s %s", r.Method, r.URL.Path)
}
})
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyIfMissing))
assert.Equal(t, int32(1), hits.inspect.Load())
assert.Equal(t, int32(1), hits.pull.Load())
})
t.Run("always pulls regardless of cache", func(t *testing.T) {
var pullCount atomic.Int32
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Contains(t, r.URL.Path, "/images/create")
pullCount.Add(1)
w.WriteHeader(http.StatusOK)
})
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyAlways))
assert.Equal(t, int32(1), pullCount.Load())
})
t.Run("never with absent image", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
writeNotFound(t, w, "no such image")
})
err := client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyNever)
require.Error(t, err)
assert.ErrorIs(t, err, ports.ErrImageNotFound)
})
t.Run("never with present image", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
writeJSON(t, w, http.StatusOK, map[string]any{"Id": "x"})
})
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyNever))
})
t.Run("unknown policy", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
t.Fatal("must not call docker on unknown policy")
})
err := client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicy("invalid"))
require.Error(t, err)
})
}
func TestRunHappyPath(t *testing.T) {
calls := struct {
create atomic.Int32
start atomic.Int32
remove atomic.Int32
}{}
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
switch {
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/containers/create"):
calls.create.Add(1)
require.Equal(t, "galaxy-game-game-1", r.URL.Query().Get("name"))
writeJSON(t, w, http.StatusCreated, map[string]any{"Id": "cont-new", "Warnings": []string{}})
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/start"):
calls.start.Add(1)
require.Equal(t, dockerPath("/containers/cont-new/start"), r.URL.Path)
w.WriteHeader(http.StatusNoContent)
case r.Method == http.MethodDelete && strings.HasPrefix(r.URL.Path, dockerPath("/containers/")):
calls.remove.Add(1)
w.WriteHeader(http.StatusNoContent)
default:
t.Fatalf("unexpected %s %s", r.Method, r.URL.Path)
}
})
result, err := client.Run(context.Background(), ports.RunSpec{
Name: "galaxy-game-game-1",
Image: "galaxy/game:test",
Hostname: "galaxy-game-game-1",
Network: "galaxy-net",
Env: map[string]string{
"GAME_STATE_PATH": "/var/lib/galaxy-game",
"STORAGE_PATH": "/var/lib/galaxy-game",
},
Labels: map[string]string{"com.galaxy.owner": "rtmanager"},
LogDriver: "json-file",
BindMounts: []ports.BindMount{
{HostPath: "/var/lib/galaxy/games/game-1", MountPath: "/var/lib/galaxy-game"},
},
CPUQuota: 1.0,
Memory: "512m",
PIDsLimit: 512,
})
require.NoError(t, err)
assert.Equal(t, "cont-new", result.ContainerID)
assert.Equal(t, "http://galaxy-game-game-1:8080", result.EngineEndpoint)
assert.False(t, result.StartedAt.IsZero())
assert.Equal(t, int32(1), calls.create.Load())
assert.Equal(t, int32(1), calls.start.Load())
assert.Equal(t, int32(0), calls.remove.Load())
}
func TestRunStartFailureRemovesContainer(t *testing.T) {
calls := struct {
create atomic.Int32
start atomic.Int32
remove atomic.Int32
}{}
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
switch {
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/containers/create"):
calls.create.Add(1)
writeJSON(t, w, http.StatusCreated, map[string]any{"Id": "cont-x"})
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/start"):
calls.start.Add(1)
http.Error(w, `{"message":"insufficient host resources"}`, http.StatusInternalServerError)
case r.Method == http.MethodDelete && strings.HasPrefix(r.URL.Path, dockerPath("/containers/cont-x")):
calls.remove.Add(1)
require.Equal(t, "1", r.URL.Query().Get("force"))
w.WriteHeader(http.StatusNoContent)
default:
t.Fatalf("unexpected %s %s", r.Method, r.URL.Path)
}
})
_, err := client.Run(context.Background(), ports.RunSpec{
Name: "x",
Image: "img",
Hostname: "x",
Network: "n",
LogDriver: "json-file",
CPUQuota: 1.0,
Memory: "64m",
PIDsLimit: 64,
})
require.Error(t, err)
assert.Equal(t, int32(1), calls.create.Load())
assert.Equal(t, int32(1), calls.start.Load())
assert.Equal(t, int32(1), calls.remove.Load(), "adapter must roll back the partial container")
}
func TestRunRejectsInvalidSpec(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
t.Fatal("must not contact docker on invalid spec")
})
_, err := client.Run(context.Background(), ports.RunSpec{Name: "x"})
require.Error(t, err)
assert.Contains(t, err.Error(), "image must not be empty")
}
func TestStop(t *testing.T) {
t.Run("graceful stop", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodPost, r.Method)
require.Equal(t, dockerPath("/containers/cont-1/stop"), r.URL.Path)
require.Equal(t, "30", r.URL.Query().Get("t"))
w.WriteHeader(http.StatusNoContent)
})
require.NoError(t, client.Stop(context.Background(), "cont-1", 30*time.Second))
})
t.Run("missing container", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
writeNotFound(t, w, "no such container")
})
err := client.Stop(context.Background(), "missing", 30*time.Second)
assert.ErrorIs(t, err, ports.ErrContainerNotFound)
})
t.Run("negative timeout normalised to zero", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, "0", r.URL.Query().Get("t"))
w.WriteHeader(http.StatusNoContent)
})
require.NoError(t, client.Stop(context.Background(), "x", -5*time.Second))
})
}
func TestRemoveIsIdempotent(t *testing.T) {
t.Run("present", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodDelete, r.Method)
w.WriteHeader(http.StatusNoContent)
})
require.NoError(t, client.Remove(context.Background(), "cont-1"))
})
t.Run("missing", func(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
writeNotFound(t, w, "no such container")
})
require.NoError(t, client.Remove(context.Background(), "missing"))
})
}
func TestListAppliesLabelFilter(t *testing.T) {
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
require.Equal(t, dockerPath("/containers/json"), r.URL.Path)
require.Equal(t, "1", r.URL.Query().Get("all"))
filtersRaw := r.URL.Query().Get("filters")
require.NotEmpty(t, filtersRaw)
var args map[string]map[string]bool
require.NoError(t, json.Unmarshal([]byte(filtersRaw), &args))
require.True(t, args["label"]["com.galaxy.owner=rtmanager"])
writeJSON(t, w, http.StatusOK, []map[string]any{
{
"Id": "cont-a",
"Image": "galaxy/game:1.2.3",
"Names": []string{"/galaxy-game-game-1"},
"Labels": map[string]string{"com.galaxy.owner": "rtmanager"},
"State": "running",
"Created": int64(1700000000),
},
})
})
got, err := client.List(context.Background(), ports.ListFilter{
Labels: map[string]string{"com.galaxy.owner": "rtmanager"},
})
require.NoError(t, err)
require.Len(t, got, 1)
assert.Equal(t, "cont-a", got[0].ID)
assert.Equal(t, "galaxy/game:1.2.3", got[0].ImageRef)
assert.Equal(t, "galaxy-game-game-1", got[0].Hostname)
assert.Equal(t, "running", got[0].Status)
assert.False(t, got[0].StartedAt.IsZero())
assert.Equal(t, "rtmanager", got[0].Labels["com.galaxy.owner"])
}
func TestEventsListenDecodesContainerEvents(t *testing.T) {
mu := make(chan struct{})
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
require.Equal(t, dockerPath("/events"), r.URL.Path)
flusher, ok := w.(http.Flusher)
require.True(t, ok)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
flusher.Flush()
// Container start event
writeEvent(t, w, "container", "start", "cont-1", map[string]string{
"image": "galaxy/game:1.2.3",
"name": "galaxy-game-game-1",
"com.galaxy.game_id": "game-1",
}, time.Now())
flusher.Flush()
// Container die event with exit code 137
writeEvent(t, w, "container", "die", "cont-1", map[string]string{
"exitCode": "137",
}, time.Now())
flusher.Flush()
// Image event must be filtered out by adapter
writeEvent(t, w, "image", "pull", "img", nil, time.Now())
flusher.Flush()
<-mu
})
defer close(mu)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
events, _, err := client.EventsListen(ctx)
require.NoError(t, err)
got := []ports.DockerEvent{}
deadline := time.After(2 * time.Second)
for len(got) < 2 {
select {
case ev, ok := <-events:
if !ok {
t.Fatalf("events channel closed; got %d events", len(got))
}
got = append(got, ev)
case <-deadline:
t.Fatalf("did not receive expected events; have %d", len(got))
}
}
require.Len(t, got, 2)
assert.Equal(t, "start", got[0].Action)
assert.Equal(t, "cont-1", got[0].ContainerID)
assert.Equal(t, "game-1", got[0].Labels["com.galaxy.game_id"])
assert.Equal(t, "die", got[1].Action)
assert.Equal(t, 137, got[1].ExitCode)
}
func writeEvent(t *testing.T, w io.Writer, eventType, action, id string, attributes map[string]string, when time.Time) {
t.Helper()
payload := map[string]any{
"Type": eventType,
"Action": action,
"Actor": map[string]any{"ID": id, "Attributes": attributes},
"time": when.Unix(),
"timeNano": when.UnixNano(),
}
data, err := json.Marshal(payload)
require.NoError(t, err)
_, err = fmt.Fprintln(w, string(data))
require.NoError(t, err)
}
// Sanity: parsing helpers.
func TestParseLogOpts(t *testing.T) {
got := parseLogOpts("max-size=1m,max-file=3, ,empty=,=novalue")
assert.Equal(t, "1m", got["max-size"])
assert.Equal(t, "3", got["max-file"])
assert.Equal(t, "", got["empty"])
_, hasNovalue := got["=novalue"]
assert.False(t, hasNovalue)
}
func TestParseDockerTime(t *testing.T) {
assert.True(t, parseDockerTime("").IsZero())
assert.True(t, parseDockerTime("not-a-date").IsZero())
parsed := parseDockerTime("2026-04-27T11:00:00.5Z")
assert.False(t, parsed.IsZero())
assert.Equal(t, time.UTC, parsed.Location())
}
func TestEnvMapToSliceDeterministicLength(t *testing.T) {
got := envMapToSlice(map[string]string{"A": "1", "B": "2"})
assert.Len(t, got, 2)
for _, kv := range got {
assert.Contains(t, []string{"A=1", "B=2"}, kv)
}
assert.Nil(t, envMapToSlice(nil))
}
// Compile-time sanity: make sure errors.Is wiring stays intact.
func TestSentinelErrorsAreDistinct(t *testing.T) {
require.True(t, errors.Is(ports.ErrNetworkMissing, ports.ErrNetworkMissing))
require.False(t, errors.Is(ports.ErrNetworkMissing, ports.ErrImageNotFound))
}
func TestURLPathEscapingForCharacters(t *testing.T) {
// Ensure the SDK URL path encodes special characters; the adapter
// passes raw inputs through and lets the SDK escape.
encoded := url.PathEscape("game-1")
assert.Equal(t, "game-1", encoded)
}
@@ -0,0 +1,175 @@
// Code generated by MockGen. DO NOT EDIT.
// Source: galaxy/rtmanager/internal/ports (interfaces: DockerClient)
//
// Generated by this command:
//
// mockgen -destination=../adapters/docker/mocks/mock_dockerclient.go -package=mocks galaxy/rtmanager/internal/ports DockerClient
//
// Package mocks is a generated GoMock package.
package mocks
import (
context "context"
ports "galaxy/rtmanager/internal/ports"
reflect "reflect"
time "time"
gomock "go.uber.org/mock/gomock"
)
// MockDockerClient is a mock of DockerClient interface.
type MockDockerClient struct {
ctrl *gomock.Controller
recorder *MockDockerClientMockRecorder
isgomock struct{}
}
// MockDockerClientMockRecorder is the mock recorder for MockDockerClient.
type MockDockerClientMockRecorder struct {
mock *MockDockerClient
}
// NewMockDockerClient creates a new mock instance.
func NewMockDockerClient(ctrl *gomock.Controller) *MockDockerClient {
mock := &MockDockerClient{ctrl: ctrl}
mock.recorder = &MockDockerClientMockRecorder{mock}
return mock
}
// EXPECT returns an object that allows the caller to indicate expected use.
func (m *MockDockerClient) EXPECT() *MockDockerClientMockRecorder {
return m.recorder
}
// EnsureNetwork mocks base method.
func (m *MockDockerClient) EnsureNetwork(ctx context.Context, name string) error {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "EnsureNetwork", ctx, name)
ret0, _ := ret[0].(error)
return ret0
}
// EnsureNetwork indicates an expected call of EnsureNetwork.
func (mr *MockDockerClientMockRecorder) EnsureNetwork(ctx, name any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "EnsureNetwork", reflect.TypeOf((*MockDockerClient)(nil).EnsureNetwork), ctx, name)
}
// EventsListen mocks base method.
func (m *MockDockerClient) EventsListen(ctx context.Context) (<-chan ports.DockerEvent, <-chan error, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "EventsListen", ctx)
ret0, _ := ret[0].(<-chan ports.DockerEvent)
ret1, _ := ret[1].(<-chan error)
ret2, _ := ret[2].(error)
return ret0, ret1, ret2
}
// EventsListen indicates an expected call of EventsListen.
func (mr *MockDockerClientMockRecorder) EventsListen(ctx any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "EventsListen", reflect.TypeOf((*MockDockerClient)(nil).EventsListen), ctx)
}
// InspectContainer mocks base method.
func (m *MockDockerClient) InspectContainer(ctx context.Context, containerID string) (ports.ContainerInspect, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "InspectContainer", ctx, containerID)
ret0, _ := ret[0].(ports.ContainerInspect)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// InspectContainer indicates an expected call of InspectContainer.
func (mr *MockDockerClientMockRecorder) InspectContainer(ctx, containerID any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "InspectContainer", reflect.TypeOf((*MockDockerClient)(nil).InspectContainer), ctx, containerID)
}
// InspectImage mocks base method.
func (m *MockDockerClient) InspectImage(ctx context.Context, ref string) (ports.ImageInspect, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "InspectImage", ctx, ref)
ret0, _ := ret[0].(ports.ImageInspect)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// InspectImage indicates an expected call of InspectImage.
func (mr *MockDockerClientMockRecorder) InspectImage(ctx, ref any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "InspectImage", reflect.TypeOf((*MockDockerClient)(nil).InspectImage), ctx, ref)
}
// List mocks base method.
func (m *MockDockerClient) List(ctx context.Context, filter ports.ListFilter) ([]ports.ContainerSummary, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "List", ctx, filter)
ret0, _ := ret[0].([]ports.ContainerSummary)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// List indicates an expected call of List.
func (mr *MockDockerClientMockRecorder) List(ctx, filter any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "List", reflect.TypeOf((*MockDockerClient)(nil).List), ctx, filter)
}
// PullImage mocks base method.
func (m *MockDockerClient) PullImage(ctx context.Context, ref string, policy ports.PullPolicy) error {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "PullImage", ctx, ref, policy)
ret0, _ := ret[0].(error)
return ret0
}
// PullImage indicates an expected call of PullImage.
func (mr *MockDockerClientMockRecorder) PullImage(ctx, ref, policy any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "PullImage", reflect.TypeOf((*MockDockerClient)(nil).PullImage), ctx, ref, policy)
}
// Remove mocks base method.
func (m *MockDockerClient) Remove(ctx context.Context, containerID string) error {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Remove", ctx, containerID)
ret0, _ := ret[0].(error)
return ret0
}
// Remove indicates an expected call of Remove.
func (mr *MockDockerClientMockRecorder) Remove(ctx, containerID any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Remove", reflect.TypeOf((*MockDockerClient)(nil).Remove), ctx, containerID)
}
// Run mocks base method.
func (m *MockDockerClient) Run(ctx context.Context, spec ports.RunSpec) (ports.RunResult, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Run", ctx, spec)
ret0, _ := ret[0].(ports.RunResult)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// Run indicates an expected call of Run.
func (mr *MockDockerClientMockRecorder) Run(ctx, spec any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Run", reflect.TypeOf((*MockDockerClient)(nil).Run), ctx, spec)
}
// Stop mocks base method.
func (m *MockDockerClient) Stop(ctx context.Context, containerID string, timeout time.Duration) error {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Stop", ctx, containerID, timeout)
ret0, _ := ret[0].(error)
return ret0
}
// Stop indicates an expected call of Stop.
func (mr *MockDockerClientMockRecorder) Stop(ctx, containerID, timeout any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Stop", reflect.TypeOf((*MockDockerClient)(nil).Stop), ctx, containerID, timeout)
}
@@ -0,0 +1,11 @@
package mocks
import (
"galaxy/rtmanager/internal/ports"
)
// Compile-time assertion that the generated mock satisfies the port
// interface. Future signature drift between the port and the generated
// file fails the build at this line, which is more actionable than a
// runtime check from a service test.
var _ ports.DockerClient = (*MockDockerClient)(nil)
@@ -0,0 +1,202 @@
// Package docker smoke tests exercise the production adapter against a
// real Docker daemon. The tests skip when no Docker socket is reachable
// (`skipUnlessDockerAvailable`), so they run in the default
// `go test ./...` pass without a build tag.
package docker
import (
"context"
"crypto/rand"
"encoding/hex"
"errors"
"os"
"testing"
"time"
"github.com/docker/docker/api/types/network"
dockerclient "github.com/docker/docker/client"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"galaxy/rtmanager/internal/ports"
)
const (
smokeImage = "alpine:3.21"
smokeNetPrefix = "rtmanager-smoke-"
)
func skipUnlessDockerAvailable(t *testing.T) {
t.Helper()
if os.Getenv("DOCKER_HOST") == "" {
if _, err := os.Stat("/var/run/docker.sock"); err != nil {
t.Skip("docker daemon not available; set DOCKER_HOST or expose /var/run/docker.sock")
}
}
}
func newSmokeAdapter(t *testing.T) (*Client, *dockerclient.Client) {
t.Helper()
docker, err := dockerclient.NewClientWithOpts(dockerclient.FromEnv, dockerclient.WithAPIVersionNegotiation())
require.NoError(t, err)
t.Cleanup(func() { _ = docker.Close() })
pingCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if _, err := docker.Ping(pingCtx); err != nil {
// A reachable socket path may still be unusable in sandboxed
// environments (e.g., macOS sandbox blocking the colima socket).
// The smoke test can only run when the daemon answers ping, so a
// permission-denied / connection-refused error is a runtime
// "Docker unavailable" signal and skips the test.
t.Skipf("docker daemon unavailable: %v", err)
}
adapter, err := NewClient(Config{
Docker: docker,
LogDriver: "json-file",
})
require.NoError(t, err)
return adapter, docker
}
func uniqueSuffix(t *testing.T) string {
t.Helper()
buf := make([]byte, 4)
_, err := rand.Read(buf)
require.NoError(t, err)
return hex.EncodeToString(buf)
}
// TestSmokeFullLifecycle runs the adapter through every method against
// the real Docker daemon: ensure-network → pull → run → events →
// stop → remove.
func TestSmokeFullLifecycle(t *testing.T) {
skipUnlessDockerAvailable(t)
adapter, docker := newSmokeAdapter(t)
suffix := uniqueSuffix(t)
netName := smokeNetPrefix + suffix
containerName := "rtmanager-smoke-cont-" + suffix
// Step 1 — provision a temporary user-defined bridge network.
createCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
_, err := docker.NetworkCreate(createCtx, netName, network.CreateOptions{Driver: "bridge"})
require.NoError(t, err)
t.Cleanup(func() {
removeCtx, removeCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer removeCancel()
_ = docker.NetworkRemove(removeCtx, netName)
})
// Step 2 — EnsureNetwork present and missing paths.
require.NoError(t, adapter.EnsureNetwork(createCtx, netName))
missingErr := adapter.EnsureNetwork(createCtx, "rtmanager-smoke-missing-"+suffix)
require.Error(t, missingErr)
assert.ErrorIs(t, missingErr, ports.ErrNetworkMissing)
// Step 3 — pull alpine via the configured policy.
pullCtx, pullCancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer pullCancel()
require.NoError(t, adapter.PullImage(pullCtx, smokeImage, ports.PullPolicyIfMissing))
// Step 4 — subscribe to events before running the container so we
// observe the start event.
listenCtx, listenCancel := context.WithCancel(context.Background())
defer listenCancel()
events, listenErrs, err := adapter.EventsListen(listenCtx)
require.NoError(t, err)
// Step 5 — run a tiny container that sleeps so we can observe it.
stateDir := t.TempDir()
runCtx, runCancel := context.WithTimeout(context.Background(), 60*time.Second)
defer runCancel()
result, err := adapter.Run(runCtx, ports.RunSpec{
Name: containerName,
Image: smokeImage,
Hostname: "smoke-" + suffix,
Network: netName,
Env: map[string]string{
"GAME_STATE_PATH": "/tmp/state",
"STORAGE_PATH": "/tmp/state",
},
Labels: map[string]string{
"com.galaxy.owner": "rtmanager",
"com.galaxy.kind": "smoke",
},
BindMounts: []ports.BindMount{
{HostPath: stateDir, MountPath: "/tmp/state"},
},
LogDriver: "json-file",
CPUQuota: 0.5,
Memory: "64m",
PIDsLimit: 32,
Cmd: []string{"/bin/sh", "-c", "sleep 60"},
})
require.NoError(t, err)
t.Cleanup(func() {
removeCtx, removeCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer removeCancel()
_ = adapter.Remove(removeCtx, result.ContainerID)
})
require.NotEmpty(t, result.ContainerID)
require.Equal(t, "http://smoke-"+suffix+":8080", result.EngineEndpoint)
// Step 6 — wait for a `start` event for the new container id.
startObserved := waitForEvent(t, events, listenErrs, "start", result.ContainerID, 15*time.Second)
require.True(t, startObserved, "did not observe start event for container %s", result.ContainerID)
// Step 7 — InspectContainer returns running state.
inspectCtx, inspectCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer inspectCancel()
inspect, err := adapter.InspectContainer(inspectCtx, result.ContainerID)
require.NoError(t, err)
assert.Equal(t, "running", inspect.Status)
// Step 8 — Stop, then Remove, then InspectContainer must report
// not found.
stopCtx, stopCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer stopCancel()
require.NoError(t, adapter.Stop(stopCtx, result.ContainerID, 5*time.Second))
require.NoError(t, adapter.Remove(stopCtx, result.ContainerID))
if _, err := adapter.InspectContainer(stopCtx, result.ContainerID); !errors.Is(err, ports.ErrContainerNotFound) {
t.Fatalf("expected ErrContainerNotFound, got %v", err)
}
// Step 9 — terminate the events subscription cleanly.
listenCancel()
select {
case _, ok := <-events:
_ = ok
case <-time.After(5 * time.Second):
t.Log("events channel did not close within timeout (best-effort)")
}
}
func waitForEvent(t *testing.T, events <-chan ports.DockerEvent, errs <-chan error, action, containerID string, timeout time.Duration) bool {
t.Helper()
deadline := time.After(timeout)
for {
select {
case ev, ok := <-events:
if !ok {
return false
}
if ev.Action == action && ev.ContainerID == containerID {
return true
}
case err := <-errs:
if err != nil {
t.Fatalf("events stream error: %v", err)
}
case <-deadline:
return false
}
}
}
@@ -0,0 +1,165 @@
// Package healtheventspublisher provides the Redis-Streams-backed
// publisher for `runtime:health_events`. Every Publish call upserts the
// latest `health_snapshots` row before XADDing the event so consumers
// observing the snapshot store can never lag the event stream by more
// than the duration of one network call.
//
// The publisher is shared across `ports.HealthEventPublisher` callers:
// the start service emits `container_started`; the probe, inspect, and
// events-listener workers emit the rest. The publisher's surface is
// stable across all of them.
package healtheventspublisher
import (
"context"
"encoding/json"
"errors"
"fmt"
"strconv"
"galaxy/rtmanager/internal/domain/health"
"galaxy/rtmanager/internal/ports"
"github.com/redis/go-redis/v9"
)
// emptyDetails is the canonical JSON payload installed when the caller
// supplies an empty Details slice. Matches the SQL DEFAULT for
// `health_snapshots.details`.
const emptyDetails = "{}"
// Wire field names used by the Redis Streams payload. Frozen by
// `rtmanager/api/runtime-health-asyncapi.yaml`; renaming any of them
// breaks consumers.
const (
fieldGameID = "game_id"
fieldContainerID = "container_id"
fieldEventType = "event_type"
fieldOccurredAtMS = "occurred_at_ms"
fieldDetails = "details"
)
// Config groups the dependencies and stream name required to construct
// a Publisher.
type Config struct {
// Client appends entries to the Redis Stream. Must be non-nil.
Client *redis.Client
// Snapshots upserts the latest health snapshot. Must be non-nil.
Snapshots ports.HealthSnapshotStore
// Stream stores the Redis Stream key events are published to (e.g.
// `runtime:health_events`). Must not be empty.
Stream string
}
// Publisher implements `ports.HealthEventPublisher` on top of a shared
// Redis client and the production `health_snapshots` store.
type Publisher struct {
client *redis.Client
snapshots ports.HealthSnapshotStore
stream string
}
// NewPublisher constructs one Publisher from cfg. Validation errors
// surface the missing collaborator verbatim.
func NewPublisher(cfg Config) (*Publisher, error) {
if cfg.Client == nil {
return nil, errors.New("new rtmanager health events publisher: nil redis client")
}
if cfg.Snapshots == nil {
return nil, errors.New("new rtmanager health events publisher: nil snapshot store")
}
if cfg.Stream == "" {
return nil, errors.New("new rtmanager health events publisher: stream must not be empty")
}
return &Publisher{
client: cfg.Client,
snapshots: cfg.Snapshots,
stream: cfg.Stream,
}, nil
}
// Publish upserts the matching health_snapshots row and then XADDs the
// envelope to the configured Redis Stream. Both side effects are
// required; the snapshot upsert runs first so a successful Publish
// always leaves the snapshot store at least as fresh as the stream.
func (publisher *Publisher) Publish(ctx context.Context, envelope ports.HealthEventEnvelope) error {
if publisher == nil || publisher.client == nil || publisher.snapshots == nil {
return errors.New("publish health event: nil publisher")
}
if ctx == nil {
return errors.New("publish health event: nil context")
}
if err := envelope.Validate(); err != nil {
return fmt.Errorf("publish health event: %w", err)
}
details := envelope.Details
if len(details) == 0 {
details = json.RawMessage(emptyDetails)
}
status, source := snapshotMappingFor(envelope.EventType)
snapshot := health.HealthSnapshot{
GameID: envelope.GameID,
ContainerID: envelope.ContainerID,
Status: status,
Source: source,
Details: details,
ObservedAt: envelope.OccurredAt.UTC(),
}
if err := publisher.snapshots.Upsert(ctx, snapshot); err != nil {
return fmt.Errorf("publish health event: upsert snapshot: %w", err)
}
occurredAtMS := envelope.OccurredAt.UTC().UnixMilli()
values := map[string]any{
fieldGameID: envelope.GameID,
fieldContainerID: envelope.ContainerID,
fieldEventType: string(envelope.EventType),
fieldOccurredAtMS: strconv.FormatInt(occurredAtMS, 10),
fieldDetails: string(details),
}
if err := publisher.client.XAdd(ctx, &redis.XAddArgs{
Stream: publisher.stream,
Values: values,
}).Err(); err != nil {
return fmt.Errorf("publish health event: xadd: %w", err)
}
return nil
}
// snapshotMappingFor returns the SnapshotStatus and SnapshotSource that
// match eventType per `rtmanager/README.md §Health Monitoring`.
//
// `container_started` is observed when the start service successfully
// runs the container; the snapshot collapses it to `healthy`.
// `probe_recovered` collapses to `healthy` per
// `rtmanager/docs/domain-and-ports.md` §4: it does not have its own
// snapshot status; the next observation overwrites the prior
// `probe_failed` with `healthy`.
func snapshotMappingFor(eventType health.EventType) (health.SnapshotStatus, health.SnapshotSource) {
switch eventType {
case health.EventTypeContainerStarted:
return health.SnapshotStatusHealthy, health.SnapshotSourceDockerEvent
case health.EventTypeContainerExited:
return health.SnapshotStatusExited, health.SnapshotSourceDockerEvent
case health.EventTypeContainerOOM:
return health.SnapshotStatusOOM, health.SnapshotSourceDockerEvent
case health.EventTypeContainerDisappeared:
return health.SnapshotStatusContainerDisappeared, health.SnapshotSourceDockerEvent
case health.EventTypeInspectUnhealthy:
return health.SnapshotStatusInspectUnhealthy, health.SnapshotSourceInspect
case health.EventTypeProbeFailed:
return health.SnapshotStatusProbeFailed, health.SnapshotSourceProbe
case health.EventTypeProbeRecovered:
return health.SnapshotStatusHealthy, health.SnapshotSourceProbe
default:
return "", ""
}
}
// Compile-time assertion: Publisher implements
// ports.HealthEventPublisher.
var _ ports.HealthEventPublisher = (*Publisher)(nil)
@@ -0,0 +1,197 @@
package healtheventspublisher_test
import (
"context"
"encoding/json"
"strconv"
"sync"
"testing"
"time"
"galaxy/rtmanager/internal/adapters/healtheventspublisher"
"galaxy/rtmanager/internal/domain/health"
"galaxy/rtmanager/internal/ports"
"github.com/alicebob/miniredis/v2"
"github.com/redis/go-redis/v9"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
// fakeSnapshots captures Upsert invocations for assertions.
type fakeSnapshots struct {
mu sync.Mutex
upserts []health.HealthSnapshot
upsertErr error
}
func (s *fakeSnapshots) Upsert(_ context.Context, snapshot health.HealthSnapshot) error {
s.mu.Lock()
defer s.mu.Unlock()
if s.upsertErr != nil {
return s.upsertErr
}
s.upserts = append(s.upserts, snapshot)
return nil
}
func (s *fakeSnapshots) Get(_ context.Context, _ string) (health.HealthSnapshot, error) {
return health.HealthSnapshot{}, nil
}
func newPublisher(t *testing.T, snapshots ports.HealthSnapshotStore) (*healtheventspublisher.Publisher, *miniredis.Miniredis, *redis.Client) {
t.Helper()
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
t.Cleanup(func() { _ = client.Close() })
publisher, err := healtheventspublisher.NewPublisher(healtheventspublisher.Config{
Client: client,
Snapshots: snapshots,
Stream: "runtime:health_events",
})
require.NoError(t, err)
return publisher, server, client
}
func TestNewPublisherRejectsMissingCollaborators(t *testing.T) {
_, err := healtheventspublisher.NewPublisher(healtheventspublisher.Config{})
require.Error(t, err)
_, err = healtheventspublisher.NewPublisher(healtheventspublisher.Config{
Client: redis.NewClient(&redis.Options{Addr: "127.0.0.1:0"}),
})
require.Error(t, err)
_, err = healtheventspublisher.NewPublisher(healtheventspublisher.Config{
Client: redis.NewClient(&redis.Options{Addr: "127.0.0.1:0"}),
Snapshots: &fakeSnapshots{},
})
require.Error(t, err)
}
func TestPublishContainerStartedUpsertsHealthyAndXAdds(t *testing.T) {
snapshots := &fakeSnapshots{}
publisher, _, client := newPublisher(t, snapshots)
occurredAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
envelope := ports.HealthEventEnvelope{
GameID: "game-1",
ContainerID: "c-1",
EventType: health.EventTypeContainerStarted,
OccurredAt: occurredAt,
Details: json.RawMessage(`{"image_ref":"galaxy/game:1.2.3"}`),
}
require.NoError(t, publisher.Publish(context.Background(), envelope))
require.Len(t, snapshots.upserts, 1)
snapshot := snapshots.upserts[0]
assert.Equal(t, "game-1", snapshot.GameID)
assert.Equal(t, "c-1", snapshot.ContainerID)
assert.Equal(t, health.SnapshotStatusHealthy, snapshot.Status)
assert.Equal(t, health.SnapshotSourceDockerEvent, snapshot.Source)
assert.JSONEq(t, `{"image_ref":"galaxy/game:1.2.3"}`, string(snapshot.Details))
assert.Equal(t, occurredAt, snapshot.ObservedAt)
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
require.NoError(t, err)
require.Len(t, entries, 1)
values := entries[0].Values
assert.Equal(t, "game-1", values["game_id"])
assert.Equal(t, "c-1", values["container_id"])
assert.Equal(t, "container_started", values["event_type"])
assert.Equal(t, strconv.FormatInt(occurredAt.UnixMilli(), 10), values["occurred_at_ms"])
assert.JSONEq(t, `{"image_ref":"galaxy/game:1.2.3"}`, values["details"].(string))
}
func TestPublishMapsEveryEventTypeToASnapshot(t *testing.T) {
t.Parallel()
cases := []struct {
eventType health.EventType
expectStatus health.SnapshotStatus
expectSource health.SnapshotSource
}{
{health.EventTypeContainerStarted, health.SnapshotStatusHealthy, health.SnapshotSourceDockerEvent},
{health.EventTypeContainerExited, health.SnapshotStatusExited, health.SnapshotSourceDockerEvent},
{health.EventTypeContainerOOM, health.SnapshotStatusOOM, health.SnapshotSourceDockerEvent},
{health.EventTypeContainerDisappeared, health.SnapshotStatusContainerDisappeared, health.SnapshotSourceDockerEvent},
{health.EventTypeInspectUnhealthy, health.SnapshotStatusInspectUnhealthy, health.SnapshotSourceInspect},
{health.EventTypeProbeFailed, health.SnapshotStatusProbeFailed, health.SnapshotSourceProbe},
{health.EventTypeProbeRecovered, health.SnapshotStatusHealthy, health.SnapshotSourceProbe},
}
for _, tc := range cases {
t.Run(string(tc.eventType), func(t *testing.T) {
t.Parallel()
snapshots := &fakeSnapshots{}
publisher, _, _ := newPublisher(t, snapshots)
require.NoError(t, publisher.Publish(context.Background(), ports.HealthEventEnvelope{
GameID: "g",
ContainerID: "c",
EventType: tc.eventType,
OccurredAt: time.Now().UTC(),
Details: json.RawMessage(`{}`),
}))
require.Len(t, snapshots.upserts, 1)
assert.Equal(t, tc.expectStatus, snapshots.upserts[0].Status)
assert.Equal(t, tc.expectSource, snapshots.upserts[0].Source)
})
}
}
func TestPublishEmptyDetailsBecomesEmptyObject(t *testing.T) {
snapshots := &fakeSnapshots{}
publisher, _, client := newPublisher(t, snapshots)
envelope := ports.HealthEventEnvelope{
GameID: "g",
ContainerID: "c",
EventType: health.EventTypeContainerDisappeared,
OccurredAt: time.Now().UTC(),
}
require.NoError(t, publisher.Publish(context.Background(), envelope))
require.Len(t, snapshots.upserts, 1)
assert.JSONEq(t, "{}", string(snapshots.upserts[0].Details))
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
require.NoError(t, err)
require.Len(t, entries, 1)
assert.JSONEq(t, "{}", entries[0].Values["details"].(string))
}
func TestPublishRejectsInvalidEnvelope(t *testing.T) {
snapshots := &fakeSnapshots{}
publisher, _, client := newPublisher(t, snapshots)
require.Error(t, publisher.Publish(context.Background(), ports.HealthEventEnvelope{}))
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
require.NoError(t, err)
assert.Empty(t, entries)
assert.Empty(t, snapshots.upserts)
}
func TestPublishSurfacesSnapshotErrorWithoutXAdd(t *testing.T) {
snapshots := &fakeSnapshots{upsertErr: assertSentinelErr}
publisher, _, client := newPublisher(t, snapshots)
err := publisher.Publish(context.Background(), ports.HealthEventEnvelope{
GameID: "g",
ContainerID: "c",
EventType: health.EventTypeContainerStarted,
OccurredAt: time.Now().UTC(),
Details: json.RawMessage(`{"image_ref":"x"}`),
})
require.Error(t, err)
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
require.NoError(t, err)
assert.Empty(t, entries, "xadd must not run when snapshot upsert fails")
}
// assertSentinelErr is a sentinel for snapshot-failure assertions.
var assertSentinelErr = sentinelError("snapshot upsert failure")
type sentinelError string
func (s sentinelError) Error() string { return string(s) }
@@ -0,0 +1,100 @@
// Package jobresultspublisher provides the Redis-Streams-backed
// publisher for `runtime:job_results`. The start-jobs and stop-jobs
// consumers call this adapter so every consumed envelope produces
// exactly one outcome entry on the result stream.
//
// The wire fields mirror the AsyncAPI schema frozen in
// `rtmanager/api/runtime-jobs-asyncapi.yaml`. Every field is XADDed
// even when empty so consumers can rely on the schema's required-field
// set.
package jobresultspublisher
import (
"context"
"errors"
"fmt"
"strings"
"galaxy/rtmanager/internal/ports"
"github.com/redis/go-redis/v9"
)
// Wire field names used by the Redis Streams payload. Frozen by
// `rtmanager/api/runtime-jobs-asyncapi.yaml`; renaming any of them
// breaks consumers.
const (
fieldGameID = "game_id"
fieldOutcome = "outcome"
fieldContainerID = "container_id"
fieldEngineEndpoint = "engine_endpoint"
fieldErrorCode = "error_code"
fieldErrorMessage = "error_message"
)
// Config groups the dependencies and stream name required to construct
// a Publisher.
type Config struct {
// Client appends entries to the Redis Stream. Must be non-nil.
Client *redis.Client
// Stream stores the Redis Stream key job results are published to
// (e.g. `runtime:job_results`). Must not be empty.
Stream string
}
// Publisher implements `ports.JobResultPublisher` on top of a shared
// Redis client.
type Publisher struct {
client *redis.Client
stream string
}
// NewPublisher constructs one Publisher from cfg. Validation errors
// surface the missing collaborator verbatim.
func NewPublisher(cfg Config) (*Publisher, error) {
if cfg.Client == nil {
return nil, errors.New("new rtmanager job results publisher: nil redis client")
}
if strings.TrimSpace(cfg.Stream) == "" {
return nil, errors.New("new rtmanager job results publisher: stream must not be empty")
}
return &Publisher{
client: cfg.Client,
stream: cfg.Stream,
}, nil
}
// Publish XADDs result to the configured Redis Stream. The wire payload
// includes every field declared as required by the AsyncAPI schema —
// empty strings are kept so consumers always see the documented keys.
func (publisher *Publisher) Publish(ctx context.Context, result ports.JobResult) error {
if publisher == nil || publisher.client == nil {
return errors.New("publish job result: nil publisher")
}
if ctx == nil {
return errors.New("publish job result: nil context")
}
if err := result.Validate(); err != nil {
return fmt.Errorf("publish job result: %w", err)
}
values := map[string]any{
fieldGameID: result.GameID,
fieldOutcome: result.Outcome,
fieldContainerID: result.ContainerID,
fieldEngineEndpoint: result.EngineEndpoint,
fieldErrorCode: result.ErrorCode,
fieldErrorMessage: result.ErrorMessage,
}
if err := publisher.client.XAdd(ctx, &redis.XAddArgs{
Stream: publisher.stream,
Values: values,
}).Err(); err != nil {
return fmt.Errorf("publish job result: xadd: %w", err)
}
return nil
}
// Compile-time assertion: Publisher implements ports.JobResultPublisher.
var _ ports.JobResultPublisher = (*Publisher)(nil)
@@ -0,0 +1,142 @@
package jobresultspublisher_test
import (
"context"
"testing"
"galaxy/rtmanager/internal/adapters/jobresultspublisher"
"galaxy/rtmanager/internal/ports"
"github.com/alicebob/miniredis/v2"
"github.com/redis/go-redis/v9"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func newPublisher(t *testing.T) (*jobresultspublisher.Publisher, *redis.Client) {
t.Helper()
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
t.Cleanup(func() { _ = client.Close() })
publisher, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{
Client: client,
Stream: "runtime:job_results",
})
require.NoError(t, err)
return publisher, client
}
func TestNewPublisherRejectsMissingCollaborators(t *testing.T) {
_, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{})
require.Error(t, err)
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
t.Cleanup(func() { _ = client.Close() })
_, err = jobresultspublisher.NewPublisher(jobresultspublisher.Config{Client: client})
require.Error(t, err)
_, err = jobresultspublisher.NewPublisher(jobresultspublisher.Config{Client: client, Stream: " "})
require.Error(t, err)
}
func TestPublishRejectsInvalidResult(t *testing.T) {
publisher, _ := newPublisher(t)
require.Error(t, publisher.Publish(context.Background(), ports.JobResult{}))
require.Error(t, publisher.Publish(context.Background(), ports.JobResult{
GameID: "game-1",
Outcome: "weird",
}))
}
func TestPublishStartSuccessXAddsAllRequiredFields(t *testing.T) {
publisher, client := newPublisher(t)
result := ports.JobResult{
GameID: "game-1",
Outcome: ports.JobOutcomeSuccess,
ContainerID: "c-1",
EngineEndpoint: "http://galaxy-game-game-1:8080",
ErrorCode: "",
ErrorMessage: "",
}
require.NoError(t, publisher.Publish(context.Background(), result))
entries, err := client.XRange(context.Background(), "runtime:job_results", "-", "+").Result()
require.NoError(t, err)
require.Len(t, entries, 1)
values := entries[0].Values
assert.Equal(t, "game-1", values["game_id"])
assert.Equal(t, "success", values["outcome"])
assert.Equal(t, "c-1", values["container_id"])
assert.Equal(t, "http://galaxy-game-game-1:8080", values["engine_endpoint"])
assert.Equal(t, "", values["error_code"])
assert.Equal(t, "", values["error_message"])
}
func TestPublishFailureXAddsEmptyContainerAndEndpoint(t *testing.T) {
publisher, client := newPublisher(t)
result := ports.JobResult{
GameID: "game-2",
Outcome: ports.JobOutcomeFailure,
ErrorCode: "image_pull_failed",
ErrorMessage: "manifest unknown",
}
require.NoError(t, publisher.Publish(context.Background(), result))
entries, err := client.XRange(context.Background(), "runtime:job_results", "-", "+").Result()
require.NoError(t, err)
require.Len(t, entries, 1)
values := entries[0].Values
assert.Equal(t, "game-2", values["game_id"])
assert.Equal(t, "failure", values["outcome"])
assert.Equal(t, "", values["container_id"], "failure must publish empty container id")
assert.Equal(t, "", values["engine_endpoint"], "failure must publish empty engine endpoint")
assert.Equal(t, "image_pull_failed", values["error_code"])
assert.Equal(t, "manifest unknown", values["error_message"])
}
func TestPublishReplayNoOpKeepsContainerAndEndpoint(t *testing.T) {
publisher, client := newPublisher(t)
result := ports.JobResult{
GameID: "game-3",
Outcome: ports.JobOutcomeSuccess,
ContainerID: "c-3",
EngineEndpoint: "http://galaxy-game-game-3:8080",
ErrorCode: "replay_no_op",
}
require.NoError(t, publisher.Publish(context.Background(), result))
entries, err := client.XRange(context.Background(), "runtime:job_results", "-", "+").Result()
require.NoError(t, err)
require.Len(t, entries, 1)
values := entries[0].Values
assert.Equal(t, "game-3", values["game_id"])
assert.Equal(t, "success", values["outcome"])
assert.Equal(t, "c-3", values["container_id"])
assert.Equal(t, "http://galaxy-game-game-3:8080", values["engine_endpoint"])
assert.Equal(t, "replay_no_op", values["error_code"])
assert.Equal(t, "", values["error_message"])
}
func TestPublishFailsOnClosedClient(t *testing.T) {
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
publisher, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{
Client: client,
Stream: "runtime:job_results",
})
require.NoError(t, err)
require.NoError(t, client.Close())
err = publisher.Publish(context.Background(), ports.JobResult{
GameID: "game-4",
Outcome: ports.JobOutcomeSuccess,
})
require.Error(t, err)
}
@@ -0,0 +1,219 @@
// Package lobbyclient provides the trusted-internal Lobby REST client
// Runtime Manager uses to fetch ancillary game metadata for diagnostics.
//
// The client is intentionally minimal: the GetGame fetch is ancillary
// diagnostics because the start envelope already carries the only
// required field (`image_ref`). A failed call surfaces as
// `ports.ErrLobbyUnavailable` so callers can distinguish "not found"
// from transport faults and continue without aborting the start
// operation.
package lobbyclient
import (
"bytes"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"net/http"
"net/url"
"strings"
"time"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"galaxy/rtmanager/internal/ports"
)
const (
getGamePathSuffix = "/api/v1/internal/games/%s"
)
// Config configures one HTTP-backed Lobby internal client.
type Config struct {
// BaseURL stores the absolute base URL of the Lobby internal HTTP
// listener (e.g. `http://lobby:8095`).
BaseURL string
// RequestTimeout bounds one outbound lookup request.
RequestTimeout time.Duration
}
// Client resolves Lobby game records through the trusted internal HTTP
// API.
type Client struct {
baseURL string
requestTimeout time.Duration
httpClient *http.Client
closeIdleConnections func()
}
type gameRecordEnvelope struct {
GameID string `json:"game_id"`
Status string `json:"status"`
TargetEngineVersion string `json:"target_engine_version"`
}
type errorEnvelope struct {
Error *errorBody `json:"error"`
}
type errorBody struct {
Code string `json:"code"`
Message string `json:"message"`
}
// NewClient constructs a Lobby internal client that uses
// repository-standard HTTP transport instrumentation through otelhttp.
// The cloned default transport keeps the production wiring isolated
// from caller-provided transports.
func NewClient(cfg Config) (*Client, error) {
transport, ok := http.DefaultTransport.(*http.Transport)
if !ok {
return nil, errors.New("new lobby internal client: default transport is not *http.Transport")
}
cloned := transport.Clone()
return newClient(cfg, &http.Client{Transport: otelhttp.NewTransport(cloned)}, cloned.CloseIdleConnections)
}
func newClient(cfg Config, httpClient *http.Client, closeIdleConnections func()) (*Client, error) {
switch {
case strings.TrimSpace(cfg.BaseURL) == "":
return nil, errors.New("new lobby internal client: base URL must not be empty")
case cfg.RequestTimeout <= 0:
return nil, errors.New("new lobby internal client: request timeout must be positive")
case httpClient == nil:
return nil, errors.New("new lobby internal client: http client must not be nil")
}
parsed, err := url.Parse(strings.TrimRight(strings.TrimSpace(cfg.BaseURL), "/"))
if err != nil {
return nil, fmt.Errorf("new lobby internal client: parse base URL: %w", err)
}
if parsed.Scheme == "" || parsed.Host == "" {
return nil, errors.New("new lobby internal client: base URL must be absolute")
}
return &Client{
baseURL: parsed.String(),
requestTimeout: cfg.RequestTimeout,
httpClient: httpClient,
closeIdleConnections: closeIdleConnections,
}, nil
}
// Close releases idle HTTP connections owned by the client transport.
// Call once on shutdown.
func (client *Client) Close() error {
if client == nil || client.closeIdleConnections == nil {
return nil
}
client.closeIdleConnections()
return nil
}
// GetGame returns the Lobby game record for gameID. It maps Lobby's
// `404 not_found` to `ports.ErrLobbyGameNotFound`; every other failure
// (transport, timeout, non-2xx response) maps to
// `ports.ErrLobbyUnavailable` wrapped with the original error so callers
// keep the diagnostic detail.
func (client *Client) GetGame(ctx context.Context, gameID string) (ports.LobbyGameRecord, error) {
if client == nil || client.httpClient == nil {
return ports.LobbyGameRecord{}, errors.New("lobby get game: nil client")
}
if ctx == nil {
return ports.LobbyGameRecord{}, errors.New("lobby get game: nil context")
}
if err := ctx.Err(); err != nil {
return ports.LobbyGameRecord{}, err
}
if strings.TrimSpace(gameID) == "" {
return ports.LobbyGameRecord{}, errors.New("lobby get game: game id must not be empty")
}
payload, statusCode, err := client.doRequest(ctx, http.MethodGet, fmt.Sprintf(getGamePathSuffix, url.PathEscape(gameID)))
if err != nil {
return ports.LobbyGameRecord{}, fmt.Errorf("%w: %w", ports.ErrLobbyUnavailable, err)
}
switch statusCode {
case http.StatusOK:
var envelope gameRecordEnvelope
if err := decodeJSONPayload(payload, &envelope); err != nil {
return ports.LobbyGameRecord{}, fmt.Errorf("%w: decode success response: %w", ports.ErrLobbyUnavailable, err)
}
if strings.TrimSpace(envelope.GameID) == "" {
return ports.LobbyGameRecord{}, fmt.Errorf("%w: success response missing game_id", ports.ErrLobbyUnavailable)
}
return ports.LobbyGameRecord{
GameID: envelope.GameID,
Status: envelope.Status,
TargetEngineVersion: envelope.TargetEngineVersion,
}, nil
case http.StatusNotFound:
return ports.LobbyGameRecord{}, ports.ErrLobbyGameNotFound
default:
errorCode := decodeErrorCode(payload)
if errorCode != "" {
return ports.LobbyGameRecord{}, fmt.Errorf("%w: unexpected status %d (error_code=%s)", ports.ErrLobbyUnavailable, statusCode, errorCode)
}
return ports.LobbyGameRecord{}, fmt.Errorf("%w: unexpected status %d", ports.ErrLobbyUnavailable, statusCode)
}
}
func (client *Client) doRequest(ctx context.Context, method, requestPath string) ([]byte, int, error) {
attemptCtx, cancel := context.WithTimeout(ctx, client.requestTimeout)
defer cancel()
req, err := http.NewRequestWithContext(attemptCtx, method, client.baseURL+requestPath, nil)
if err != nil {
return nil, 0, fmt.Errorf("build request: %w", err)
}
req.Header.Set("Accept", "application/json")
resp, err := client.httpClient.Do(req)
if err != nil {
return nil, 0, err
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, 0, fmt.Errorf("read response body: %w", err)
}
return body, resp.StatusCode, nil
}
// decodeJSONPayload tolerantly decodes a JSON object; unknown fields
// are ignored so additive Lobby schema changes do not break us.
func decodeJSONPayload(payload []byte, target any) error {
decoder := json.NewDecoder(bytes.NewReader(payload))
if err := decoder.Decode(target); err != nil {
return err
}
if err := decoder.Decode(&struct{}{}); err != io.EOF {
if err == nil {
return errors.New("unexpected trailing JSON input")
}
return err
}
return nil
}
func decodeErrorCode(payload []byte) string {
if len(payload) == 0 {
return ""
}
var envelope errorEnvelope
if err := json.Unmarshal(payload, &envelope); err != nil {
return ""
}
if envelope.Error == nil {
return ""
}
return envelope.Error.Code
}
// Compile-time assertion: Client implements ports.LobbyInternalClient.
var _ ports.LobbyInternalClient = (*Client)(nil)
@@ -0,0 +1,153 @@
package lobbyclient
import (
"context"
"errors"
"net/http"
"net/http/httptest"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"galaxy/rtmanager/internal/ports"
)
func newTestClient(t *testing.T, baseURL string, timeout time.Duration) *Client {
t.Helper()
client, err := NewClient(Config{BaseURL: baseURL, RequestTimeout: timeout})
require.NoError(t, err)
t.Cleanup(func() { _ = client.Close() })
return client
}
func TestNewClientValidatesConfig(t *testing.T) {
cases := map[string]Config{
"empty base url": {BaseURL: "", RequestTimeout: time.Second},
"non-absolute base url": {BaseURL: "lobby:8095", RequestTimeout: time.Second},
"non-positive timeout": {BaseURL: "http://lobby:8095", RequestTimeout: 0},
}
for name, cfg := range cases {
t.Run(name, func(t *testing.T) {
_, err := NewClient(cfg)
require.Error(t, err)
})
}
}
func TestGetGameSuccess(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, http.MethodGet, r.Method)
require.Equal(t, "/api/v1/internal/games/game-1", r.URL.Path)
require.Equal(t, "application/json", r.Header.Get("Accept"))
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte(`{
"game_id": "game-1",
"game_name": "Sample",
"status": "running",
"target_engine_version": "1.4.2",
"current_turn": 0,
"runtime_status": "running"
}`))
}))
defer server.Close()
client := newTestClient(t, server.URL, time.Second)
got, err := client.GetGame(context.Background(), "game-1")
require.NoError(t, err)
assert.Equal(t, "game-1", got.GameID)
assert.Equal(t, "running", got.Status)
assert.Equal(t, "1.4.2", got.TargetEngineVersion)
}
func TestGetGameNotFound(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusNotFound)
_, _ = w.Write([]byte(`{"error":{"code":"not_found","message":"no such game"}}`))
}))
defer server.Close()
client := newTestClient(t, server.URL, time.Second)
_, err := client.GetGame(context.Background(), "missing")
require.Error(t, err)
assert.True(t, errors.Is(err, ports.ErrLobbyGameNotFound))
assert.False(t, errors.Is(err, ports.ErrLobbyUnavailable))
}
func TestGetGameInternalErrorMapsToUnavailable(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusInternalServerError)
_, _ = w.Write([]byte(`{"error":{"code":"internal_error","message":"boom"}}`))
}))
defer server.Close()
client := newTestClient(t, server.URL, time.Second)
_, err := client.GetGame(context.Background(), "x")
require.Error(t, err)
assert.True(t, errors.Is(err, ports.ErrLobbyUnavailable))
assert.Contains(t, err.Error(), "500")
assert.Contains(t, err.Error(), "internal_error")
}
func TestGetGameTimeoutMapsToUnavailable(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
time.Sleep(150 * time.Millisecond)
_, _ = w.Write([]byte(`{}`))
}))
defer server.Close()
client := newTestClient(t, server.URL, 50*time.Millisecond)
_, err := client.GetGame(context.Background(), "x")
require.Error(t, err)
assert.True(t, errors.Is(err, ports.ErrLobbyUnavailable))
}
func TestGetGameSuccessMissingGameIDIsUnavailable(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_, _ = w.Write([]byte(`{"status":"running"}`))
}))
defer server.Close()
client := newTestClient(t, server.URL, time.Second)
_, err := client.GetGame(context.Background(), "x")
require.Error(t, err)
assert.True(t, errors.Is(err, ports.ErrLobbyUnavailable))
assert.Contains(t, err.Error(), "missing game_id")
}
func TestGetGameRejectsBadInput(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
t.Fatal("must not contact lobby on bad input")
}))
defer server.Close()
client := newTestClient(t, server.URL, time.Second)
t.Run("empty game id", func(t *testing.T) {
_, err := client.GetGame(context.Background(), " ")
require.Error(t, err)
assert.Contains(t, err.Error(), "game id")
})
t.Run("canceled context", func(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
cancel()
_, err := client.GetGame(ctx, "x")
require.Error(t, err)
assert.True(t, errors.Is(err, context.Canceled))
})
}
func TestCloseReleasesConnections(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
_, _ = w.Write([]byte(`{"game_id":"x","status":"running","target_engine_version":"1.0.0"}`))
}))
defer server.Close()
client := newTestClient(t, server.URL, time.Second)
_, err := client.GetGame(context.Background(), "x")
require.NoError(t, err)
assert.NoError(t, client.Close())
assert.NoError(t, client.Close()) // idempotent
}
@@ -0,0 +1,70 @@
// Package notificationpublisher provides the Redis-Streams-backed
// notification-intent publisher Runtime Manager uses to emit admin-only
// failure notifications. The adapter is a thin shim over
// `galaxy/notificationintent.Publisher` that drops the entry id at the
// wrapper boundary; rationale lives in
// `rtmanager/docs/domain-and-ports.md §7`.
package notificationpublisher
import (
"context"
"errors"
"fmt"
"github.com/redis/go-redis/v9"
"galaxy/notificationintent"
"galaxy/rtmanager/internal/ports"
)
// Config groups the dependencies and stream name required to
// construct a Publisher.
type Config struct {
// Client appends entries to Redis Streams. Must be non-nil.
Client *redis.Client
// Stream stores the Redis Stream key intents are published to.
// When empty, `notificationintent.DefaultIntentsStream` is used.
Stream string
}
// Publisher implements `ports.NotificationIntentPublisher` on top of
// the shared `notificationintent.Publisher`. The wrapper is the single
// point that drops the entry id returned by the underlying publisher.
type Publisher struct {
inner *notificationintent.Publisher
}
// NewPublisher constructs a Publisher from cfg. It wraps the shared
// publisher and delegates validation; transport errors and validation
// errors propagate verbatim.
func NewPublisher(cfg Config) (*Publisher, error) {
if cfg.Client == nil {
return nil, errors.New("new rtmanager notification publisher: nil redis client")
}
inner, err := notificationintent.NewPublisher(notificationintent.PublisherConfig{
Client: cfg.Client,
Stream: cfg.Stream,
})
if err != nil {
return nil, fmt.Errorf("new rtmanager notification publisher: %w", err)
}
return &Publisher{inner: inner}, nil
}
// Publish forwards intent to the underlying notificationintent
// publisher and discards the resulting Redis Stream entry id. A failed
// publish surfaces as the underlying error.
func (publisher *Publisher) Publish(ctx context.Context, intent notificationintent.Intent) error {
if publisher == nil || publisher.inner == nil {
return errors.New("publish notification intent: nil publisher")
}
if _, err := publisher.inner.Publish(ctx, intent); err != nil {
return err
}
return nil
}
// Compile-time assertion: Publisher implements
// ports.NotificationIntentPublisher.
var _ ports.NotificationIntentPublisher = (*Publisher)(nil)
@@ -0,0 +1,123 @@
package notificationpublisher
import (
"context"
"encoding/json"
"testing"
"time"
"github.com/alicebob/miniredis/v2"
"github.com/redis/go-redis/v9"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"galaxy/notificationintent"
)
func newRedis(t *testing.T) (*redis.Client, *miniredis.Miniredis) {
t.Helper()
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
t.Cleanup(func() { _ = client.Close() })
return client, server
}
func readStream(t *testing.T, client *redis.Client, stream string) []redis.XMessage {
t.Helper()
messages, err := client.XRange(context.Background(), stream, "-", "+").Result()
require.NoError(t, err)
return messages
}
func TestNewPublisherValidation(t *testing.T) {
t.Run("nil client", func(t *testing.T) {
_, err := NewPublisher(Config{})
require.Error(t, err)
assert.Contains(t, err.Error(), "nil redis client")
})
}
func TestPublisherWritesIntent(t *testing.T) {
client, _ := newRedis(t)
publisher, err := NewPublisher(Config{Client: client, Stream: "notification:intents"})
require.NoError(t, err)
intent, err := notificationintent.NewRuntimeImagePullFailedIntent(
notificationintent.Metadata{
IdempotencyKey: "rtmanager:start:game-1:abc",
OccurredAt: time.UnixMilli(1714200000000).UTC(),
},
notificationintent.RuntimeImagePullFailedPayload{
GameID: "game-1",
ImageRef: "galaxy/game:1.4.2",
ErrorCode: "image_pull_failed",
ErrorMessage: "registry timeout",
AttemptedAtMs: 1714200000000,
},
)
require.NoError(t, err)
require.NoError(t, publisher.Publish(context.Background(), intent))
messages := readStream(t, client, "notification:intents")
require.Len(t, messages, 1)
values := messages[0].Values
assert.Equal(t, "runtime.image_pull_failed", values["notification_type"])
assert.Equal(t, "runtime_manager", values["producer"])
assert.Equal(t, "admin_email", values["audience_kind"])
assert.Equal(t, "rtmanager:start:game-1:abc", values["idempotency_key"])
// recipient_user_ids_json must be absent for admin_email audience.
_, hasRecipients := values["recipient_user_ids_json"]
assert.False(t, hasRecipients)
payloadRaw, ok := values["payload_json"].(string)
require.True(t, ok)
var payload map[string]any
require.NoError(t, json.Unmarshal([]byte(payloadRaw), &payload))
assert.Equal(t, "game-1", payload["game_id"])
assert.Equal(t, "galaxy/game:1.4.2", payload["image_ref"])
}
func TestPublisherForwardsValidationError(t *testing.T) {
client, _ := newRedis(t)
publisher, err := NewPublisher(Config{Client: client})
require.NoError(t, err)
// Intent with a zero OccurredAt fails the shared validator.
bad := notificationintent.Intent{
NotificationType: notificationintent.NotificationTypeRuntimeImagePullFailed,
Producer: notificationintent.ProducerRuntimeManager,
AudienceKind: notificationintent.AudienceKindAdminEmail,
IdempotencyKey: "k",
PayloadJSON: `{"game_id":"g","image_ref":"r","error_code":"c","error_message":"m","attempted_at_ms":1}`,
}
require.Error(t, publisher.Publish(context.Background(), bad))
}
func TestPublisherDefaultsStreamName(t *testing.T) {
client, _ := newRedis(t)
publisher, err := NewPublisher(Config{Client: client, Stream: ""})
require.NoError(t, err)
intent, err := notificationintent.NewRuntimeContainerStartFailedIntent(
notificationintent.Metadata{
IdempotencyKey: "k",
OccurredAt: time.UnixMilli(1714200000000).UTC(),
},
notificationintent.RuntimeContainerStartFailedPayload{
GameID: "g",
ImageRef: "r",
ErrorCode: "container_start_failed",
ErrorMessage: "boom",
AttemptedAtMs: 1714200000000,
},
)
require.NoError(t, err)
require.NoError(t, publisher.Publish(context.Background(), intent))
messages := readStream(t, client, notificationintent.DefaultIntentsStream)
require.Len(t, messages, 1)
}
@@ -0,0 +1,203 @@
// Package healthsnapshotstore implements the PostgreSQL-backed adapter
// for `ports.HealthSnapshotStore`.
//
// The package owns the on-disk shape of the `health_snapshots` table
// defined in
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`
// and translates the schema-agnostic `ports.HealthSnapshotStore` interface
// declared in `internal/ports/healthsnapshotstore.go` into concrete
// go-jet/v2 statements driven by the pgx driver.
//
// The `details` jsonb column round-trips as a `json.RawMessage`. Empty
// payloads are substituted with the SQL default `{}` on Upsert so the
// CHECK constraints and downstream readers never observe a non-JSON
// empty string.
package healthsnapshotstore
import (
"context"
"database/sql"
"encoding/json"
"errors"
"fmt"
"strings"
"time"
"galaxy/rtmanager/internal/adapters/postgres/internal/sqlx"
pgtable "galaxy/rtmanager/internal/adapters/postgres/jet/rtmanager/table"
"galaxy/rtmanager/internal/domain/health"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
pg "github.com/go-jet/jet/v2/postgres"
)
// emptyDetails is the canonical jsonb payload installed when the caller
// supplies an empty Details slice. It matches the SQL DEFAULT for the
// column.
const emptyDetails = "{}"
// Config configures one PostgreSQL-backed health-snapshot store instance.
type Config struct {
// DB stores the connection pool the store uses for every query.
DB *sql.DB
// OperationTimeout bounds one round trip.
OperationTimeout time.Duration
}
// Store persists Runtime Manager health snapshots in PostgreSQL.
type Store struct {
db *sql.DB
operationTimeout time.Duration
}
// New constructs one PostgreSQL-backed health-snapshot store from cfg.
func New(cfg Config) (*Store, error) {
if cfg.DB == nil {
return nil, errors.New("new postgres health snapshot store: db must not be nil")
}
if cfg.OperationTimeout <= 0 {
return nil, errors.New("new postgres health snapshot store: operation timeout must be positive")
}
return &Store{
db: cfg.DB,
operationTimeout: cfg.OperationTimeout,
}, nil
}
// healthSnapshotSelectColumns is the canonical SELECT list for the
// health_snapshots table, matching scanSnapshot's column order.
var healthSnapshotSelectColumns = pg.ColumnList{
pgtable.HealthSnapshots.GameID,
pgtable.HealthSnapshots.ContainerID,
pgtable.HealthSnapshots.Status,
pgtable.HealthSnapshots.Source,
pgtable.HealthSnapshots.Details,
pgtable.HealthSnapshots.ObservedAt,
}
// Upsert installs snapshot as the latest observation for snapshot.GameID.
// snapshot is validated through health.HealthSnapshot.Validate before the
// SQL is issued.
func (store *Store) Upsert(ctx context.Context, snapshot health.HealthSnapshot) error {
if store == nil || store.db == nil {
return errors.New("upsert health snapshot: nil store")
}
if err := snapshot.Validate(); err != nil {
return fmt.Errorf("upsert health snapshot: %w", err)
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "upsert health snapshot", store.operationTimeout)
if err != nil {
return err
}
defer cancel()
details := emptyDetails
if len(snapshot.Details) > 0 {
details = string(snapshot.Details)
}
stmt := pgtable.HealthSnapshots.INSERT(
pgtable.HealthSnapshots.GameID,
pgtable.HealthSnapshots.ContainerID,
pgtable.HealthSnapshots.Status,
pgtable.HealthSnapshots.Source,
pgtable.HealthSnapshots.Details,
pgtable.HealthSnapshots.ObservedAt,
).VALUES(
snapshot.GameID,
snapshot.ContainerID,
string(snapshot.Status),
string(snapshot.Source),
details,
snapshot.ObservedAt.UTC(),
).ON_CONFLICT(pgtable.HealthSnapshots.GameID).DO_UPDATE(
pg.SET(
pgtable.HealthSnapshots.ContainerID.SET(pgtable.HealthSnapshots.EXCLUDED.ContainerID),
pgtable.HealthSnapshots.Status.SET(pgtable.HealthSnapshots.EXCLUDED.Status),
pgtable.HealthSnapshots.Source.SET(pgtable.HealthSnapshots.EXCLUDED.Source),
pgtable.HealthSnapshots.Details.SET(pgtable.HealthSnapshots.EXCLUDED.Details),
pgtable.HealthSnapshots.ObservedAt.SET(pgtable.HealthSnapshots.EXCLUDED.ObservedAt),
),
)
query, args := stmt.Sql()
if _, err := store.db.ExecContext(operationCtx, query, args...); err != nil {
return fmt.Errorf("upsert health snapshot: %w", err)
}
return nil
}
// Get returns the latest snapshot for gameID. It returns
// runtime.ErrNotFound when no snapshot has been recorded yet.
func (store *Store) Get(ctx context.Context, gameID string) (health.HealthSnapshot, error) {
if store == nil || store.db == nil {
return health.HealthSnapshot{}, errors.New("get health snapshot: nil store")
}
if strings.TrimSpace(gameID) == "" {
return health.HealthSnapshot{}, fmt.Errorf("get health snapshot: game id must not be empty")
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "get health snapshot", store.operationTimeout)
if err != nil {
return health.HealthSnapshot{}, err
}
defer cancel()
stmt := pg.SELECT(healthSnapshotSelectColumns).
FROM(pgtable.HealthSnapshots).
WHERE(pgtable.HealthSnapshots.GameID.EQ(pg.String(gameID)))
query, args := stmt.Sql()
row := store.db.QueryRowContext(operationCtx, query, args...)
snapshot, err := scanSnapshot(row)
if sqlx.IsNoRows(err) {
return health.HealthSnapshot{}, runtime.ErrNotFound
}
if err != nil {
return health.HealthSnapshot{}, fmt.Errorf("get health snapshot: %w", err)
}
return snapshot, nil
}
// rowScanner abstracts *sql.Row and *sql.Rows so scanSnapshot can be
// shared across both single-row reads and iterated reads.
type rowScanner interface {
Scan(dest ...any) error
}
// scanSnapshot scans one health_snapshots row from rs.
func scanSnapshot(rs rowScanner) (health.HealthSnapshot, error) {
var (
gameID string
containerID string
status string
source string
details []byte
observedAt time.Time
)
if err := rs.Scan(
&gameID,
&containerID,
&status,
&source,
&details,
&observedAt,
); err != nil {
return health.HealthSnapshot{}, err
}
return health.HealthSnapshot{
GameID: gameID,
ContainerID: containerID,
Status: health.SnapshotStatus(status),
Source: health.SnapshotSource(source),
Details: json.RawMessage(details),
ObservedAt: observedAt.UTC(),
}, nil
}
// Ensure Store satisfies the ports.HealthSnapshotStore interface at
// compile time.
var _ ports.HealthSnapshotStore = (*Store)(nil)
@@ -0,0 +1,157 @@
package healthsnapshotstore_test
import (
"context"
"encoding/json"
"testing"
"time"
"galaxy/rtmanager/internal/adapters/postgres/healthsnapshotstore"
"galaxy/rtmanager/internal/adapters/postgres/internal/pgtest"
"galaxy/rtmanager/internal/domain/health"
"galaxy/rtmanager/internal/domain/runtime"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestMain(m *testing.M) { pgtest.RunMain(m) }
func newStore(t *testing.T) *healthsnapshotstore.Store {
t.Helper()
pgtest.TruncateAll(t)
store, err := healthsnapshotstore.New(healthsnapshotstore.Config{
DB: pgtest.Ensure(t).Pool(),
OperationTimeout: pgtest.OperationTimeout,
})
require.NoError(t, err)
return store
}
func probeFailedSnapshot(gameID string, observedAt time.Time) health.HealthSnapshot {
return health.HealthSnapshot{
GameID: gameID,
ContainerID: "container-1",
Status: health.SnapshotStatusProbeFailed,
Source: health.SnapshotSourceProbe,
Details: json.RawMessage(`{"consecutive_failures":3,"last_status":503,"last_error":"timeout"}`),
ObservedAt: observedAt,
}
}
func TestUpsertAndGetRoundTrip(t *testing.T) {
ctx := context.Background()
store := newStore(t)
snapshot := probeFailedSnapshot("game-001",
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
require.NoError(t, store.Upsert(ctx, snapshot))
got, err := store.Get(ctx, "game-001")
require.NoError(t, err)
assert.Equal(t, snapshot.GameID, got.GameID)
assert.Equal(t, snapshot.ContainerID, got.ContainerID)
assert.Equal(t, snapshot.Status, got.Status)
assert.Equal(t, snapshot.Source, got.Source)
assert.JSONEq(t, string(snapshot.Details), string(got.Details))
assert.True(t, snapshot.ObservedAt.Equal(got.ObservedAt))
assert.Equal(t, time.UTC, got.ObservedAt.Location())
}
func TestUpsertOverwritesPriorSnapshot(t *testing.T) {
ctx := context.Background()
store := newStore(t)
first := probeFailedSnapshot("game-001",
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
require.NoError(t, store.Upsert(ctx, first))
second := health.HealthSnapshot{
GameID: "game-001",
ContainerID: "container-2",
Status: health.SnapshotStatusHealthy,
Source: health.SnapshotSourceInspect,
Details: json.RawMessage(`{"restart_count":0,"state":"running"}`),
ObservedAt: first.ObservedAt.Add(time.Minute),
}
require.NoError(t, store.Upsert(ctx, second))
got, err := store.Get(ctx, "game-001")
require.NoError(t, err)
assert.Equal(t, "container-2", got.ContainerID)
assert.Equal(t, health.SnapshotStatusHealthy, got.Status)
assert.Equal(t, health.SnapshotSourceInspect, got.Source)
assert.JSONEq(t, string(second.Details), string(got.Details))
assert.True(t, second.ObservedAt.Equal(got.ObservedAt))
}
func TestGetReturnsNotFound(t *testing.T) {
ctx := context.Background()
store := newStore(t)
_, err := store.Get(ctx, "game-missing")
require.ErrorIs(t, err, runtime.ErrNotFound)
}
func TestUpsertEmptyDetailsRoundTripsAsEmptyObject(t *testing.T) {
ctx := context.Background()
store := newStore(t)
snapshot := probeFailedSnapshot("game-001",
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
snapshot.Details = nil
require.NoError(t, store.Upsert(ctx, snapshot))
got, err := store.Get(ctx, "game-001")
require.NoError(t, err)
assert.JSONEq(t, "{}", string(got.Details),
"empty json.RawMessage must round-trip as the SQL default {}, got %q",
string(got.Details))
}
func TestUpsertValidatesSnapshot(t *testing.T) {
ctx := context.Background()
store := newStore(t)
tests := []struct {
name string
mutate func(*health.HealthSnapshot)
}{
{"empty game id", func(s *health.HealthSnapshot) { s.GameID = "" }},
{"unknown status", func(s *health.HealthSnapshot) { s.Status = "exotic" }},
{"unknown source", func(s *health.HealthSnapshot) { s.Source = "exotic" }},
{"zero observed at", func(s *health.HealthSnapshot) { s.ObservedAt = time.Time{} }},
{"invalid json details", func(s *health.HealthSnapshot) {
s.Details = json.RawMessage("not json")
}},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
snapshot := probeFailedSnapshot("game-001",
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
tt.mutate(&snapshot)
err := store.Upsert(ctx, snapshot)
require.Error(t, err)
})
}
}
func TestGetRejectsEmptyGameID(t *testing.T) {
ctx := context.Background()
store := newStore(t)
_, err := store.Get(ctx, "")
require.Error(t, err)
}
func TestNewRejectsNilDB(t *testing.T) {
_, err := healthsnapshotstore.New(healthsnapshotstore.Config{OperationTimeout: time.Second})
require.Error(t, err)
}
func TestNewRejectsNonPositiveTimeout(t *testing.T) {
_, err := healthsnapshotstore.New(healthsnapshotstore.Config{
DB: pgtest.Ensure(t).Pool(),
})
require.Error(t, err)
}
@@ -0,0 +1,209 @@
// Package pgtest exposes the testcontainers-backed PostgreSQL bootstrap
// shared by every Runtime Manager PG adapter test. The package is regular
// Go code — not a `_test.go` file — so it can be imported by the
// `_test.go` files in the three sibling store packages
// (`runtimerecordstore`, `operationlogstore`, `healthsnapshotstore`).
//
// No production code in `cmd/rtmanager` or in the runtime imports this
// package. The testcontainers-go dependency therefore stays out of the
// production binary's import graph.
package pgtest
import (
"context"
"database/sql"
"net/url"
"os"
"sync"
"testing"
"time"
"galaxy/postgres"
"galaxy/rtmanager/internal/adapters/postgres/migrations"
testcontainers "github.com/testcontainers/testcontainers-go"
tcpostgres "github.com/testcontainers/testcontainers-go/modules/postgres"
"github.com/testcontainers/testcontainers-go/wait"
)
const (
postgresImage = "postgres:16-alpine"
superUser = "galaxy"
superPassword = "galaxy"
superDatabase = "galaxy_rtmanager"
serviceRole = "rtmanagerservice"
servicePassword = "rtmanagerservice"
serviceSchema = "rtmanager"
containerStartup = 90 * time.Second
// OperationTimeout is the per-statement timeout used by every store
// constructed via the per-package newStore helpers. Tests may pass a
// smaller value if they need to assert deadline behaviour explicitly.
OperationTimeout = 10 * time.Second
)
// Env holds the per-process container plus the *sql.DB pool already
// provisioned with the rtmanager schema, role, and migrations applied.
type Env struct {
container *tcpostgres.PostgresContainer
pool *sql.DB
}
// Pool returns the shared pool. Tests truncate per-table state before
// each run via TruncateAll.
func (env *Env) Pool() *sql.DB { return env.pool }
var (
once sync.Once
cur *Env
curEr error
)
// Ensure starts the PostgreSQL container on first invocation and applies
// the embedded goose migrations. Subsequent invocations reuse the same
// container/pool. When Docker is unavailable Ensure calls t.Skip with the
// underlying error so the test suite still passes on machines without
// Docker.
func Ensure(t testing.TB) *Env {
t.Helper()
once.Do(func() {
cur, curEr = start()
})
if curEr != nil {
t.Skipf("postgres container start failed (Docker unavailable?): %v", curEr)
}
return cur
}
// TruncateAll wipes every Runtime Manager table inside the shared pool,
// leaving the schema and indexes intact. Use it from each test that needs
// a clean slate.
func TruncateAll(t testing.TB) {
t.Helper()
env := Ensure(t)
const stmt = `TRUNCATE TABLE runtime_records, operation_log, health_snapshots RESTART IDENTITY CASCADE`
if _, err := env.pool.ExecContext(context.Background(), stmt); err != nil {
t.Fatalf("truncate rtmanager tables: %v", err)
}
}
// Shutdown terminates the shared container and closes the pool. It is
// invoked from each test package's TestMain after `m.Run` returns so the
// container is released even if individual tests panic.
func Shutdown() {
if cur == nil {
return
}
if cur.pool != nil {
_ = cur.pool.Close()
}
if cur.container != nil {
_ = testcontainers.TerminateContainer(cur.container)
}
cur = nil
}
// RunMain is a convenience helper for each store package's TestMain: it
// runs the test main, captures the exit code, shuts the container down,
// and exits. Wiring it through one helper keeps every TestMain to two
// lines.
func RunMain(m *testing.M) {
code := m.Run()
Shutdown()
os.Exit(code)
}
func start() (*Env, error) {
ctx := context.Background()
container, err := tcpostgres.Run(ctx, postgresImage,
tcpostgres.WithDatabase(superDatabase),
tcpostgres.WithUsername(superUser),
tcpostgres.WithPassword(superPassword),
testcontainers.WithWaitStrategy(
wait.ForLog("database system is ready to accept connections").
WithOccurrence(2).
WithStartupTimeout(containerStartup),
),
)
if err != nil {
return nil, err
}
baseDSN, err := container.ConnectionString(ctx, "sslmode=disable")
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
if err := provisionRoleAndSchema(ctx, baseDSN); err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
scopedDSN, err := dsnForServiceRole(baseDSN)
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
cfg := postgres.DefaultConfig()
cfg.PrimaryDSN = scopedDSN
cfg.OperationTimeout = OperationTimeout
pool, err := postgres.OpenPrimary(ctx, cfg)
if err != nil {
_ = testcontainers.TerminateContainer(container)
return nil, err
}
if err := postgres.Ping(ctx, pool, OperationTimeout); err != nil {
_ = pool.Close()
_ = testcontainers.TerminateContainer(container)
return nil, err
}
if err := postgres.RunMigrations(ctx, pool, migrations.FS(), "."); err != nil {
_ = pool.Close()
_ = testcontainers.TerminateContainer(container)
return nil, err
}
return &Env{container: container, pool: pool}, nil
}
func provisionRoleAndSchema(ctx context.Context, baseDSN string) error {
cfg := postgres.DefaultConfig()
cfg.PrimaryDSN = baseDSN
cfg.OperationTimeout = OperationTimeout
db, err := postgres.OpenPrimary(ctx, cfg)
if err != nil {
return err
}
defer func() { _ = db.Close() }()
statements := []string{
`DO $$ BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'rtmanagerservice') THEN
CREATE ROLE rtmanagerservice LOGIN PASSWORD 'rtmanagerservice';
END IF;
END $$;`,
`CREATE SCHEMA IF NOT EXISTS rtmanager AUTHORIZATION rtmanagerservice;`,
`GRANT USAGE ON SCHEMA rtmanager TO rtmanagerservice;`,
}
for _, statement := range statements {
if _, err := db.ExecContext(ctx, statement); err != nil {
return err
}
}
return nil
}
func dsnForServiceRole(baseDSN string) (string, error) {
parsed, err := url.Parse(baseDSN)
if err != nil {
return "", err
}
values := url.Values{}
values.Set("search_path", serviceSchema)
values.Set("sslmode", "disable")
scoped := url.URL{
Scheme: parsed.Scheme,
User: url.UserPassword(serviceRole, servicePassword),
Host: parsed.Host,
Path: parsed.Path,
RawQuery: values.Encode(),
}
return scoped.String(), nil
}
@@ -0,0 +1,112 @@
// Package sqlx contains the small set of helpers shared by every Runtime
// Manager PostgreSQL adapter (runtimerecordstore, operationlogstore,
// healthsnapshotstore). The helpers centralise the boundary translations
// for nullable timestamps and the pgx SQLSTATE codes the adapters
// interpret as domain conflicts.
package sqlx
import (
"context"
"database/sql"
"errors"
"fmt"
"time"
"github.com/jackc/pgx/v5/pgconn"
)
// PgUniqueViolationCode identifies the SQLSTATE returned by PostgreSQL
// when a UNIQUE constraint is violated by INSERT or UPDATE.
const PgUniqueViolationCode = "23505"
// IsUniqueViolation reports whether err is a PostgreSQL unique-violation,
// regardless of constraint name.
func IsUniqueViolation(err error) bool {
var pgErr *pgconn.PgError
if !errors.As(err, &pgErr) {
return false
}
return pgErr.Code == PgUniqueViolationCode
}
// IsNoRows reports whether err is sql.ErrNoRows.
func IsNoRows(err error) bool {
return errors.Is(err, sql.ErrNoRows)
}
// NullableTime returns t.UTC() when non-zero, otherwise nil so the column
// is bound as SQL NULL.
func NullableTime(t time.Time) any {
if t.IsZero() {
return nil
}
return t.UTC()
}
// NullableTimePtr returns t.UTC() when t is non-nil and non-zero, otherwise
// nil. Companion of NullableTime for domain types that use *time.Time to
// express absent timestamps.
func NullableTimePtr(t *time.Time) any {
if t == nil {
return nil
}
return NullableTime(*t)
}
// NullableString returns value when non-empty, otherwise nil so the column
// is bound as SQL NULL. Used for Runtime Manager columns that map empty
// domain strings to NULL (current_container_id, current_image_ref).
func NullableString(value string) any {
if value == "" {
return nil
}
return value
}
// StringFromNullable copies an optional sql.NullString into a domain
// string. NULL becomes the empty string, matching the Runtime Manager
// domain convention that empty == NULL for nullable text columns.
func StringFromNullable(value sql.NullString) string {
if !value.Valid {
return ""
}
return value.String
}
// TimeFromNullable copies an optional sql.NullTime into a domain
// time.Time, applying the global UTC normalisation rule. NULL values
// become the zero time.Time.
func TimeFromNullable(value sql.NullTime) time.Time {
if !value.Valid {
return time.Time{}
}
return value.Time.UTC()
}
// TimePtrFromNullable copies an optional sql.NullTime into a domain
// *time.Time. NULL becomes nil; non-NULL values are wrapped after UTC
// normalisation.
func TimePtrFromNullable(value sql.NullTime) *time.Time {
if !value.Valid {
return nil
}
t := value.Time.UTC()
return &t
}
// WithTimeout derives a child context bounded by timeout and prefixes
// context errors with operation. Callers must always invoke the returned
// cancel.
func WithTimeout(ctx context.Context, operation string, timeout time.Duration) (context.Context, context.CancelFunc, error) {
if ctx == nil {
return nil, nil, fmt.Errorf("%s: nil context", operation)
}
if err := ctx.Err(); err != nil {
return nil, nil, fmt.Errorf("%s: %w", operation, err)
}
if timeout <= 0 {
return nil, nil, fmt.Errorf("%s: operation timeout must be positive", operation)
}
bounded, cancel := context.WithTimeout(ctx, timeout)
return bounded, cancel, nil
}
@@ -0,0 +1,19 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package model
import (
"time"
)
type GooseDbVersion struct {
ID int32 `sql:"primary_key"`
VersionID int64
IsApplied bool
Tstamp time.Time
}
@@ -0,0 +1,21 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package model
import (
"time"
)
type HealthSnapshots struct {
GameID string `sql:"primary_key"`
ContainerID string
Status string
Source string
Details string
ObservedAt time.Time
}
@@ -0,0 +1,27 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package model
import (
"time"
)
type OperationLog struct {
ID int64 `sql:"primary_key"`
GameID string
OpKind string
OpSource string
SourceRef string
ImageRef string
ContainerID string
Outcome string
ErrorCode string
ErrorMessage string
StartedAt time.Time
FinishedAt *time.Time
}
@@ -0,0 +1,27 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package model
import (
"time"
)
type RuntimeRecords struct {
GameID string `sql:"primary_key"`
Status string
CurrentContainerID *string
CurrentImageRef *string
EngineEndpoint string
StatePath string
DockerNetwork string
StartedAt *time.Time
StoppedAt *time.Time
RemovedAt *time.Time
LastOpAt time.Time
CreatedAt time.Time
}
@@ -0,0 +1,87 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package table
import (
"github.com/go-jet/jet/v2/postgres"
)
var GooseDbVersion = newGooseDbVersionTable("rtmanager", "goose_db_version", "")
type gooseDbVersionTable struct {
postgres.Table
// Columns
ID postgres.ColumnInteger
VersionID postgres.ColumnInteger
IsApplied postgres.ColumnBool
Tstamp postgres.ColumnTimestamp
AllColumns postgres.ColumnList
MutableColumns postgres.ColumnList
DefaultColumns postgres.ColumnList
}
type GooseDbVersionTable struct {
gooseDbVersionTable
EXCLUDED gooseDbVersionTable
}
// AS creates new GooseDbVersionTable with assigned alias
func (a GooseDbVersionTable) AS(alias string) *GooseDbVersionTable {
return newGooseDbVersionTable(a.SchemaName(), a.TableName(), alias)
}
// Schema creates new GooseDbVersionTable with assigned schema name
func (a GooseDbVersionTable) FromSchema(schemaName string) *GooseDbVersionTable {
return newGooseDbVersionTable(schemaName, a.TableName(), a.Alias())
}
// WithPrefix creates new GooseDbVersionTable with assigned table prefix
func (a GooseDbVersionTable) WithPrefix(prefix string) *GooseDbVersionTable {
return newGooseDbVersionTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
}
// WithSuffix creates new GooseDbVersionTable with assigned table suffix
func (a GooseDbVersionTable) WithSuffix(suffix string) *GooseDbVersionTable {
return newGooseDbVersionTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
}
func newGooseDbVersionTable(schemaName, tableName, alias string) *GooseDbVersionTable {
return &GooseDbVersionTable{
gooseDbVersionTable: newGooseDbVersionTableImpl(schemaName, tableName, alias),
EXCLUDED: newGooseDbVersionTableImpl("", "excluded", ""),
}
}
func newGooseDbVersionTableImpl(schemaName, tableName, alias string) gooseDbVersionTable {
var (
IDColumn = postgres.IntegerColumn("id")
VersionIDColumn = postgres.IntegerColumn("version_id")
IsAppliedColumn = postgres.BoolColumn("is_applied")
TstampColumn = postgres.TimestampColumn("tstamp")
allColumns = postgres.ColumnList{IDColumn, VersionIDColumn, IsAppliedColumn, TstampColumn}
mutableColumns = postgres.ColumnList{VersionIDColumn, IsAppliedColumn, TstampColumn}
defaultColumns = postgres.ColumnList{TstampColumn}
)
return gooseDbVersionTable{
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
//Columns
ID: IDColumn,
VersionID: VersionIDColumn,
IsApplied: IsAppliedColumn,
Tstamp: TstampColumn,
AllColumns: allColumns,
MutableColumns: mutableColumns,
DefaultColumns: defaultColumns,
}
}
@@ -0,0 +1,93 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package table
import (
"github.com/go-jet/jet/v2/postgres"
)
var HealthSnapshots = newHealthSnapshotsTable("rtmanager", "health_snapshots", "")
type healthSnapshotsTable struct {
postgres.Table
// Columns
GameID postgres.ColumnString
ContainerID postgres.ColumnString
Status postgres.ColumnString
Source postgres.ColumnString
Details postgres.ColumnString
ObservedAt postgres.ColumnTimestampz
AllColumns postgres.ColumnList
MutableColumns postgres.ColumnList
DefaultColumns postgres.ColumnList
}
type HealthSnapshotsTable struct {
healthSnapshotsTable
EXCLUDED healthSnapshotsTable
}
// AS creates new HealthSnapshotsTable with assigned alias
func (a HealthSnapshotsTable) AS(alias string) *HealthSnapshotsTable {
return newHealthSnapshotsTable(a.SchemaName(), a.TableName(), alias)
}
// Schema creates new HealthSnapshotsTable with assigned schema name
func (a HealthSnapshotsTable) FromSchema(schemaName string) *HealthSnapshotsTable {
return newHealthSnapshotsTable(schemaName, a.TableName(), a.Alias())
}
// WithPrefix creates new HealthSnapshotsTable with assigned table prefix
func (a HealthSnapshotsTable) WithPrefix(prefix string) *HealthSnapshotsTable {
return newHealthSnapshotsTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
}
// WithSuffix creates new HealthSnapshotsTable with assigned table suffix
func (a HealthSnapshotsTable) WithSuffix(suffix string) *HealthSnapshotsTable {
return newHealthSnapshotsTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
}
func newHealthSnapshotsTable(schemaName, tableName, alias string) *HealthSnapshotsTable {
return &HealthSnapshotsTable{
healthSnapshotsTable: newHealthSnapshotsTableImpl(schemaName, tableName, alias),
EXCLUDED: newHealthSnapshotsTableImpl("", "excluded", ""),
}
}
func newHealthSnapshotsTableImpl(schemaName, tableName, alias string) healthSnapshotsTable {
var (
GameIDColumn = postgres.StringColumn("game_id")
ContainerIDColumn = postgres.StringColumn("container_id")
StatusColumn = postgres.StringColumn("status")
SourceColumn = postgres.StringColumn("source")
DetailsColumn = postgres.StringColumn("details")
ObservedAtColumn = postgres.TimestampzColumn("observed_at")
allColumns = postgres.ColumnList{GameIDColumn, ContainerIDColumn, StatusColumn, SourceColumn, DetailsColumn, ObservedAtColumn}
mutableColumns = postgres.ColumnList{ContainerIDColumn, StatusColumn, SourceColumn, DetailsColumn, ObservedAtColumn}
defaultColumns = postgres.ColumnList{ContainerIDColumn, DetailsColumn}
)
return healthSnapshotsTable{
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
//Columns
GameID: GameIDColumn,
ContainerID: ContainerIDColumn,
Status: StatusColumn,
Source: SourceColumn,
Details: DetailsColumn,
ObservedAt: ObservedAtColumn,
AllColumns: allColumns,
MutableColumns: mutableColumns,
DefaultColumns: defaultColumns,
}
}
@@ -0,0 +1,111 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package table
import (
"github.com/go-jet/jet/v2/postgres"
)
var OperationLog = newOperationLogTable("rtmanager", "operation_log", "")
type operationLogTable struct {
postgres.Table
// Columns
ID postgres.ColumnInteger
GameID postgres.ColumnString
OpKind postgres.ColumnString
OpSource postgres.ColumnString
SourceRef postgres.ColumnString
ImageRef postgres.ColumnString
ContainerID postgres.ColumnString
Outcome postgres.ColumnString
ErrorCode postgres.ColumnString
ErrorMessage postgres.ColumnString
StartedAt postgres.ColumnTimestampz
FinishedAt postgres.ColumnTimestampz
AllColumns postgres.ColumnList
MutableColumns postgres.ColumnList
DefaultColumns postgres.ColumnList
}
type OperationLogTable struct {
operationLogTable
EXCLUDED operationLogTable
}
// AS creates new OperationLogTable with assigned alias
func (a OperationLogTable) AS(alias string) *OperationLogTable {
return newOperationLogTable(a.SchemaName(), a.TableName(), alias)
}
// Schema creates new OperationLogTable with assigned schema name
func (a OperationLogTable) FromSchema(schemaName string) *OperationLogTable {
return newOperationLogTable(schemaName, a.TableName(), a.Alias())
}
// WithPrefix creates new OperationLogTable with assigned table prefix
func (a OperationLogTable) WithPrefix(prefix string) *OperationLogTable {
return newOperationLogTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
}
// WithSuffix creates new OperationLogTable with assigned table suffix
func (a OperationLogTable) WithSuffix(suffix string) *OperationLogTable {
return newOperationLogTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
}
func newOperationLogTable(schemaName, tableName, alias string) *OperationLogTable {
return &OperationLogTable{
operationLogTable: newOperationLogTableImpl(schemaName, tableName, alias),
EXCLUDED: newOperationLogTableImpl("", "excluded", ""),
}
}
func newOperationLogTableImpl(schemaName, tableName, alias string) operationLogTable {
var (
IDColumn = postgres.IntegerColumn("id")
GameIDColumn = postgres.StringColumn("game_id")
OpKindColumn = postgres.StringColumn("op_kind")
OpSourceColumn = postgres.StringColumn("op_source")
SourceRefColumn = postgres.StringColumn("source_ref")
ImageRefColumn = postgres.StringColumn("image_ref")
ContainerIDColumn = postgres.StringColumn("container_id")
OutcomeColumn = postgres.StringColumn("outcome")
ErrorCodeColumn = postgres.StringColumn("error_code")
ErrorMessageColumn = postgres.StringColumn("error_message")
StartedAtColumn = postgres.TimestampzColumn("started_at")
FinishedAtColumn = postgres.TimestampzColumn("finished_at")
allColumns = postgres.ColumnList{IDColumn, GameIDColumn, OpKindColumn, OpSourceColumn, SourceRefColumn, ImageRefColumn, ContainerIDColumn, OutcomeColumn, ErrorCodeColumn, ErrorMessageColumn, StartedAtColumn, FinishedAtColumn}
mutableColumns = postgres.ColumnList{GameIDColumn, OpKindColumn, OpSourceColumn, SourceRefColumn, ImageRefColumn, ContainerIDColumn, OutcomeColumn, ErrorCodeColumn, ErrorMessageColumn, StartedAtColumn, FinishedAtColumn}
defaultColumns = postgres.ColumnList{IDColumn, SourceRefColumn, ImageRefColumn, ContainerIDColumn, ErrorCodeColumn, ErrorMessageColumn}
)
return operationLogTable{
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
//Columns
ID: IDColumn,
GameID: GameIDColumn,
OpKind: OpKindColumn,
OpSource: OpSourceColumn,
SourceRef: SourceRefColumn,
ImageRef: ImageRefColumn,
ContainerID: ContainerIDColumn,
Outcome: OutcomeColumn,
ErrorCode: ErrorCodeColumn,
ErrorMessage: ErrorMessageColumn,
StartedAt: StartedAtColumn,
FinishedAt: FinishedAtColumn,
AllColumns: allColumns,
MutableColumns: mutableColumns,
DefaultColumns: defaultColumns,
}
}
@@ -0,0 +1,111 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package table
import (
"github.com/go-jet/jet/v2/postgres"
)
var RuntimeRecords = newRuntimeRecordsTable("rtmanager", "runtime_records", "")
type runtimeRecordsTable struct {
postgres.Table
// Columns
GameID postgres.ColumnString
Status postgres.ColumnString
CurrentContainerID postgres.ColumnString
CurrentImageRef postgres.ColumnString
EngineEndpoint postgres.ColumnString
StatePath postgres.ColumnString
DockerNetwork postgres.ColumnString
StartedAt postgres.ColumnTimestampz
StoppedAt postgres.ColumnTimestampz
RemovedAt postgres.ColumnTimestampz
LastOpAt postgres.ColumnTimestampz
CreatedAt postgres.ColumnTimestampz
AllColumns postgres.ColumnList
MutableColumns postgres.ColumnList
DefaultColumns postgres.ColumnList
}
type RuntimeRecordsTable struct {
runtimeRecordsTable
EXCLUDED runtimeRecordsTable
}
// AS creates new RuntimeRecordsTable with assigned alias
func (a RuntimeRecordsTable) AS(alias string) *RuntimeRecordsTable {
return newRuntimeRecordsTable(a.SchemaName(), a.TableName(), alias)
}
// Schema creates new RuntimeRecordsTable with assigned schema name
func (a RuntimeRecordsTable) FromSchema(schemaName string) *RuntimeRecordsTable {
return newRuntimeRecordsTable(schemaName, a.TableName(), a.Alias())
}
// WithPrefix creates new RuntimeRecordsTable with assigned table prefix
func (a RuntimeRecordsTable) WithPrefix(prefix string) *RuntimeRecordsTable {
return newRuntimeRecordsTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
}
// WithSuffix creates new RuntimeRecordsTable with assigned table suffix
func (a RuntimeRecordsTable) WithSuffix(suffix string) *RuntimeRecordsTable {
return newRuntimeRecordsTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
}
func newRuntimeRecordsTable(schemaName, tableName, alias string) *RuntimeRecordsTable {
return &RuntimeRecordsTable{
runtimeRecordsTable: newRuntimeRecordsTableImpl(schemaName, tableName, alias),
EXCLUDED: newRuntimeRecordsTableImpl("", "excluded", ""),
}
}
func newRuntimeRecordsTableImpl(schemaName, tableName, alias string) runtimeRecordsTable {
var (
GameIDColumn = postgres.StringColumn("game_id")
StatusColumn = postgres.StringColumn("status")
CurrentContainerIDColumn = postgres.StringColumn("current_container_id")
CurrentImageRefColumn = postgres.StringColumn("current_image_ref")
EngineEndpointColumn = postgres.StringColumn("engine_endpoint")
StatePathColumn = postgres.StringColumn("state_path")
DockerNetworkColumn = postgres.StringColumn("docker_network")
StartedAtColumn = postgres.TimestampzColumn("started_at")
StoppedAtColumn = postgres.TimestampzColumn("stopped_at")
RemovedAtColumn = postgres.TimestampzColumn("removed_at")
LastOpAtColumn = postgres.TimestampzColumn("last_op_at")
CreatedAtColumn = postgres.TimestampzColumn("created_at")
allColumns = postgres.ColumnList{GameIDColumn, StatusColumn, CurrentContainerIDColumn, CurrentImageRefColumn, EngineEndpointColumn, StatePathColumn, DockerNetworkColumn, StartedAtColumn, StoppedAtColumn, RemovedAtColumn, LastOpAtColumn, CreatedAtColumn}
mutableColumns = postgres.ColumnList{StatusColumn, CurrentContainerIDColumn, CurrentImageRefColumn, EngineEndpointColumn, StatePathColumn, DockerNetworkColumn, StartedAtColumn, StoppedAtColumn, RemovedAtColumn, LastOpAtColumn, CreatedAtColumn}
defaultColumns = postgres.ColumnList{}
)
return runtimeRecordsTable{
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
//Columns
GameID: GameIDColumn,
Status: StatusColumn,
CurrentContainerID: CurrentContainerIDColumn,
CurrentImageRef: CurrentImageRefColumn,
EngineEndpoint: EngineEndpointColumn,
StatePath: StatePathColumn,
DockerNetwork: DockerNetworkColumn,
StartedAt: StartedAtColumn,
StoppedAt: StoppedAtColumn,
RemovedAt: RemovedAtColumn,
LastOpAt: LastOpAtColumn,
CreatedAt: CreatedAtColumn,
AllColumns: allColumns,
MutableColumns: mutableColumns,
DefaultColumns: defaultColumns,
}
}
@@ -0,0 +1,17 @@
//
// Code generated by go-jet DO NOT EDIT.
//
// WARNING: Changes to this file may cause incorrect behavior
// and will be lost if the code is regenerated
//
package table
// UseSchema sets a new schema name for all generated table SQL builder types. It is recommended to invoke
// this method only once at the beginning of the program.
func UseSchema(schema string) {
GooseDbVersion = GooseDbVersion.FromSchema(schema)
HealthSnapshots = HealthSnapshots.FromSchema(schema)
OperationLog = OperationLog.FromSchema(schema)
RuntimeRecords = RuntimeRecords.FromSchema(schema)
}
@@ -0,0 +1,106 @@
-- +goose Up
-- Initial Runtime Manager PostgreSQL schema.
--
-- Three tables cover the durable surface of the service:
-- * runtime_records — one row per game with the latest known runtime
-- status and Docker container binding;
-- * operation_log — append-only audit of every start/stop/restart/
-- patch/cleanup/reconcile_* operation RTM performed;
-- * health_snapshots — latest technical health observation per game.
--
-- Schema and the matching `rtmanagerservice` role are provisioned
-- outside this script (in tests via cmd/jetgen/main.go::provisionRoleAndSchema;
-- in production via an ops init script). This migration runs as the
-- schema owner with `search_path=rtmanager` and only contains DDL for the
-- service-owned tables and indexes. ARCHITECTURE.md §Database topology
-- mandates that the per-service role's grants stay restricted to its own
-- schema; consequently this file deliberately deviates from PLAN.md
-- Stage 09's literal `CREATE SCHEMA IF NOT EXISTS rtmanager;` instruction.
-- runtime_records holds one durable record per game with the latest
-- known runtime status and Docker container binding. The status enum
-- (running | stopped | removed) is enforced by a CHECK so domain code
-- can rely on it without reading every callsite. The (status, last_op_at)
-- index drives the periodic container-cleanup worker that scans
-- `status='stopped' AND last_op_at < now() - retention`.
CREATE TABLE runtime_records (
game_id text PRIMARY KEY,
status text NOT NULL,
current_container_id text,
current_image_ref text,
engine_endpoint text NOT NULL,
state_path text NOT NULL,
docker_network text NOT NULL,
started_at timestamptz,
stopped_at timestamptz,
removed_at timestamptz,
last_op_at timestamptz NOT NULL,
created_at timestamptz NOT NULL,
CONSTRAINT runtime_records_status_chk
CHECK (status IN ('running', 'stopped', 'removed'))
);
CREATE INDEX runtime_records_status_last_op_idx
ON runtime_records (status, last_op_at);
-- operation_log is an append-only audit of every operation Runtime
-- Manager performed against a game's runtime. The (game_id, started_at
-- DESC) index drives audit reads from the GM/Admin REST surface;
-- finished_at is nullable for in-flight rows even though Stage 13+
-- always finalises the row in the same transaction. The op_kind /
-- op_source / outcome enums are enforced by CHECK constraints to keep
-- the audit schema honest without a separate Go validator.
CREATE TABLE operation_log (
id bigserial PRIMARY KEY,
game_id text NOT NULL,
op_kind text NOT NULL,
op_source text NOT NULL,
source_ref text NOT NULL DEFAULT '',
image_ref text NOT NULL DEFAULT '',
container_id text NOT NULL DEFAULT '',
outcome text NOT NULL,
error_code text NOT NULL DEFAULT '',
error_message text NOT NULL DEFAULT '',
started_at timestamptz NOT NULL,
finished_at timestamptz,
CONSTRAINT operation_log_op_kind_chk
CHECK (op_kind IN (
'start', 'stop', 'restart', 'patch',
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
)),
CONSTRAINT operation_log_op_source_chk
CHECK (op_source IN (
'lobby_stream', 'gm_rest', 'admin_rest',
'auto_ttl', 'auto_reconcile'
)),
CONSTRAINT operation_log_outcome_chk
CHECK (outcome IN ('success', 'failure'))
);
CREATE INDEX operation_log_game_started_idx
ON operation_log (game_id, started_at DESC);
-- health_snapshots stores the latest technical health observation per
-- game. One row per game; later observations overwrite. The status enum
-- mirrors the `event_type` vocabulary on `runtime:health_events`
-- (collapsed to a flat status column for the latest-observation view).
CREATE TABLE health_snapshots (
game_id text PRIMARY KEY,
container_id text NOT NULL DEFAULT '',
status text NOT NULL,
source text NOT NULL,
details jsonb NOT NULL DEFAULT '{}'::jsonb,
observed_at timestamptz NOT NULL,
CONSTRAINT health_snapshots_status_chk
CHECK (status IN (
'healthy', 'probe_failed', 'exited',
'oom', 'inspect_unhealthy', 'container_disappeared'
)),
CONSTRAINT health_snapshots_source_chk
CHECK (source IN ('docker_event', 'inspect', 'probe'))
);
-- +goose Down
DROP TABLE IF EXISTS health_snapshots;
DROP TABLE IF EXISTS operation_log;
DROP TABLE IF EXISTS runtime_records;
@@ -0,0 +1,19 @@
// Package migrations exposes the embedded goose migration files used by
// Runtime Manager to provision its `rtmanager` schema in PostgreSQL.
//
// The embedded filesystem is consumed by `pkg/postgres.RunMigrations`
// during rtmanager-service startup and by `cmd/jetgen` when regenerating
// the `internal/adapters/postgres/jet/` code against a transient
// PostgreSQL instance.
package migrations
import "embed"
//go:embed *.sql
var fs embed.FS
// FS returns the embedded filesystem containing every numbered goose
// migration shipped with Runtime Manager.
func FS() embed.FS {
return fs
}
@@ -0,0 +1,235 @@
// Package operationlogstore implements the PostgreSQL-backed adapter for
// `ports.OperationLogStore`.
//
// The package owns the on-disk shape of the `operation_log` table defined
// in
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`
// and translates the schema-agnostic `ports.OperationLogStore` interface
// declared in `internal/ports/operationlogstore.go` into concrete
// go-jet/v2 statements driven by the pgx driver.
//
// Append uses `INSERT ... RETURNING id` to surface the bigserial id back
// to callers; ListByGame is index-driven by `operation_log_game_started_idx`.
package operationlogstore
import (
"context"
"database/sql"
"errors"
"fmt"
"strings"
"time"
"galaxy/rtmanager/internal/adapters/postgres/internal/sqlx"
pgtable "galaxy/rtmanager/internal/adapters/postgres/jet/rtmanager/table"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/ports"
pg "github.com/go-jet/jet/v2/postgres"
)
// Config configures one PostgreSQL-backed operation-log store instance.
type Config struct {
// DB stores the connection pool the store uses for every query.
DB *sql.DB
// OperationTimeout bounds one round trip.
OperationTimeout time.Duration
}
// Store persists Runtime Manager operation-log entries in PostgreSQL.
type Store struct {
db *sql.DB
operationTimeout time.Duration
}
// New constructs one PostgreSQL-backed operation-log store from cfg.
func New(cfg Config) (*Store, error) {
if cfg.DB == nil {
return nil, errors.New("new postgres operation log store: db must not be nil")
}
if cfg.OperationTimeout <= 0 {
return nil, errors.New("new postgres operation log store: operation timeout must be positive")
}
return &Store{
db: cfg.DB,
operationTimeout: cfg.OperationTimeout,
}, nil
}
// operationLogSelectColumns is the canonical SELECT list for the
// operation_log table, matching scanEntry's column order.
var operationLogSelectColumns = pg.ColumnList{
pgtable.OperationLog.ID,
pgtable.OperationLog.GameID,
pgtable.OperationLog.OpKind,
pgtable.OperationLog.OpSource,
pgtable.OperationLog.SourceRef,
pgtable.OperationLog.ImageRef,
pgtable.OperationLog.ContainerID,
pgtable.OperationLog.Outcome,
pgtable.OperationLog.ErrorCode,
pgtable.OperationLog.ErrorMessage,
pgtable.OperationLog.StartedAt,
pgtable.OperationLog.FinishedAt,
}
// Append inserts entry into the operation log and returns the generated
// bigserial id. entry is validated through operation.OperationEntry.Validate
// before the SQL is issued.
func (store *Store) Append(ctx context.Context, entry operation.OperationEntry) (int64, error) {
if store == nil || store.db == nil {
return 0, errors.New("append operation log entry: nil store")
}
if err := entry.Validate(); err != nil {
return 0, fmt.Errorf("append operation log entry: %w", err)
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "append operation log entry", store.operationTimeout)
if err != nil {
return 0, err
}
defer cancel()
stmt := pgtable.OperationLog.INSERT(
pgtable.OperationLog.GameID,
pgtable.OperationLog.OpKind,
pgtable.OperationLog.OpSource,
pgtable.OperationLog.SourceRef,
pgtable.OperationLog.ImageRef,
pgtable.OperationLog.ContainerID,
pgtable.OperationLog.Outcome,
pgtable.OperationLog.ErrorCode,
pgtable.OperationLog.ErrorMessage,
pgtable.OperationLog.StartedAt,
pgtable.OperationLog.FinishedAt,
).VALUES(
entry.GameID,
string(entry.OpKind),
string(entry.OpSource),
entry.SourceRef,
entry.ImageRef,
entry.ContainerID,
string(entry.Outcome),
entry.ErrorCode,
entry.ErrorMessage,
entry.StartedAt.UTC(),
sqlx.NullableTimePtr(entry.FinishedAt),
).RETURNING(pgtable.OperationLog.ID)
query, args := stmt.Sql()
row := store.db.QueryRowContext(operationCtx, query, args...)
var id int64
if err := row.Scan(&id); err != nil {
return 0, fmt.Errorf("append operation log entry: %w", err)
}
return id, nil
}
// ListByGame returns the most recent entries for gameID, ordered by
// started_at descending and capped by limit. The (game_id,
// started_at DESC) index drives the read.
func (store *Store) ListByGame(ctx context.Context, gameID string, limit int) ([]operation.OperationEntry, error) {
if store == nil || store.db == nil {
return nil, errors.New("list operation log entries by game: nil store")
}
if strings.TrimSpace(gameID) == "" {
return nil, fmt.Errorf("list operation log entries by game: game id must not be empty")
}
if limit <= 0 {
return nil, fmt.Errorf("list operation log entries by game: limit must be positive, got %d", limit)
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "list operation log entries by game", store.operationTimeout)
if err != nil {
return nil, err
}
defer cancel()
stmt := pg.SELECT(operationLogSelectColumns).
FROM(pgtable.OperationLog).
WHERE(pgtable.OperationLog.GameID.EQ(pg.String(gameID))).
ORDER_BY(pgtable.OperationLog.StartedAt.DESC(), pgtable.OperationLog.ID.DESC()).
LIMIT(int64(limit))
query, args := stmt.Sql()
rows, err := store.db.QueryContext(operationCtx, query, args...)
if err != nil {
return nil, fmt.Errorf("list operation log entries by game: %w", err)
}
defer rows.Close()
entries := make([]operation.OperationEntry, 0)
for rows.Next() {
entry, err := scanEntry(rows)
if err != nil {
return nil, fmt.Errorf("list operation log entries by game: scan: %w", err)
}
entries = append(entries, entry)
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("list operation log entries by game: %w", err)
}
if len(entries) == 0 {
return nil, nil
}
return entries, nil
}
// rowScanner abstracts *sql.Row and *sql.Rows so scanEntry can be shared
// across both single-row reads and iterated reads.
type rowScanner interface {
Scan(dest ...any) error
}
// scanEntry scans one operation_log row from rs.
func scanEntry(rs rowScanner) (operation.OperationEntry, error) {
var (
id int64
gameID string
opKind string
opSource string
sourceRef string
imageRef string
containerID string
outcome string
errorCode string
errorMessage string
startedAt time.Time
finishedAt sql.NullTime
)
if err := rs.Scan(
&id,
&gameID,
&opKind,
&opSource,
&sourceRef,
&imageRef,
&containerID,
&outcome,
&errorCode,
&errorMessage,
&startedAt,
&finishedAt,
); err != nil {
return operation.OperationEntry{}, err
}
return operation.OperationEntry{
ID: id,
GameID: gameID,
OpKind: operation.OpKind(opKind),
OpSource: operation.OpSource(opSource),
SourceRef: sourceRef,
ImageRef: imageRef,
ContainerID: containerID,
Outcome: operation.Outcome(outcome),
ErrorCode: errorCode,
ErrorMessage: errorMessage,
StartedAt: startedAt.UTC(),
FinishedAt: sqlx.TimePtrFromNullable(finishedAt),
}, nil
}
// Ensure Store satisfies the ports.OperationLogStore interface at compile
// time.
var _ ports.OperationLogStore = (*Store)(nil)
@@ -0,0 +1,207 @@
package operationlogstore_test
import (
"context"
"testing"
"time"
"galaxy/rtmanager/internal/adapters/postgres/internal/pgtest"
"galaxy/rtmanager/internal/adapters/postgres/operationlogstore"
"galaxy/rtmanager/internal/domain/operation"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestMain(m *testing.M) { pgtest.RunMain(m) }
func newStore(t *testing.T) *operationlogstore.Store {
t.Helper()
pgtest.TruncateAll(t)
store, err := operationlogstore.New(operationlogstore.Config{
DB: pgtest.Ensure(t).Pool(),
OperationTimeout: pgtest.OperationTimeout,
})
require.NoError(t, err)
return store
}
func successStartEntry(gameID string, startedAt time.Time, sourceRef string) operation.OperationEntry {
finishedAt := startedAt.Add(time.Second)
return operation.OperationEntry{
GameID: gameID,
OpKind: operation.OpKindStart,
OpSource: operation.OpSourceLobbyStream,
SourceRef: sourceRef,
ImageRef: "galaxy/game:v1.2.3",
ContainerID: "container-1",
Outcome: operation.OutcomeSuccess,
StartedAt: startedAt,
FinishedAt: &finishedAt,
}
}
func TestAppendReturnsPositiveIDs(t *testing.T) {
ctx := context.Background()
store := newStore(t)
startedAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
id1, err := store.Append(ctx, successStartEntry("game-001", startedAt, "1700000000000-0"))
require.NoError(t, err)
assert.Greater(t, id1, int64(0))
id2, err := store.Append(ctx, successStartEntry("game-001", startedAt.Add(time.Minute), "1700000000001-0"))
require.NoError(t, err)
assert.Greater(t, id2, id1)
}
func TestAppendValidatesEntry(t *testing.T) {
ctx := context.Background()
store := newStore(t)
tests := []struct {
name string
mutate func(*operation.OperationEntry)
}{
{"empty game id", func(e *operation.OperationEntry) { e.GameID = "" }},
{"unknown op kind", func(e *operation.OperationEntry) { e.OpKind = "exotic" }},
{"unknown op source", func(e *operation.OperationEntry) { e.OpSource = "exotic" }},
{"unknown outcome", func(e *operation.OperationEntry) { e.Outcome = "exotic" }},
{"zero started at", func(e *operation.OperationEntry) { e.StartedAt = time.Time{} }},
{"failure without error code", func(e *operation.OperationEntry) {
e.Outcome = operation.OutcomeFailure
e.ErrorCode = ""
}},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
entry := successStartEntry("game-001",
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC), "ref")
tt.mutate(&entry)
_, err := store.Append(ctx, entry)
require.Error(t, err)
})
}
}
func TestListByGameReturnsEntriesNewestFirst(t *testing.T) {
ctx := context.Background()
store := newStore(t)
base := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
for index := range 3 {
_, err := store.Append(ctx, successStartEntry("game-001",
base.Add(time.Duration(index)*time.Minute),
"ref-game-001-"))
require.NoError(t, err)
}
// Foreign-game entry must not appear in the list.
_, err := store.Append(ctx, successStartEntry("game-other", base, "ref-other"))
require.NoError(t, err)
entries, err := store.ListByGame(ctx, "game-001", 10)
require.NoError(t, err)
require.Len(t, entries, 3)
for index := range 2 {
assert.True(t,
!entries[index].StartedAt.Before(entries[index+1].StartedAt),
"entries must be ordered started_at DESC; got %s before %s",
entries[index].StartedAt, entries[index+1].StartedAt,
)
}
}
func TestListByGameRespectsLimit(t *testing.T) {
ctx := context.Background()
store := newStore(t)
base := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
for index := range 5 {
_, err := store.Append(ctx, successStartEntry("game-001",
base.Add(time.Duration(index)*time.Minute), "ref"))
require.NoError(t, err)
}
entries, err := store.ListByGame(ctx, "game-001", 2)
require.NoError(t, err)
require.Len(t, entries, 2)
}
func TestListByGameReturnsEmptyForUnknownGame(t *testing.T) {
ctx := context.Background()
store := newStore(t)
entries, err := store.ListByGame(ctx, "game-missing", 10)
require.NoError(t, err)
assert.Empty(t, entries)
}
func TestListByGameRejectsInvalidArgs(t *testing.T) {
ctx := context.Background()
store := newStore(t)
_, err := store.ListByGame(ctx, "", 10)
require.Error(t, err)
_, err = store.ListByGame(ctx, "game-001", 0)
require.Error(t, err)
_, err = store.ListByGame(ctx, "game-001", -3)
require.Error(t, err)
}
func TestAppendRoundTripsAllFields(t *testing.T) {
ctx := context.Background()
store := newStore(t)
startedAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
finishedAt := startedAt.Add(2 * time.Second)
original := operation.OperationEntry{
GameID: "game-001",
OpKind: operation.OpKindStop,
OpSource: operation.OpSourceGMRest,
SourceRef: "request-7",
ImageRef: "galaxy/game:v2.0.0",
ContainerID: "container-X",
Outcome: operation.OutcomeFailure,
ErrorCode: "container_start_failed",
ErrorMessage: "stop deadline exceeded",
StartedAt: startedAt,
FinishedAt: &finishedAt,
}
id, err := store.Append(ctx, original)
require.NoError(t, err)
entries, err := store.ListByGame(ctx, "game-001", 10)
require.NoError(t, err)
require.Len(t, entries, 1)
got := entries[0]
assert.Equal(t, id, got.ID)
assert.Equal(t, original.GameID, got.GameID)
assert.Equal(t, original.OpKind, got.OpKind)
assert.Equal(t, original.OpSource, got.OpSource)
assert.Equal(t, original.SourceRef, got.SourceRef)
assert.Equal(t, original.ImageRef, got.ImageRef)
assert.Equal(t, original.ContainerID, got.ContainerID)
assert.Equal(t, original.Outcome, got.Outcome)
assert.Equal(t, original.ErrorCode, got.ErrorCode)
assert.Equal(t, original.ErrorMessage, got.ErrorMessage)
assert.True(t, original.StartedAt.Equal(got.StartedAt))
require.NotNil(t, got.FinishedAt)
assert.True(t, original.FinishedAt.Equal(*got.FinishedAt))
assert.Equal(t, time.UTC, got.StartedAt.Location())
assert.Equal(t, time.UTC, got.FinishedAt.Location())
}
func TestNewRejectsNilDB(t *testing.T) {
_, err := operationlogstore.New(operationlogstore.Config{OperationTimeout: time.Second})
require.Error(t, err)
}
func TestNewRejectsNonPositiveTimeout(t *testing.T) {
_, err := operationlogstore.New(operationlogstore.Config{
DB: pgtest.Ensure(t).Pool(),
})
require.Error(t, err)
}
@@ -0,0 +1,500 @@
// Package runtimerecordstore implements the PostgreSQL-backed adapter for
// `ports.RuntimeRecordStore`.
//
// The package owns the on-disk shape of the `runtime_records` table
// defined in
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`
// and translates the schema-agnostic `ports.RuntimeRecordStore` interface
// declared in `internal/ports/runtimerecordstore.go` into concrete
// go-jet/v2 statements driven by the pgx driver.
//
// Lifecycle transitions (UpdateStatus) use compare-and-swap on
// `(status, current_container_id)` rather than holding a SELECT ... FOR
// UPDATE lock across the caller's logic, mirroring the pattern used by
// `lobby/internal/adapters/postgres/gamestore`.
package runtimerecordstore
import (
"context"
"database/sql"
"errors"
"fmt"
"strings"
"time"
"galaxy/rtmanager/internal/adapters/postgres/internal/sqlx"
pgtable "galaxy/rtmanager/internal/adapters/postgres/jet/rtmanager/table"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
pg "github.com/go-jet/jet/v2/postgres"
)
// Config configures one PostgreSQL-backed runtime-record store instance.
// The store does not own the underlying *sql.DB lifecycle: the caller
// (typically the service runtime) opens, instruments, migrates, and
// closes the pool.
type Config struct {
// DB stores the connection pool the store uses for every query.
DB *sql.DB
// OperationTimeout bounds one round trip. The store creates a
// derived context for each operation so callers cannot starve the
// pool with an unbounded ctx.
OperationTimeout time.Duration
}
// Store persists Runtime Manager runtime records in PostgreSQL.
type Store struct {
db *sql.DB
operationTimeout time.Duration
}
// New constructs one PostgreSQL-backed runtime-record store from cfg.
func New(cfg Config) (*Store, error) {
if cfg.DB == nil {
return nil, errors.New("new postgres runtime record store: db must not be nil")
}
if cfg.OperationTimeout <= 0 {
return nil, errors.New("new postgres runtime record store: operation timeout must be positive")
}
return &Store{
db: cfg.DB,
operationTimeout: cfg.OperationTimeout,
}, nil
}
// runtimeSelectColumns is the canonical SELECT list for the runtime_records
// table, matching scanRecord's column order.
var runtimeSelectColumns = pg.ColumnList{
pgtable.RuntimeRecords.GameID,
pgtable.RuntimeRecords.Status,
pgtable.RuntimeRecords.CurrentContainerID,
pgtable.RuntimeRecords.CurrentImageRef,
pgtable.RuntimeRecords.EngineEndpoint,
pgtable.RuntimeRecords.StatePath,
pgtable.RuntimeRecords.DockerNetwork,
pgtable.RuntimeRecords.StartedAt,
pgtable.RuntimeRecords.StoppedAt,
pgtable.RuntimeRecords.RemovedAt,
pgtable.RuntimeRecords.LastOpAt,
pgtable.RuntimeRecords.CreatedAt,
}
// Get returns the record identified by gameID. It returns
// runtime.ErrNotFound when no record exists.
func (store *Store) Get(ctx context.Context, gameID string) (runtime.RuntimeRecord, error) {
if store == nil || store.db == nil {
return runtime.RuntimeRecord{}, errors.New("get runtime record: nil store")
}
if strings.TrimSpace(gameID) == "" {
return runtime.RuntimeRecord{}, fmt.Errorf("get runtime record: game id must not be empty")
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "get runtime record", store.operationTimeout)
if err != nil {
return runtime.RuntimeRecord{}, err
}
defer cancel()
stmt := pg.SELECT(runtimeSelectColumns).
FROM(pgtable.RuntimeRecords).
WHERE(pgtable.RuntimeRecords.GameID.EQ(pg.String(gameID)))
query, args := stmt.Sql()
row := store.db.QueryRowContext(operationCtx, query, args...)
record, err := scanRecord(row)
if sqlx.IsNoRows(err) {
return runtime.RuntimeRecord{}, runtime.ErrNotFound
}
if err != nil {
return runtime.RuntimeRecord{}, fmt.Errorf("get runtime record: %w", err)
}
return record, nil
}
// Upsert inserts record when no row exists for record.GameID and
// otherwise overwrites every mutable column verbatim. created_at is
// preserved across upserts so the "first time RTM saw the game"
// timestamp stays stable.
func (store *Store) Upsert(ctx context.Context, record runtime.RuntimeRecord) error {
if store == nil || store.db == nil {
return errors.New("upsert runtime record: nil store")
}
if err := record.Validate(); err != nil {
return fmt.Errorf("upsert runtime record: %w", err)
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "upsert runtime record", store.operationTimeout)
if err != nil {
return err
}
defer cancel()
stmt := pgtable.RuntimeRecords.INSERT(
pgtable.RuntimeRecords.GameID,
pgtable.RuntimeRecords.Status,
pgtable.RuntimeRecords.CurrentContainerID,
pgtable.RuntimeRecords.CurrentImageRef,
pgtable.RuntimeRecords.EngineEndpoint,
pgtable.RuntimeRecords.StatePath,
pgtable.RuntimeRecords.DockerNetwork,
pgtable.RuntimeRecords.StartedAt,
pgtable.RuntimeRecords.StoppedAt,
pgtable.RuntimeRecords.RemovedAt,
pgtable.RuntimeRecords.LastOpAt,
pgtable.RuntimeRecords.CreatedAt,
).VALUES(
record.GameID,
string(record.Status),
sqlx.NullableString(record.CurrentContainerID),
sqlx.NullableString(record.CurrentImageRef),
record.EngineEndpoint,
record.StatePath,
record.DockerNetwork,
sqlx.NullableTimePtr(record.StartedAt),
sqlx.NullableTimePtr(record.StoppedAt),
sqlx.NullableTimePtr(record.RemovedAt),
record.LastOpAt.UTC(),
record.CreatedAt.UTC(),
).ON_CONFLICT(pgtable.RuntimeRecords.GameID).DO_UPDATE(
pg.SET(
pgtable.RuntimeRecords.Status.SET(pgtable.RuntimeRecords.EXCLUDED.Status),
pgtable.RuntimeRecords.CurrentContainerID.SET(pgtable.RuntimeRecords.EXCLUDED.CurrentContainerID),
pgtable.RuntimeRecords.CurrentImageRef.SET(pgtable.RuntimeRecords.EXCLUDED.CurrentImageRef),
pgtable.RuntimeRecords.EngineEndpoint.SET(pgtable.RuntimeRecords.EXCLUDED.EngineEndpoint),
pgtable.RuntimeRecords.StatePath.SET(pgtable.RuntimeRecords.EXCLUDED.StatePath),
pgtable.RuntimeRecords.DockerNetwork.SET(pgtable.RuntimeRecords.EXCLUDED.DockerNetwork),
pgtable.RuntimeRecords.StartedAt.SET(pgtable.RuntimeRecords.EXCLUDED.StartedAt),
pgtable.RuntimeRecords.StoppedAt.SET(pgtable.RuntimeRecords.EXCLUDED.StoppedAt),
pgtable.RuntimeRecords.RemovedAt.SET(pgtable.RuntimeRecords.EXCLUDED.RemovedAt),
pgtable.RuntimeRecords.LastOpAt.SET(pgtable.RuntimeRecords.EXCLUDED.LastOpAt),
),
)
query, args := stmt.Sql()
if _, err := store.db.ExecContext(operationCtx, query, args...); err != nil {
return fmt.Errorf("upsert runtime record: %w", err)
}
return nil
}
// UpdateStatus applies one status transition with a compare-and-swap
// guard on (status, current_container_id). Validate is invoked before
// any SQL touch.
func (store *Store) UpdateStatus(ctx context.Context, input ports.UpdateStatusInput) error {
if store == nil || store.db == nil {
return errors.New("update runtime status: nil store")
}
if err := input.Validate(); err != nil {
return err
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "update runtime status", store.operationTimeout)
if err != nil {
return err
}
defer cancel()
now := input.Now.UTC()
stmt, err := buildUpdateStatusStatement(input, now)
if err != nil {
return err
}
query, args := stmt.Sql()
result, err := store.db.ExecContext(operationCtx, query, args...)
if err != nil {
return fmt.Errorf("update runtime status: %w", err)
}
affected, err := result.RowsAffected()
if err != nil {
return fmt.Errorf("update runtime status: rows affected: %w", err)
}
if affected == 0 {
return store.classifyMissingUpdate(operationCtx, input.GameID)
}
return nil
}
// classifyMissingUpdate distinguishes ErrNotFound from ErrConflict after
// an UPDATE that affected zero rows. A row that is absent yields
// ErrNotFound; a row whose status or container_id does not match the
// CAS predicate yields ErrConflict.
func (store *Store) classifyMissingUpdate(ctx context.Context, gameID string) error {
probe := pg.SELECT(pgtable.RuntimeRecords.Status).
FROM(pgtable.RuntimeRecords).
WHERE(pgtable.RuntimeRecords.GameID.EQ(pg.String(gameID)))
probeQuery, probeArgs := probe.Sql()
var current string
row := store.db.QueryRowContext(ctx, probeQuery, probeArgs...)
if err := row.Scan(&current); err != nil {
if sqlx.IsNoRows(err) {
return runtime.ErrNotFound
}
return fmt.Errorf("update runtime status: probe: %w", err)
}
return runtime.ErrConflict
}
// buildUpdateStatusStatement assembles the UPDATE statement applied for
// one runtime-status transition.
//
// status, last_op_at are always updated. The remaining columns are
// driven by the destination:
//
// - StatusStopped: stopped_at is captured at Now.
// - StatusRemoved: removed_at is captured at Now and current_container_id
// is NULLed (the container is gone; the prior id remains observable
// through operation_log).
// - StatusRunning: only status + last_op_at change. Fresh started_at
// and current_container_id are installed via Upsert before any
// stopped → running transition reaches this path; the path exists
// so runtime.AllowedTransitions stays one-to-one with the adapter
// capability matrix even though v1 services use Upsert for this
// case.
func buildUpdateStatusStatement(input ports.UpdateStatusInput, now time.Time) (pg.UpdateStatement, error) {
statusValue := pg.String(string(input.To))
nowValue := pg.TimestampzT(now)
var stmt pg.UpdateStatement
switch input.To {
case runtime.StatusStopped:
stmt = pgtable.RuntimeRecords.UPDATE(
pgtable.RuntimeRecords.Status,
pgtable.RuntimeRecords.LastOpAt,
pgtable.RuntimeRecords.StoppedAt,
).SET(
statusValue,
nowValue,
nowValue,
)
case runtime.StatusRemoved:
stmt = pgtable.RuntimeRecords.UPDATE(
pgtable.RuntimeRecords.Status,
pgtable.RuntimeRecords.LastOpAt,
pgtable.RuntimeRecords.RemovedAt,
pgtable.RuntimeRecords.CurrentContainerID,
).SET(
statusValue,
nowValue,
nowValue,
pg.NULL,
)
case runtime.StatusRunning:
stmt = pgtable.RuntimeRecords.UPDATE(
pgtable.RuntimeRecords.Status,
pgtable.RuntimeRecords.LastOpAt,
).SET(
statusValue,
nowValue,
)
default:
return nil, fmt.Errorf("update runtime status: destination status %q is unsupported", input.To)
}
whereExpr := pg.AND(
pgtable.RuntimeRecords.GameID.EQ(pg.String(input.GameID)),
pgtable.RuntimeRecords.Status.EQ(pg.String(string(input.ExpectedFrom))),
)
if input.ExpectedContainerID != "" {
whereExpr = pg.AND(
whereExpr,
pgtable.RuntimeRecords.CurrentContainerID.EQ(pg.String(input.ExpectedContainerID)),
)
}
return stmt.WHERE(whereExpr), nil
}
// ListByStatus returns every record currently indexed under status.
// Ordering is last_op_at DESC, game_id ASC — the direction the
// `runtime_records_status_last_op_idx` index is built in.
func (store *Store) ListByStatus(ctx context.Context, status runtime.Status) ([]runtime.RuntimeRecord, error) {
if store == nil || store.db == nil {
return nil, errors.New("list runtime records by status: nil store")
}
if !status.IsKnown() {
return nil, fmt.Errorf("list runtime records by status: status %q is unsupported", status)
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "list runtime records by status", store.operationTimeout)
if err != nil {
return nil, err
}
defer cancel()
stmt := pg.SELECT(runtimeSelectColumns).
FROM(pgtable.RuntimeRecords).
WHERE(pgtable.RuntimeRecords.Status.EQ(pg.String(string(status)))).
ORDER_BY(pgtable.RuntimeRecords.LastOpAt.DESC(), pgtable.RuntimeRecords.GameID.ASC())
query, args := stmt.Sql()
rows, err := store.db.QueryContext(operationCtx, query, args...)
if err != nil {
return nil, fmt.Errorf("list runtime records by status: %w", err)
}
defer rows.Close()
records := make([]runtime.RuntimeRecord, 0)
for rows.Next() {
record, err := scanRecord(rows)
if err != nil {
return nil, fmt.Errorf("list runtime records by status: scan: %w", err)
}
records = append(records, record)
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("list runtime records by status: %w", err)
}
if len(records) == 0 {
return nil, nil
}
return records, nil
}
// List returns every runtime record currently stored. Ordering matches
// ListByStatus — last_op_at DESC, game_id ASC — so the REST list
// endpoint sees the freshest activity first.
func (store *Store) List(ctx context.Context) ([]runtime.RuntimeRecord, error) {
if store == nil || store.db == nil {
return nil, errors.New("list runtime records: nil store")
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "list runtime records", store.operationTimeout)
if err != nil {
return nil, err
}
defer cancel()
stmt := pg.SELECT(runtimeSelectColumns).
FROM(pgtable.RuntimeRecords).
ORDER_BY(pgtable.RuntimeRecords.LastOpAt.DESC(), pgtable.RuntimeRecords.GameID.ASC())
query, args := stmt.Sql()
rows, err := store.db.QueryContext(operationCtx, query, args...)
if err != nil {
return nil, fmt.Errorf("list runtime records: %w", err)
}
defer rows.Close()
records := make([]runtime.RuntimeRecord, 0)
for rows.Next() {
record, err := scanRecord(rows)
if err != nil {
return nil, fmt.Errorf("list runtime records: scan: %w", err)
}
records = append(records, record)
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("list runtime records: %w", err)
}
if len(records) == 0 {
return nil, nil
}
return records, nil
}
// CountByStatus returns the number of records indexed under each status.
// Statuses with zero records are present in the result with a zero
// count so callers (e.g. the telemetry gauge) can publish a stable
// label set on every reading.
func (store *Store) CountByStatus(ctx context.Context) (map[runtime.Status]int, error) {
if store == nil || store.db == nil {
return nil, errors.New("count runtime records by status: nil store")
}
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "count runtime records by status", store.operationTimeout)
if err != nil {
return nil, err
}
defer cancel()
countAlias := pg.COUNT(pg.STAR).AS("count")
stmt := pg.SELECT(pgtable.RuntimeRecords.Status, countAlias).
FROM(pgtable.RuntimeRecords).
GROUP_BY(pgtable.RuntimeRecords.Status)
query, args := stmt.Sql()
rows, err := store.db.QueryContext(operationCtx, query, args...)
if err != nil {
return nil, fmt.Errorf("count runtime records by status: %w", err)
}
defer rows.Close()
counts := make(map[runtime.Status]int, len(runtime.AllStatuses()))
for _, status := range runtime.AllStatuses() {
counts[status] = 0
}
for rows.Next() {
var status string
var count int
if err := rows.Scan(&status, &count); err != nil {
return nil, fmt.Errorf("count runtime records by status: scan: %w", err)
}
counts[runtime.Status(status)] = count
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("count runtime records by status: %w", err)
}
return counts, nil
}
// rowScanner abstracts *sql.Row and *sql.Rows so scanRecord can be shared
// across both single-row reads and iterated reads.
type rowScanner interface {
Scan(dest ...any) error
}
// scanRecord scans one runtime_records row from rs. Returns sql.ErrNoRows
// verbatim so callers can distinguish "no row" from a hard error.
func scanRecord(rs rowScanner) (runtime.RuntimeRecord, error) {
var (
gameID string
status string
currentContainerID sql.NullString
currentImageRef sql.NullString
engineEndpoint string
statePath string
dockerNetwork string
startedAt sql.NullTime
stoppedAt sql.NullTime
removedAt sql.NullTime
lastOpAt time.Time
createdAt time.Time
)
if err := rs.Scan(
&gameID,
&status,
&currentContainerID,
&currentImageRef,
&engineEndpoint,
&statePath,
&dockerNetwork,
&startedAt,
&stoppedAt,
&removedAt,
&lastOpAt,
&createdAt,
); err != nil {
return runtime.RuntimeRecord{}, err
}
return runtime.RuntimeRecord{
GameID: gameID,
Status: runtime.Status(status),
CurrentContainerID: sqlx.StringFromNullable(currentContainerID),
CurrentImageRef: sqlx.StringFromNullable(currentImageRef),
EngineEndpoint: engineEndpoint,
StatePath: statePath,
DockerNetwork: dockerNetwork,
StartedAt: sqlx.TimePtrFromNullable(startedAt),
StoppedAt: sqlx.TimePtrFromNullable(stoppedAt),
RemovedAt: sqlx.TimePtrFromNullable(removedAt),
LastOpAt: lastOpAt.UTC(),
CreatedAt: createdAt.UTC(),
}, nil
}
// Ensure Store satisfies the ports.RuntimeRecordStore interface at
// compile time.
var _ ports.RuntimeRecordStore = (*Store)(nil)
@@ -0,0 +1,420 @@
package runtimerecordstore_test
import (
"context"
"errors"
"sync"
"testing"
"time"
"galaxy/rtmanager/internal/adapters/postgres/internal/pgtest"
"galaxy/rtmanager/internal/adapters/postgres/runtimerecordstore"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestMain(m *testing.M) { pgtest.RunMain(m) }
func newStore(t *testing.T) *runtimerecordstore.Store {
t.Helper()
pgtest.TruncateAll(t)
store, err := runtimerecordstore.New(runtimerecordstore.Config{
DB: pgtest.Ensure(t).Pool(),
OperationTimeout: pgtest.OperationTimeout,
})
require.NoError(t, err)
return store
}
func runningRecord(t *testing.T, gameID, containerID, imageRef string) runtime.RuntimeRecord {
t.Helper()
now := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
started := now
return runtime.RuntimeRecord{
GameID: gameID,
Status: runtime.StatusRunning,
CurrentContainerID: containerID,
CurrentImageRef: imageRef,
EngineEndpoint: "http://galaxy-game-" + gameID + ":8080",
StatePath: "/var/lib/galaxy/games/" + gameID,
DockerNetwork: "galaxy-net",
StartedAt: &started,
LastOpAt: now,
CreatedAt: now,
}
}
func TestUpsertAndGetRoundTrip(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
got, err := store.Get(ctx, record.GameID)
require.NoError(t, err)
assert.Equal(t, record.GameID, got.GameID)
assert.Equal(t, record.Status, got.Status)
assert.Equal(t, record.CurrentContainerID, got.CurrentContainerID)
assert.Equal(t, record.CurrentImageRef, got.CurrentImageRef)
assert.Equal(t, record.EngineEndpoint, got.EngineEndpoint)
assert.Equal(t, record.StatePath, got.StatePath)
assert.Equal(t, record.DockerNetwork, got.DockerNetwork)
require.NotNil(t, got.StartedAt)
assert.True(t, record.StartedAt.Equal(*got.StartedAt))
assert.Equal(t, time.UTC, got.StartedAt.Location())
assert.Equal(t, time.UTC, got.LastOpAt.Location())
assert.Equal(t, time.UTC, got.CreatedAt.Location())
assert.Nil(t, got.StoppedAt)
assert.Nil(t, got.RemovedAt)
}
func TestGetReturnsNotFound(t *testing.T) {
ctx := context.Background()
store := newStore(t)
_, err := store.Get(ctx, "game-missing")
require.ErrorIs(t, err, runtime.ErrNotFound)
}
func TestUpsertOverwritesMutableColumnsPreservesCreatedAt(t *testing.T) {
ctx := context.Background()
store := newStore(t)
original := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, original))
updated := original
updated.CurrentContainerID = "container-2"
updated.CurrentImageRef = "galaxy/game:v1.2.4"
newStarted := original.LastOpAt.Add(time.Minute)
updated.StartedAt = &newStarted
updated.LastOpAt = newStarted
// Fresh CreatedAt simulates a caller passing "now"; the store must
// preserve the original CreatedAt value on conflict.
updated.CreatedAt = newStarted
require.NoError(t, store.Upsert(ctx, updated))
got, err := store.Get(ctx, original.GameID)
require.NoError(t, err)
assert.Equal(t, "container-2", got.CurrentContainerID)
assert.Equal(t, "galaxy/game:v1.2.4", got.CurrentImageRef)
assert.True(t, got.LastOpAt.Equal(newStarted))
assert.True(t, got.CreatedAt.Equal(original.CreatedAt),
"created_at must be preserved across upserts: got %s, want %s",
got.CreatedAt, original.CreatedAt)
}
func TestUpdateStatusRunningToStopped(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
now := record.LastOpAt.Add(2 * time.Minute)
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusRunning,
ExpectedContainerID: record.CurrentContainerID,
To: runtime.StatusStopped,
Now: now,
}))
got, err := store.Get(ctx, record.GameID)
require.NoError(t, err)
assert.Equal(t, runtime.StatusStopped, got.Status)
require.NotNil(t, got.StoppedAt)
assert.True(t, now.Equal(*got.StoppedAt))
assert.True(t, now.Equal(got.LastOpAt))
// container id is preserved on stop; cleanup later NULLs it.
assert.Equal(t, record.CurrentContainerID, got.CurrentContainerID)
}
func TestUpdateStatusRunningToRemovedClearsContainerID(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
now := record.LastOpAt.Add(time.Minute)
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusRemoved,
Now: now,
}))
got, err := store.Get(ctx, record.GameID)
require.NoError(t, err)
assert.Equal(t, runtime.StatusRemoved, got.Status)
require.NotNil(t, got.RemovedAt)
assert.True(t, now.Equal(*got.RemovedAt))
assert.True(t, now.Equal(got.LastOpAt))
assert.Empty(t, got.CurrentContainerID, "current_container_id must be NULL after removal")
}
func TestUpdateStatusStoppedToRemoved(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
stopAt := record.LastOpAt.Add(time.Minute)
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusStopped,
Now: stopAt,
}))
removeAt := stopAt.Add(time.Hour)
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusStopped,
To: runtime.StatusRemoved,
Now: removeAt,
}))
got, err := store.Get(ctx, record.GameID)
require.NoError(t, err)
assert.Equal(t, runtime.StatusRemoved, got.Status)
require.NotNil(t, got.RemovedAt)
assert.True(t, removeAt.Equal(*got.RemovedAt))
assert.True(t, removeAt.Equal(got.LastOpAt))
require.NotNil(t, got.StoppedAt, "stopped_at must remain populated through removal")
assert.True(t, stopAt.Equal(*got.StoppedAt))
assert.Empty(t, got.CurrentContainerID)
}
func TestUpdateStatusReturnsConflictOnFromMismatch(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusStopped, // wrong
To: runtime.StatusRemoved,
Now: record.LastOpAt.Add(time.Minute),
})
require.ErrorIs(t, err, runtime.ErrConflict)
}
func TestUpdateStatusReturnsConflictOnContainerIDMismatch(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusRunning,
ExpectedContainerID: "container-other",
To: runtime.StatusStopped,
Now: record.LastOpAt.Add(time.Minute),
})
require.ErrorIs(t, err, runtime.ErrConflict)
}
func TestUpdateStatusReturnsNotFoundForMissing(t *testing.T) {
ctx := context.Background()
store := newStore(t)
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: "game-missing",
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusStopped,
Now: time.Now().UTC(),
})
require.ErrorIs(t, err, runtime.ErrNotFound)
}
func TestUpdateStatusValidatesInputBeforeStore(t *testing.T) {
ctx := context.Background()
store := newStore(t)
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: "game-001",
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusStopped,
// Now intentionally zero — validation must reject.
})
require.Error(t, err)
}
// TestUpdateStatusConcurrentCAS asserts the CAS guard: when two callers
// race to apply the running → stopped transition on the same row,
// exactly one wins (returns nil) and the other observes
// runtime.ErrConflict.
func TestUpdateStatusConcurrentCAS(t *testing.T) {
ctx := context.Background()
store := newStore(t)
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
require.NoError(t, store.Upsert(ctx, record))
const concurrency = 8
results := make([]error, concurrency)
var wg sync.WaitGroup
wg.Add(concurrency)
for index := range concurrency {
go func() {
defer wg.Done()
results[index] = store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: record.GameID,
ExpectedFrom: runtime.StatusRunning,
ExpectedContainerID: record.CurrentContainerID,
To: runtime.StatusStopped,
Now: record.LastOpAt.Add(time.Duration(index+1) * time.Second),
})
}()
}
wg.Wait()
wins, conflicts := 0, 0
for _, err := range results {
switch {
case err == nil:
wins++
case errors.Is(err, runtime.ErrConflict):
conflicts++
default:
t.Errorf("unexpected error from concurrent UpdateStatus: %v", err)
}
}
assert.Equal(t, 1, wins, "exactly one caller must win the CAS race")
assert.Equal(t, concurrency-1, conflicts, "the rest must observe runtime.ErrConflict")
}
func TestListByStatusReturnsExpectedRecords(t *testing.T) {
ctx := context.Background()
store := newStore(t)
a := runningRecord(t, "game-aaa", "container-a", "galaxy/game:v1.2.3")
b := runningRecord(t, "game-bbb", "container-b", "galaxy/game:v1.2.3")
c := runningRecord(t, "game-ccc", "container-c", "galaxy/game:v1.2.3")
for _, r := range []runtime.RuntimeRecord{a, b, c} {
require.NoError(t, store.Upsert(ctx, r))
}
stopAt := a.LastOpAt.Add(time.Minute)
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: b.GameID,
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusStopped,
Now: stopAt,
}))
running, err := store.ListByStatus(ctx, runtime.StatusRunning)
require.NoError(t, err)
gotIDs := map[string]struct{}{}
for _, r := range running {
gotIDs[r.GameID] = struct{}{}
}
assert.Contains(t, gotIDs, a.GameID)
assert.Contains(t, gotIDs, c.GameID)
assert.NotContains(t, gotIDs, b.GameID)
stopped, err := store.ListByStatus(ctx, runtime.StatusStopped)
require.NoError(t, err)
require.Len(t, stopped, 1)
assert.Equal(t, b.GameID, stopped[0].GameID)
}
func TestListByStatusRejectsUnknown(t *testing.T) {
ctx := context.Background()
store := newStore(t)
_, err := store.ListByStatus(ctx, runtime.Status("exotic"))
require.Error(t, err)
}
func TestListReturnsEveryStatus(t *testing.T) {
ctx := context.Background()
store := newStore(t)
a := runningRecord(t, "game-aaa", "container-a", "galaxy/game:v1.2.3")
b := runningRecord(t, "game-bbb", "container-b", "galaxy/game:v1.2.3")
c := runningRecord(t, "game-ccc", "container-c", "galaxy/game:v1.2.3")
for _, r := range []runtime.RuntimeRecord{a, b, c} {
require.NoError(t, store.Upsert(ctx, r))
}
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: b.GameID,
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusStopped,
Now: b.LastOpAt.Add(time.Minute),
}))
all, err := store.List(ctx)
require.NoError(t, err)
require.Len(t, all, 3)
gotIDs := map[string]runtime.Status{}
for _, r := range all {
gotIDs[r.GameID] = r.Status
}
assert.Equal(t, runtime.StatusRunning, gotIDs[a.GameID])
assert.Equal(t, runtime.StatusStopped, gotIDs[b.GameID])
assert.Equal(t, runtime.StatusRunning, gotIDs[c.GameID])
}
func TestListReturnsNilWhenEmpty(t *testing.T) {
ctx := context.Background()
store := newStore(t)
all, err := store.List(ctx)
require.NoError(t, err)
assert.Nil(t, all)
}
func TestCountByStatusReturnsAllBuckets(t *testing.T) {
ctx := context.Background()
store := newStore(t)
a := runningRecord(t, "game-1", "container-1", "galaxy/game:v1.2.3")
b := runningRecord(t, "game-2", "container-2", "galaxy/game:v1.2.3")
c := runningRecord(t, "game-3", "container-3", "galaxy/game:v1.2.3")
for _, r := range []runtime.RuntimeRecord{a, b, c} {
require.NoError(t, store.Upsert(ctx, r))
}
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
GameID: b.GameID,
ExpectedFrom: runtime.StatusRunning,
To: runtime.StatusStopped,
Now: b.LastOpAt.Add(time.Minute),
}))
counts, err := store.CountByStatus(ctx)
require.NoError(t, err)
for _, status := range runtime.AllStatuses() {
_, ok := counts[status]
assert.True(t, ok, "status %q must appear in counts even when zero", status)
}
assert.Equal(t, 2, counts[runtime.StatusRunning])
assert.Equal(t, 1, counts[runtime.StatusStopped])
assert.Equal(t, 0, counts[runtime.StatusRemoved])
}
func TestNewRejectsNilDB(t *testing.T) {
_, err := runtimerecordstore.New(runtimerecordstore.Config{OperationTimeout: time.Second})
require.Error(t, err)
}
func TestNewRejectsNonPositiveTimeout(t *testing.T) {
_, err := runtimerecordstore.New(runtimerecordstore.Config{
DB: pgtest.Ensure(t).Pool(),
})
require.Error(t, err)
}
@@ -0,0 +1,117 @@
// Package gamelease implements the Redis-backed adapter for
// `ports.GameLeaseStore`.
//
// The lease guards every lifecycle operation Runtime Manager runs
// against one game (start, stop, restart, patch, cleanup, plus the
// reconciler's drift mutations). Acquisition uses `SET NX PX <ttl>`
// with a random caller token; release runs a Lua compare-and-delete
// so a holder that lost the lease through TTL expiry cannot wipe
// another caller's claim.
package gamelease
import (
"context"
"errors"
"fmt"
"strings"
"time"
"galaxy/rtmanager/internal/adapters/redisstate"
"galaxy/rtmanager/internal/ports"
"github.com/redis/go-redis/v9"
)
// releaseScript removes the per-game lease only when the supplied token
// still owns it. Compare-and-delete prevents a TTL-expired holder from
// clearing another caller's claim.
var releaseScript = redis.NewScript(`
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
end
return 0
`)
// Config configures one Redis-backed game lease store instance. The
// store does not own the redis client lifecycle; the caller (typically
// the service runtime) opens and closes it.
type Config struct {
// Client stores the Redis client the store uses for every command.
Client *redis.Client
}
// Store persists the per-game lifecycle lease in Redis.
type Store struct {
client *redis.Client
keys redisstate.Keyspace
}
// New constructs one Redis-backed game lease store from cfg.
func New(cfg Config) (*Store, error) {
if cfg.Client == nil {
return nil, errors.New("new rtmanager game lease store: nil redis client")
}
return &Store{
client: cfg.Client,
keys: redisstate.Keyspace{},
}, nil
}
// TryAcquire attempts to acquire the per-game lease for gameID owned by
// token for ttl. The acquired return is true on a successful claim and
// false when another caller still owns the lease. A non-nil error
// reports a transport failure and must not be confused with a missed
// lease.
func (store *Store) TryAcquire(ctx context.Context, gameID, token string, ttl time.Duration) (bool, error) {
if store == nil || store.client == nil {
return false, errors.New("try acquire game lease: nil store")
}
if ctx == nil {
return false, errors.New("try acquire game lease: nil context")
}
if strings.TrimSpace(gameID) == "" {
return false, errors.New("try acquire game lease: game id must not be empty")
}
if strings.TrimSpace(token) == "" {
return false, errors.New("try acquire game lease: token must not be empty")
}
if ttl <= 0 {
return false, errors.New("try acquire game lease: ttl must be positive")
}
acquired, err := store.client.SetNX(ctx, store.keys.GameLease(gameID), token, ttl).Result()
if err != nil {
return false, fmt.Errorf("try acquire game lease: %w", err)
}
return acquired, nil
}
// Release removes the per-game lease for gameID only when token still
// matches the stored owner value. A token mismatch is a silent no-op.
func (store *Store) Release(ctx context.Context, gameID, token string) error {
if store == nil || store.client == nil {
return errors.New("release game lease: nil store")
}
if ctx == nil {
return errors.New("release game lease: nil context")
}
if strings.TrimSpace(gameID) == "" {
return errors.New("release game lease: game id must not be empty")
}
if strings.TrimSpace(token) == "" {
return errors.New("release game lease: token must not be empty")
}
if err := releaseScript.Run(
ctx,
store.client,
[]string{store.keys.GameLease(gameID)},
token,
).Err(); err != nil {
return fmt.Errorf("release game lease: %w", err)
}
return nil
}
// Compile-time assertion: Store implements ports.GameLeaseStore.
var _ ports.GameLeaseStore = (*Store)(nil)
@@ -0,0 +1,133 @@
package gamelease_test
import (
"context"
"testing"
"time"
"galaxy/rtmanager/internal/adapters/redisstate"
"galaxy/rtmanager/internal/adapters/redisstate/gamelease"
"github.com/alicebob/miniredis/v2"
"github.com/redis/go-redis/v9"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func newLeaseStore(t *testing.T) (*gamelease.Store, *miniredis.Miniredis) {
t.Helper()
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
t.Cleanup(func() { _ = client.Close() })
store, err := gamelease.New(gamelease.Config{Client: client})
require.NoError(t, err)
return store, server
}
func TestNewRejectsNilClient(t *testing.T) {
_, err := gamelease.New(gamelease.Config{})
require.Error(t, err)
}
func TestTryAcquireSetsKeyAndTTL(t *testing.T) {
store, server := newLeaseStore(t)
acquired, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
require.NoError(t, err)
assert.True(t, acquired)
key := redisstate.Keyspace{}.GameLease("game-1")
assert.True(t, server.Exists(key), "key %q must exist after TryAcquire", key)
stored, err := server.Get(key)
require.NoError(t, err)
assert.Equal(t, "token-A", stored)
// TTL must be positive (miniredis returns the remaining duration).
ttl := server.TTL(key)
assert.Greater(t, ttl, time.Duration(0))
}
func TestTryAcquireReturnsFalseWhenAlreadyHeld(t *testing.T) {
store, _ := newLeaseStore(t)
acquired, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
require.NoError(t, err)
require.True(t, acquired)
acquired, err = store.TryAcquire(context.Background(), "game-1", "token-B", time.Minute)
require.NoError(t, err)
assert.False(t, acquired)
}
func TestReleaseRemovesKeyForOwnerToken(t *testing.T) {
store, server := newLeaseStore(t)
_, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
require.NoError(t, err)
require.NoError(t, store.Release(context.Background(), "game-1", "token-A"))
key := redisstate.Keyspace{}.GameLease("game-1")
assert.False(t, server.Exists(key), "key %q must be deleted after Release", key)
}
func TestReleaseIsNoOpForForeignToken(t *testing.T) {
store, server := newLeaseStore(t)
_, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
require.NoError(t, err)
require.NoError(t, store.Release(context.Background(), "game-1", "token-B"))
key := redisstate.Keyspace{}.GameLease("game-1")
assert.True(t, server.Exists(key), "key %q must still exist when foreign token is released", key)
stored, err := server.Get(key)
require.NoError(t, err)
assert.Equal(t, "token-A", stored)
}
func TestTryAcquireSucceedsAfterTTLExpiry(t *testing.T) {
store, server := newLeaseStore(t)
acquired, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
require.NoError(t, err)
require.True(t, acquired)
server.FastForward(2 * time.Minute)
acquired, err = store.TryAcquire(context.Background(), "game-1", "token-B", time.Minute)
require.NoError(t, err)
assert.True(t, acquired)
}
func TestTryAcquireRejectsInvalidArguments(t *testing.T) {
store, _ := newLeaseStore(t)
_, err := store.TryAcquire(context.Background(), "", "token", time.Minute)
require.Error(t, err)
_, err = store.TryAcquire(context.Background(), "game-1", "", time.Minute)
require.Error(t, err)
_, err = store.TryAcquire(context.Background(), "game-1", "token", 0)
require.Error(t, err)
}
func TestReleaseRejectsInvalidArguments(t *testing.T) {
store, _ := newLeaseStore(t)
require.Error(t, store.Release(context.Background(), "", "token"))
require.Error(t, store.Release(context.Background(), "game-1", ""))
}
func TestKeyspaceGameLeaseIsPrefixedAndEncoded(t *testing.T) {
key := redisstate.Keyspace{}.GameLease("game with spaces")
assert.NotEmpty(t, key)
assert.Contains(t, key, "rtmanager:game_lease:")
suffix := key[len("rtmanager:game_lease:"):]
// base64url-encoded suffix must not contain the original spaces.
assert.NotContains(t, suffix, " ")
}
@@ -0,0 +1,44 @@
// Package redisstate hosts the Runtime Manager Redis adapters that share
// a single keyspace. Each sibling subpackage (e.g. `streamoffsets`)
// implements one port and uses Keyspace to compose its keys, so the
// Redis namespace stays under one document and one prefix.
//
// The package itself only declares the keyspace; concrete stores live in
// nested packages so dependencies (testcontainers, miniredis) stay out
// of consumer build graphs that do not need them.
package redisstate
import "encoding/base64"
// defaultPrefix is the mandatory `rtmanager:` namespace prefix shared by
// every Runtime Manager Redis key.
const defaultPrefix = "rtmanager:"
// Keyspace builds the Runtime Manager Redis keys. The namespace covers
// the stream consumer offsets and the per-game lifecycle lease in v1.
//
// Dynamic key segments are encoded with base64url so raw key structure
// does not depend on caller-provided characters; this matches the
// encoding chosen by `lobby/internal/adapters/redisstate.Keyspace`.
type Keyspace struct{}
// StreamOffset returns the Redis key that stores the last successfully
// processed entry id for one Redis Stream consumer. The streamLabel is
// the short logical identifier of the consumer (e.g. `start_jobs`,
// `stop_jobs`), not the full stream name; it stays stable when the
// underlying stream key is renamed.
func (Keyspace) StreamOffset(streamLabel string) string {
return defaultPrefix + "stream_offsets:" + encodeKeyComponent(streamLabel)
}
// GameLease returns the Redis key that stores the per-game lifecycle
// lease guarding start / stop / restart / patch / cleanup operations
// against the same game. The gameID is base64url-encoded so callers can
// pass any opaque identifier without escaping raw key characters.
func (Keyspace) GameLease(gameID string) string {
return defaultPrefix + "game_lease:" + encodeKeyComponent(gameID)
}
func encodeKeyComponent(value string) string {
return base64.RawURLEncoding.EncodeToString([]byte(value))
}
@@ -0,0 +1,94 @@
// Package streamoffsets implements the Redis-backed adapter for
// `ports.StreamOffsetStore`.
//
// The start-jobs and stop-jobs consumers call Load on startup to
// resume from the persisted offset and Save after every successful
// message handling. Keys are produced by
// `redisstate.Keyspace.StreamOffset`, mirroring the lobby pattern.
package streamoffsets
import (
"context"
"errors"
"fmt"
"strings"
"galaxy/rtmanager/internal/adapters/redisstate"
"galaxy/rtmanager/internal/ports"
"github.com/redis/go-redis/v9"
)
// Config configures one Redis-backed stream-offset store instance. The
// store does not own the redis client lifecycle; the caller (typically
// the service runtime) opens and closes it.
type Config struct {
// Client stores the Redis client the store uses for every command.
Client *redis.Client
}
// Store persists Runtime Manager stream consumer offsets in Redis.
type Store struct {
client *redis.Client
keys redisstate.Keyspace
}
// New constructs one Redis-backed stream-offset store from cfg.
func New(cfg Config) (*Store, error) {
if cfg.Client == nil {
return nil, errors.New("new rtmanager stream offset store: nil redis client")
}
return &Store{
client: cfg.Client,
keys: redisstate.Keyspace{},
}, nil
}
// Load returns the last processed entry id for streamLabel when one is
// stored. A missing key returns ("", false, nil).
func (store *Store) Load(ctx context.Context, streamLabel string) (string, bool, error) {
if store == nil || store.client == nil {
return "", false, errors.New("load rtmanager stream offset: nil store")
}
if ctx == nil {
return "", false, errors.New("load rtmanager stream offset: nil context")
}
if strings.TrimSpace(streamLabel) == "" {
return "", false, errors.New("load rtmanager stream offset: stream label must not be empty")
}
value, err := store.client.Get(ctx, store.keys.StreamOffset(streamLabel)).Result()
switch {
case errors.Is(err, redis.Nil):
return "", false, nil
case err != nil:
return "", false, fmt.Errorf("load rtmanager stream offset: %w", err)
}
return value, true, nil
}
// Save stores entryID as the new offset for streamLabel. The key has no
// TTL — offsets are durable and only overwritten by subsequent Saves.
func (store *Store) Save(ctx context.Context, streamLabel, entryID string) error {
if store == nil || store.client == nil {
return errors.New("save rtmanager stream offset: nil store")
}
if ctx == nil {
return errors.New("save rtmanager stream offset: nil context")
}
if strings.TrimSpace(streamLabel) == "" {
return errors.New("save rtmanager stream offset: stream label must not be empty")
}
if strings.TrimSpace(entryID) == "" {
return errors.New("save rtmanager stream offset: entry id must not be empty")
}
if err := store.client.Set(ctx, store.keys.StreamOffset(streamLabel), entryID, 0).Err(); err != nil {
return fmt.Errorf("save rtmanager stream offset: %w", err)
}
return nil
}
// Ensure Store satisfies the ports.StreamOffsetStore interface at
// compile time.
var _ ports.StreamOffsetStore = (*Store)(nil)
@@ -0,0 +1,86 @@
package streamoffsets_test
import (
"context"
"testing"
"galaxy/rtmanager/internal/adapters/redisstate"
"galaxy/rtmanager/internal/adapters/redisstate/streamoffsets"
"github.com/alicebob/miniredis/v2"
"github.com/redis/go-redis/v9"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func newOffsetStore(t *testing.T) (*streamoffsets.Store, *miniredis.Miniredis) {
t.Helper()
server := miniredis.RunT(t)
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
t.Cleanup(func() { _ = client.Close() })
store, err := streamoffsets.New(streamoffsets.Config{Client: client})
require.NoError(t, err)
return store, server
}
func TestNewRejectsNilClient(t *testing.T) {
_, err := streamoffsets.New(streamoffsets.Config{})
require.Error(t, err)
}
func TestLoadMissingReturnsNotFound(t *testing.T) {
store, _ := newOffsetStore(t)
id, found, err := store.Load(context.Background(), "start_jobs")
require.NoError(t, err)
assert.False(t, found)
assert.Empty(t, id)
}
func TestSaveLoadRoundTrip(t *testing.T) {
store, server := newOffsetStore(t)
require.NoError(t, store.Save(context.Background(), "start_jobs", "1700000000000-0"))
id, found, err := store.Load(context.Background(), "start_jobs")
require.NoError(t, err)
assert.True(t, found)
assert.Equal(t, "1700000000000-0", id)
// The persisted key must follow the rtmanager keyspace prefix.
expectedKey := redisstate.Keyspace{}.StreamOffset("start_jobs")
assert.True(t, server.Exists(expectedKey),
"key %q must exist after Save", expectedKey)
}
func TestSaveOverwritesPriorValue(t *testing.T) {
store, _ := newOffsetStore(t)
require.NoError(t, store.Save(context.Background(), "start_jobs", "100-0"))
require.NoError(t, store.Save(context.Background(), "start_jobs", "200-0"))
id, found, err := store.Load(context.Background(), "start_jobs")
require.NoError(t, err)
assert.True(t, found)
assert.Equal(t, "200-0", id)
}
func TestLoadAndSaveRejectInvalidArguments(t *testing.T) {
store, _ := newOffsetStore(t)
require.Error(t, store.Save(context.Background(), "", "100-0"))
require.Error(t, store.Save(context.Background(), "start_jobs", ""))
_, _, err := store.Load(context.Background(), "")
require.Error(t, err)
}
func TestKeyspaceStreamOffsetIsPrefixed(t *testing.T) {
key := redisstate.Keyspace{}.StreamOffset("start_jobs")
assert.NotEmpty(t, key)
assert.Contains(t, key, "rtmanager:stream_offsets:")
// base64url-encoded label must not contain raw colons or spaces.
suffix := key[len("rtmanager:stream_offsets:"):]
assert.NotContains(t, suffix, ":")
}
@@ -0,0 +1,367 @@
package internalhttp
import (
"bytes"
"context"
"errors"
"io"
"net/http"
"net/http/httptest"
"path/filepath"
"runtime"
"strings"
"sync"
"testing"
"time"
"galaxy/rtmanager/internal/api/internalhttp/handlers"
domainruntime "galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
"galaxy/rtmanager/internal/service/cleanupcontainer"
"galaxy/rtmanager/internal/service/patchruntime"
"galaxy/rtmanager/internal/service/restartruntime"
"galaxy/rtmanager/internal/service/startruntime"
"galaxy/rtmanager/internal/service/stopruntime"
"github.com/getkin/kin-openapi/openapi3"
"github.com/getkin/kin-openapi/openapi3filter"
"github.com/getkin/kin-openapi/routers"
"github.com/getkin/kin-openapi/routers/legacy"
"github.com/stretchr/testify/require"
)
// TestInternalRESTConformance loads the OpenAPI specification, drives
// every runtime operation against the live internal HTTP listener
// backed by stub services, and validates each response body against
// the spec via `openapi3filter.ValidateResponse`. The test catches
// drift between the wire shape produced by the handler layer and the
// frozen contract; failure-path response shapes are validated by the
// per-handler tests in `handlers/<op>_test.go`.
func TestInternalRESTConformance(t *testing.T) {
t.Parallel()
doc := loadConformanceSpec(t)
router, err := legacy.NewRouter(doc)
require.NoError(t, err)
deps := newConformanceDeps(t)
server, err := NewServer(newConformanceConfig(), Dependencies{
Logger: nil,
Telemetry: nil,
Readiness: nil,
RuntimeRecords: deps.records,
StartRuntime: deps.start,
StopRuntime: deps.stop,
RestartRuntime: deps.restart,
PatchRuntime: deps.patch,
CleanupContainer: deps.cleanup,
})
require.NoError(t, err)
cases := []conformanceCase{
{
name: "internalListRuntimes",
method: http.MethodGet,
path: "/api/v1/internal/runtimes",
},
{
name: "internalGetRuntime",
method: http.MethodGet,
path: "/api/v1/internal/runtimes/" + conformanceGameID,
},
{
name: "internalStartRuntime",
method: http.MethodPost,
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/start",
contentType: "application/json",
body: `{"image_ref":"galaxy/game:v1.2.3"}`,
},
{
name: "internalStopRuntime",
method: http.MethodPost,
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/stop",
contentType: "application/json",
body: `{"reason":"admin_request"}`,
},
{
name: "internalRestartRuntime",
method: http.MethodPost,
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/restart",
},
{
name: "internalPatchRuntime",
method: http.MethodPost,
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/patch",
contentType: "application/json",
body: `{"image_ref":"galaxy/game:v1.2.4"}`,
},
{
name: "internalCleanupRuntimeContainer",
method: http.MethodDelete,
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/container",
},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
runConformanceCase(t, server.handler, router, tc)
})
}
}
// conformanceGameID is the path variable used for every per-game
// conformance request.
const conformanceGameID = "game-conformance"
// conformanceServerURL mirrors the canonical `servers[0].url` entry in
// `rtmanager/api/internal-openapi.yaml`. The legacy router matches
// requests against this prefix; updating the spec's server URL
// requires updating this constant.
const conformanceServerURL = "http://localhost:8096"
// conformanceCase describes one request the conformance test drives.
type conformanceCase struct {
name string
method string
path string
contentType string
body string
}
func runConformanceCase(t *testing.T, handler http.Handler, router routers.Router, tc conformanceCase) {
t.Helper()
// Drive the handler with the path-only form so the listener's
// http.ServeMux matches the registered routes (which use raw paths,
// without the OpenAPI server URL prefix).
var bodyReader io.Reader
if tc.body != "" {
bodyReader = strings.NewReader(tc.body)
}
request := httptest.NewRequest(tc.method, tc.path, bodyReader)
if tc.contentType != "" {
request.Header.Set("Content-Type", tc.contentType)
}
request.Header.Set("X-Galaxy-Caller", "admin")
recorder := httptest.NewRecorder()
handler.ServeHTTP(recorder, request)
require.Equalf(t, http.StatusOK, recorder.Code, "operation %s returned %d: %s", tc.name, recorder.Code, recorder.Body.String())
// kin-openapi's legacy router requires the request URL to match a
// `servers[].url` entry; rebuild the validation request with the
// canonical local server URL declared in the spec.
validationURL := conformanceServerURL + tc.path
validationRequest := httptest.NewRequest(tc.method, validationURL, bodyReaderFor(tc.body))
if tc.contentType != "" {
validationRequest.Header.Set("Content-Type", tc.contentType)
}
validationRequest.Header.Set("X-Galaxy-Caller", "admin")
route, pathParams, err := router.FindRoute(validationRequest)
require.NoError(t, err)
requestInput := &openapi3filter.RequestValidationInput{
Request: validationRequest,
PathParams: pathParams,
Route: route,
Options: &openapi3filter.Options{
IncludeResponseStatus: true,
},
}
require.NoError(t, openapi3filter.ValidateRequest(context.Background(), requestInput))
responseInput := &openapi3filter.ResponseValidationInput{
RequestValidationInput: requestInput,
Status: recorder.Code,
Header: recorder.Header(),
Options: &openapi3filter.Options{
IncludeResponseStatus: true,
},
}
responseInput.SetBodyBytes(recorder.Body.Bytes())
require.NoError(t, openapi3filter.ValidateResponse(context.Background(), responseInput))
}
func loadConformanceSpec(t *testing.T) *openapi3.T {
t.Helper()
_, thisFile, _, ok := runtime.Caller(0)
require.True(t, ok)
specPath := filepath.Join(filepath.Dir(thisFile), "..", "..", "..", "api", "internal-openapi.yaml")
loader := openapi3.NewLoader()
doc, err := loader.LoadFromFile(specPath)
require.NoError(t, err)
require.NoError(t, doc.Validate(context.Background()))
return doc
}
func bodyReaderFor(raw string) io.Reader {
if raw == "" {
return http.NoBody
}
return bytes.NewBufferString(raw)
}
// conformanceDeps groups the stub collaborators handed to the listener.
type conformanceDeps struct {
records *conformanceRecords
start *conformanceStart
stop *conformanceStop
restart *conformanceRestart
patch *conformancePatch
cleanup *conformanceCleanup
}
func newConformanceDeps(t *testing.T) *conformanceDeps {
t.Helper()
return &conformanceDeps{
records: newConformanceRecords(),
start: &conformanceStart{},
stop: &conformanceStop{},
restart: &conformanceRestart{},
patch: &conformancePatch{},
cleanup: &conformanceCleanup{},
}
}
func newConformanceConfig() Config {
return Config{
Addr: ":0",
ReadHeaderTimeout: time.Second,
ReadTimeout: time.Second,
WriteTimeout: time.Second,
IdleTimeout: time.Second,
}
}
// conformanceRecord builds a canonical running record used by every
// stub service.
func conformanceRecord() domainruntime.RuntimeRecord {
started := time.Date(2026, 4, 26, 13, 0, 0, 0, time.UTC)
return domainruntime.RuntimeRecord{
GameID: conformanceGameID,
Status: domainruntime.StatusRunning,
CurrentContainerID: "container-conformance",
CurrentImageRef: "galaxy/game:v1.2.3",
EngineEndpoint: "http://galaxy-game-" + conformanceGameID + ":8080",
StatePath: "/var/lib/galaxy/" + conformanceGameID,
DockerNetwork: "galaxy-engine",
StartedAt: &started,
LastOpAt: started,
CreatedAt: started,
}
}
// conformanceRecords is an in-memory record store seeded with one
// canonical record so the get / list endpoints have something to
// return.
type conformanceRecords struct {
mu sync.Mutex
stored map[string]domainruntime.RuntimeRecord
}
func newConformanceRecords() *conformanceRecords {
return &conformanceRecords{
stored: map[string]domainruntime.RuntimeRecord{
conformanceGameID: conformanceRecord(),
},
}
}
func (s *conformanceRecords) Get(_ context.Context, gameID string) (domainruntime.RuntimeRecord, error) {
s.mu.Lock()
defer s.mu.Unlock()
record, ok := s.stored[gameID]
if !ok {
return domainruntime.RuntimeRecord{}, domainruntime.ErrNotFound
}
return record, nil
}
func (s *conformanceRecords) Upsert(_ context.Context, _ domainruntime.RuntimeRecord) error {
return errors.New("not used in conformance test")
}
func (s *conformanceRecords) UpdateStatus(_ context.Context, _ ports.UpdateStatusInput) error {
return errors.New("not used in conformance test")
}
func (s *conformanceRecords) ListByStatus(_ context.Context, _ domainruntime.Status) ([]domainruntime.RuntimeRecord, error) {
return nil, errors.New("not used in conformance test")
}
func (s *conformanceRecords) List(_ context.Context) ([]domainruntime.RuntimeRecord, error) {
s.mu.Lock()
defer s.mu.Unlock()
out := make([]domainruntime.RuntimeRecord, 0, len(s.stored))
for _, record := range s.stored {
out = append(out, record)
}
return out, nil
}
// conformanceStart is the stub StartService used by the conformance
// test. Every Handle call returns the canonical record.
type conformanceStart struct{}
func (s *conformanceStart) Handle(_ context.Context, _ startruntime.Input) (startruntime.Result, error) {
return startruntime.Result{
Record: conformanceRecord(),
Outcome: "success",
}, nil
}
type conformanceStop struct{}
func (s *conformanceStop) Handle(_ context.Context, _ stopruntime.Input) (stopruntime.Result, error) {
rec := conformanceRecord()
rec.Status = domainruntime.StatusStopped
stopped := rec.LastOpAt.Add(time.Second)
rec.StoppedAt = &stopped
rec.LastOpAt = stopped
return stopruntime.Result{Record: rec, Outcome: "success"}, nil
}
type conformanceRestart struct{}
func (s *conformanceRestart) Handle(_ context.Context, _ restartruntime.Input) (restartruntime.Result, error) {
return restartruntime.Result{Record: conformanceRecord(), Outcome: "success"}, nil
}
type conformancePatch struct{}
func (s *conformancePatch) Handle(_ context.Context, in patchruntime.Input) (patchruntime.Result, error) {
rec := conformanceRecord()
if in.NewImageRef != "" {
rec.CurrentImageRef = in.NewImageRef
}
return patchruntime.Result{Record: rec, Outcome: "success"}, nil
}
type conformanceCleanup struct{}
func (s *conformanceCleanup) Handle(_ context.Context, _ cleanupcontainer.Input) (cleanupcontainer.Result, error) {
rec := conformanceRecord()
rec.Status = domainruntime.StatusRemoved
rec.CurrentContainerID = ""
removed := rec.LastOpAt.Add(time.Minute)
rec.RemovedAt = &removed
rec.LastOpAt = removed
return cleanupcontainer.Result{Record: rec, Outcome: "success"}, nil
}
// Compile-time guards: the stubs must satisfy the handler-level
// service ports plus ports.RuntimeRecordStore so the listener accepts
// them.
var (
_ handlers.StartService = (*conformanceStart)(nil)
_ handlers.StopService = (*conformanceStop)(nil)
_ handlers.RestartService = (*conformanceRestart)(nil)
_ handlers.PatchService = (*conformancePatch)(nil)
_ handlers.CleanupService = (*conformanceCleanup)(nil)
_ ports.RuntimeRecordStore = (*conformanceRecords)(nil)
)
@@ -0,0 +1,55 @@
package handlers
import (
"net/http"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/service/cleanupcontainer"
"galaxy/rtmanager/internal/service/startruntime"
)
// newCleanupHandler returns the handler for
// `DELETE /api/v1/internal/runtimes/{game_id}/container`. The OpenAPI
// spec declares no request body for this operation; any client-provided
// body is ignored.
func newCleanupHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.cleanup")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.CleanupContainer == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"cleanup container service is not wired",
)
return
}
gameID, ok := extractGameID(writer, request)
if !ok {
return
}
result, err := deps.CleanupContainer.Handle(request.Context(), cleanupcontainer.Input{
GameID: gameID,
OpSource: resolveOpSource(request),
SourceRef: requestSourceRef(request),
})
if err != nil {
logger.ErrorContext(request.Context(), "cleanup container service errored",
"game_id", gameID,
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"cleanup container service failed",
)
return
}
if result.Outcome == operation.OutcomeFailure {
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
}
}
@@ -0,0 +1,238 @@
package handlers
import (
"encoding/json"
"errors"
"io"
"log/slog"
"net/http"
"strings"
"time"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/service/startruntime"
)
// JSONContentType is the Content-Type used by every internal REST
// response. Exported so the listener-level tests can match it without
// re-declaring the constant.
const JSONContentType = "application/json; charset=utf-8"
// gameIDPathParam is the name of the {game_id} path variable shared by
// every per-game runtime endpoint.
const gameIDPathParam = "game_id"
// callerHeader is the HTTP header that distinguishes Game Master from
// Admin Service in the operation log. Documented in
// `rtmanager/api/internal-openapi.yaml` and
// `rtmanager/docs/services.md` §18.
const callerHeader = "X-Galaxy-Caller"
// errorCodeDockerUnavailable mirrors the OpenAPI error code value. The
// lifecycle services do not currently emit it (they use
// `service_unavailable` for Docker daemon failures); the handler layer
// maps it to 503 anyway so future producers do not require a handler
// change.
const errorCodeDockerUnavailable = "docker_unavailable"
// errorBody mirrors the `error` element of the OpenAPI ErrorResponse
// schema.
type errorBody struct {
Code string `json:"code"`
Message string `json:"message"`
}
// errorResponse mirrors the OpenAPI ErrorResponse envelope.
type errorResponse struct {
Error errorBody `json:"error"`
}
// runtimeRecordResponse mirrors the OpenAPI RuntimeRecord schema.
// Required fields use plain strings; nullable fields use pointers so an
// absent value encodes as the JSON literal `null` (matches the
// `nullable: true` declaration in the spec). Times are RFC3339 UTC.
type runtimeRecordResponse struct {
GameID string `json:"game_id"`
Status string `json:"status"`
CurrentContainerID *string `json:"current_container_id"`
CurrentImageRef *string `json:"current_image_ref"`
EngineEndpoint *string `json:"engine_endpoint"`
StatePath string `json:"state_path"`
DockerNetwork string `json:"docker_network"`
StartedAt *string `json:"started_at"`
StoppedAt *string `json:"stopped_at"`
RemovedAt *string `json:"removed_at"`
LastOpAt string `json:"last_op_at"`
CreatedAt string `json:"created_at"`
}
// runtimesListResponse mirrors the OpenAPI RuntimesList schema. Items
// is always non-nil so the JSON form carries `[]` rather than `null`
// for an empty result.
type runtimesListResponse struct {
Items []runtimeRecordResponse `json:"items"`
}
// encodeRuntimeRecord turns a domain RuntimeRecord into its wire shape.
func encodeRuntimeRecord(record runtime.RuntimeRecord) runtimeRecordResponse {
resp := runtimeRecordResponse{
GameID: record.GameID,
Status: string(record.Status),
StatePath: record.StatePath,
DockerNetwork: record.DockerNetwork,
LastOpAt: record.LastOpAt.UTC().Format(time.RFC3339Nano),
CreatedAt: record.CreatedAt.UTC().Format(time.RFC3339Nano),
}
if record.CurrentContainerID != "" {
v := record.CurrentContainerID
resp.CurrentContainerID = &v
}
if record.CurrentImageRef != "" {
v := record.CurrentImageRef
resp.CurrentImageRef = &v
}
if record.EngineEndpoint != "" {
v := record.EngineEndpoint
resp.EngineEndpoint = &v
}
if record.StartedAt != nil {
v := record.StartedAt.UTC().Format(time.RFC3339Nano)
resp.StartedAt = &v
}
if record.StoppedAt != nil {
v := record.StoppedAt.UTC().Format(time.RFC3339Nano)
resp.StoppedAt = &v
}
if record.RemovedAt != nil {
v := record.RemovedAt.UTC().Format(time.RFC3339Nano)
resp.RemovedAt = &v
}
return resp
}
// encodeRuntimesList builds the wire shape returned by the list handler.
// records may be nil (empty store); the result still carries an empty
// items slice so the JSON form is `{"items":[]}`.
func encodeRuntimesList(records []runtime.RuntimeRecord) runtimesListResponse {
resp := runtimesListResponse{
Items: make([]runtimeRecordResponse, 0, len(records)),
}
for _, record := range records {
resp.Items = append(resp.Items, encodeRuntimeRecord(record))
}
return resp
}
// writeJSON writes payload as a JSON response with the given status code.
func writeJSON(writer http.ResponseWriter, statusCode int, payload any) {
writer.Header().Set("Content-Type", JSONContentType)
writer.WriteHeader(statusCode)
_ = json.NewEncoder(writer).Encode(payload)
}
// writeError writes the canonical error envelope at statusCode.
func writeError(writer http.ResponseWriter, statusCode int, code, message string) {
writeJSON(writer, statusCode, errorResponse{
Error: errorBody{Code: code, Message: message},
})
}
// writeFailure writes the canonical error envelope using the HTTP
// status mapped from code. Used by every lifecycle handler when its
// service returns `Outcome=failure`.
func writeFailure(writer http.ResponseWriter, code, message string) {
writeError(writer, mapErrorCodeToStatus(code), code, message)
}
// mapErrorCodeToStatus maps a stable error code to the HTTP status
// declared by `rtmanager/api/internal-openapi.yaml`. Unknown codes
// degrade to 500 so a future error code that ships ahead of its
// handler-layer mapping still produces a structurally valid response.
func mapErrorCodeToStatus(code string) int {
switch code {
case startruntime.ErrorCodeInvalidRequest,
startruntime.ErrorCodeStartConfigInvalid,
startruntime.ErrorCodeImageRefNotSemver:
return http.StatusBadRequest
case startruntime.ErrorCodeNotFound:
return http.StatusNotFound
case startruntime.ErrorCodeConflict,
startruntime.ErrorCodeSemverPatchOnly:
return http.StatusConflict
case startruntime.ErrorCodeServiceUnavailable,
errorCodeDockerUnavailable:
return http.StatusServiceUnavailable
case startruntime.ErrorCodeImagePullFailed,
startruntime.ErrorCodeContainerStartFailed,
startruntime.ErrorCodeInternal:
return http.StatusInternalServerError
default:
return http.StatusInternalServerError
}
}
// decodeStrictJSON decodes one request body into target with strict
// JSON semantics: unknown fields are rejected and trailing content is
// rejected. Mirrors the helper used by lobby's internal HTTP layer.
func decodeStrictJSON(body io.Reader, target any) error {
decoder := json.NewDecoder(body)
decoder.DisallowUnknownFields()
if err := decoder.Decode(target); err != nil {
return err
}
if decoder.More() {
return errors.New("unexpected trailing content after JSON body")
}
return nil
}
// extractGameID pulls the {game_id} path variable from request. An empty
// or whitespace-only value writes a `400 invalid_request` and returns
// ok=false so callers can short-circuit.
func extractGameID(writer http.ResponseWriter, request *http.Request) (string, bool) {
raw := request.PathValue(gameIDPathParam)
if strings.TrimSpace(raw) == "" {
writeError(writer, http.StatusBadRequest,
startruntime.ErrorCodeInvalidRequest,
"game id is required",
)
return "", false
}
return raw, true
}
// resolveOpSource maps the X-Galaxy-Caller header to an
// `operation.OpSource`. Missing or unknown values default to
// `OpSourceAdminRest`, matching the contract documented in
// `rtmanager/api/internal-openapi.yaml`.
func resolveOpSource(request *http.Request) operation.OpSource {
switch strings.ToLower(strings.TrimSpace(request.Header.Get(callerHeader))) {
case "gm":
return operation.OpSourceGMRest
default:
return operation.OpSourceAdminRest
}
}
// requestSourceRef returns an opaque per-request reference recorded in
// `operation_log.source_ref`. v1 reads the `X-Request-ID` header when
// present so callers may correlate REST requests with audit rows; the
// listener does not currently install a request-id middleware so the
// header path is the only source.
func requestSourceRef(request *http.Request) string {
if v := strings.TrimSpace(request.Header.Get("X-Request-ID")); v != "" {
return v
}
return ""
}
// loggerFor returns a logger annotated with the operation tag. Each
// handler scopes its logs by op so operators filtering on
// `op=internal_rest.start` see exactly the lifecycle they care about.
func loggerFor(parent *slog.Logger, op string) *slog.Logger {
if parent == nil {
parent = slog.Default()
}
return parent.With("component", "internal_http.handlers", "op", op)
}
@@ -0,0 +1,197 @@
package handlers
import (
"context"
"encoding/json"
"errors"
"io"
"net/http"
"net/http/httptest"
"strings"
"sync"
"testing"
"time"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/ports"
"github.com/stretchr/testify/require"
)
// fixedClock is the wall-clock used to build canonical sample records
// across the handler tests. UTC Sunday 1pm 2026-04-26 is far enough in
// the future to be obvious in test output.
var fixedClock = time.Date(2026, 4, 26, 13, 0, 0, 0, time.UTC)
// sampleRunningRecord returns a canonical running record used by every
// happy-path test in this package.
func sampleRunningRecord(t *testing.T) runtime.RuntimeRecord {
t.Helper()
started := fixedClock
return runtime.RuntimeRecord{
GameID: "game-test",
Status: runtime.StatusRunning,
CurrentContainerID: "container-test",
CurrentImageRef: "galaxy/game:v1.2.3",
EngineEndpoint: "http://galaxy-game-game-test:8080",
StatePath: "/var/lib/galaxy/game-test",
DockerNetwork: "galaxy-engine",
StartedAt: &started,
LastOpAt: fixedClock,
CreatedAt: fixedClock,
}
}
// sampleStoppedRecord returns a canonical stopped record useful for
// cleanup-handler and list-handler tests.
func sampleStoppedRecord(t *testing.T) runtime.RuntimeRecord {
t.Helper()
started := fixedClock
stopped := fixedClock.Add(time.Minute)
return runtime.RuntimeRecord{
GameID: "game-stopped",
Status: runtime.StatusStopped,
CurrentContainerID: "container-stopped",
CurrentImageRef: "galaxy/game:v1.2.3",
EngineEndpoint: "http://galaxy-game-game-stopped:8080",
StatePath: "/var/lib/galaxy/game-stopped",
DockerNetwork: "galaxy-engine",
StartedAt: &started,
StoppedAt: &stopped,
LastOpAt: stopped,
CreatedAt: fixedClock,
}
}
// drive routes one request through a full mux configured by Register.
// It returns the captured ResponseRecorder so tests can assert on
// status, headers, and body.
func drive(t *testing.T, deps Dependencies, method, path string, headers http.Header, body io.Reader) *httptest.ResponseRecorder {
t.Helper()
mux := http.NewServeMux()
Register(mux, deps)
request := httptest.NewRequest(method, path, body)
for key, values := range headers {
for _, value := range values {
request.Header.Add(key, value)
}
}
recorder := httptest.NewRecorder()
mux.ServeHTTP(recorder, request)
return recorder
}
// decodeRecordResponse asserts that the response carried a 200 with
// the canonical content type and decodes the record body.
func decodeRecordResponse(t *testing.T, rec *httptest.ResponseRecorder) runtimeRecordResponse {
t.Helper()
require.Equalf(t, http.StatusOK, rec.Code, "expected 200, got body: %s", rec.Body.String())
require.Equal(t, JSONContentType, rec.Header().Get("Content-Type"))
var resp runtimeRecordResponse
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
return resp
}
// decodeErrorBody asserts the canonical error envelope and decodes it.
func decodeErrorBody(t *testing.T, rec *httptest.ResponseRecorder, wantStatus int) errorBody {
t.Helper()
require.Equalf(t, wantStatus, rec.Code, "expected %d, got body: %s", wantStatus, rec.Body.String())
require.Equal(t, JSONContentType, rec.Header().Get("Content-Type"))
var resp errorResponse
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
return resp.Error
}
// fakeRuntimeRecords is an in-memory ports.RuntimeRecordStore used by
// list / get tests. It is intentionally minimal — services use their
// own fakes in `internal/service/<op>/service_test.go` and do not
// share this helper.
type fakeRuntimeRecords struct {
mu sync.Mutex
stored map[string]runtime.RuntimeRecord
listErr error
getErr error
}
func newFakeRuntimeRecords() *fakeRuntimeRecords {
return &fakeRuntimeRecords{stored: map[string]runtime.RuntimeRecord{}}
}
func (s *fakeRuntimeRecords) put(record runtime.RuntimeRecord) {
s.mu.Lock()
defer s.mu.Unlock()
s.stored[record.GameID] = record
}
func (s *fakeRuntimeRecords) Get(_ context.Context, gameID string) (runtime.RuntimeRecord, error) {
s.mu.Lock()
defer s.mu.Unlock()
if s.getErr != nil {
return runtime.RuntimeRecord{}, s.getErr
}
record, ok := s.stored[gameID]
if !ok {
return runtime.RuntimeRecord{}, runtime.ErrNotFound
}
return record, nil
}
func (s *fakeRuntimeRecords) Upsert(_ context.Context, _ runtime.RuntimeRecord) error {
return errors.New("not used in handler tests")
}
func (s *fakeRuntimeRecords) UpdateStatus(_ context.Context, _ ports.UpdateStatusInput) error {
return errors.New("not used in handler tests")
}
func (s *fakeRuntimeRecords) ListByStatus(_ context.Context, _ runtime.Status) ([]runtime.RuntimeRecord, error) {
return nil, errors.New("not used in handler tests")
}
func (s *fakeRuntimeRecords) List(_ context.Context) ([]runtime.RuntimeRecord, error) {
s.mu.Lock()
defer s.mu.Unlock()
if s.listErr != nil {
return nil, s.listErr
}
if len(s.stored) == 0 {
return nil, nil
}
records := make([]runtime.RuntimeRecord, 0, len(s.stored))
for _, record := range s.stored {
records = append(records, record)
}
return records, nil
}
// jsonHeaders returns the default headers used by tests that send a
// JSON body.
func jsonHeaders() http.Header {
h := http.Header{}
h.Set("Content-Type", "application/json")
return h
}
// withCaller adds the X-Galaxy-Caller header to h and returns h. The
// helper exists to keep test cases readable when the header is the
// only difference between two table rows.
func withCaller(h http.Header, value string) http.Header {
if h == nil {
h = http.Header{}
}
h.Set(callerHeader, value)
return h
}
// strReader builds an io.Reader from raw JSON.
func strReader(raw string) io.Reader {
return strings.NewReader(raw)
}
// Compile-time assertions that the in-memory fake satisfies the port.
var _ ports.RuntimeRecordStore = (*fakeRuntimeRecords)(nil)
@@ -0,0 +1,55 @@
package handlers
import (
"errors"
"net/http"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/service/startruntime"
)
// newGetHandler returns the handler for
// `GET /api/v1/internal/runtimes/{game_id}`. The handler reads
// directly from the runtime record store and translates
// `runtime.ErrNotFound` to `404 not_found`. Like list, it does not
// run through the service layer and does not produce an operation_log
// row.
func newGetHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.get")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.RuntimeRecords == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"runtime records store is not wired",
)
return
}
gameID, ok := extractGameID(writer, request)
if !ok {
return
}
record, err := deps.RuntimeRecords.Get(request.Context(), gameID)
if errors.Is(err, runtime.ErrNotFound) {
writeError(writer, http.StatusNotFound,
startruntime.ErrorCodeNotFound,
"runtime record not found",
)
return
}
if err != nil {
logger.ErrorContext(request.Context(), "get runtime record",
"game_id", gameID,
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"failed to read runtime record",
)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(record))
}
}
@@ -0,0 +1,69 @@
package handlers
import (
"log/slog"
"net/http"
"galaxy/rtmanager/internal/ports"
)
// Route paths registered by Register. The values match the operation
// IDs frozen by `rtmanager/api/internal-openapi.yaml` and
// `rtmanager/contract_openapi_test.go`.
const (
listRuntimesPath = "/api/v1/internal/runtimes"
getRuntimePath = "/api/v1/internal/runtimes/{game_id}"
startRuntimePath = "/api/v1/internal/runtimes/{game_id}/start"
stopRuntimePath = "/api/v1/internal/runtimes/{game_id}/stop"
restartRuntimePath = "/api/v1/internal/runtimes/{game_id}/restart"
patchRuntimePath = "/api/v1/internal/runtimes/{game_id}/patch"
cleanupRuntimePath = "/api/v1/internal/runtimes/{game_id}/container"
)
// Dependencies bundles the collaborators required to serve the GM/Admin
// REST surface. Any service may be nil for tests that exercise a
// subset of the surface; in that case the unwired routes return
// `500 internal_error` (mirrors lobby's "service is not wired"
// pattern).
type Dependencies struct {
// Logger receives structured logs scoped per handler. nil falls back
// to slog.Default.
Logger *slog.Logger
// RuntimeRecords backs the read-only list and get handlers. They do
// not produce operation_log rows because they do not mutate state.
RuntimeRecords ports.RuntimeRecordStore
// StartRuntime executes the start lifecycle operation. Production
// wiring passes `*startruntime.Service` (the concrete service
// satisfies StartService).
StartRuntime StartService
// StopRuntime executes the stop lifecycle operation.
StopRuntime StopService
// RestartRuntime executes the restart lifecycle operation.
RestartRuntime RestartService
// PatchRuntime executes the patch lifecycle operation.
PatchRuntime PatchService
// CleanupContainer executes the cleanup_container lifecycle
// operation.
CleanupContainer CleanupService
}
// Register attaches every internal REST route to mux using deps. Each
// route reads its dependency lazily so a partially-wired Dependencies
// (e.g., a probe-only listener test) does not crash; missing
// dependencies surface as `500 internal_error`. Routes use Go 1.22
// method-aware mux patterns.
func Register(mux *http.ServeMux, deps Dependencies) {
mux.HandleFunc("GET "+listRuntimesPath, newListHandler(deps))
mux.HandleFunc("GET "+getRuntimePath, newGetHandler(deps))
mux.HandleFunc("POST "+startRuntimePath, newStartHandler(deps))
mux.HandleFunc("POST "+stopRuntimePath, newStopHandler(deps))
mux.HandleFunc("POST "+restartRuntimePath, newRestartHandler(deps))
mux.HandleFunc("POST "+patchRuntimePath, newPatchHandler(deps))
mux.HandleFunc("DELETE "+cleanupRuntimePath, newCleanupHandler(deps))
}
@@ -0,0 +1,610 @@
package handlers
import (
"context"
"net/http"
"testing"
"galaxy/rtmanager/internal/api/internalhttp/handlers/mocks"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/domain/runtime"
"galaxy/rtmanager/internal/service/cleanupcontainer"
"galaxy/rtmanager/internal/service/patchruntime"
"galaxy/rtmanager/internal/service/restartruntime"
"galaxy/rtmanager/internal/service/startruntime"
"galaxy/rtmanager/internal/service/stopruntime"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.uber.org/mock/gomock"
)
// Tests for the mutating handlers (start, stop, restart, patch,
// cleanup). Each handler delegates to one lifecycle service through a
// narrow `mockgen`-backed interface; the handler layer is responsible
// for input parsing, the `X-Galaxy-Caller` → `op_source` mapping, and
// the canonical `ErrorCode` → HTTP status table documented in
// `rtmanager/docs/services.md` §18.
// --- start ---
func TestStartHandlerReturnsRecordOnSuccess(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
record := sampleRunningRecord(t)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(startruntime.Input{})).
DoAndReturn(func(_ context.Context, in startruntime.Input) (startruntime.Result, error) {
assert.Equal(t, "game-test", in.GameID)
assert.Equal(t, "galaxy/game:v1.2.3", in.ImageRef)
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
return startruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
})
deps := Dependencies{StartRuntime: mock}
rec := drive(t, deps, http.MethodPost, "/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "game-test", resp.GameID)
assert.Equal(t, "running", resp.Status)
}
func TestStartHandlerReturnsRecordOnReplayNoOp(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
record := sampleRunningRecord(t)
mock.EXPECT().
Handle(gomock.Any(), gomock.Any()).
Return(startruntime.Result{
Record: record,
Outcome: operation.OutcomeSuccess,
ErrorCode: startruntime.ErrorCodeReplayNoOp,
}, nil)
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "game-test", resp.GameID)
}
func TestStartHandlerMapsServiceFailures(t *testing.T) {
t.Parallel()
cases := []struct {
name string
errorCode string
wantStatus int
}{
{"start_config_invalid", startruntime.ErrorCodeStartConfigInvalid, http.StatusBadRequest},
{"image_pull_failed", startruntime.ErrorCodeImagePullFailed, http.StatusInternalServerError},
{"container_start_failed", startruntime.ErrorCodeContainerStartFailed, http.StatusInternalServerError},
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
{"internal_error", startruntime.ErrorCodeInternal, http.StatusInternalServerError},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
mock.EXPECT().
Handle(gomock.Any(), gomock.Any()).
Return(startruntime.Result{
Outcome: operation.OutcomeFailure,
ErrorCode: tc.errorCode,
ErrorMessage: "synthetic " + tc.name,
}, nil)
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
body := decodeErrorBody(t, rec, tc.wantStatus)
assert.Equal(t, tc.errorCode, body.Code)
assert.Equal(t, "synthetic "+tc.name, body.Message)
})
}
}
func TestStartHandlerRejectsUnknownJSONFields(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":"x","extra":"y"}`),
)
body := decodeErrorBody(t, rec, http.StatusBadRequest)
assert.Equal(t, "invalid_request", body.Code)
}
func TestStartHandlerRejectsMalformedJSON(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":`),
)
body := decodeErrorBody(t, rec, http.StatusBadRequest)
assert.Equal(t, "invalid_request", body.Code)
}
func TestStartHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
t.Parallel()
cases := []struct {
header string
want operation.OpSource
hdrLabel string
}{
{"gm", operation.OpSourceGMRest, "gm"},
{"GM", operation.OpSourceGMRest, "uppercase gm"},
{"admin", operation.OpSourceAdminRest, "admin"},
{"unknown", operation.OpSourceAdminRest, "unknown value"},
{"", operation.OpSourceAdminRest, "missing header"},
}
for _, tc := range cases {
t.Run(tc.hdrLabel, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
record := sampleRunningRecord(t)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(startruntime.Input{})).
DoAndReturn(func(_ context.Context, in startruntime.Input) (startruntime.Result, error) {
assert.Equal(t, tc.want, in.OpSource)
return startruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
})
headers := jsonHeaders()
if tc.header != "" {
headers = withCaller(headers, tc.header)
}
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
headers,
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
require.Equal(t, http.StatusOK, rec.Code)
})
}
}
func TestStartHandlerForwardsXRequestIDAsSourceRef(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(startruntime.Input{})).
DoAndReturn(func(_ context.Context, in startruntime.Input) (startruntime.Result, error) {
assert.Equal(t, "req-42", in.SourceRef)
return startruntime.Result{Record: sampleRunningRecord(t), Outcome: operation.OutcomeSuccess}, nil
})
headers := jsonHeaders()
headers.Set("X-Request-ID", "req-42")
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
headers,
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
require.Equal(t, http.StatusOK, rec.Code)
}
func TestStartHandlerReturnsInternalErrorWhenServiceErrors(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStartService(ctrl)
mock.EXPECT().
Handle(gomock.Any(), gomock.Any()).
Return(startruntime.Result{}, assert.AnError)
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
func TestStartHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/start",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
// --- stop ---
func TestStopHandlerReturnsRecordOnSuccess(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStopService(ctrl)
record := sampleStoppedRecord(t)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(stopruntime.Input{})).
DoAndReturn(func(_ context.Context, in stopruntime.Input) (stopruntime.Result, error) {
assert.Equal(t, "game-test", in.GameID)
assert.Equal(t, stopruntime.StopReasonAdminRequest, in.Reason)
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
return stopruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/stop",
jsonHeaders(),
strReader(`{"reason":"admin_request"}`),
)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "stopped", resp.Status)
}
func TestStopHandlerMapsServiceFailures(t *testing.T) {
t.Parallel()
cases := []struct {
name string
errorCode string
wantStatus int
}{
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
{"invalid_request", startruntime.ErrorCodeInvalidRequest, http.StatusBadRequest},
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
{"internal_error", startruntime.ErrorCodeInternal, http.StatusInternalServerError},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStopService(ctrl)
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(stopruntime.Result{
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
}, nil)
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/stop",
jsonHeaders(),
strReader(`{"reason":"admin_request"}`),
)
body := decodeErrorBody(t, rec, tc.wantStatus)
assert.Equal(t, tc.errorCode, body.Code)
})
}
}
func TestStopHandlerRejectsUnknownJSONFields(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStopService(ctrl)
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/stop",
jsonHeaders(),
strReader(`{"reason":"admin_request","extra":1}`),
)
body := decodeErrorBody(t, rec, http.StatusBadRequest)
assert.Equal(t, "invalid_request", body.Code)
}
func TestStopHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockStopService(ctrl)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(stopruntime.Input{})).
DoAndReturn(func(_ context.Context, in stopruntime.Input) (stopruntime.Result, error) {
assert.Equal(t, operation.OpSourceGMRest, in.OpSource)
return stopruntime.Result{Record: sampleStoppedRecord(t), Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/stop",
withCaller(jsonHeaders(), "gm"),
strReader(`{"reason":"cancelled"}`),
)
require.Equal(t, http.StatusOK, rec.Code)
}
func TestStopHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/stop",
jsonHeaders(),
strReader(`{"reason":"admin_request"}`),
)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
// --- restart ---
func TestRestartHandlerReturnsRecordOnSuccess(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockRestartService(ctrl)
record := sampleRunningRecord(t)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(restartruntime.Input{})).
DoAndReturn(func(_ context.Context, in restartruntime.Input) (restartruntime.Result, error) {
assert.Equal(t, "game-test", in.GameID)
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
return restartruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{RestartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/restart", nil, nil,
)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "running", resp.Status)
}
func TestRestartHandlerMapsServiceFailures(t *testing.T) {
t.Parallel()
cases := []struct {
name string
errorCode string
wantStatus int
}{
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
{"internal_error", startruntime.ErrorCodeInternal, http.StatusInternalServerError},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockRestartService(ctrl)
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(restartruntime.Result{
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
}, nil)
rec := drive(t, Dependencies{RestartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/restart", nil, nil,
)
body := decodeErrorBody(t, rec, tc.wantStatus)
assert.Equal(t, tc.errorCode, body.Code)
})
}
}
func TestRestartHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockRestartService(ctrl)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(restartruntime.Input{})).
DoAndReturn(func(_ context.Context, in restartruntime.Input) (restartruntime.Result, error) {
assert.Equal(t, operation.OpSourceGMRest, in.OpSource)
return restartruntime.Result{Record: sampleRunningRecord(t), Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{RestartRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/restart",
withCaller(http.Header{}, "gm"), nil,
)
require.Equal(t, http.StatusOK, rec.Code)
}
func TestRestartHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/restart", nil, nil,
)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
// --- patch ---
func TestPatchHandlerReturnsRecordOnSuccess(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockPatchService(ctrl)
record := sampleRunningRecord(t)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(patchruntime.Input{})).
DoAndReturn(func(_ context.Context, in patchruntime.Input) (patchruntime.Result, error) {
assert.Equal(t, "game-test", in.GameID)
assert.Equal(t, "galaxy/game:v1.2.4", in.NewImageRef)
return patchruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/patch",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "running", resp.Status)
}
func TestPatchHandlerMapsServiceFailures(t *testing.T) {
t.Parallel()
cases := []struct {
name string
errorCode string
wantStatus int
}{
{"image_ref_not_semver", startruntime.ErrorCodeImageRefNotSemver, http.StatusBadRequest},
{"semver_patch_only", startruntime.ErrorCodeSemverPatchOnly, http.StatusConflict},
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockPatchService(ctrl)
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(patchruntime.Result{
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
}, nil)
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/patch",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
)
body := decodeErrorBody(t, rec, tc.wantStatus)
assert.Equal(t, tc.errorCode, body.Code)
})
}
}
func TestPatchHandlerRejectsUnknownJSONFields(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockPatchService(ctrl)
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/patch",
jsonHeaders(),
strReader(`{"image_ref":"x","unexpected":true}`),
)
body := decodeErrorBody(t, rec, http.StatusBadRequest)
assert.Equal(t, "invalid_request", body.Code)
}
func TestPatchHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockPatchService(ctrl)
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(patchruntime.Input{})).
DoAndReturn(func(_ context.Context, in patchruntime.Input) (patchruntime.Result, error) {
assert.Equal(t, operation.OpSourceGMRest, in.OpSource)
return patchruntime.Result{Record: sampleRunningRecord(t), Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/patch",
withCaller(jsonHeaders(), "gm"),
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
)
require.Equal(t, http.StatusOK, rec.Code)
}
func TestPatchHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodPost,
"/api/v1/internal/runtimes/game-test/patch",
jsonHeaders(),
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
// --- cleanup ---
func TestCleanupHandlerReturnsRecordOnSuccess(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockCleanupService(ctrl)
record := sampleStoppedRecord(t)
record.Status = runtime.StatusRemoved
record.CurrentContainerID = ""
removed := record.LastOpAt
record.RemovedAt = &removed
mock.EXPECT().
Handle(gomock.Any(), gomock.AssignableToTypeOf(cleanupcontainer.Input{})).
DoAndReturn(func(_ context.Context, in cleanupcontainer.Input) (cleanupcontainer.Result, error) {
assert.Equal(t, "game-stopped", in.GameID)
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
return cleanupcontainer.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
})
rec := drive(t, Dependencies{CleanupContainer: mock}, http.MethodDelete,
"/api/v1/internal/runtimes/game-stopped/container", nil, nil,
)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "removed", resp.Status)
assert.Nil(t, resp.CurrentContainerID, "container id must be null after cleanup")
}
func TestCleanupHandlerMapsServiceFailures(t *testing.T) {
t.Parallel()
cases := []struct {
name string
errorCode string
wantStatus int
}{
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()
ctrl := gomock.NewController(t)
mock := mocks.NewMockCleanupService(ctrl)
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(cleanupcontainer.Result{
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
}, nil)
rec := drive(t, Dependencies{CleanupContainer: mock}, http.MethodDelete,
"/api/v1/internal/runtimes/game-test/container", nil, nil,
)
body := decodeErrorBody(t, rec, tc.wantStatus)
assert.Equal(t, tc.errorCode, body.Code)
})
}
}
func TestCleanupHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodDelete,
"/api/v1/internal/runtimes/game-test/container", nil, nil,
)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
@@ -0,0 +1,115 @@
package handlers
import (
"encoding/json"
"errors"
"net/http"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
// Tests for the read-only handlers (`internalListRuntimes`,
// `internalGetRuntime`). These bypass the service layer and read
// directly from `ports.RuntimeRecordStore` — see
// `rtmanager/docs/services.md` §18.
func TestListHandlerReturnsEmptyItemsForEmptyStore(t *testing.T) {
t.Parallel()
deps := Dependencies{RuntimeRecords: newFakeRuntimeRecords()}
rec := drive(t, deps, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
require.Equal(t, http.StatusOK, rec.Code)
require.Equal(t, JSONContentType, rec.Header().Get("Content-Type"))
var resp runtimesListResponse
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
require.NotNil(t, resp.Items, "items must never be nil")
assert.Empty(t, resp.Items)
}
func TestListHandlerReturnsEveryStoredRecord(t *testing.T) {
t.Parallel()
store := newFakeRuntimeRecords()
store.put(sampleRunningRecord(t))
store.put(sampleStoppedRecord(t))
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
require.Equal(t, http.StatusOK, rec.Code)
var resp runtimesListResponse
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
require.Len(t, resp.Items, 2)
gotIDs := map[string]string{}
for _, item := range resp.Items {
gotIDs[item.GameID] = item.Status
}
assert.Equal(t, "running", gotIDs["game-test"])
assert.Equal(t, "stopped", gotIDs["game-stopped"])
}
func TestListHandlerReturnsInternalErrorWhenStoreFails(t *testing.T) {
t.Parallel()
store := newFakeRuntimeRecords()
store.listErr = errors.New("postgres exploded")
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
func TestListHandlerReturnsInternalErrorWhenStoreNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
func TestGetHandlerReturnsTheRecord(t *testing.T) {
t.Parallel()
store := newFakeRuntimeRecords()
record := sampleRunningRecord(t)
store.put(record)
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes/game-test", nil, nil)
resp := decodeRecordResponse(t, rec)
assert.Equal(t, "game-test", resp.GameID)
assert.Equal(t, "running", resp.Status)
if assert.NotNil(t, resp.CurrentImageRef) {
assert.Equal(t, "galaxy/game:v1.2.3", *resp.CurrentImageRef)
}
}
func TestGetHandlerReturnsNotFoundForMissingRecord(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{RuntimeRecords: newFakeRuntimeRecords()}, http.MethodGet, "/api/v1/internal/runtimes/game-missing", nil, nil)
body := decodeErrorBody(t, rec, http.StatusNotFound)
assert.Equal(t, "not_found", body.Code)
}
func TestGetHandlerReturnsInternalErrorWhenStoreFails(t *testing.T) {
t.Parallel()
store := newFakeRuntimeRecords()
store.getErr = errors.New("transport blew up")
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes/game-test", nil, nil)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
func TestGetHandlerReturnsInternalErrorWhenStoreNotWired(t *testing.T) {
t.Parallel()
rec := drive(t, Dependencies{}, http.MethodGet, "/api/v1/internal/runtimes/game-test", nil, nil)
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
assert.Equal(t, "internal_error", body.Code)
}
@@ -0,0 +1,38 @@
package handlers
import (
"net/http"
"galaxy/rtmanager/internal/service/startruntime"
)
// newListHandler returns the handler for `GET /api/v1/internal/runtimes`.
// The handler reads directly from `ports.RuntimeRecordStore.List` —
// this surface is read-only and does not produce operation_log rows
// (rationale: see `rtmanager/docs/services.md` §18).
func newListHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.list")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.RuntimeRecords == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"runtime records store is not wired",
)
return
}
records, err := deps.RuntimeRecords.List(request.Context())
if err != nil {
logger.ErrorContext(request.Context(), "list runtime records",
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"failed to list runtime records",
)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimesList(records))
}
}
@@ -0,0 +1,217 @@
// Code generated by MockGen. DO NOT EDIT.
// Source: galaxy/rtmanager/internal/api/internalhttp/handlers (interfaces: StartService,StopService,RestartService,PatchService,CleanupService)
//
// Generated by this command:
//
// mockgen -destination=mocks/mock_services.go -package=mocks galaxy/rtmanager/internal/api/internalhttp/handlers StartService,StopService,RestartService,PatchService,CleanupService
//
// Package mocks is a generated GoMock package.
package mocks
import (
context "context"
cleanupcontainer "galaxy/rtmanager/internal/service/cleanupcontainer"
patchruntime "galaxy/rtmanager/internal/service/patchruntime"
restartruntime "galaxy/rtmanager/internal/service/restartruntime"
startruntime "galaxy/rtmanager/internal/service/startruntime"
stopruntime "galaxy/rtmanager/internal/service/stopruntime"
reflect "reflect"
gomock "go.uber.org/mock/gomock"
)
// MockStartService is a mock of StartService interface.
type MockStartService struct {
ctrl *gomock.Controller
recorder *MockStartServiceMockRecorder
isgomock struct{}
}
// MockStartServiceMockRecorder is the mock recorder for MockStartService.
type MockStartServiceMockRecorder struct {
mock *MockStartService
}
// NewMockStartService creates a new mock instance.
func NewMockStartService(ctrl *gomock.Controller) *MockStartService {
mock := &MockStartService{ctrl: ctrl}
mock.recorder = &MockStartServiceMockRecorder{mock}
return mock
}
// EXPECT returns an object that allows the caller to indicate expected use.
func (m *MockStartService) EXPECT() *MockStartServiceMockRecorder {
return m.recorder
}
// Handle mocks base method.
func (m *MockStartService) Handle(ctx context.Context, in startruntime.Input) (startruntime.Result, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Handle", ctx, in)
ret0, _ := ret[0].(startruntime.Result)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// Handle indicates an expected call of Handle.
func (mr *MockStartServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockStartService)(nil).Handle), ctx, in)
}
// MockStopService is a mock of StopService interface.
type MockStopService struct {
ctrl *gomock.Controller
recorder *MockStopServiceMockRecorder
isgomock struct{}
}
// MockStopServiceMockRecorder is the mock recorder for MockStopService.
type MockStopServiceMockRecorder struct {
mock *MockStopService
}
// NewMockStopService creates a new mock instance.
func NewMockStopService(ctrl *gomock.Controller) *MockStopService {
mock := &MockStopService{ctrl: ctrl}
mock.recorder = &MockStopServiceMockRecorder{mock}
return mock
}
// EXPECT returns an object that allows the caller to indicate expected use.
func (m *MockStopService) EXPECT() *MockStopServiceMockRecorder {
return m.recorder
}
// Handle mocks base method.
func (m *MockStopService) Handle(ctx context.Context, in stopruntime.Input) (stopruntime.Result, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Handle", ctx, in)
ret0, _ := ret[0].(stopruntime.Result)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// Handle indicates an expected call of Handle.
func (mr *MockStopServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockStopService)(nil).Handle), ctx, in)
}
// MockRestartService is a mock of RestartService interface.
type MockRestartService struct {
ctrl *gomock.Controller
recorder *MockRestartServiceMockRecorder
isgomock struct{}
}
// MockRestartServiceMockRecorder is the mock recorder for MockRestartService.
type MockRestartServiceMockRecorder struct {
mock *MockRestartService
}
// NewMockRestartService creates a new mock instance.
func NewMockRestartService(ctrl *gomock.Controller) *MockRestartService {
mock := &MockRestartService{ctrl: ctrl}
mock.recorder = &MockRestartServiceMockRecorder{mock}
return mock
}
// EXPECT returns an object that allows the caller to indicate expected use.
func (m *MockRestartService) EXPECT() *MockRestartServiceMockRecorder {
return m.recorder
}
// Handle mocks base method.
func (m *MockRestartService) Handle(ctx context.Context, in restartruntime.Input) (restartruntime.Result, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Handle", ctx, in)
ret0, _ := ret[0].(restartruntime.Result)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// Handle indicates an expected call of Handle.
func (mr *MockRestartServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockRestartService)(nil).Handle), ctx, in)
}
// MockPatchService is a mock of PatchService interface.
type MockPatchService struct {
ctrl *gomock.Controller
recorder *MockPatchServiceMockRecorder
isgomock struct{}
}
// MockPatchServiceMockRecorder is the mock recorder for MockPatchService.
type MockPatchServiceMockRecorder struct {
mock *MockPatchService
}
// NewMockPatchService creates a new mock instance.
func NewMockPatchService(ctrl *gomock.Controller) *MockPatchService {
mock := &MockPatchService{ctrl: ctrl}
mock.recorder = &MockPatchServiceMockRecorder{mock}
return mock
}
// EXPECT returns an object that allows the caller to indicate expected use.
func (m *MockPatchService) EXPECT() *MockPatchServiceMockRecorder {
return m.recorder
}
// Handle mocks base method.
func (m *MockPatchService) Handle(ctx context.Context, in patchruntime.Input) (patchruntime.Result, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Handle", ctx, in)
ret0, _ := ret[0].(patchruntime.Result)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// Handle indicates an expected call of Handle.
func (mr *MockPatchServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockPatchService)(nil).Handle), ctx, in)
}
// MockCleanupService is a mock of CleanupService interface.
type MockCleanupService struct {
ctrl *gomock.Controller
recorder *MockCleanupServiceMockRecorder
isgomock struct{}
}
// MockCleanupServiceMockRecorder is the mock recorder for MockCleanupService.
type MockCleanupServiceMockRecorder struct {
mock *MockCleanupService
}
// NewMockCleanupService creates a new mock instance.
func NewMockCleanupService(ctrl *gomock.Controller) *MockCleanupService {
mock := &MockCleanupService{ctrl: ctrl}
mock.recorder = &MockCleanupServiceMockRecorder{mock}
return mock
}
// EXPECT returns an object that allows the caller to indicate expected use.
func (m *MockCleanupService) EXPECT() *MockCleanupServiceMockRecorder {
return m.recorder
}
// Handle mocks base method.
func (m *MockCleanupService) Handle(ctx context.Context, in cleanupcontainer.Input) (cleanupcontainer.Result, error) {
m.ctrl.T.Helper()
ret := m.ctrl.Call(m, "Handle", ctx, in)
ret0, _ := ret[0].(cleanupcontainer.Result)
ret1, _ := ret[1].(error)
return ret0, ret1
}
// Handle indicates an expected call of Handle.
func (mr *MockCleanupServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
mr.mock.ctrl.T.Helper()
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockCleanupService)(nil).Handle), ctx, in)
}
@@ -0,0 +1,71 @@
package handlers
import (
"net/http"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/service/patchruntime"
"galaxy/rtmanager/internal/service/startruntime"
)
// patchRequestBody mirrors the OpenAPI PatchRequest schema. The
// service layer validates `image_ref` shape (semver, distribution
// reference) and surfaces `image_ref_not_semver` /
// `semver_patch_only` as needed.
type patchRequestBody struct {
ImageRef string `json:"image_ref"`
}
// newPatchHandler returns the handler for
// `POST /api/v1/internal/runtimes/{game_id}/patch`.
func newPatchHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.patch")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.PatchRuntime == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"patch runtime service is not wired",
)
return
}
gameID, ok := extractGameID(writer, request)
if !ok {
return
}
var body patchRequestBody
if err := decodeStrictJSON(request.Body, &body); err != nil {
writeError(writer, http.StatusBadRequest,
startruntime.ErrorCodeInvalidRequest,
err.Error(),
)
return
}
result, err := deps.PatchRuntime.Handle(request.Context(), patchruntime.Input{
GameID: gameID,
NewImageRef: body.ImageRef,
OpSource: resolveOpSource(request),
SourceRef: requestSourceRef(request),
})
if err != nil {
logger.ErrorContext(request.Context(), "patch runtime service errored",
"game_id", gameID,
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"patch runtime service failed",
)
return
}
if result.Outcome == operation.OutcomeFailure {
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
}
}
@@ -0,0 +1,55 @@
package handlers
import (
"net/http"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/service/restartruntime"
"galaxy/rtmanager/internal/service/startruntime"
)
// newRestartHandler returns the handler for
// `POST /api/v1/internal/runtimes/{game_id}/restart`. The OpenAPI spec
// declares no request body for this operation; any client-provided
// body is ignored.
func newRestartHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.restart")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.RestartRuntime == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"restart runtime service is not wired",
)
return
}
gameID, ok := extractGameID(writer, request)
if !ok {
return
}
result, err := deps.RestartRuntime.Handle(request.Context(), restartruntime.Input{
GameID: gameID,
OpSource: resolveOpSource(request),
SourceRef: requestSourceRef(request),
})
if err != nil {
logger.ErrorContext(request.Context(), "restart runtime service errored",
"game_id", gameID,
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"restart runtime service failed",
)
return
}
if result.Outcome == operation.OutcomeFailure {
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
}
}
@@ -0,0 +1,54 @@
// Package handlers ships the GM/Admin-facing internal REST surface of
// Runtime Manager. The package is consumed by
// `galaxy/rtmanager/internal/api/internalhttp`; each handler delegates
// to one of the lifecycle services in `internal/service/`
// (`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
// `cleanupcontainer`) or reads directly from `ports.RuntimeRecordStore`
// (list / get).
//
// The interfaces declared in this file mirror the single `Handle`
// method exposed by every concrete lifecycle service. Production wiring
// passes the concrete service pointers; tests pass `mockgen`-generated
// mocks. The narrow shape keeps the handler layer free of service
// internals (lease tokens, telemetry, durable side effects) and matches
// the repo-wide `mockgen` convention for wide / recorder ports.
package handlers
import (
"context"
"galaxy/rtmanager/internal/service/cleanupcontainer"
"galaxy/rtmanager/internal/service/patchruntime"
"galaxy/rtmanager/internal/service/restartruntime"
"galaxy/rtmanager/internal/service/startruntime"
"galaxy/rtmanager/internal/service/stopruntime"
)
//go:generate go run go.uber.org/mock/mockgen -destination=mocks/mock_services.go -package=mocks galaxy/rtmanager/internal/api/internalhttp/handlers StartService,StopService,RestartService,PatchService,CleanupService
// StartService is the narrow port the start handler depends on. It
// matches the public Handle method of `startruntime.Service`; the
// concrete service satisfies the interface implicitly.
type StartService interface {
Handle(ctx context.Context, in startruntime.Input) (startruntime.Result, error)
}
// StopService is the narrow port the stop handler depends on.
type StopService interface {
Handle(ctx context.Context, in stopruntime.Input) (stopruntime.Result, error)
}
// RestartService is the narrow port the restart handler depends on.
type RestartService interface {
Handle(ctx context.Context, in restartruntime.Input) (restartruntime.Result, error)
}
// PatchService is the narrow port the patch handler depends on.
type PatchService interface {
Handle(ctx context.Context, in patchruntime.Input) (patchruntime.Result, error)
}
// CleanupService is the narrow port the cleanup handler depends on.
type CleanupService interface {
Handle(ctx context.Context, in cleanupcontainer.Input) (cleanupcontainer.Result, error)
}
@@ -0,0 +1,71 @@
package handlers
import (
"net/http"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/service/startruntime"
)
// startRequestBody mirrors the OpenAPI StartRequest schema. Only
// `image_ref` is accepted; unknown fields are rejected by
// decodeStrictJSON.
type startRequestBody struct {
ImageRef string `json:"image_ref"`
}
// newStartHandler returns the handler for
// `POST /api/v1/internal/runtimes/{game_id}/start`. The handler
// delegates the entire lifecycle to `startruntime.Service`; failure
// codes are mapped to HTTP statuses via mapErrorCodeToStatus.
func newStartHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.start")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.StartRuntime == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"start runtime service is not wired",
)
return
}
gameID, ok := extractGameID(writer, request)
if !ok {
return
}
var body startRequestBody
if err := decodeStrictJSON(request.Body, &body); err != nil {
writeError(writer, http.StatusBadRequest,
startruntime.ErrorCodeInvalidRequest,
err.Error(),
)
return
}
result, err := deps.StartRuntime.Handle(request.Context(), startruntime.Input{
GameID: gameID,
ImageRef: body.ImageRef,
OpSource: resolveOpSource(request),
SourceRef: requestSourceRef(request),
})
if err != nil {
logger.ErrorContext(request.Context(), "start runtime service errored",
"game_id", gameID,
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"start runtime service failed",
)
return
}
if result.Outcome == operation.OutcomeFailure {
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
}
}
@@ -0,0 +1,70 @@
package handlers
import (
"net/http"
"galaxy/rtmanager/internal/domain/operation"
"galaxy/rtmanager/internal/service/startruntime"
"galaxy/rtmanager/internal/service/stopruntime"
)
// stopRequestBody mirrors the OpenAPI StopRequest schema. The reason
// enum is validated at the service layer (`stopruntime.Input.Validate`);
// unknown values surface as `invalid_request`.
type stopRequestBody struct {
Reason string `json:"reason"`
}
// newStopHandler returns the handler for
// `POST /api/v1/internal/runtimes/{game_id}/stop`.
func newStopHandler(deps Dependencies) http.HandlerFunc {
logger := loggerFor(deps.Logger, "internal_rest.stop")
return func(writer http.ResponseWriter, request *http.Request) {
if deps.StopRuntime == nil {
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"stop runtime service is not wired",
)
return
}
gameID, ok := extractGameID(writer, request)
if !ok {
return
}
var body stopRequestBody
if err := decodeStrictJSON(request.Body, &body); err != nil {
writeError(writer, http.StatusBadRequest,
startruntime.ErrorCodeInvalidRequest,
err.Error(),
)
return
}
result, err := deps.StopRuntime.Handle(request.Context(), stopruntime.Input{
GameID: gameID,
Reason: stopruntime.StopReason(body.Reason),
OpSource: resolveOpSource(request),
SourceRef: requestSourceRef(request),
})
if err != nil {
logger.ErrorContext(request.Context(), "stop runtime service errored",
"game_id", gameID,
"err", err.Error(),
)
writeError(writer, http.StatusInternalServerError,
startruntime.ErrorCodeInternal,
"stop runtime service failed",
)
return
}
if result.Outcome == operation.OutcomeFailure {
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
return
}
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
}
}
@@ -0,0 +1,363 @@
// Package internalhttp provides the trusted internal HTTP listener used
// by the runnable Runtime Manager process. It exposes `/healthz` and
// `/readyz` plus the GM/Admin REST surface backed by the lifecycle
// services in `internal/service/`.
package internalhttp
import (
"context"
"encoding/json"
"errors"
"fmt"
"log/slog"
"net"
"net/http"
"strconv"
"sync"
"time"
"galaxy/rtmanager/internal/api/internalhttp/handlers"
"galaxy/rtmanager/internal/ports"
"galaxy/rtmanager/internal/telemetry"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel/attribute"
)
const jsonContentType = "application/json; charset=utf-8"
// errorCodeServiceUnavailable mirrors the stable error code declared in
// `rtmanager/api/internal-openapi.yaml` `§Error Model`.
const errorCodeServiceUnavailable = "service_unavailable"
// HealthzPath and ReadyzPath are the internal probe routes documented in
// `rtmanager/api/internal-openapi.yaml`.
const (
HealthzPath = "/healthz"
ReadyzPath = "/readyz"
)
// ReadinessProbe reports whether the dependencies the listener guards
// (PostgreSQL, Redis, Docker) are reachable. A non-nil error is reported
// to the caller as `503 service_unavailable` with the wrapped message.
type ReadinessProbe interface {
Check(ctx context.Context) error
}
// Config describes the trusted internal HTTP listener owned by Runtime
// Manager.
type Config struct {
// Addr is the TCP listen address used by the internal HTTP server.
Addr string
// ReadHeaderTimeout bounds how long the listener may spend reading
// request headers before the server rejects the connection.
ReadHeaderTimeout time.Duration
// ReadTimeout bounds how long the listener may spend reading one
// request.
ReadTimeout time.Duration
// WriteTimeout bounds how long the listener may spend writing one
// response.
WriteTimeout time.Duration
// IdleTimeout bounds how long the listener keeps an idle keep-alive
// connection open.
IdleTimeout time.Duration
}
// Validate reports whether cfg contains a usable internal HTTP listener
// configuration.
func (cfg Config) Validate() error {
switch {
case cfg.Addr == "":
return errors.New("internal HTTP addr must not be empty")
case cfg.ReadHeaderTimeout <= 0:
return errors.New("internal HTTP read header timeout must be positive")
case cfg.ReadTimeout <= 0:
return errors.New("internal HTTP read timeout must be positive")
case cfg.WriteTimeout <= 0:
return errors.New("internal HTTP write timeout must be positive")
case cfg.IdleTimeout <= 0:
return errors.New("internal HTTP idle timeout must be positive")
default:
return nil
}
}
// Dependencies describes the collaborators used by the internal HTTP
// transport layer. The listener still works when the lifecycle service
// fields are zero — handlers register but each returns
// `500 internal_error` until the runtime wires the real services.
type Dependencies struct {
// Logger writes structured listener lifecycle logs. When nil,
// slog.Default is used.
Logger *slog.Logger
// Telemetry records low-cardinality probe metrics and lifecycle
// events.
Telemetry *telemetry.Runtime
// Readiness reports whether PG / Redis / Docker are reachable. A
// nil readiness probe makes `/readyz` always answer `200`; the
// runtime always supplies a real probe in production wiring.
Readiness ReadinessProbe
// RuntimeRecords backs the read-only list/get handlers. When nil
// those routes return `500 internal_error`.
RuntimeRecords ports.RuntimeRecordStore
// StartRuntime, StopRuntime, RestartRuntime, PatchRuntime, and
// CleanupContainer back the lifecycle handlers. Each accepts a
// narrow interface so tests can pass `mockgen`-generated mocks;
// production wiring passes the concrete `*<lifecycle>.Service`
// pointer.
StartRuntime handlers.StartService
StopRuntime handlers.StopService
RestartRuntime handlers.RestartService
PatchRuntime handlers.PatchService
CleanupContainer handlers.CleanupService
}
// Server owns the trusted internal HTTP listener exposed by Runtime
// Manager.
type Server struct {
cfg Config
handler http.Handler
logger *slog.Logger
metrics *telemetry.Runtime
stateMu sync.RWMutex
server *http.Server
listener net.Listener
}
// NewServer constructs one trusted internal HTTP server for cfg and deps.
func NewServer(cfg Config, deps Dependencies) (*Server, error) {
if err := cfg.Validate(); err != nil {
return nil, fmt.Errorf("new internal HTTP server: %w", err)
}
logger := deps.Logger
if logger == nil {
logger = slog.Default()
}
return &Server{
cfg: cfg,
handler: newHandler(deps, logger),
logger: logger.With("component", "internal_http"),
metrics: deps.Telemetry,
}, nil
}
// Addr returns the currently bound listener address after Run is called.
// It returns an empty string if the server has not yet bound a listener.
func (server *Server) Addr() string {
server.stateMu.RLock()
defer server.stateMu.RUnlock()
if server.listener == nil {
return ""
}
return server.listener.Addr().String()
}
// Run binds the configured listener and serves the internal HTTP surface
// until Shutdown closes the server.
func (server *Server) Run(ctx context.Context) error {
if ctx == nil {
return errors.New("run internal HTTP server: nil context")
}
if err := ctx.Err(); err != nil {
return err
}
listener, err := net.Listen("tcp", server.cfg.Addr)
if err != nil {
return fmt.Errorf("run internal HTTP server: listen on %q: %w", server.cfg.Addr, err)
}
httpServer := &http.Server{
Handler: server.handler,
ReadHeaderTimeout: server.cfg.ReadHeaderTimeout,
ReadTimeout: server.cfg.ReadTimeout,
WriteTimeout: server.cfg.WriteTimeout,
IdleTimeout: server.cfg.IdleTimeout,
}
server.stateMu.Lock()
server.server = httpServer
server.listener = listener
server.stateMu.Unlock()
server.logger.Info("rtmanager internal HTTP server started", "addr", listener.Addr().String())
defer func() {
server.stateMu.Lock()
server.server = nil
server.listener = nil
server.stateMu.Unlock()
}()
err = httpServer.Serve(listener)
switch {
case err == nil:
return nil
case errors.Is(err, http.ErrServerClosed):
server.logger.Info("rtmanager internal HTTP server stopped")
return nil
default:
return fmt.Errorf("run internal HTTP server: serve on %q: %w", server.cfg.Addr, err)
}
}
// Shutdown gracefully stops the internal HTTP server within ctx.
func (server *Server) Shutdown(ctx context.Context) error {
if ctx == nil {
return errors.New("shutdown internal HTTP server: nil context")
}
server.stateMu.RLock()
httpServer := server.server
server.stateMu.RUnlock()
if httpServer == nil {
return nil
}
if err := httpServer.Shutdown(ctx); err != nil && !errors.Is(err, http.ErrServerClosed) {
return fmt.Errorf("shutdown internal HTTP server: %w", err)
}
return nil
}
func newHandler(deps Dependencies, logger *slog.Logger) http.Handler {
mux := http.NewServeMux()
mux.HandleFunc("GET "+HealthzPath, handleHealthz)
mux.HandleFunc("GET "+ReadyzPath, handleReadyz(deps.Readiness, logger))
handlers.Register(mux, handlers.Dependencies{
Logger: logger,
RuntimeRecords: deps.RuntimeRecords,
StartRuntime: deps.StartRuntime,
StopRuntime: deps.StopRuntime,
RestartRuntime: deps.RestartRuntime,
PatchRuntime: deps.PatchRuntime,
CleanupContainer: deps.CleanupContainer,
})
metrics := deps.Telemetry
options := []otelhttp.Option{}
if metrics != nil {
options = append(options,
otelhttp.WithTracerProvider(metrics.TracerProvider()),
otelhttp.WithMeterProvider(metrics.MeterProvider()),
)
}
return otelhttp.NewHandler(withObservability(mux, metrics), "rtmanager.internal_http", options...)
}
func withObservability(next http.Handler, metrics *telemetry.Runtime) http.Handler {
return http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) {
startedAt := time.Now()
recorder := &statusRecorder{
ResponseWriter: writer,
statusCode: http.StatusOK,
}
next.ServeHTTP(recorder, request)
route := request.Pattern
switch recorder.statusCode {
case http.StatusMethodNotAllowed:
route = "method_not_allowed"
case http.StatusNotFound:
route = "not_found"
case 0:
route = "unmatched"
}
if route == "" {
route = "unmatched"
}
if metrics != nil {
metrics.RecordInternalHTTPRequest(
request.Context(),
[]attribute.KeyValue{
attribute.String("route", route),
attribute.String("method", request.Method),
attribute.String("status_code", strconv.Itoa(recorder.statusCode)),
},
time.Since(startedAt),
)
}
})
}
func handleHealthz(writer http.ResponseWriter, _ *http.Request) {
writeStatusResponse(writer, http.StatusOK, "ok")
}
func handleReadyz(probe ReadinessProbe, logger *slog.Logger) http.HandlerFunc {
return func(writer http.ResponseWriter, request *http.Request) {
if probe == nil {
writeStatusResponse(writer, http.StatusOK, "ready")
return
}
if err := probe.Check(request.Context()); err != nil {
logger.WarnContext(request.Context(), "rtmanager readiness probe failed",
"err", err.Error(),
)
writeServiceUnavailable(writer, err.Error())
return
}
writeStatusResponse(writer, http.StatusOK, "ready")
}
}
func writeStatusResponse(writer http.ResponseWriter, statusCode int, status string) {
writer.Header().Set("Content-Type", jsonContentType)
writer.WriteHeader(statusCode)
_ = json.NewEncoder(writer).Encode(statusResponse{Status: status})
}
func writeServiceUnavailable(writer http.ResponseWriter, message string) {
writer.Header().Set("Content-Type", jsonContentType)
writer.WriteHeader(http.StatusServiceUnavailable)
_ = json.NewEncoder(writer).Encode(errorResponse{
Error: errorBody{
Code: errorCodeServiceUnavailable,
Message: message,
},
})
}
type statusResponse struct {
Status string `json:"status"`
}
type errorBody struct {
Code string `json:"code"`
Message string `json:"message"`
}
type errorResponse struct {
Error errorBody `json:"error"`
}
type statusRecorder struct {
http.ResponseWriter
statusCode int
}
func (recorder *statusRecorder) WriteHeader(statusCode int) {
recorder.statusCode = statusCode
recorder.ResponseWriter.WriteHeader(statusCode)
}
@@ -0,0 +1,115 @@
package internalhttp
import (
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"strings"
"testing"
"time"
"github.com/stretchr/testify/require"
)
func newTestConfig() Config {
return Config{
Addr: ":0",
ReadHeaderTimeout: time.Second,
ReadTimeout: time.Second,
WriteTimeout: time.Second,
IdleTimeout: time.Second,
}
}
type stubReadiness struct {
err error
}
func (probe stubReadiness) Check(_ context.Context) error {
return probe.err
}
func newTestServer(t *testing.T, deps Dependencies) http.Handler {
t.Helper()
server, err := NewServer(newTestConfig(), deps)
require.NoError(t, err)
return server.handler
}
func TestHealthzReturnsOK(t *testing.T) {
t.Parallel()
handler := newTestServer(t, Dependencies{})
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, HealthzPath, nil)
handler.ServeHTTP(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
require.Equal(t, jsonContentType, rec.Header().Get("Content-Type"))
var body statusResponse
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
require.Equal(t, "ok", body.Status)
}
func TestReadyzReturnsReadyWhenProbeIsNil(t *testing.T) {
t.Parallel()
handler := newTestServer(t, Dependencies{})
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, ReadyzPath, nil)
handler.ServeHTTP(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
var body statusResponse
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
require.Equal(t, "ready", body.Status)
}
func TestReadyzReturnsReadyWhenProbeSucceeds(t *testing.T) {
t.Parallel()
handler := newTestServer(t, Dependencies{Readiness: stubReadiness{}})
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, ReadyzPath, nil)
handler.ServeHTTP(rec, req)
require.Equal(t, http.StatusOK, rec.Code)
var body statusResponse
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
require.Equal(t, "ready", body.Status)
}
func TestReadyzReturnsServiceUnavailableWhenProbeFails(t *testing.T) {
t.Parallel()
handler := newTestServer(t, Dependencies{
Readiness: stubReadiness{err: errors.New("postgres ping: connection refused")},
})
rec := httptest.NewRecorder()
req := httptest.NewRequest(http.MethodGet, ReadyzPath, nil)
handler.ServeHTTP(rec, req)
require.Equal(t, http.StatusServiceUnavailable, rec.Code)
require.Equal(t, jsonContentType, rec.Header().Get("Content-Type"))
var body errorResponse
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
require.Equal(t, errorCodeServiceUnavailable, body.Error.Code)
require.True(t, strings.Contains(body.Error.Message, "postgres"))
}
func TestNewServerRejectsInvalidConfig(t *testing.T) {
t.Parallel()
_, err := NewServer(Config{}, Dependencies{})
require.Error(t, err)
}
+170
View File
@@ -0,0 +1,170 @@
// Package app wires the Runtime Manager process lifecycle and
// coordinates component startup and graceful shutdown.
package app
import (
"context"
"errors"
"fmt"
"sync"
"galaxy/rtmanager/internal/config"
)
// Component is a long-lived Runtime Manager subsystem that participates
// in coordinated startup and graceful shutdown.
type Component interface {
// Run starts the component and blocks until it stops.
Run(context.Context) error
// Shutdown stops the component within the provided timeout-bounded
// context.
Shutdown(context.Context) error
}
// App owns the process-level lifecycle of Runtime Manager and its
// registered components.
type App struct {
cfg config.Config
components []Component
}
// New constructs App with a defensive copy of the supplied components.
func New(cfg config.Config, components ...Component) *App {
clonedComponents := append([]Component(nil), components...)
return &App{
cfg: cfg,
components: clonedComponents,
}
}
// Run starts all configured components, waits for cancellation or the
// first component failure, and then executes best-effort graceful
// shutdown.
func (app *App) Run(ctx context.Context) error {
if ctx == nil {
return errors.New("run rtmanager app: nil context")
}
if err := app.validate(); err != nil {
return err
}
if len(app.components) == 0 {
<-ctx.Done()
return nil
}
runCtx, cancel := context.WithCancel(ctx)
defer cancel()
results := make(chan componentResult, len(app.components))
var runWaitGroup sync.WaitGroup
for index, component := range app.components {
runWaitGroup.Add(1)
go func(componentIndex int, component Component) {
defer runWaitGroup.Done()
results <- componentResult{
index: componentIndex,
err: component.Run(runCtx),
}
}(index, component)
}
var runErr error
select {
case <-ctx.Done():
case result := <-results:
runErr = classifyComponentResult(ctx, result)
}
cancel()
shutdownErr := app.shutdownComponents()
waitErr := app.waitForComponents(&runWaitGroup)
return errors.Join(runErr, shutdownErr, waitErr)
}
type componentResult struct {
index int
err error
}
func (app *App) validate() error {
if app.cfg.ShutdownTimeout <= 0 {
return fmt.Errorf("run rtmanager app: shutdown timeout must be positive, got %s", app.cfg.ShutdownTimeout)
}
for index, component := range app.components {
if component == nil {
return fmt.Errorf("run rtmanager app: component %d is nil", index)
}
}
return nil
}
func classifyComponentResult(parentCtx context.Context, result componentResult) error {
switch {
case result.err == nil:
if parentCtx.Err() != nil {
return nil
}
return fmt.Errorf("run rtmanager app: component %d exited without error before shutdown", result.index)
case errors.Is(result.err, context.Canceled) && parentCtx.Err() != nil:
return nil
default:
return fmt.Errorf("run rtmanager app: component %d: %w", result.index, result.err)
}
}
func (app *App) shutdownComponents() error {
var shutdownWaitGroup sync.WaitGroup
errs := make(chan error, len(app.components))
for index, component := range app.components {
shutdownWaitGroup.Add(1)
go func(componentIndex int, component Component) {
defer shutdownWaitGroup.Done()
shutdownCtx, cancel := context.WithTimeout(context.Background(), app.cfg.ShutdownTimeout)
defer cancel()
if err := component.Shutdown(shutdownCtx); err != nil {
errs <- fmt.Errorf("shutdown rtmanager component %d: %w", componentIndex, err)
}
}(index, component)
}
shutdownWaitGroup.Wait()
close(errs)
var joined error
for err := range errs {
joined = errors.Join(joined, err)
}
return joined
}
func (app *App) waitForComponents(runWaitGroup *sync.WaitGroup) error {
done := make(chan struct{})
go func() {
runWaitGroup.Wait()
close(done)
}()
waitCtx, cancel := context.WithTimeout(context.Background(), app.cfg.ShutdownTimeout)
defer cancel()
select {
case <-done:
return nil
case <-waitCtx.Done():
return fmt.Errorf("wait for rtmanager components: %w", waitCtx.Err())
}
}
+137
View File
@@ -0,0 +1,137 @@
package app
import (
"context"
"errors"
"sync/atomic"
"testing"
"time"
"galaxy/rtmanager/internal/config"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
type fakeComponent struct {
runErr error
shutdownErr error
runHook func(context.Context) error
shutdownHook func(context.Context) error
runCount atomic.Int32
downCount atomic.Int32
blockForCtx bool
}
func (component *fakeComponent) Run(ctx context.Context) error {
component.runCount.Add(1)
if component.runHook != nil {
return component.runHook(ctx)
}
if component.blockForCtx {
<-ctx.Done()
return ctx.Err()
}
return component.runErr
}
func (component *fakeComponent) Shutdown(ctx context.Context) error {
component.downCount.Add(1)
if component.shutdownHook != nil {
return component.shutdownHook(ctx)
}
return component.shutdownErr
}
func newCfg() config.Config {
return config.Config{ShutdownTimeout: time.Second}
}
func TestAppRunWithoutComponentsBlocksUntilContextDone(t *testing.T) {
t.Parallel()
app := New(newCfg())
ctx, cancel := context.WithCancel(context.Background())
cancel()
require.NoError(t, app.Run(ctx))
}
func TestAppRunReturnsOnContextCancel(t *testing.T) {
t.Parallel()
component := &fakeComponent{blockForCtx: true}
app := New(newCfg(), component)
ctx, cancel := context.WithCancel(context.Background())
go func() {
time.Sleep(10 * time.Millisecond)
cancel()
}()
require.NoError(t, app.Run(ctx))
assert.EqualValues(t, 1, component.runCount.Load())
assert.EqualValues(t, 1, component.downCount.Load())
}
func TestAppRunPropagatesComponentFailure(t *testing.T) {
t.Parallel()
failure := errors.New("boom")
component := &fakeComponent{runErr: failure}
app := New(newCfg(), component)
err := app.Run(context.Background())
require.Error(t, err)
require.ErrorIs(t, err, failure)
assert.EqualValues(t, 1, component.downCount.Load())
}
func TestAppRunFailsOnNilContext(t *testing.T) {
t.Parallel()
app := New(newCfg())
var ctx context.Context
require.Error(t, app.Run(ctx))
}
func TestAppRunFailsOnNonPositiveShutdownTimeout(t *testing.T) {
t.Parallel()
app := New(config.Config{}, &fakeComponent{})
require.Error(t, app.Run(context.Background()))
}
func TestAppRunFailsOnNilComponent(t *testing.T) {
t.Parallel()
app := New(newCfg(), nil)
require.Error(t, app.Run(context.Background()))
}
func TestAppRunFlagsCleanExitBeforeShutdown(t *testing.T) {
t.Parallel()
component := &fakeComponent{}
app := New(newCfg(), component)
err := app.Run(context.Background())
require.Error(t, err)
require.True(t, contains(err.Error(), "exited without error"))
}
func contains(haystack, needle string) bool {
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || index(haystack, needle) >= 0))
}
func index(haystack, needle string) int {
for i := 0; i+len(needle) <= len(haystack); i++ {
if haystack[i:i+len(needle)] == needle {
return i
}
}
return -1
}
+85
View File
@@ -0,0 +1,85 @@
package app
import (
"context"
"errors"
"fmt"
"time"
"galaxy/redisconn"
"galaxy/rtmanager/internal/config"
"galaxy/rtmanager/internal/telemetry"
"github.com/docker/docker/client"
"github.com/redis/go-redis/v9"
)
// newRedisClient builds the master Redis client from cfg via the shared
// `pkg/redisconn` helper. Replica clients are not opened in this iteration
// per ARCHITECTURE.md §Persistence Backends; they will be wired when read
// routing is introduced.
func newRedisClient(cfg config.RedisConfig) *redis.Client {
return redisconn.NewMasterClient(cfg.Conn)
}
// instrumentRedisClient attaches the OpenTelemetry tracing and metrics
// instrumentation to client when telemetryRuntime is available. The
// actual instrumentation lives in `pkg/redisconn` so every Galaxy service
// shares one surface.
func instrumentRedisClient(redisClient *redis.Client, telemetryRuntime *telemetry.Runtime) error {
if redisClient == nil {
return errors.New("instrument redis client: nil client")
}
if telemetryRuntime == nil {
return nil
}
return redisconn.Instrument(redisClient,
redisconn.WithTracerProvider(telemetryRuntime.TracerProvider()),
redisconn.WithMeterProvider(telemetryRuntime.MeterProvider()),
)
}
// pingRedis performs a single Redis PING bounded by
// cfg.Conn.OperationTimeout to confirm that the configured Redis endpoint
// is reachable at startup.
func pingRedis(ctx context.Context, cfg config.RedisConfig, redisClient *redis.Client) error {
return redisconn.Ping(ctx, redisClient, cfg.Conn.OperationTimeout)
}
// newDockerClient constructs a Docker SDK client for cfg.Host with an
// optional API version override. The bootstrap layer opens and pings
// the client; the production Docker adapter wraps it for the service
// layer.
func newDockerClient(cfg config.DockerConfig) (*client.Client, error) {
options := []client.Opt{client.WithHost(cfg.Host)}
if cfg.APIVersion == "" {
options = append(options, client.WithAPIVersionNegotiation())
} else {
options = append(options, client.WithVersion(cfg.APIVersion))
}
docker, err := client.NewClientWithOpts(options...)
if err != nil {
return nil, fmt.Errorf("new docker client: %w", err)
}
return docker, nil
}
// pingDocker bounds one Docker daemon ping under timeout and returns a
// wrapped error so startup failures are easy to spot in service logs.
func pingDocker(ctx context.Context, dockerClient *client.Client, timeout time.Duration) error {
if dockerClient == nil {
return errors.New("ping docker: nil client")
}
if timeout <= 0 {
return errors.New("ping docker: timeout must be positive")
}
pingCtx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
if _, err := dockerClient.Ping(pingCtx); err != nil {
return fmt.Errorf("ping docker: %w", err)
}
return nil
}
+82
View File
@@ -0,0 +1,82 @@
package app
import (
"context"
"testing"
"time"
"galaxy/redisconn"
"galaxy/rtmanager/internal/config"
"github.com/alicebob/miniredis/v2"
"github.com/stretchr/testify/require"
)
func newTestRedisCfg(addr string) config.RedisConfig {
return config.RedisConfig{
Conn: redisconn.Config{
MasterAddr: addr,
Password: "test",
OperationTimeout: time.Second,
},
}
}
func TestPingRedisSucceedsAgainstMiniredis(t *testing.T) {
t.Parallel()
server := miniredis.RunT(t)
redisCfg := newTestRedisCfg(server.Addr())
client := newRedisClient(redisCfg)
t.Cleanup(func() { _ = client.Close() })
require.NoError(t, pingRedis(context.Background(), redisCfg, client))
}
func TestPingRedisReturnsErrorWhenClosed(t *testing.T) {
t.Parallel()
server := miniredis.RunT(t)
redisCfg := newTestRedisCfg(server.Addr())
client := newRedisClient(redisCfg)
require.NoError(t, client.Close())
require.Error(t, pingRedis(context.Background(), redisCfg, client))
}
func TestNewDockerClientHonoursHostOverride(t *testing.T) {
t.Parallel()
docker, err := newDockerClient(config.DockerConfig{
Host: "unix:///var/run/docker.sock",
APIVersion: "1.43",
Network: "galaxy-net",
LogDriver: "json-file",
PullPolicy: config.ImagePullPolicyIfMissing,
})
require.NoError(t, err)
require.NotNil(t, docker)
require.NoError(t, docker.Close())
}
func TestPingDockerRejectsNilClient(t *testing.T) {
t.Parallel()
require.Error(t, pingDocker(context.Background(), nil, time.Second))
}
func TestPingDockerRejectsNonPositiveTimeout(t *testing.T) {
t.Parallel()
docker, err := newDockerClient(config.DockerConfig{
Host: "unix:///var/run/docker.sock",
Network: "galaxy-net",
LogDriver: "json-file",
})
require.NoError(t, err)
t.Cleanup(func() { _ = docker.Close() })
require.Error(t, pingDocker(context.Background(), docker, 0))
}
+262
View File
@@ -0,0 +1,262 @@
package app
import (
"context"
"database/sql"
"errors"
"fmt"
"log/slog"
"time"
"galaxy/postgres"
"galaxy/redisconn"
"galaxy/rtmanager/internal/adapters/postgres/migrations"
"galaxy/rtmanager/internal/api/internalhttp"
"galaxy/rtmanager/internal/config"
"galaxy/rtmanager/internal/telemetry"
dockerclient "github.com/docker/docker/client"
"github.com/redis/go-redis/v9"
)
// Runtime owns the runnable Runtime Manager process plus the cleanup
// functions that release runtime resources after shutdown.
type Runtime struct {
cfg config.Config
app *App
wiring *wiring
internalServer *internalhttp.Server
cleanupFns []func() error
}
// NewRuntime constructs the runnable Runtime Manager process from cfg.
//
// PostgreSQL migrations apply strictly before the internal HTTP listener
// becomes ready. The runtime opens one shared `*redis.Client`, one
// `*sql.DB`, one Docker SDK client, and one OpenTelemetry runtime; all
// are released in reverse construction order on shutdown.
func NewRuntime(ctx context.Context, cfg config.Config, logger *slog.Logger) (*Runtime, error) {
if ctx == nil {
return nil, errors.New("new rtmanager runtime: nil context")
}
if err := cfg.Validate(); err != nil {
return nil, fmt.Errorf("new rtmanager runtime: %w", err)
}
if logger == nil {
logger = slog.Default()
}
runtime := &Runtime{
cfg: cfg,
}
cleanupOnError := func(err error) (*Runtime, error) {
if cleanupErr := runtime.Close(); cleanupErr != nil {
return nil, fmt.Errorf("%w; cleanup: %w", err, cleanupErr)
}
return nil, err
}
telemetryRuntime, err := telemetry.NewProcess(ctx, telemetry.ProcessConfig{
ServiceName: cfg.Telemetry.ServiceName,
TracesExporter: cfg.Telemetry.TracesExporter,
MetricsExporter: cfg.Telemetry.MetricsExporter,
TracesProtocol: cfg.Telemetry.TracesProtocol,
MetricsProtocol: cfg.Telemetry.MetricsProtocol,
StdoutTracesEnabled: cfg.Telemetry.StdoutTracesEnabled,
StdoutMetricsEnabled: cfg.Telemetry.StdoutMetricsEnabled,
}, logger)
if err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: telemetry: %w", err))
}
runtime.cleanupFns = append(runtime.cleanupFns, func() error {
shutdownCtx, cancel := context.WithTimeout(context.Background(), cfg.ShutdownTimeout)
defer cancel()
return telemetryRuntime.Shutdown(shutdownCtx)
})
redisClient := newRedisClient(cfg.Redis)
if err := instrumentRedisClient(redisClient, telemetryRuntime); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
}
runtime.cleanupFns = append(runtime.cleanupFns, func() error {
err := redisClient.Close()
if errors.Is(err, redis.ErrClosed) {
return nil
}
return err
})
if err := pingRedis(ctx, cfg.Redis, redisClient); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
}
pgPool, err := postgres.OpenPrimary(ctx, cfg.Postgres.Conn,
postgres.WithTracerProvider(telemetryRuntime.TracerProvider()),
postgres.WithMeterProvider(telemetryRuntime.MeterProvider()),
)
if err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: open postgres: %w", err))
}
runtime.cleanupFns = append(runtime.cleanupFns, pgPool.Close)
unregisterPGStats, err := postgres.InstrumentDBStats(pgPool,
postgres.WithMeterProvider(telemetryRuntime.MeterProvider()),
)
if err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: instrument postgres: %w", err))
}
runtime.cleanupFns = append(runtime.cleanupFns, func() error {
return unregisterPGStats()
})
if err := postgres.Ping(ctx, pgPool, cfg.Postgres.Conn.OperationTimeout); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: ping postgres: %w", err))
}
if err := postgres.RunMigrations(ctx, pgPool, migrations.FS(), "."); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: run postgres migrations: %w", err))
}
dockerClient, err := newDockerClient(cfg.Docker)
if err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
}
runtime.cleanupFns = append(runtime.cleanupFns, dockerClient.Close)
if err := pingDocker(ctx, dockerClient, cfg.Postgres.Conn.OperationTimeout); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
}
wiring, err := newWiring(cfg, redisClient, pgPool, dockerClient, time.Now, logger, telemetryRuntime)
if err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: wiring: %w", err))
}
runtime.wiring = wiring
runtime.cleanupFns = append(runtime.cleanupFns, wiring.close)
if err := wiring.registerTelemetryGauges(); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: register telemetry gauges: %w", err))
}
if err := wiring.reconciler.ReconcileNow(ctx); err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: initial reconcile: %w", err))
}
probe := newReadinessProbe(pgPool, redisClient, dockerClient, cfg)
internalServer, err := internalhttp.NewServer(internalhttp.Config{
Addr: cfg.InternalHTTP.Addr,
ReadHeaderTimeout: cfg.InternalHTTP.ReadHeaderTimeout,
ReadTimeout: cfg.InternalHTTP.ReadTimeout,
WriteTimeout: cfg.InternalHTTP.WriteTimeout,
IdleTimeout: cfg.InternalHTTP.IdleTimeout,
}, internalhttp.Dependencies{
Logger: logger,
Telemetry: telemetryRuntime,
Readiness: probe,
RuntimeRecords: wiring.runtimeRecordStore,
StartRuntime: wiring.startRuntimeService,
StopRuntime: wiring.stopRuntimeService,
RestartRuntime: wiring.restartRuntimeService,
PatchRuntime: wiring.patchRuntimeService,
CleanupContainer: wiring.cleanupContainerService,
})
if err != nil {
return cleanupOnError(fmt.Errorf("new rtmanager runtime: internal HTTP server: %w", err))
}
runtime.internalServer = internalServer
runtime.app = New(cfg,
internalServer,
wiring.startJobsConsumer,
wiring.stopJobsConsumer,
wiring.dockerEventsListener,
wiring.healthProbeWorker,
wiring.dockerInspectWorker,
wiring.reconciler,
wiring.containerCleanupWorker,
)
return runtime, nil
}
// InternalServer returns the internal HTTP server owned by runtime. It is
// primarily exposed for tests; production code should not depend on it.
func (runtime *Runtime) InternalServer() *internalhttp.Server {
if runtime == nil {
return nil
}
return runtime.internalServer
}
// Run serves the internal HTTP listener until ctx is canceled or one
// component fails.
func (runtime *Runtime) Run(ctx context.Context) error {
if ctx == nil {
return errors.New("run rtmanager runtime: nil context")
}
if runtime == nil {
return errors.New("run rtmanager runtime: nil runtime")
}
if runtime.app == nil {
return errors.New("run rtmanager runtime: nil app")
}
return runtime.app.Run(ctx)
}
// Close releases every runtime dependency in reverse construction order.
// Close is safe to call multiple times.
func (runtime *Runtime) Close() error {
if runtime == nil {
return nil
}
var joined error
for index := len(runtime.cleanupFns) - 1; index >= 0; index-- {
if err := runtime.cleanupFns[index](); err != nil {
joined = errors.Join(joined, err)
}
}
runtime.cleanupFns = nil
return joined
}
// readinessProbe pings every steady-state dependency the listener
// guards: PostgreSQL primary, Redis master, the Docker daemon, plus
// the configured Docker network's existence.
type readinessProbe struct {
pgPool *sql.DB
redisClient *redis.Client
dockerClient *dockerclient.Client
postgresTimeout time.Duration
redisTimeout time.Duration
dockerTimeout time.Duration
}
func newReadinessProbe(pgPool *sql.DB, redisClient *redis.Client, dockerClient *dockerclient.Client, cfg config.Config) *readinessProbe {
return &readinessProbe{
pgPool: pgPool,
redisClient: redisClient,
dockerClient: dockerClient,
postgresTimeout: cfg.Postgres.Conn.OperationTimeout,
redisTimeout: cfg.Redis.Conn.OperationTimeout,
dockerTimeout: cfg.Postgres.Conn.OperationTimeout,
}
}
// Check pings PostgreSQL, Redis, and Docker. The first failing
// dependency aborts the check so callers see a single, actionable
// error.
func (probe *readinessProbe) Check(ctx context.Context) error {
if err := postgres.Ping(ctx, probe.pgPool, probe.postgresTimeout); err != nil {
return err
}
if err := redisconn.Ping(ctx, probe.redisClient, probe.redisTimeout); err != nil {
return err
}
return pingDocker(ctx, probe.dockerClient, probe.dockerTimeout)
}
+541
View File
@@ -0,0 +1,541 @@
package app
import (
"context"
"database/sql"
"errors"
"fmt"
"log/slog"
"net/http"
"time"
"galaxy/rtmanager/internal/adapters/docker"
"galaxy/rtmanager/internal/adapters/healtheventspublisher"
"galaxy/rtmanager/internal/adapters/jobresultspublisher"
"galaxy/rtmanager/internal/adapters/lobbyclient"
"galaxy/rtmanager/internal/adapters/notificationpublisher"
"galaxy/rtmanager/internal/adapters/postgres/healthsnapshotstore"
"galaxy/rtmanager/internal/adapters/postgres/operationlogstore"
"galaxy/rtmanager/internal/adapters/postgres/runtimerecordstore"
"galaxy/rtmanager/internal/adapters/redisstate/gamelease"
"galaxy/rtmanager/internal/adapters/redisstate/streamoffsets"
"galaxy/rtmanager/internal/config"
"galaxy/rtmanager/internal/ports"
"galaxy/rtmanager/internal/service/cleanupcontainer"
"galaxy/rtmanager/internal/service/patchruntime"
"galaxy/rtmanager/internal/service/restartruntime"
"galaxy/rtmanager/internal/service/startruntime"
"galaxy/rtmanager/internal/service/stopruntime"
"galaxy/rtmanager/internal/telemetry"
"galaxy/rtmanager/internal/worker/containercleanup"
"galaxy/rtmanager/internal/worker/dockerevents"
"galaxy/rtmanager/internal/worker/dockerinspect"
"galaxy/rtmanager/internal/worker/healthprobe"
"galaxy/rtmanager/internal/worker/reconcile"
"galaxy/rtmanager/internal/worker/startjobsconsumer"
"galaxy/rtmanager/internal/worker/stopjobsconsumer"
dockerclient "github.com/docker/docker/client"
"github.com/redis/go-redis/v9"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
// wiring owns the process-level singletons constructed once during
// `NewRuntime` and consumed by every worker and HTTP handler.
//
// The struct exposes typed accessors so callers can grab the store /
// adapter / service singletons without depending on internal fields.
type wiring struct {
cfg config.Config
redisClient *redis.Client
pgPool *sql.DB
dockerClient *dockerclient.Client
clock func() time.Time
logger *slog.Logger
telemetry *telemetry.Runtime
// Persistence stores.
runtimeRecordStore *runtimerecordstore.Store
operationLogStore *operationlogstore.Store
healthSnapshotStore *healthsnapshotstore.Store
streamOffsetStore *streamoffsets.Store
gameLeaseStore *gamelease.Store
// External adapters.
dockerAdapter *docker.Client
lobbyClient *lobbyclient.Client
notificationPublisher *notificationpublisher.Publisher
healthEventsPublisher *healtheventspublisher.Publisher
jobResultsPublisher *jobresultspublisher.Publisher
// Service layer.
startRuntimeService *startruntime.Service
stopRuntimeService *stopruntime.Service
restartRuntimeService *restartruntime.Service
patchRuntimeService *patchruntime.Service
cleanupContainerService *cleanupcontainer.Service
// Worker layer.
startJobsConsumer *startjobsconsumer.Consumer
stopJobsConsumer *stopjobsconsumer.Consumer
dockerEventsListener *dockerevents.Listener
healthProbeWorker *healthprobe.Worker
dockerInspectWorker *dockerinspect.Worker
reconciler *reconcile.Reconciler
containerCleanupWorker *containercleanup.Worker
// closers releases adapter-level resources at runtime shutdown.
closers []func() error
}
// newWiring constructs the process-level dependency set, the persistence
// stores, the external adapters, and the service layer. It validates
// every required collaborator so callers can rely on them being non-nil.
func newWiring(
cfg config.Config,
redisClient *redis.Client,
pgPool *sql.DB,
dockerClient *dockerclient.Client,
clock func() time.Time,
logger *slog.Logger,
telemetryRuntime *telemetry.Runtime,
) (*wiring, error) {
if redisClient == nil {
return nil, errors.New("new rtmanager wiring: nil redis client")
}
if pgPool == nil {
return nil, errors.New("new rtmanager wiring: nil postgres pool")
}
if dockerClient == nil {
return nil, errors.New("new rtmanager wiring: nil docker client")
}
if clock == nil {
clock = time.Now
}
if logger == nil {
logger = slog.Default()
}
if telemetryRuntime == nil {
return nil, fmt.Errorf("new rtmanager wiring: nil telemetry runtime")
}
w := &wiring{
cfg: cfg,
redisClient: redisClient,
pgPool: pgPool,
dockerClient: dockerClient,
clock: clock,
logger: logger,
telemetry: telemetryRuntime,
}
if err := w.buildPersistence(); err != nil {
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
}
if err := w.buildAdapters(); err != nil {
_ = w.close()
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
}
if err := w.buildServices(); err != nil {
_ = w.close()
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
}
if err := w.buildWorkers(); err != nil {
_ = w.close()
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
}
return w, nil
}
func (w *wiring) buildPersistence() error {
runtimeStore, err := runtimerecordstore.New(runtimerecordstore.Config{
DB: w.pgPool,
OperationTimeout: w.cfg.Postgres.Conn.OperationTimeout,
})
if err != nil {
return fmt.Errorf("runtime record store: %w", err)
}
w.runtimeRecordStore = runtimeStore
operationStore, err := operationlogstore.New(operationlogstore.Config{
DB: w.pgPool,
OperationTimeout: w.cfg.Postgres.Conn.OperationTimeout,
})
if err != nil {
return fmt.Errorf("operation log store: %w", err)
}
w.operationLogStore = operationStore
snapshotStore, err := healthsnapshotstore.New(healthsnapshotstore.Config{
DB: w.pgPool,
OperationTimeout: w.cfg.Postgres.Conn.OperationTimeout,
})
if err != nil {
return fmt.Errorf("health snapshot store: %w", err)
}
w.healthSnapshotStore = snapshotStore
offsetStore, err := streamoffsets.New(streamoffsets.Config{Client: w.redisClient})
if err != nil {
return fmt.Errorf("stream offset store: %w", err)
}
w.streamOffsetStore = offsetStore
leaseStore, err := gamelease.New(gamelease.Config{Client: w.redisClient})
if err != nil {
return fmt.Errorf("game lease store: %w", err)
}
w.gameLeaseStore = leaseStore
return nil
}
func (w *wiring) buildAdapters() error {
dockerAdapter, err := docker.NewClient(docker.Config{
Docker: w.dockerClient,
LogDriver: w.cfg.Docker.LogDriver,
LogOpts: w.cfg.Docker.LogOpts,
Clock: w.clock,
})
if err != nil {
return fmt.Errorf("docker adapter: %w", err)
}
w.dockerAdapter = dockerAdapter
lobby, err := lobbyclient.NewClient(lobbyclient.Config{
BaseURL: w.cfg.Lobby.BaseURL,
RequestTimeout: w.cfg.Lobby.Timeout,
})
if err != nil {
return fmt.Errorf("lobby client: %w", err)
}
w.lobbyClient = lobby
w.closers = append(w.closers, lobby.Close)
notificationPub, err := notificationpublisher.NewPublisher(notificationpublisher.Config{
Client: w.redisClient,
Stream: w.cfg.Streams.NotificationIntents,
})
if err != nil {
return fmt.Errorf("notification publisher: %w", err)
}
w.notificationPublisher = notificationPub
healthPub, err := healtheventspublisher.NewPublisher(healtheventspublisher.Config{
Client: w.redisClient,
Snapshots: w.healthSnapshotStore,
Stream: w.cfg.Streams.HealthEvents,
})
if err != nil {
return fmt.Errorf("health events publisher: %w", err)
}
w.healthEventsPublisher = healthPub
jobResultsPub, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{
Client: w.redisClient,
Stream: w.cfg.Streams.JobResults,
})
if err != nil {
return fmt.Errorf("job results publisher: %w", err)
}
w.jobResultsPublisher = jobResultsPub
return nil
}
func (w *wiring) buildServices() error {
startService, err := startruntime.NewService(startruntime.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
OperationLogs: w.operationLogStore,
Docker: w.dockerAdapter,
Leases: w.gameLeaseStore,
HealthEvents: w.healthEventsPublisher,
Notifications: w.notificationPublisher,
Lobby: w.lobbyClient,
Container: w.cfg.Container,
DockerCfg: w.cfg.Docker,
Coordination: w.cfg.Coordination,
Telemetry: w.telemetry,
Logger: w.logger,
Clock: w.clock,
})
if err != nil {
return fmt.Errorf("start runtime service: %w", err)
}
w.startRuntimeService = startService
stopService, err := stopruntime.NewService(stopruntime.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
OperationLogs: w.operationLogStore,
Docker: w.dockerAdapter,
Leases: w.gameLeaseStore,
HealthEvents: w.healthEventsPublisher,
Container: w.cfg.Container,
Coordination: w.cfg.Coordination,
Telemetry: w.telemetry,
Logger: w.logger,
Clock: w.clock,
})
if err != nil {
return fmt.Errorf("stop runtime service: %w", err)
}
w.stopRuntimeService = stopService
restartService, err := restartruntime.NewService(restartruntime.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
OperationLogs: w.operationLogStore,
Docker: w.dockerAdapter,
Leases: w.gameLeaseStore,
StopService: stopService,
StartService: startService,
Coordination: w.cfg.Coordination,
Telemetry: w.telemetry,
Logger: w.logger,
Clock: w.clock,
})
if err != nil {
return fmt.Errorf("restart runtime service: %w", err)
}
w.restartRuntimeService = restartService
patchService, err := patchruntime.NewService(patchruntime.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
OperationLogs: w.operationLogStore,
Docker: w.dockerAdapter,
Leases: w.gameLeaseStore,
StopService: stopService,
StartService: startService,
Coordination: w.cfg.Coordination,
Telemetry: w.telemetry,
Logger: w.logger,
Clock: w.clock,
})
if err != nil {
return fmt.Errorf("patch runtime service: %w", err)
}
w.patchRuntimeService = patchService
cleanupService, err := cleanupcontainer.NewService(cleanupcontainer.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
OperationLogs: w.operationLogStore,
Docker: w.dockerAdapter,
Leases: w.gameLeaseStore,
Coordination: w.cfg.Coordination,
Telemetry: w.telemetry,
Logger: w.logger,
Clock: w.clock,
})
if err != nil {
return fmt.Errorf("cleanup container service: %w", err)
}
w.cleanupContainerService = cleanupService
return nil
}
// buildWorkers constructs the asynchronous Lobby ↔ RTM stream
// consumers. Both consumers participate in the process lifecycle as
// `app.Component`s; `internal/app/runtime.go` passes them into
// `app.New` alongside the internal HTTP server.
func (w *wiring) buildWorkers() error {
startConsumer, err := startjobsconsumer.NewConsumer(startjobsconsumer.Config{
Client: w.redisClient,
Stream: w.cfg.Streams.StartJobs,
BlockTimeout: w.cfg.Streams.BlockTimeout,
StartService: w.startRuntimeService,
JobResults: w.jobResultsPublisher,
OffsetStore: w.streamOffsetStore,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("start jobs consumer: %w", err)
}
w.startJobsConsumer = startConsumer
stopConsumer, err := stopjobsconsumer.NewConsumer(stopjobsconsumer.Config{
Client: w.redisClient,
Stream: w.cfg.Streams.StopJobs,
BlockTimeout: w.cfg.Streams.BlockTimeout,
StopService: w.stopRuntimeService,
JobResults: w.jobResultsPublisher,
OffsetStore: w.streamOffsetStore,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("stop jobs consumer: %w", err)
}
w.stopJobsConsumer = stopConsumer
eventsListener, err := dockerevents.NewListener(dockerevents.Dependencies{
Docker: w.dockerAdapter,
RuntimeRecords: w.runtimeRecordStore,
HealthEvents: w.healthEventsPublisher,
Telemetry: w.telemetry,
Clock: w.clock,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("docker events listener: %w", err)
}
w.dockerEventsListener = eventsListener
probeHTTPClient, err := newProbeHTTPClient(w.telemetry)
if err != nil {
return fmt.Errorf("health probe http client: %w", err)
}
probeWorker, err := healthprobe.NewWorker(healthprobe.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
HealthEvents: w.healthEventsPublisher,
HTTPClient: probeHTTPClient,
Telemetry: w.telemetry,
Interval: w.cfg.Health.ProbeInterval,
ProbeTimeout: w.cfg.Health.ProbeTimeout,
FailuresThreshold: w.cfg.Health.ProbeFailuresThreshold,
Clock: w.clock,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("health probe worker: %w", err)
}
w.healthProbeWorker = probeWorker
inspectWorker, err := dockerinspect.NewWorker(dockerinspect.Dependencies{
Docker: w.dockerAdapter,
RuntimeRecords: w.runtimeRecordStore,
HealthEvents: w.healthEventsPublisher,
Telemetry: w.telemetry,
Interval: w.cfg.Health.InspectInterval,
Clock: w.clock,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("docker inspect worker: %w", err)
}
w.dockerInspectWorker = inspectWorker
reconciler, err := reconcile.NewReconciler(reconcile.Dependencies{
Docker: w.dockerAdapter,
RuntimeRecords: w.runtimeRecordStore,
OperationLogs: w.operationLogStore,
HealthEvents: w.healthEventsPublisher,
Leases: w.gameLeaseStore,
Telemetry: w.telemetry,
DockerCfg: w.cfg.Docker,
ContainerCfg: w.cfg.Container,
Coordination: w.cfg.Coordination,
Interval: w.cfg.Cleanup.ReconcileInterval,
Clock: w.clock,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("reconciler: %w", err)
}
w.reconciler = reconciler
cleanupWorker, err := containercleanup.NewWorker(containercleanup.Dependencies{
RuntimeRecords: w.runtimeRecordStore,
Cleanup: w.cleanupContainerService,
Retention: w.cfg.Container.Retention,
Interval: w.cfg.Cleanup.CleanupInterval,
Clock: w.clock,
Logger: w.logger,
})
if err != nil {
return fmt.Errorf("container cleanup worker: %w", err)
}
w.containerCleanupWorker = cleanupWorker
return nil
}
// newProbeHTTPClient constructs the otelhttp-instrumented HTTP client
// the active health probe uses to call engine `/healthz`. It clones
// the default transport so caller-provided transports stay isolated
// from production wiring (mirrors the lobby internal client).
func newProbeHTTPClient(telemetryRuntime *telemetry.Runtime) (*http.Client, error) {
transport, ok := http.DefaultTransport.(*http.Transport)
if !ok {
return nil, errors.New("default http transport is not *http.Transport")
}
cloned := transport.Clone()
instrumented := otelhttp.NewTransport(cloned,
otelhttp.WithTracerProvider(telemetryRuntime.TracerProvider()),
otelhttp.WithMeterProvider(telemetryRuntime.MeterProvider()),
)
return &http.Client{Transport: instrumented}, nil
}
// registerTelemetryGauges installs the runtime-records-by-status gauge
// callback so the telemetry runtime can observe the persistent store
// without holding a strong reference to the wiring.
func (w *wiring) registerTelemetryGauges() error {
probe := newRuntimeRecordsProbe(w.runtimeRecordStore)
return w.telemetry.RegisterGauges(telemetry.GaugeDependencies{
RuntimeRecordsByStatus: probe,
Logger: w.logger,
})
}
// close releases adapter-level resources owned by the wiring layer.
// Returns the joined error of every closer; the caller is expected to
// invoke this once during process shutdown.
func (w *wiring) close() error {
var joined error
for index := len(w.closers) - 1; index >= 0; index-- {
if err := w.closers[index](); err != nil {
joined = errors.Join(joined, err)
}
}
w.closers = nil
return joined
}
// runtimeRecordsProbe adapts runtimerecordstore.Store to
// telemetry.RuntimeRecordsByStatusProbe by translating the typed status
// keys into the string keys the gauge expects.
type runtimeRecordsProbe struct {
store *runtimerecordstore.Store
}
func newRuntimeRecordsProbe(store *runtimerecordstore.Store) *runtimeRecordsProbe {
return &runtimeRecordsProbe{store: store}
}
func (p *runtimeRecordsProbe) CountByStatus(ctx context.Context) (map[string]int, error) {
if p == nil || p.store == nil {
return nil, errors.New("runtime records probe: nil store")
}
counts, err := p.store.CountByStatus(ctx)
if err != nil {
return nil, err
}
out := make(map[string]int, len(counts))
for status, count := range counts {
out[string(status)] = count
}
return out, nil
}
// Compile-time assertions that the constructed adapters satisfy the
// expected port surfaces; these prevent silent regressions when a
// port shape changes.
var (
_ ports.RuntimeRecordStore = (*runtimerecordstore.Store)(nil)
_ ports.OperationLogStore = (*operationlogstore.Store)(nil)
_ ports.HealthSnapshotStore = (*healthsnapshotstore.Store)(nil)
_ ports.StreamOffsetStore = (*streamoffsets.Store)(nil)
_ ports.GameLeaseStore = (*gamelease.Store)(nil)
_ ports.DockerClient = (*docker.Client)(nil)
_ ports.LobbyInternalClient = (*lobbyclient.Client)(nil)
_ ports.NotificationIntentPublisher = (*notificationpublisher.Publisher)(nil)
_ ports.HealthEventPublisher = (*healtheventspublisher.Publisher)(nil)
_ ports.JobResultPublisher = (*jobresultspublisher.Publisher)(nil)
_ Component = (*reconcile.Reconciler)(nil)
_ Component = (*containercleanup.Worker)(nil)
_ containercleanup.Cleaner = (*cleanupcontainer.Service)(nil)
)
+632
View File
@@ -0,0 +1,632 @@
// Package config loads the Runtime Manager process configuration from
// environment variables.
package config
import (
"fmt"
"strings"
"time"
"galaxy/postgres"
"galaxy/redisconn"
"galaxy/rtmanager/internal/telemetry"
)
const (
envPrefix = "RTMANAGER"
shutdownTimeoutEnvVar = "RTMANAGER_SHUTDOWN_TIMEOUT"
logLevelEnvVar = "RTMANAGER_LOG_LEVEL"
internalHTTPAddrEnvVar = "RTMANAGER_INTERNAL_HTTP_ADDR"
internalHTTPReadHeaderTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_READ_HEADER_TIMEOUT"
internalHTTPReadTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT"
internalHTTPWriteTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT"
internalHTTPIdleTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT"
dockerHostEnvVar = "RTMANAGER_DOCKER_HOST"
dockerAPIVersionEnvVar = "RTMANAGER_DOCKER_API_VERSION"
dockerNetworkEnvVar = "RTMANAGER_DOCKER_NETWORK"
dockerLogDriverEnvVar = "RTMANAGER_DOCKER_LOG_DRIVER"
dockerLogOptsEnvVar = "RTMANAGER_DOCKER_LOG_OPTS"
imagePullPolicyEnvVar = "RTMANAGER_IMAGE_PULL_POLICY"
defaultCPUQuotaEnvVar = "RTMANAGER_DEFAULT_CPU_QUOTA"
defaultMemoryEnvVar = "RTMANAGER_DEFAULT_MEMORY"
defaultPIDsLimitEnvVar = "RTMANAGER_DEFAULT_PIDS_LIMIT"
containerStopTimeoutSecondsEnvVar = "RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS"
containerRetentionDaysEnvVar = "RTMANAGER_CONTAINER_RETENTION_DAYS"
engineStateMountPathEnvVar = "RTMANAGER_ENGINE_STATE_MOUNT_PATH"
engineStateEnvNameEnvVar = "RTMANAGER_ENGINE_STATE_ENV_NAME"
gameStateDirModeEnvVar = "RTMANAGER_GAME_STATE_DIR_MODE"
gameStateOwnerUIDEnvVar = "RTMANAGER_GAME_STATE_OWNER_UID"
gameStateOwnerGIDEnvVar = "RTMANAGER_GAME_STATE_OWNER_GID"
gameStateRootEnvVar = "RTMANAGER_GAME_STATE_ROOT"
startJobsStreamEnvVar = "RTMANAGER_REDIS_START_JOBS_STREAM"
stopJobsStreamEnvVar = "RTMANAGER_REDIS_STOP_JOBS_STREAM"
jobResultsStreamEnvVar = "RTMANAGER_REDIS_JOB_RESULTS_STREAM"
healthEventsStreamEnvVar = "RTMANAGER_REDIS_HEALTH_EVENTS_STREAM"
notificationIntentsStreamEnv = "RTMANAGER_NOTIFICATION_INTENTS_STREAM"
streamBlockTimeoutEnvVar = "RTMANAGER_STREAM_BLOCK_TIMEOUT"
inspectIntervalEnvVar = "RTMANAGER_INSPECT_INTERVAL"
probeIntervalEnvVar = "RTMANAGER_PROBE_INTERVAL"
probeTimeoutEnvVar = "RTMANAGER_PROBE_TIMEOUT"
probeFailuresThresholdEnvVar = "RTMANAGER_PROBE_FAILURES_THRESHOLD"
reconcileIntervalEnvVar = "RTMANAGER_RECONCILE_INTERVAL"
cleanupIntervalEnvVar = "RTMANAGER_CLEANUP_INTERVAL"
gameLeaseTTLSecondsEnvVar = "RTMANAGER_GAME_LEASE_TTL_SECONDS"
lobbyInternalBaseURLEnvVar = "RTMANAGER_LOBBY_INTERNAL_BASE_URL"
lobbyInternalTimeoutEnvVar = "RTMANAGER_LOBBY_INTERNAL_TIMEOUT"
otelServiceNameEnvVar = "OTEL_SERVICE_NAME"
otelTracesExporterEnvVar = "OTEL_TRACES_EXPORTER"
otelMetricsExporterEnvVar = "OTEL_METRICS_EXPORTER"
otelExporterOTLPProtocolEnvVar = "OTEL_EXPORTER_OTLP_PROTOCOL"
otelExporterOTLPTracesProtocolEnvVar = "OTEL_EXPORTER_OTLP_TRACES_PROTOCOL"
otelExporterOTLPMetricsProtocolEnvVar = "OTEL_EXPORTER_OTLP_METRICS_PROTOCOL"
otelStdoutTracesEnabledEnvVar = "RTMANAGER_OTEL_STDOUT_TRACES_ENABLED"
otelStdoutMetricsEnabledEnvVar = "RTMANAGER_OTEL_STDOUT_METRICS_ENABLED"
defaultShutdownTimeout = 30 * time.Second
defaultLogLevel = "info"
defaultInternalHTTPAddr = ":8096"
defaultReadHeaderTimeout = 2 * time.Second
defaultReadTimeout = 5 * time.Second
defaultWriteTimeout = 15 * time.Second
defaultIdleTimeout = 60 * time.Second
defaultDockerHost = "unix:///var/run/docker.sock"
defaultDockerNetwork = "galaxy-net"
defaultDockerLogDriver = "json-file"
defaultImagePullPolicy = ImagePullPolicyIfMissing
defaultCPUQuota = 1.0
defaultMemory = "512m"
defaultPIDsLimit = 512
defaultContainerStopTimeout = 30 * time.Second
defaultContainerRetention = 30 * 24 * time.Hour
defaultEngineStateMountPath = "/var/lib/galaxy-game"
defaultEngineStateEnvName = "GAME_STATE_PATH"
defaultGameStateDirMode = 0o750
defaultStartJobsStream = "runtime:start_jobs"
defaultStopJobsStream = "runtime:stop_jobs"
defaultJobResultsStream = "runtime:job_results"
defaultHealthEventsStream = "runtime:health_events"
defaultNotificationIntentsKey = "notification:intents"
defaultStreamBlockTimeout = 5 * time.Second
defaultInspectInterval = 30 * time.Second
defaultProbeInterval = 15 * time.Second
defaultProbeTimeout = 2 * time.Second
defaultProbeFailuresThreshold = 3
defaultReconcileInterval = 5 * time.Minute
defaultCleanupInterval = time.Hour
defaultGameLeaseTTL = 60 * time.Second
defaultLobbyInternalTimeout = 2 * time.Second
defaultOTelServiceName = "galaxy-rtmanager"
)
// ImagePullPolicy enumerates the supported image pull policies. The start
// service validates a producer-supplied `image_ref` against this policy at
// start time.
type ImagePullPolicy string
// Supported pull policies, frozen by `rtmanager/README.md` §Configuration.
const (
ImagePullPolicyIfMissing ImagePullPolicy = "if_missing"
ImagePullPolicyAlways ImagePullPolicy = "always"
ImagePullPolicyNever ImagePullPolicy = "never"
)
// Validate reports whether p is one of the frozen pull policies.
func (p ImagePullPolicy) Validate() error {
switch p {
case ImagePullPolicyIfMissing, ImagePullPolicyAlways, ImagePullPolicyNever:
return nil
default:
return fmt.Errorf("image pull policy %q must be one of %q, %q, %q",
p, ImagePullPolicyIfMissing, ImagePullPolicyAlways, ImagePullPolicyNever)
}
}
// Config stores the full Runtime Manager process configuration.
type Config struct {
// ShutdownTimeout bounds graceful shutdown of every long-lived
// component.
ShutdownTimeout time.Duration
// Logging configures the process-wide structured logger.
Logging LoggingConfig
// InternalHTTP configures the trusted internal HTTP listener that
// serves probes and the GM/Admin REST surface.
InternalHTTP InternalHTTPConfig
// Docker configures the Docker SDK client RTM uses to drive the local
// Docker daemon.
Docker DockerConfig
// Postgres configures the PostgreSQL-backed durable store consumed via
// `pkg/postgres`.
Postgres PostgresConfig
// Redis configures the shared Redis connection topology consumed via
// `pkg/redisconn`.
Redis RedisConfig
// Streams stores the stable Redis Stream names RTM reads from and
// writes to.
Streams StreamsConfig
// Container stores the per-container defaults applied at start time
// when the resolved image does not declare its own labels.
Container ContainerConfig
// Health configures the periodic health-monitoring workers (events
// listener, inspect, active probe).
Health HealthConfig
// Cleanup configures the reconciler and container-cleanup workers.
Cleanup CleanupConfig
// Coordination configures the per-game Redis lease used to serialise
// operations across all entry points.
Coordination CoordinationConfig
// Lobby configures the synchronous Lobby internal REST client used by
// the start service for ancillary lookups.
Lobby LobbyConfig
// Telemetry configures the process-wide OpenTelemetry runtime.
Telemetry TelemetryConfig
}
// LoggingConfig configures the process-wide structured logger.
type LoggingConfig struct {
// Level stores the process log level accepted by log/slog.
Level string
}
// InternalHTTPConfig configures the trusted internal HTTP listener.
type InternalHTTPConfig struct {
// Addr stores the TCP listen address.
Addr string
// ReadHeaderTimeout bounds request-header reading.
ReadHeaderTimeout time.Duration
// ReadTimeout bounds reading one request.
ReadTimeout time.Duration
// WriteTimeout bounds writing one response.
WriteTimeout time.Duration
// IdleTimeout bounds how long keep-alive connections stay open.
IdleTimeout time.Duration
}
// Validate reports whether cfg stores a usable internal HTTP listener
// configuration.
func (cfg InternalHTTPConfig) Validate() error {
switch {
case strings.TrimSpace(cfg.Addr) == "":
return fmt.Errorf("internal HTTP addr must not be empty")
case !isTCPAddr(cfg.Addr):
return fmt.Errorf("internal HTTP addr %q must use host:port form", cfg.Addr)
case cfg.ReadHeaderTimeout <= 0:
return fmt.Errorf("internal HTTP read header timeout must be positive")
case cfg.ReadTimeout <= 0:
return fmt.Errorf("internal HTTP read timeout must be positive")
case cfg.WriteTimeout <= 0:
return fmt.Errorf("internal HTTP write timeout must be positive")
case cfg.IdleTimeout <= 0:
return fmt.Errorf("internal HTTP idle timeout must be positive")
default:
return nil
}
}
// DockerConfig configures the Docker SDK client.
type DockerConfig struct {
// Host stores the Docker daemon endpoint (e.g.
// `unix:///var/run/docker.sock`).
Host string
// APIVersion overrides the Docker API version. Empty lets the SDK
// negotiate.
APIVersion string
// Network stores the user-defined Docker bridge network containers
// attach to. Provisioned outside RTM; missing network is a fail-fast
// condition at startup.
Network string
// LogDriver stores the Docker logging driver applied to engine
// containers.
LogDriver string
// LogOpts stores the comma-separated `key=value` driver options.
LogOpts string
// PullPolicy stores the configured image pull policy.
PullPolicy ImagePullPolicy
}
// Validate reports whether cfg stores a usable Docker configuration.
func (cfg DockerConfig) Validate() error {
switch {
case strings.TrimSpace(cfg.Host) == "":
return fmt.Errorf("docker host must not be empty")
case strings.TrimSpace(cfg.Network) == "":
return fmt.Errorf("docker network must not be empty")
case strings.TrimSpace(cfg.LogDriver) == "":
return fmt.Errorf("docker log driver must not be empty")
}
return cfg.PullPolicy.Validate()
}
// PostgresConfig configures the PostgreSQL-backed durable store consumed
// via `pkg/postgres`.
type PostgresConfig struct {
// Conn carries the primary plus replica DSN topology and pool tuning.
Conn postgres.Config
}
// Validate reports whether cfg stores a usable PostgreSQL configuration.
func (cfg PostgresConfig) Validate() error {
return cfg.Conn.Validate()
}
// RedisConfig configures the Runtime Manager Redis connection topology.
type RedisConfig struct {
// Conn carries the connection topology (master, replicas, password,
// db, per-call timeout).
Conn redisconn.Config
}
// Validate reports whether cfg stores a usable Redis configuration.
func (cfg RedisConfig) Validate() error {
return cfg.Conn.Validate()
}
// StreamsConfig stores the stable Redis Stream names used by Runtime
// Manager.
type StreamsConfig struct {
// StartJobs stores the Redis Streams key Lobby writes start jobs to.
StartJobs string
// StopJobs stores the Redis Streams key Lobby writes stop jobs to.
StopJobs string
// JobResults stores the Redis Streams key RTM writes job outcomes
// to.
JobResults string
// HealthEvents stores the Redis Streams key RTM publishes
// technical health events to.
HealthEvents string
// NotificationIntents stores the Redis Streams key RTM publishes
// admin-only notification intents to.
NotificationIntents string
// BlockTimeout bounds the maximum blocking read window for stream
// consumers.
BlockTimeout time.Duration
}
// Validate reports whether cfg stores usable stream names.
func (cfg StreamsConfig) Validate() error {
switch {
case strings.TrimSpace(cfg.StartJobs) == "":
return fmt.Errorf("redis start jobs stream must not be empty")
case strings.TrimSpace(cfg.StopJobs) == "":
return fmt.Errorf("redis stop jobs stream must not be empty")
case strings.TrimSpace(cfg.JobResults) == "":
return fmt.Errorf("redis job results stream must not be empty")
case strings.TrimSpace(cfg.HealthEvents) == "":
return fmt.Errorf("redis health events stream must not be empty")
case strings.TrimSpace(cfg.NotificationIntents) == "":
return fmt.Errorf("redis notification intents stream must not be empty")
case cfg.BlockTimeout <= 0:
return fmt.Errorf("redis stream block timeout must be positive")
default:
return nil
}
}
// ContainerConfig stores the per-container defaults applied at start
// time. Resource defaults apply when the resolved engine image does not
// expose `com.galaxy.cpu_quota` / `com.galaxy.memory` /
// `com.galaxy.pids_limit` labels.
type ContainerConfig struct {
// DefaultCPUQuota is the fallback `--cpus` value applied when the
// image does not declare `com.galaxy.cpu_quota`.
DefaultCPUQuota float64
// DefaultMemory is the fallback `--memory` value applied when the
// image does not declare `com.galaxy.memory`.
DefaultMemory string
// DefaultPIDsLimit is the fallback `--pids-limit` value applied
// when the image does not declare `com.galaxy.pids_limit`.
DefaultPIDsLimit int
// StopTimeout bounds graceful container stop before Docker fires
// SIGKILL.
StopTimeout time.Duration
// Retention stores the TTL after which `status=stopped` containers
// are removed by the cleanup worker.
Retention time.Duration
// EngineStateMountPath is the in-container path the per-game state
// directory is bind-mounted to.
EngineStateMountPath string
// EngineStateEnvName is the env-var name forwarded to the engine
// pointing at EngineStateMountPath.
EngineStateEnvName string
// GameStateDirMode stores the unix permissions applied to the
// per-game state directory on creation.
GameStateDirMode uint32
// GameStateOwnerUID stores the unix uid applied to the per-game
// state directory on creation.
GameStateOwnerUID int
// GameStateOwnerGID stores the unix gid applied to the per-game
// state directory on creation.
GameStateOwnerGID int
// GameStateRoot is the host path under which per-game state
// directories are created.
GameStateRoot string
}
// Validate reports whether cfg stores usable container defaults.
func (cfg ContainerConfig) Validate() error {
switch {
case cfg.DefaultCPUQuota <= 0:
return fmt.Errorf("default cpu quota must be positive")
case strings.TrimSpace(cfg.DefaultMemory) == "":
return fmt.Errorf("default memory must not be empty")
case cfg.DefaultPIDsLimit <= 0:
return fmt.Errorf("default pids limit must be positive")
case cfg.StopTimeout <= 0:
return fmt.Errorf("container stop timeout must be positive")
case cfg.Retention <= 0:
return fmt.Errorf("container retention must be positive")
case strings.TrimSpace(cfg.EngineStateMountPath) == "":
return fmt.Errorf("engine state mount path must not be empty")
case strings.TrimSpace(cfg.EngineStateEnvName) == "":
return fmt.Errorf("engine state env name must not be empty")
case cfg.GameStateDirMode == 0:
return fmt.Errorf("game state dir mode must be non-zero")
case strings.TrimSpace(cfg.GameStateRoot) == "":
return fmt.Errorf("game state root must not be empty")
case !strings.HasPrefix(strings.TrimSpace(cfg.GameStateRoot), "/"):
return fmt.Errorf("game state root %q must be an absolute path", cfg.GameStateRoot)
default:
return nil
}
}
// HealthConfig configures the periodic health-monitoring workers
// (Docker events listener, periodic inspect, active probe).
type HealthConfig struct {
// InspectInterval is the period between two periodic Docker inspect
// passes.
InspectInterval time.Duration
// ProbeInterval is the period between two engine `/healthz` probe
// rounds.
ProbeInterval time.Duration
// ProbeTimeout bounds one engine `/healthz` request.
ProbeTimeout time.Duration
// ProbeFailuresThreshold is the consecutive-failure count that
// triggers a `probe_failed` event.
ProbeFailuresThreshold int
}
// Validate reports whether cfg stores usable health-monitoring settings.
func (cfg HealthConfig) Validate() error {
switch {
case cfg.InspectInterval <= 0:
return fmt.Errorf("inspect interval must be positive")
case cfg.ProbeInterval <= 0:
return fmt.Errorf("probe interval must be positive")
case cfg.ProbeTimeout <= 0:
return fmt.Errorf("probe timeout must be positive")
case cfg.ProbeFailuresThreshold <= 0:
return fmt.Errorf("probe failures threshold must be positive")
default:
return nil
}
}
// CleanupConfig configures the reconciler and container-cleanup workers.
type CleanupConfig struct {
// ReconcileInterval is the period between two reconciler passes.
ReconcileInterval time.Duration
// CleanupInterval is the period between two container-cleanup
// passes.
CleanupInterval time.Duration
}
// Validate reports whether cfg stores usable cleanup settings.
func (cfg CleanupConfig) Validate() error {
switch {
case cfg.ReconcileInterval <= 0:
return fmt.Errorf("reconcile interval must be positive")
case cfg.CleanupInterval <= 0:
return fmt.Errorf("cleanup interval must be positive")
default:
return nil
}
}
// CoordinationConfig configures the per-game Redis lease.
type CoordinationConfig struct {
// GameLeaseTTL bounds the per-game lease lifetime renewed every
// half-TTL while an operation runs.
GameLeaseTTL time.Duration
}
// Validate reports whether cfg stores a usable lease configuration.
func (cfg CoordinationConfig) Validate() error {
if cfg.GameLeaseTTL <= 0 {
return fmt.Errorf("game lease ttl must be positive")
}
return nil
}
// LobbyConfig configures the synchronous Lobby internal REST client.
type LobbyConfig struct {
// BaseURL stores the trusted Lobby internal listener base URL.
BaseURL string
// Timeout bounds one Lobby internal request.
Timeout time.Duration
}
// Validate reports whether cfg stores a usable Lobby client
// configuration.
func (cfg LobbyConfig) Validate() error {
switch {
case strings.TrimSpace(cfg.BaseURL) == "":
return fmt.Errorf("lobby internal base url must not be empty")
case !isHTTPURL(cfg.BaseURL):
return fmt.Errorf("lobby internal base url %q must be an absolute http(s) URL", cfg.BaseURL)
case cfg.Timeout <= 0:
return fmt.Errorf("lobby internal timeout must be positive")
default:
return nil
}
}
// TelemetryConfig configures the Runtime Manager OpenTelemetry runtime.
type TelemetryConfig struct {
// ServiceName overrides the default OpenTelemetry service name.
ServiceName string
// TracesExporter selects the external traces exporter. Supported
// values are `none` and `otlp`.
TracesExporter string
// MetricsExporter selects the external metrics exporter. Supported
// values are `none` and `otlp`.
MetricsExporter string
// TracesProtocol selects the OTLP traces protocol when
// TracesExporter is `otlp`.
TracesProtocol string
// MetricsProtocol selects the OTLP metrics protocol when
// MetricsExporter is `otlp`.
MetricsProtocol string
// StdoutTracesEnabled enables the additional stdout trace exporter
// used for local development and debugging.
StdoutTracesEnabled bool
// StdoutMetricsEnabled enables the additional stdout metric
// exporter used for local development and debugging.
StdoutMetricsEnabled bool
}
// Validate reports whether cfg contains a supported OpenTelemetry
// configuration.
func (cfg TelemetryConfig) Validate() error {
return telemetry.ProcessConfig{
ServiceName: cfg.ServiceName,
TracesExporter: cfg.TracesExporter,
MetricsExporter: cfg.MetricsExporter,
TracesProtocol: cfg.TracesProtocol,
MetricsProtocol: cfg.MetricsProtocol,
StdoutTracesEnabled: cfg.StdoutTracesEnabled,
StdoutMetricsEnabled: cfg.StdoutMetricsEnabled,
}.Validate()
}
// DefaultConfig returns the default Runtime Manager process configuration.
func DefaultConfig() Config {
return Config{
ShutdownTimeout: defaultShutdownTimeout,
Logging: LoggingConfig{
Level: defaultLogLevel,
},
InternalHTTP: InternalHTTPConfig{
Addr: defaultInternalHTTPAddr,
ReadHeaderTimeout: defaultReadHeaderTimeout,
ReadTimeout: defaultReadTimeout,
WriteTimeout: defaultWriteTimeout,
IdleTimeout: defaultIdleTimeout,
},
Docker: DockerConfig{
Host: defaultDockerHost,
Network: defaultDockerNetwork,
LogDriver: defaultDockerLogDriver,
PullPolicy: defaultImagePullPolicy,
},
Postgres: PostgresConfig{
Conn: postgres.DefaultConfig(),
},
Redis: RedisConfig{
Conn: redisconn.DefaultConfig(),
},
Streams: StreamsConfig{
StartJobs: defaultStartJobsStream,
StopJobs: defaultStopJobsStream,
JobResults: defaultJobResultsStream,
HealthEvents: defaultHealthEventsStream,
NotificationIntents: defaultNotificationIntentsKey,
BlockTimeout: defaultStreamBlockTimeout,
},
Container: ContainerConfig{
DefaultCPUQuota: defaultCPUQuota,
DefaultMemory: defaultMemory,
DefaultPIDsLimit: defaultPIDsLimit,
StopTimeout: defaultContainerStopTimeout,
Retention: defaultContainerRetention,
EngineStateMountPath: defaultEngineStateMountPath,
EngineStateEnvName: defaultEngineStateEnvName,
GameStateDirMode: defaultGameStateDirMode,
},
Health: HealthConfig{
InspectInterval: defaultInspectInterval,
ProbeInterval: defaultProbeInterval,
ProbeTimeout: defaultProbeTimeout,
ProbeFailuresThreshold: defaultProbeFailuresThreshold,
},
Cleanup: CleanupConfig{
ReconcileInterval: defaultReconcileInterval,
CleanupInterval: defaultCleanupInterval,
},
Coordination: CoordinationConfig{
GameLeaseTTL: defaultGameLeaseTTL,
},
Lobby: LobbyConfig{
Timeout: defaultLobbyInternalTimeout,
},
Telemetry: TelemetryConfig{
ServiceName: defaultOTelServiceName,
TracesExporter: "none",
MetricsExporter: "none",
},
}
}
+142
View File
@@ -0,0 +1,142 @@
package config
import (
"strings"
"testing"
"time"
"github.com/stretchr/testify/require"
)
func validEnv(t *testing.T) {
t.Helper()
t.Setenv("RTMANAGER_POSTGRES_PRIMARY_DSN", "postgres://rtm:secret@localhost:5432/galaxy?search_path=rtmanager&sslmode=disable")
t.Setenv("RTMANAGER_REDIS_MASTER_ADDR", "localhost:6379")
t.Setenv("RTMANAGER_REDIS_PASSWORD", "secret")
t.Setenv("RTMANAGER_GAME_STATE_ROOT", "/var/lib/galaxy/games")
t.Setenv("RTMANAGER_LOBBY_INTERNAL_BASE_URL", "http://lobby:8095")
}
func TestLoadFromEnvAcceptsDefaults(t *testing.T) {
validEnv(t)
cfg, err := LoadFromEnv()
require.NoError(t, err)
require.Equal(t, ":8096", cfg.InternalHTTP.Addr)
require.Equal(t, "unix:///var/run/docker.sock", cfg.Docker.Host)
require.Equal(t, "galaxy-net", cfg.Docker.Network)
require.Equal(t, "json-file", cfg.Docker.LogDriver)
require.Equal(t, ImagePullPolicyIfMissing, cfg.Docker.PullPolicy)
require.Equal(t, "runtime:start_jobs", cfg.Streams.StartJobs)
require.Equal(t, "runtime:stop_jobs", cfg.Streams.StopJobs)
require.Equal(t, "runtime:job_results", cfg.Streams.JobResults)
require.Equal(t, "runtime:health_events", cfg.Streams.HealthEvents)
require.Equal(t, "notification:intents", cfg.Streams.NotificationIntents)
require.Equal(t, 30*time.Second, cfg.Container.StopTimeout)
require.Equal(t, 30*24*time.Hour, cfg.Container.Retention)
require.Equal(t, "/var/lib/galaxy-game", cfg.Container.EngineStateMountPath)
require.Equal(t, "GAME_STATE_PATH", cfg.Container.EngineStateEnvName)
require.EqualValues(t, 0o750, cfg.Container.GameStateDirMode)
require.Equal(t, 60*time.Second, cfg.Coordination.GameLeaseTTL)
require.Equal(t, "http://lobby:8095", cfg.Lobby.BaseURL)
require.Equal(t, 2*time.Second, cfg.Lobby.Timeout)
require.Equal(t, "galaxy-rtmanager", cfg.Telemetry.ServiceName)
}
func TestLoadFromEnvHonoursOverrides(t *testing.T) {
validEnv(t)
t.Setenv("RTMANAGER_INTERNAL_HTTP_ADDR", ":9000")
t.Setenv("RTMANAGER_DOCKER_NETWORK", "custom-net")
t.Setenv("RTMANAGER_IMAGE_PULL_POLICY", "always")
t.Setenv("RTMANAGER_REDIS_START_JOBS_STREAM", "custom:start_jobs")
t.Setenv("RTMANAGER_GAME_LEASE_TTL_SECONDS", "120")
t.Setenv("RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS", "45")
t.Setenv("RTMANAGER_CONTAINER_RETENTION_DAYS", "7")
t.Setenv("RTMANAGER_GAME_STATE_DIR_MODE", "0700")
cfg, err := LoadFromEnv()
require.NoError(t, err)
require.Equal(t, ":9000", cfg.InternalHTTP.Addr)
require.Equal(t, "custom-net", cfg.Docker.Network)
require.Equal(t, ImagePullPolicyAlways, cfg.Docker.PullPolicy)
require.Equal(t, "custom:start_jobs", cfg.Streams.StartJobs)
require.Equal(t, 120*time.Second, cfg.Coordination.GameLeaseTTL)
require.Equal(t, 45*time.Second, cfg.Container.StopTimeout)
require.Equal(t, 7*24*time.Hour, cfg.Container.Retention)
require.EqualValues(t, 0o700, cfg.Container.GameStateDirMode)
}
func TestLoadFromEnvRejectsUnknownPullPolicy(t *testing.T) {
validEnv(t)
t.Setenv("RTMANAGER_IMAGE_PULL_POLICY", "weekly")
_, err := LoadFromEnv()
require.Error(t, err)
require.Contains(t, err.Error(), "image pull policy")
}
func TestLoadFromEnvRequiresGameStateRoot(t *testing.T) {
t.Setenv("RTMANAGER_POSTGRES_PRIMARY_DSN", "postgres://rtm:secret@localhost:5432/galaxy")
t.Setenv("RTMANAGER_REDIS_MASTER_ADDR", "localhost:6379")
t.Setenv("RTMANAGER_REDIS_PASSWORD", "secret")
t.Setenv("RTMANAGER_LOBBY_INTERNAL_BASE_URL", "http://lobby:8095")
_, err := LoadFromEnv()
require.Error(t, err)
require.Contains(t, err.Error(), "RTMANAGER_GAME_STATE_ROOT")
}
func TestLoadFromEnvRequiresLobbyBaseURL(t *testing.T) {
t.Setenv("RTMANAGER_POSTGRES_PRIMARY_DSN", "postgres://rtm:secret@localhost:5432/galaxy")
t.Setenv("RTMANAGER_REDIS_MASTER_ADDR", "localhost:6379")
t.Setenv("RTMANAGER_REDIS_PASSWORD", "secret")
t.Setenv("RTMANAGER_GAME_STATE_ROOT", "/var/lib/galaxy/games")
_, err := LoadFromEnv()
require.Error(t, err)
require.Contains(t, err.Error(), "RTMANAGER_LOBBY_INTERNAL_BASE_URL")
}
func TestLoadFromEnvRejectsRelativeStateRoot(t *testing.T) {
validEnv(t)
t.Setenv("RTMANAGER_GAME_STATE_ROOT", "relative/path")
_, err := LoadFromEnv()
require.Error(t, err)
require.Contains(t, err.Error(), "absolute path")
}
func TestLoadFromEnvRejectsBadLogLevel(t *testing.T) {
validEnv(t)
t.Setenv("RTMANAGER_LOG_LEVEL", "verbose")
_, err := LoadFromEnv()
require.Error(t, err)
require.Contains(t, err.Error(), "RTMANAGER_LOG_LEVEL")
}
func TestImagePullPolicyValidate(t *testing.T) {
require.NoError(t, ImagePullPolicyIfMissing.Validate())
require.NoError(t, ImagePullPolicyAlways.Validate())
require.NoError(t, ImagePullPolicyNever.Validate())
require.Error(t, ImagePullPolicy("monthly").Validate())
}
func TestInternalHTTPValidateRejectsBadAddr(t *testing.T) {
cfg := DefaultConfig().InternalHTTP
cfg.Addr = "not-an-addr"
err := cfg.Validate()
require.Error(t, err)
require.Contains(t, err.Error(), "host:port")
}
func TestStreamsValidateRequiresAllNames(t *testing.T) {
cfg := DefaultConfig().Streams
cfg.StartJobs = " "
err := cfg.Validate()
require.Error(t, err)
require.True(t, strings.Contains(err.Error(), "start jobs"))
}
+319
View File
@@ -0,0 +1,319 @@
package config
import (
"fmt"
"os"
"strconv"
"strings"
"time"
"galaxy/postgres"
"galaxy/redisconn"
)
// LoadFromEnv builds Config from environment variables and validates the
// resulting configuration.
func LoadFromEnv() (Config, error) {
cfg := DefaultConfig()
var err error
cfg.ShutdownTimeout, err = durationEnv(shutdownTimeoutEnvVar, cfg.ShutdownTimeout)
if err != nil {
return Config{}, err
}
cfg.Logging.Level = stringEnv(logLevelEnvVar, cfg.Logging.Level)
cfg.InternalHTTP.Addr = stringEnv(internalHTTPAddrEnvVar, cfg.InternalHTTP.Addr)
cfg.InternalHTTP.ReadHeaderTimeout, err = durationEnv(internalHTTPReadHeaderTimeoutEnvVar, cfg.InternalHTTP.ReadHeaderTimeout)
if err != nil {
return Config{}, err
}
cfg.InternalHTTP.ReadTimeout, err = durationEnv(internalHTTPReadTimeoutEnvVar, cfg.InternalHTTP.ReadTimeout)
if err != nil {
return Config{}, err
}
cfg.InternalHTTP.WriteTimeout, err = durationEnv(internalHTTPWriteTimeoutEnvVar, cfg.InternalHTTP.WriteTimeout)
if err != nil {
return Config{}, err
}
cfg.InternalHTTP.IdleTimeout, err = durationEnv(internalHTTPIdleTimeoutEnvVar, cfg.InternalHTTP.IdleTimeout)
if err != nil {
return Config{}, err
}
cfg.Docker.Host = stringEnv(dockerHostEnvVar, cfg.Docker.Host)
cfg.Docker.APIVersion = stringEnv(dockerAPIVersionEnvVar, cfg.Docker.APIVersion)
cfg.Docker.Network = stringEnv(dockerNetworkEnvVar, cfg.Docker.Network)
cfg.Docker.LogDriver = stringEnv(dockerLogDriverEnvVar, cfg.Docker.LogDriver)
cfg.Docker.LogOpts = stringEnv(dockerLogOptsEnvVar, cfg.Docker.LogOpts)
if raw, ok := os.LookupEnv(imagePullPolicyEnvVar); ok {
cfg.Docker.PullPolicy = ImagePullPolicy(strings.TrimSpace(raw))
}
pgConn, err := postgres.LoadFromEnv(envPrefix)
if err != nil {
return Config{}, err
}
cfg.Postgres.Conn = pgConn
redisConn, err := redisconn.LoadFromEnv(envPrefix)
if err != nil {
return Config{}, err
}
cfg.Redis.Conn = redisConn
cfg.Streams.StartJobs = stringEnv(startJobsStreamEnvVar, cfg.Streams.StartJobs)
cfg.Streams.StopJobs = stringEnv(stopJobsStreamEnvVar, cfg.Streams.StopJobs)
cfg.Streams.JobResults = stringEnv(jobResultsStreamEnvVar, cfg.Streams.JobResults)
cfg.Streams.HealthEvents = stringEnv(healthEventsStreamEnvVar, cfg.Streams.HealthEvents)
cfg.Streams.NotificationIntents = stringEnv(notificationIntentsStreamEnv, cfg.Streams.NotificationIntents)
cfg.Streams.BlockTimeout, err = durationEnv(streamBlockTimeoutEnvVar, cfg.Streams.BlockTimeout)
if err != nil {
return Config{}, err
}
cfg.Container.DefaultCPUQuota, err = floatEnv(defaultCPUQuotaEnvVar, cfg.Container.DefaultCPUQuota)
if err != nil {
return Config{}, err
}
cfg.Container.DefaultMemory = stringEnv(defaultMemoryEnvVar, cfg.Container.DefaultMemory)
cfg.Container.DefaultPIDsLimit, err = intEnv(defaultPIDsLimitEnvVar, cfg.Container.DefaultPIDsLimit)
if err != nil {
return Config{}, err
}
cfg.Container.StopTimeout, err = secondsEnv(containerStopTimeoutSecondsEnvVar, cfg.Container.StopTimeout)
if err != nil {
return Config{}, err
}
cfg.Container.Retention, err = daysEnv(containerRetentionDaysEnvVar, cfg.Container.Retention)
if err != nil {
return Config{}, err
}
cfg.Container.EngineStateMountPath = stringEnv(engineStateMountPathEnvVar, cfg.Container.EngineStateMountPath)
cfg.Container.EngineStateEnvName = stringEnv(engineStateEnvNameEnvVar, cfg.Container.EngineStateEnvName)
cfg.Container.GameStateDirMode, err = octalUint32Env(gameStateDirModeEnvVar, cfg.Container.GameStateDirMode)
if err != nil {
return Config{}, err
}
cfg.Container.GameStateOwnerUID, err = intEnv(gameStateOwnerUIDEnvVar, cfg.Container.GameStateOwnerUID)
if err != nil {
return Config{}, err
}
cfg.Container.GameStateOwnerGID, err = intEnv(gameStateOwnerGIDEnvVar, cfg.Container.GameStateOwnerGID)
if err != nil {
return Config{}, err
}
root, ok := os.LookupEnv(gameStateRootEnvVar)
if !ok || strings.TrimSpace(root) == "" {
return Config{}, fmt.Errorf("%s must be set", gameStateRootEnvVar)
}
cfg.Container.GameStateRoot = strings.TrimSpace(root)
cfg.Health.InspectInterval, err = durationEnv(inspectIntervalEnvVar, cfg.Health.InspectInterval)
if err != nil {
return Config{}, err
}
cfg.Health.ProbeInterval, err = durationEnv(probeIntervalEnvVar, cfg.Health.ProbeInterval)
if err != nil {
return Config{}, err
}
cfg.Health.ProbeTimeout, err = durationEnv(probeTimeoutEnvVar, cfg.Health.ProbeTimeout)
if err != nil {
return Config{}, err
}
cfg.Health.ProbeFailuresThreshold, err = intEnv(probeFailuresThresholdEnvVar, cfg.Health.ProbeFailuresThreshold)
if err != nil {
return Config{}, err
}
cfg.Cleanup.ReconcileInterval, err = durationEnv(reconcileIntervalEnvVar, cfg.Cleanup.ReconcileInterval)
if err != nil {
return Config{}, err
}
cfg.Cleanup.CleanupInterval, err = durationEnv(cleanupIntervalEnvVar, cfg.Cleanup.CleanupInterval)
if err != nil {
return Config{}, err
}
cfg.Coordination.GameLeaseTTL, err = secondsEnv(gameLeaseTTLSecondsEnvVar, cfg.Coordination.GameLeaseTTL)
if err != nil {
return Config{}, err
}
lobbyURL, ok := os.LookupEnv(lobbyInternalBaseURLEnvVar)
if !ok || strings.TrimSpace(lobbyURL) == "" {
return Config{}, fmt.Errorf("%s must be set", lobbyInternalBaseURLEnvVar)
}
cfg.Lobby.BaseURL = strings.TrimSpace(lobbyURL)
cfg.Lobby.Timeout, err = durationEnv(lobbyInternalTimeoutEnvVar, cfg.Lobby.Timeout)
if err != nil {
return Config{}, err
}
cfg.Telemetry.ServiceName = stringEnv(otelServiceNameEnvVar, cfg.Telemetry.ServiceName)
cfg.Telemetry.TracesExporter = normalizeExporterValue(stringEnv(otelTracesExporterEnvVar, cfg.Telemetry.TracesExporter))
cfg.Telemetry.MetricsExporter = normalizeExporterValue(stringEnv(otelMetricsExporterEnvVar, cfg.Telemetry.MetricsExporter))
cfg.Telemetry.TracesProtocol = normalizeProtocolValue(
os.Getenv(otelExporterOTLPTracesProtocolEnvVar),
os.Getenv(otelExporterOTLPProtocolEnvVar),
cfg.Telemetry.TracesProtocol,
)
cfg.Telemetry.MetricsProtocol = normalizeProtocolValue(
os.Getenv(otelExporterOTLPMetricsProtocolEnvVar),
os.Getenv(otelExporterOTLPProtocolEnvVar),
cfg.Telemetry.MetricsProtocol,
)
cfg.Telemetry.StdoutTracesEnabled, err = boolEnv(otelStdoutTracesEnabledEnvVar, cfg.Telemetry.StdoutTracesEnabled)
if err != nil {
return Config{}, err
}
cfg.Telemetry.StdoutMetricsEnabled, err = boolEnv(otelStdoutMetricsEnabledEnvVar, cfg.Telemetry.StdoutMetricsEnabled)
if err != nil {
return Config{}, err
}
if err := cfg.Validate(); err != nil {
return Config{}, err
}
return cfg, nil
}
func stringEnv(name string, fallback string) string {
value, ok := os.LookupEnv(name)
if !ok {
return fallback
}
return strings.TrimSpace(value)
}
func durationEnv(name string, fallback time.Duration) (time.Duration, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := time.ParseDuration(strings.TrimSpace(value))
if err != nil {
return 0, fmt.Errorf("%s: parse duration: %w", name, err)
}
return parsed, nil
}
func secondsEnv(name string, fallback time.Duration) (time.Duration, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := strconv.Atoi(strings.TrimSpace(value))
if err != nil {
return 0, fmt.Errorf("%s: parse seconds: %w", name, err)
}
if parsed <= 0 {
return 0, fmt.Errorf("%s: must be positive", name)
}
return time.Duration(parsed) * time.Second, nil
}
func daysEnv(name string, fallback time.Duration) (time.Duration, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := strconv.Atoi(strings.TrimSpace(value))
if err != nil {
return 0, fmt.Errorf("%s: parse days: %w", name, err)
}
if parsed <= 0 {
return 0, fmt.Errorf("%s: must be positive", name)
}
return time.Duration(parsed) * 24 * time.Hour, nil
}
func intEnv(name string, fallback int) (int, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := strconv.Atoi(strings.TrimSpace(value))
if err != nil {
return 0, fmt.Errorf("%s: parse int: %w", name, err)
}
return parsed, nil
}
func floatEnv(name string, fallback float64) (float64, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := strconv.ParseFloat(strings.TrimSpace(value), 64)
if err != nil {
return 0, fmt.Errorf("%s: parse float: %w", name, err)
}
return parsed, nil
}
func boolEnv(name string, fallback bool) (bool, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := strconv.ParseBool(strings.TrimSpace(value))
if err != nil {
return false, fmt.Errorf("%s: parse bool: %w", name, err)
}
return parsed, nil
}
func octalUint32Env(name string, fallback uint32) (uint32, error) {
value, ok := os.LookupEnv(name)
if !ok {
return fallback, nil
}
parsed, err := strconv.ParseUint(strings.TrimSpace(value), 8, 32)
if err != nil {
return 0, fmt.Errorf("%s: parse octal: %w", name, err)
}
return uint32(parsed), nil
}
func normalizeExporterValue(value string) string {
trimmed := strings.TrimSpace(value)
switch trimmed {
case "", "none":
return "none"
default:
return trimmed
}
}
func normalizeProtocolValue(primary string, fallback string, defaultValue string) string {
primary = strings.TrimSpace(primary)
if primary != "" {
return primary
}
fallback = strings.TrimSpace(fallback)
if fallback != "" {
return fallback
}
return strings.TrimSpace(defaultValue)
}
+93
View File
@@ -0,0 +1,93 @@
package config
import (
"fmt"
"log/slog"
"net"
"net/url"
"strings"
)
// Validate reports whether cfg stores a usable Runtime Manager process
// configuration.
func (cfg Config) Validate() error {
if cfg.ShutdownTimeout <= 0 {
return fmt.Errorf("%s must be positive", shutdownTimeoutEnvVar)
}
if err := validateSlogLevel(cfg.Logging.Level); err != nil {
return fmt.Errorf("%s: %w", logLevelEnvVar, err)
}
if err := cfg.InternalHTTP.Validate(); err != nil {
return err
}
if err := cfg.Docker.Validate(); err != nil {
return err
}
if err := cfg.Postgres.Validate(); err != nil {
return err
}
if err := cfg.Redis.Validate(); err != nil {
return err
}
if err := cfg.Streams.Validate(); err != nil {
return err
}
if err := cfg.Container.Validate(); err != nil {
return err
}
if err := cfg.Health.Validate(); err != nil {
return err
}
if err := cfg.Cleanup.Validate(); err != nil {
return err
}
if err := cfg.Coordination.Validate(); err != nil {
return err
}
if err := cfg.Lobby.Validate(); err != nil {
return err
}
if err := cfg.Telemetry.Validate(); err != nil {
return err
}
return nil
}
func validateSlogLevel(level string) error {
var slogLevel slog.Level
if err := slogLevel.UnmarshalText([]byte(strings.TrimSpace(level))); err != nil {
return fmt.Errorf("invalid slog level %q: %w", level, err)
}
return nil
}
func isTCPAddr(value string) bool {
host, port, err := net.SplitHostPort(strings.TrimSpace(value))
if err != nil {
return false
}
if port == "" {
return false
}
if host == "" {
return true
}
return !strings.Contains(host, " ")
}
func isHTTPURL(value string) bool {
parsed, err := url.Parse(strings.TrimSpace(value))
if err != nil {
return false
}
if parsed.Scheme != "http" && parsed.Scheme != "https" {
return false
}
return parsed.Host != ""
}
@@ -0,0 +1,231 @@
// Package health defines the technical-health domain types owned by
// Runtime Manager.
//
// EventType matches the `event_type` enum frozen in
// `galaxy/rtmanager/api/runtime-health-asyncapi.yaml`. SnapshotStatus
// matches the SQL CHECK on `health_snapshots.status` and is intentionally
// narrower than EventType (the snapshot table collapses
// `container_started → healthy` and drops `probe_recovered` per
// `galaxy/rtmanager/README.md §Health Monitoring`).
package health
import (
"encoding/json"
"fmt"
"strings"
"time"
)
// EventType identifies one entry on the `runtime:health_events` Redis
// Stream. Used by the health-event publishers and consumers.
type EventType string
const (
// EventTypeContainerStarted reports a successful container start.
EventTypeContainerStarted EventType = "container_started"
// EventTypeContainerExited reports a non-zero Docker `die` event.
EventTypeContainerExited EventType = "container_exited"
// EventTypeContainerOOM reports a Docker `oom` event.
EventTypeContainerOOM EventType = "container_oom"
// EventTypeContainerDisappeared reports that the listener observed
// a `destroy` event for a record Runtime Manager did not initiate.
EventTypeContainerDisappeared EventType = "container_disappeared"
// EventTypeInspectUnhealthy reports an unexpected outcome of the
// periodic Docker inspect (RestartCount growth, unexpected status,
// declared HEALTHCHECK reporting unhealthy).
EventTypeInspectUnhealthy EventType = "inspect_unhealthy"
// EventTypeProbeFailed reports that the active HTTP probe crossed
// the configured failure threshold.
EventTypeProbeFailed EventType = "probe_failed"
// EventTypeProbeRecovered reports the first probe success after a
// `probe_failed` event was published.
EventTypeProbeRecovered EventType = "probe_recovered"
)
// IsKnown reports whether eventType belongs to the frozen event-type
// vocabulary.
func (eventType EventType) IsKnown() bool {
switch eventType {
case EventTypeContainerStarted,
EventTypeContainerExited,
EventTypeContainerOOM,
EventTypeContainerDisappeared,
EventTypeInspectUnhealthy,
EventTypeProbeFailed,
EventTypeProbeRecovered:
return true
default:
return false
}
}
// AllEventTypes returns the frozen list of every event-type value.
func AllEventTypes() []EventType {
return []EventType{
EventTypeContainerStarted,
EventTypeContainerExited,
EventTypeContainerOOM,
EventTypeContainerDisappeared,
EventTypeInspectUnhealthy,
EventTypeProbeFailed,
EventTypeProbeRecovered,
}
}
// SnapshotStatus identifies one latest-observation status value stored
// in the `health_snapshots.status` column. Distinct from EventType: the
// table collapses `container_started → healthy` and never persists
// `probe_recovered` (it is conveyed only as a `runtime:health_events`
// entry with status=healthy in the next observation).
type SnapshotStatus string
const (
// SnapshotStatusHealthy reports that the most recent observation
// found the container live and the engine probe responsive.
SnapshotStatusHealthy SnapshotStatus = "healthy"
// SnapshotStatusProbeFailed reports that the active probe crossed
// the failure threshold.
SnapshotStatusProbeFailed SnapshotStatus = "probe_failed"
// SnapshotStatusExited reports that the container exited.
SnapshotStatusExited SnapshotStatus = "exited"
// SnapshotStatusOOM reports that the container was killed by the
// OOM killer.
SnapshotStatusOOM SnapshotStatus = "oom"
// SnapshotStatusInspectUnhealthy reports that the periodic inspect
// observed an unexpected state.
SnapshotStatusInspectUnhealthy SnapshotStatus = "inspect_unhealthy"
// SnapshotStatusContainerDisappeared reports that Docker no longer
// reports the container.
SnapshotStatusContainerDisappeared SnapshotStatus = "container_disappeared"
)
// IsKnown reports whether status belongs to the frozen snapshot-status
// vocabulary.
func (status SnapshotStatus) IsKnown() bool {
switch status {
case SnapshotStatusHealthy,
SnapshotStatusProbeFailed,
SnapshotStatusExited,
SnapshotStatusOOM,
SnapshotStatusInspectUnhealthy,
SnapshotStatusContainerDisappeared:
return true
default:
return false
}
}
// AllSnapshotStatuses returns the frozen list of every snapshot-status
// value.
func AllSnapshotStatuses() []SnapshotStatus {
return []SnapshotStatus{
SnapshotStatusHealthy,
SnapshotStatusProbeFailed,
SnapshotStatusExited,
SnapshotStatusOOM,
SnapshotStatusInspectUnhealthy,
SnapshotStatusContainerDisappeared,
}
}
// SnapshotSource identifies the observation source that produced one
// snapshot. Matches the SQL CHECK on `health_snapshots.source`.
type SnapshotSource string
const (
// SnapshotSourceDockerEvent reports that the latest observation
// arrived through the Docker events listener.
SnapshotSourceDockerEvent SnapshotSource = "docker_event"
// SnapshotSourceInspect reports that the latest observation arrived
// through the periodic Docker inspect worker.
SnapshotSourceInspect SnapshotSource = "inspect"
// SnapshotSourceProbe reports that the latest observation arrived
// through the active HTTP probe.
SnapshotSourceProbe SnapshotSource = "probe"
)
// IsKnown reports whether source belongs to the frozen snapshot-source
// vocabulary.
func (source SnapshotSource) IsKnown() bool {
switch source {
case SnapshotSourceDockerEvent,
SnapshotSourceInspect,
SnapshotSourceProbe:
return true
default:
return false
}
}
// AllSnapshotSources returns the frozen list of every snapshot-source
// value.
func AllSnapshotSources() []SnapshotSource {
return []SnapshotSource{
SnapshotSourceDockerEvent,
SnapshotSourceInspect,
SnapshotSourceProbe,
}
}
// HealthSnapshot stores the latest technical-health observation for one
// game. One row per game_id; later observations overwrite.
type HealthSnapshot struct {
// GameID identifies the platform game.
GameID string
// ContainerID stores the Docker container id observed by the
// snapshot source. Empty when the source could not associate a
// container (e.g., reconciler dispose for a record whose container
// is already gone).
ContainerID string
// Status stores the latest observed snapshot status.
Status SnapshotStatus
// Source stores the observation source that produced this entry.
Source SnapshotSource
// Details stores the source-specific JSON detail payload. Adapters
// store and retrieve it verbatim. Empty / nil values are persisted
// as the SQL default `{}`.
Details json.RawMessage
// ObservedAt stores the wall-clock at which the source captured the
// observation.
ObservedAt time.Time
}
// Validate reports whether snapshot satisfies the snapshot invariants
// implied by the SQL CHECK constraints.
func (snapshot HealthSnapshot) Validate() error {
if strings.TrimSpace(snapshot.GameID) == "" {
return fmt.Errorf("game id must not be empty")
}
if !snapshot.Status.IsKnown() {
return fmt.Errorf("status %q is unsupported", snapshot.Status)
}
if !snapshot.Source.IsKnown() {
return fmt.Errorf("source %q is unsupported", snapshot.Source)
}
if snapshot.ObservedAt.IsZero() {
return fmt.Errorf("observed at must not be zero")
}
if len(snapshot.Details) > 0 && !json.Valid(snapshot.Details) {
return fmt.Errorf("details must be valid JSON when non-empty")
}
return nil
}
@@ -0,0 +1,133 @@
package health
import (
"encoding/json"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestEventTypeIsKnown(t *testing.T) {
for _, eventType := range AllEventTypes() {
assert.Truef(t, eventType.IsKnown(), "expected %q known", eventType)
}
assert.False(t, EventType("").IsKnown())
assert.False(t, EventType("paused").IsKnown())
}
func TestAllEventTypesCoverFrozenSet(t *testing.T) {
assert.ElementsMatch(t,
[]EventType{
EventTypeContainerStarted,
EventTypeContainerExited,
EventTypeContainerOOM,
EventTypeContainerDisappeared,
EventTypeInspectUnhealthy,
EventTypeProbeFailed,
EventTypeProbeRecovered,
},
AllEventTypes(),
)
}
func TestSnapshotStatusIsKnown(t *testing.T) {
for _, status := range AllSnapshotStatuses() {
assert.Truef(t, status.IsKnown(), "expected %q known", status)
}
assert.False(t, SnapshotStatus("").IsKnown())
assert.False(t, SnapshotStatus("starting").IsKnown())
assert.False(t, SnapshotStatus("probe_recovered").IsKnown(),
"snapshot status must not include event-only values")
assert.False(t, SnapshotStatus("container_started").IsKnown(),
"snapshot status must not include event-only values")
}
func TestAllSnapshotStatusesCoverFrozenSet(t *testing.T) {
assert.ElementsMatch(t,
[]SnapshotStatus{
SnapshotStatusHealthy,
SnapshotStatusProbeFailed,
SnapshotStatusExited,
SnapshotStatusOOM,
SnapshotStatusInspectUnhealthy,
SnapshotStatusContainerDisappeared,
},
AllSnapshotStatuses(),
)
}
func TestSnapshotSourceIsKnown(t *testing.T) {
for _, source := range AllSnapshotSources() {
assert.Truef(t, source.IsKnown(), "expected %q known", source)
}
assert.False(t, SnapshotSource("").IsKnown())
assert.False(t, SnapshotSource("manual").IsKnown())
}
func TestAllSnapshotSourcesCoverFrozenSet(t *testing.T) {
assert.ElementsMatch(t,
[]SnapshotSource{
SnapshotSourceDockerEvent,
SnapshotSourceInspect,
SnapshotSourceProbe,
},
AllSnapshotSources(),
)
}
func sampleSnapshot() HealthSnapshot {
return HealthSnapshot{
GameID: "game-test",
ContainerID: "container-1",
Status: SnapshotStatusHealthy,
Source: SnapshotSourceProbe,
Details: json.RawMessage(`{"prior_failure_count":0}`),
ObservedAt: time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC),
}
}
func TestHealthSnapshotValidateHappy(t *testing.T) {
require.NoError(t, sampleSnapshot().Validate())
}
func TestHealthSnapshotValidateAcceptsEmptyDetails(t *testing.T) {
snapshot := sampleSnapshot()
snapshot.Details = nil
assert.NoError(t, snapshot.Validate())
}
func TestHealthSnapshotValidateAcceptsEmptyContainerID(t *testing.T) {
snapshot := sampleSnapshot()
snapshot.ContainerID = ""
assert.NoError(t, snapshot.Validate())
}
func TestHealthSnapshotValidateRejects(t *testing.T) {
tests := []struct {
name string
mutate func(*HealthSnapshot)
}{
{"empty game id", func(s *HealthSnapshot) { s.GameID = "" }},
{"unknown status", func(s *HealthSnapshot) { s.Status = "exotic" }},
{"unknown source", func(s *HealthSnapshot) { s.Source = "exotic" }},
{"zero observed at", func(s *HealthSnapshot) { s.ObservedAt = time.Time{} }},
{"invalid details json", func(s *HealthSnapshot) {
s.Details = json.RawMessage("not-json")
}},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
snapshot := sampleSnapshot()
tt.mutate(&snapshot)
assert.Error(t, snapshot.Validate())
})
}
}
+245
View File
@@ -0,0 +1,245 @@
// Package operation defines the runtime-operation audit-log domain types
// owned by Runtime Manager.
//
// One OperationEntry maps to one row of the `operation_log` PostgreSQL
// table (see
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`).
// The OpKind / OpSource / Outcome enums match the SQL CHECK constraints
// verbatim and feed the telemetry counters declared in
// `galaxy/rtmanager/README.md §Observability`.
package operation
import (
"fmt"
"strings"
"time"
)
// OpKind identifies the kind of operation Runtime Manager performed.
type OpKind string
const (
// OpKindStart records a start lifecycle operation.
OpKindStart OpKind = "start"
// OpKindStop records a stop lifecycle operation.
OpKindStop OpKind = "stop"
// OpKindRestart records a restart lifecycle operation
// (recreate with the same image_ref).
OpKindRestart OpKind = "restart"
// OpKindPatch records a semver-patch lifecycle operation
// (recreate with a new image_ref).
OpKindPatch OpKind = "patch"
// OpKindCleanupContainer records a container removal performed by
// the cleanup TTL worker or the admin DELETE endpoint.
OpKindCleanupContainer OpKind = "cleanup_container"
// OpKindReconcileAdopt records that the reconciler discovered an
// unrecorded container labelled `com.galaxy.owner=rtmanager` and
// inserted a runtime record for it.
OpKindReconcileAdopt OpKind = "reconcile_adopt"
// OpKindReconcileDispose records that the reconciler observed a
// running record whose container is missing in Docker and marked it
// as removed.
OpKindReconcileDispose OpKind = "reconcile_dispose"
)
// IsKnown reports whether kind belongs to the frozen op-kind vocabulary.
func (kind OpKind) IsKnown() bool {
switch kind {
case OpKindStart,
OpKindStop,
OpKindRestart,
OpKindPatch,
OpKindCleanupContainer,
OpKindReconcileAdopt,
OpKindReconcileDispose:
return true
default:
return false
}
}
// AllOpKinds returns the frozen list of every op-kind value. The slice
// order is stable across calls.
func AllOpKinds() []OpKind {
return []OpKind{
OpKindStart,
OpKindStop,
OpKindRestart,
OpKindPatch,
OpKindCleanupContainer,
OpKindReconcileAdopt,
OpKindReconcileDispose,
}
}
// OpSource identifies where one operation entered Runtime Manager.
type OpSource string
const (
// OpSourceLobbyStream identifies entries triggered by the
// `runtime:start_jobs` or `runtime:stop_jobs` Redis Stream consumer.
OpSourceLobbyStream OpSource = "lobby_stream"
// OpSourceGMRest identifies entries triggered by Game Master through
// the internal REST surface.
OpSourceGMRest OpSource = "gm_rest"
// OpSourceAdminRest identifies entries triggered by Admin Service
// through the internal REST surface.
OpSourceAdminRest OpSource = "admin_rest"
// OpSourceAutoTTL identifies entries triggered by the periodic
// container-cleanup worker.
OpSourceAutoTTL OpSource = "auto_ttl"
// OpSourceAutoReconcile identifies entries triggered by the
// reconciler at startup or on its periodic interval.
OpSourceAutoReconcile OpSource = "auto_reconcile"
)
// IsKnown reports whether source belongs to the frozen op-source
// vocabulary.
func (source OpSource) IsKnown() bool {
switch source {
case OpSourceLobbyStream,
OpSourceGMRest,
OpSourceAdminRest,
OpSourceAutoTTL,
OpSourceAutoReconcile:
return true
default:
return false
}
}
// AllOpSources returns the frozen list of every op-source value. The
// slice order is stable across calls.
func AllOpSources() []OpSource {
return []OpSource{
OpSourceLobbyStream,
OpSourceGMRest,
OpSourceAdminRest,
OpSourceAutoTTL,
OpSourceAutoReconcile,
}
}
// Outcome reports the high-level outcome of one operation.
type Outcome string
const (
// OutcomeSuccess reports that the operation completed without
// surfacing an error.
OutcomeSuccess Outcome = "success"
// OutcomeFailure reports that the operation surfaced a stable error
// code recorded in OperationEntry.ErrorCode.
OutcomeFailure Outcome = "failure"
)
// IsKnown reports whether outcome belongs to the frozen outcome
// vocabulary.
func (outcome Outcome) IsKnown() bool {
switch outcome {
case OutcomeSuccess, OutcomeFailure:
return true
default:
return false
}
}
// AllOutcomes returns the frozen list of every outcome value.
func AllOutcomes() []Outcome {
return []Outcome{OutcomeSuccess, OutcomeFailure}
}
// OperationEntry stores one append-only audit row of the `operation_log`
// table. ID is zero on records that have not been persisted yet; the
// store assigns it from the table's bigserial column. FinishedAt is a
// pointer because the column is nullable for in-flight rows even though
// the lifecycle services finalise the row in the same transaction.
type OperationEntry struct {
// ID identifies the persisted row. Zero before persistence.
ID int64
// GameID identifies the platform game this operation acted on.
GameID string
// OpKind classifies what the operation did.
OpKind OpKind
// OpSource classifies how the operation entered Runtime Manager.
OpSource OpSource
// SourceRef stores an opaque per-source reference such as a Redis
// Stream entry id, a REST request id, or an admin user id. Empty
// when the source does not provide one.
SourceRef string
// ImageRef stores the engine image reference associated with the
// operation, when applicable. Empty for operations that do not
// touch an image (e.g., cleanup_container).
ImageRef string
// ContainerID stores the Docker container id observed at the time
// of the operation, when applicable.
ContainerID string
// Outcome reports whether the operation succeeded or failed.
Outcome Outcome
// ErrorCode stores the stable error code on failure. Empty on
// success.
ErrorCode string
// ErrorMessage stores the operator-readable detail on failure.
// Empty on success.
ErrorMessage string
// StartedAt stores the wall-clock at which the operation began.
StartedAt time.Time
// FinishedAt stores the wall-clock at which the operation
// finalised. Nil for in-flight rows.
FinishedAt *time.Time
}
// Validate reports whether entry satisfies the operation-log invariants
// implied by the SQL CHECK constraints and the README §Persistence
// Layout.
func (entry OperationEntry) Validate() error {
if strings.TrimSpace(entry.GameID) == "" {
return fmt.Errorf("game id must not be empty")
}
if !entry.OpKind.IsKnown() {
return fmt.Errorf("op kind %q is unsupported", entry.OpKind)
}
if !entry.OpSource.IsKnown() {
return fmt.Errorf("op source %q is unsupported", entry.OpSource)
}
if !entry.Outcome.IsKnown() {
return fmt.Errorf("outcome %q is unsupported", entry.Outcome)
}
if entry.StartedAt.IsZero() {
return fmt.Errorf("started at must not be zero")
}
if entry.FinishedAt != nil {
if entry.FinishedAt.IsZero() {
return fmt.Errorf("finished at must not be zero when present")
}
if entry.FinishedAt.Before(entry.StartedAt) {
return fmt.Errorf("finished at must not be before started at")
}
}
if entry.Outcome == OutcomeFailure && strings.TrimSpace(entry.ErrorCode) == "" {
return fmt.Errorf("error code must not be empty for failure entries")
}
return nil
}

Some files were not shown because too many files have changed in this diff Show More