feat: runtime manager
This commit is contained in:
@@ -0,0 +1,28 @@
|
||||
# Makefile for galaxy/rtmanager.
|
||||
#
|
||||
# The `jet` target regenerates the go-jet/v2 query-builder code under
|
||||
# internal/adapters/postgres/jet/ against a transient PostgreSQL container
|
||||
# brought up by cmd/jetgen. Generated code is committed.
|
||||
#
|
||||
# The `mocks` target regenerates the gomock-driven mocks via the
|
||||
# //go:generate directives that live next to the interfaces they cover:
|
||||
# - internal/ports/ — port interfaces (Stage 12)
|
||||
# - internal/api/internalhttp/handlers/ — REST handler service ports (Stage 16)
|
||||
# Generated code is committed.
|
||||
#
|
||||
# The `integration` target runs the service-local end-to-end suite
|
||||
# under integration/. It requires a reachable Docker daemon
|
||||
# (`/var/run/docker.sock` or `DOCKER_HOST`); without one the helpers
|
||||
# in integration/harness call t.Skip and the tests are no-ops.
|
||||
|
||||
.PHONY: jet mocks integration
|
||||
|
||||
jet:
|
||||
go run ./cmd/jetgen
|
||||
|
||||
mocks:
|
||||
go generate ./internal/ports/...
|
||||
go generate ./internal/api/internalhttp/handlers/...
|
||||
|
||||
integration:
|
||||
go test -tags=integration -count=1 ./integration/...
|
||||
+1022
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,867 @@
|
||||
# Runtime Manager
|
||||
|
||||
`Runtime Manager` (RTM) is the only Galaxy platform service permitted to interact with the
|
||||
Docker daemon. It owns the lifecycle of `galaxy/game` engine containers and the technical
|
||||
runtime view of running games. Other services consume RTM via two transports: an asynchronous
|
||||
Redis Streams contract (used by `Game Lobby`) and a synchronous internal REST surface (used by
|
||||
`Game Master` and `Admin Service`).
|
||||
|
||||
## References
|
||||
|
||||
- [`../ARCHITECTURE.md`](../ARCHITECTURE.md) — system architecture, §9 Runtime Manager.
|
||||
- [`../TESTING.md`](../TESTING.md) §7 — testing matrix for RTM.
|
||||
- [`./docs/README.md`](./docs/README.md) — service-local documentation entry point.
|
||||
- [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml) — REST contract.
|
||||
- [`./api/runtime-jobs-asyncapi.yaml`](./api/runtime-jobs-asyncapi.yaml) — start/stop job
|
||||
streams contract.
|
||||
- [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml) —
|
||||
`runtime:health_events` stream contract.
|
||||
- [`../game/README.md`](../game/README.md) — game engine container contract (env, ports,
|
||||
`/healthz`).
|
||||
- [`../lobby/README.md`](../lobby/README.md) — Game Lobby integration with RTM.
|
||||
|
||||
## Purpose
|
||||
|
||||
A running Galaxy game lives in exactly one Docker container. The platform must be able to:
|
||||
|
||||
- create the container with the right engine version and configuration;
|
||||
- supply the engine with a stable storage location for game state;
|
||||
- keep the runtime status visible to platform-level services;
|
||||
- replace the container in place for patch upgrades and restarts;
|
||||
- remove containers that are no longer needed;
|
||||
- detect and surface engine failures to whoever should react.
|
||||
|
||||
`Runtime Manager` is the single component that performs these actions. It deliberately does
|
||||
**not** reason about platform metadata, membership, schedules, turn cutoffs, or any other
|
||||
business state. Game Lobby owns platform metadata; Game Master will own runtime business state
|
||||
when implemented.
|
||||
|
||||
## Scope
|
||||
|
||||
`Runtime Manager` is the source of truth for:
|
||||
|
||||
- the mapping `game_id -> current_container_id` for every running container;
|
||||
- the durable history of every start, stop, restart, patch, and cleanup operation it performed;
|
||||
- the most recent technical health observation per game (last Docker event, last successful or
|
||||
failed probe, last inspect result).
|
||||
|
||||
`Runtime Manager` is not the source of truth for:
|
||||
|
||||
- any business or platform-level metadata of a game (owned by `Game Lobby`);
|
||||
- runtime state visible to players or operators as game state, including current turn,
|
||||
generation status, engine version registry (owned by `Game Master`);
|
||||
- the engine version catalogue or which engine version a game is allowed to use (`Game Master`
|
||||
is the future owner; `Game Lobby` supplies `image_ref` in v1);
|
||||
- contents of the engine state directory; that is engine domain;
|
||||
- backup, archival, or operator cleanup of state directories.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Multi-instance operation in v1. Coordination is single-process; multiple replicas are an
|
||||
explicit future iteration.
|
||||
- Engine version arbitration. The producer (`Game Lobby` in v1, `Game Master` later) supplies `image_ref`.
|
||||
- Image registry control. Pull policy is configurable, but RTM does not push, retag, or
|
||||
promote images.
|
||||
- TLS or mTLS on the internal listener. RTM trusts its network segment.
|
||||
- Direct delivery of player-visible push notifications. RTM publishes admin-only notification
|
||||
intents only for failures invisible elsewhere; everything else is delegated.
|
||||
- Kubernetes, Docker Swarm, or other orchestrators. v1 targets a single Docker daemon reached
|
||||
through `unix:///var/run/docker.sock`.
|
||||
|
||||
## Position in the System
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Lobby["Game Lobby"]
|
||||
GM["Game Master"]
|
||||
Admin["Admin Service"]
|
||||
Notify["Notification Service"]
|
||||
RTM["Runtime Manager"]
|
||||
Engine["Game Engine container"]
|
||||
Docker["Docker Daemon"]
|
||||
Postgres["PostgreSQL\nschema rtmanager"]
|
||||
Redis["Redis\nstreams + leases"]
|
||||
|
||||
Lobby -->|runtime:start_jobs / stop_jobs| RTM
|
||||
RTM -->|runtime:job_results| Lobby
|
||||
GM -->|internal REST| RTM
|
||||
Admin -->|internal REST| RTM
|
||||
RTM -->|notification:intents (admin)| Notify
|
||||
RTM -->|runtime:health_events| Redis
|
||||
RTM <--> Docker
|
||||
Docker -->|create / start / stop / rm| Engine
|
||||
RTM --> Postgres
|
||||
RTM --> Redis
|
||||
Engine -.bind mount.- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
|
||||
```
|
||||
|
||||
## Responsibility Boundaries
|
||||
|
||||
`Runtime Manager` is responsible for:
|
||||
|
||||
- accepting start, stop, restart, patch, inspect, and cleanup requests through the supported
|
||||
transports and producing one durable outcome per request;
|
||||
- creating Docker containers from a producer-supplied `image_ref` and binding them to the
|
||||
configured Docker network and host state directory;
|
||||
- enforcing the one-game-one-container invariant in its own state and on Docker;
|
||||
- monitoring container health through Docker events, periodic inspect, and active HTTP probes;
|
||||
- publishing technical runtime events (`runtime:job_results`, `runtime:health_events`) and
|
||||
admin-only notification intents for failures that no other service can observe;
|
||||
- reconciling its persistent state with Docker reality on startup and periodically;
|
||||
- removing exited containers automatically by retention TTL or explicitly by admin command.
|
||||
|
||||
`Runtime Manager` is not responsible for:
|
||||
|
||||
- evaluating whether a game is allowed to start (Lobby validates roster, schedule, etc.);
|
||||
- registering a started runtime with `Game Master` (Lobby calls GM after a successful job
|
||||
result);
|
||||
- mapping platform users to engine players (GM owns this mapping);
|
||||
- player command routing (GM proxies player commands directly to engine);
|
||||
- cleaning up host state directories;
|
||||
- patching the engine version registry; the registry lives in `Game Master`.
|
||||
|
||||
## Container Model
|
||||
|
||||
### Network
|
||||
|
||||
Containers attach to a single user-defined Docker bridge network. The network is provisioned
|
||||
**outside** RTM: docker-compose, Terraform, or an operator runbook creates `galaxy-net` (or
|
||||
whatever name is configured via `RTMANAGER_DOCKER_NETWORK`).
|
||||
|
||||
RTM validates the network's presence at startup. A missing network is a fail-fast condition;
|
||||
the process exits non-zero before opening any listener.
|
||||
|
||||
### DNS name and engine endpoint
|
||||
|
||||
Each container is created with hostname `galaxy-game-{game_id}` and is attached to the
|
||||
configured network. Docker's embedded DNS resolves the hostname for any other container in the
|
||||
same network.
|
||||
|
||||
The `engine_endpoint` published in `runtime:job_results` and visible through the inspect REST
|
||||
endpoint is the full URL `http://galaxy-game-{game_id}:8080`. The port is fixed at `8080`
|
||||
inside the container; RTM does not publish ports to the host.
|
||||
|
||||
Restart and patch keep the same DNS name. The `container_id` changes; the `engine_endpoint`
|
||||
does not.
|
||||
|
||||
### State storage (bind mount)
|
||||
|
||||
Engine state lives on the host filesystem. RTM never uses Docker named volumes — the rationale
|
||||
is operator-friendly backup and inspection.
|
||||
|
||||
- Host root: `RTMANAGER_GAME_STATE_ROOT` (operator-supplied, e.g. `/var/lib/galaxy/games`).
|
||||
- Per-game directory: `<RTMANAGER_GAME_STATE_ROOT>/{game_id}`. RTM creates it with permissions
|
||||
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and ownership `RTMANAGER_GAME_STATE_OWNER_UID`
|
||||
/ `_GID` (default `0:0` — operator overrides for non-root engine).
|
||||
- Bind mount: the per-game directory is mounted into the container at the path declared by
|
||||
`RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
|
||||
- Environment: the container receives `GAME_STATE_PATH=<mount path>`. The engine resolves the
|
||||
path from this variable. The same variable is forwarded to the engine as `STORAGE_PATH` for
|
||||
backward compatibility — both names are accepted in v1.
|
||||
|
||||
RTM never deletes the host state directory. Removing it is the responsibility of operator
|
||||
tooling (backup, manual cleanup, or future Admin Service workflows). Removing the container
|
||||
through the cleanup endpoint or the retention TTL leaves the directory intact.
|
||||
|
||||
### Container labels
|
||||
|
||||
RTM applies the following labels to every container it creates:
|
||||
|
||||
| Label | Value | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `com.galaxy.owner` | `rtmanager` | Filter for `docker ps` and reconcile. |
|
||||
| `com.galaxy.kind` | `game-engine` | Differentiates from infra containers. |
|
||||
| `com.galaxy.game_id` | `{game_id}` | Reverse lookup from container to platform game. |
|
||||
| `com.galaxy.engine_image_ref` | `{image_ref}` | Cross-check against `runtime_records`. |
|
||||
| `com.galaxy.started_at_ms` | `{ms}` | Unambiguous start timestamp. |
|
||||
|
||||
Labels are read from the resolved engine image to choose resource limits (see below).
|
||||
|
||||
### Resource limits
|
||||
|
||||
Resource limits originate in the **engine image**, not in the producer envelope or RTM config:
|
||||
|
||||
| Image label | Container limit | RTM fallback config |
|
||||
| --- | --- | --- |
|
||||
| `com.galaxy.cpu_quota` | `--cpus` value | `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`) |
|
||||
| `com.galaxy.memory` | `--memory` value | `RTMANAGER_DEFAULT_MEMORY` (default `512m`) |
|
||||
| `com.galaxy.pids_limit` | `--pids-limit` value | `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`) |
|
||||
|
||||
If a label is missing or unparseable, RTM uses the matching fallback. Producers never pass
|
||||
limits.
|
||||
|
||||
### Logging driver
|
||||
|
||||
Engine container stdout / stderr are routed by Docker's logging driver. RTM passes the driver
|
||||
and its options when creating the container:
|
||||
|
||||
- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
|
||||
- `RTMANAGER_DOCKER_LOG_OPTS` (default empty; comma-separated `key=value` pairs).
|
||||
|
||||
RTM never reads the container's stdout itself. Operators consume engine logs via `docker logs`
|
||||
or via whatever sink the configured driver feeds (fluentd, journald, etc.).
|
||||
|
||||
The production Docker SDK adapter that creates and starts these containers lives at
|
||||
`internal/adapters/docker/`. Its design rationale — fixed engine port, partial-rollback on
|
||||
`ContainerStart` failure, events-stream filter rationale, and the `mockgen`-driven service-test
|
||||
fixture — is captured in [`docs/adapters.md`](docs/adapters.md).
|
||||
|
||||
## Runtime Surface
|
||||
|
||||
### Listeners
|
||||
|
||||
| Listener | Default address | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `internal` HTTP | `:8096` (`RTMANAGER_INTERNAL_HTTP_ADDR`) | Probes (`/healthz`, `/readyz`) and the trusted REST surface for `Game Master` and `Admin Service`. |
|
||||
|
||||
There is no public listener. The internal listener is unauthenticated and assumes a trusted
|
||||
network segment.
|
||||
|
||||
### Background workers
|
||||
|
||||
| Worker | Driver | Description |
|
||||
| --- | --- | --- |
|
||||
| `startjobs` consumer | Redis Stream `runtime:start_jobs` | Decodes start envelope and invokes the start service. |
|
||||
| `stopjobs` consumer | Redis Stream `runtime:stop_jobs` | Decodes stop envelope and invokes the stop service. |
|
||||
| Docker events listener | Docker `/events` API | Subscribes with the label filter, emits `runtime:health_events` for container_started / exited / oom / disappeared. |
|
||||
| Active HTTP probe | Periodic | `GET {engine_endpoint}/healthz` for every running runtime; emits `probe_failed` / `probe_recovered` with hysteresis. |
|
||||
| Periodic Docker inspect | Periodic | Refreshes inspect data; emits `inspect_unhealthy` when restart_count grows or status is unexpected. |
|
||||
| Reconciler | Startup + periodic | Reconciles `runtime_records` with `docker ps` (see Reconciliation section). |
|
||||
| Container cleanup | Periodic | Removes exited containers older than `RTMANAGER_CONTAINER_RETENTION_DAYS`. |
|
||||
|
||||
### Startup dependencies
|
||||
|
||||
In start order:
|
||||
|
||||
1. PostgreSQL primary (DSN `RTMANAGER_POSTGRES_PRIMARY_DSN`). Goose migrations apply
|
||||
synchronously before any listener opens.
|
||||
2. Redis master (`RTMANAGER_REDIS_MASTER_ADDR`).
|
||||
3. Docker daemon at `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`). RTM
|
||||
verifies API ping and the presence of `RTMANAGER_DOCKER_NETWORK`.
|
||||
4. Telemetry exporter (OTLP grpc/http or stdout).
|
||||
5. Internal HTTP listener.
|
||||
6. Reconciler runs once and blocks until done.
|
||||
7. Background workers start.
|
||||
|
||||
A failure in any step is fatal and exits the process non-zero.
|
||||
|
||||
### Probes
|
||||
|
||||
`/healthz` reports liveness — the process responds when the HTTP server is alive.
|
||||
|
||||
`/readyz` reports readiness — `200` only when:
|
||||
|
||||
- the PostgreSQL pool can ping the primary;
|
||||
- the Redis master client can ping;
|
||||
- the Docker client can ping;
|
||||
- the configured Docker network exists.
|
||||
|
||||
Both probes are documented in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
|
||||
|
||||
## Lifecycles
|
||||
|
||||
All operations share a per-game-id Redis lease (`rtmanager:game_lease:{game_id}`,
|
||||
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`, default `60`). The lease serialises operations on a
|
||||
single game across all entry points (stream consumers and REST handlers). v1 does not renew
|
||||
the lease mid-operation; long pulls of multi-GB images can therefore expire the lease before
|
||||
the operation finishes — the trade-off is documented in
|
||||
[`docs/services.md` §1](docs/services.md).
|
||||
|
||||
### Start
|
||||
|
||||
**Triggers:**
|
||||
|
||||
- Lobby: a Redis Streams entry on `runtime:start_jobs` with envelope
|
||||
`{game_id, image_ref, requested_at_ms}`.
|
||||
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/start` with body
|
||||
`{image_ref}`.
|
||||
|
||||
**Pre-conditions:**
|
||||
|
||||
- `image_ref` is a non-empty string and parseable as a Docker reference.
|
||||
- Configured Docker network exists.
|
||||
- The lease for `{game_id}` is acquired.
|
||||
|
||||
**Flow on success:**
|
||||
|
||||
1. Read `runtime_records.{game_id}`. If `status=running` with the same `image_ref`, return
|
||||
the existing record (idempotent success, `error_code=replay_no_op`).
|
||||
2. Pull the image per `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`).
|
||||
3. Inspect the resolved image, derive resource limits from labels.
|
||||
4. Ensure the per-game state directory exists with the configured mode and ownership.
|
||||
5. `docker create` with the configured network, hostname, labels, env (`GAME_STATE_PATH`,
|
||||
`STORAGE_PATH`), bind mount, log driver, resource limits.
|
||||
6. `docker start`.
|
||||
7. Upsert `runtime_records` (`status=running`, `current_container_id`, `engine_endpoint`,
|
||||
`current_image_ref`, `started_at`, `last_op_at`).
|
||||
8. Append `operation_log` entry (`op_kind=start`, `outcome=success`, source-specific
|
||||
`op_source`).
|
||||
9. Publish `runtime:health_events` `container_started`.
|
||||
10. For Lobby callers: publish `runtime:job_results`
|
||||
`{game_id, outcome=success, container_id, engine_endpoint}`.
|
||||
For REST callers: respond `200` with the runtime record.
|
||||
|
||||
**Failure paths:**
|
||||
|
||||
| Failure | PG side effect | Notification intent | Outcome to caller |
|
||||
| --- | --- | --- | --- |
|
||||
| Invalid `image_ref` shape, network missing | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
|
||||
| Image pull error | `operation_log` failure | `runtime.image_pull_failed` | `failure / image_pull_failed` |
|
||||
| `docker create` / `start` error | `operation_log` failure | `runtime.container_start_failed` | `failure / container_start_failed` |
|
||||
| State directory creation error | `operation_log` failure | `runtime.start_config_invalid` | `failure / start_config_invalid` |
|
||||
|
||||
A failed start never leaves a partially-running container: if `docker create` succeeded but
|
||||
the subsequent step failed, RTM removes the container before recording the failure.
|
||||
|
||||
The production start orchestrator that implements the flow and the failure paths above lives
|
||||
at `internal/service/startruntime/`. Its design rationale — why the per-game lease and the
|
||||
health-events publisher live with the start service, the `Result`-shaped contract consumed by
|
||||
the stream consumer and the REST handler, the rollback rule on Upsert failure, and the
|
||||
`created_at`-preservation rule for re-starts — is captured in
|
||||
[`docs/services.md`](docs/services.md).
|
||||
|
||||
### Stop
|
||||
|
||||
**Triggers:**
|
||||
|
||||
- Lobby: Redis Streams entry on `runtime:stop_jobs` with envelope
|
||||
`{game_id, reason, requested_at_ms}`. `reason ∈ {orphan_cleanup, cancelled, finished,
|
||||
admin_request, timeout}`.
|
||||
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/stop` with body
|
||||
`{reason}`.
|
||||
|
||||
**Pre-conditions:**
|
||||
|
||||
- Lease acquired.
|
||||
|
||||
**Flow on success:**
|
||||
|
||||
1. Read `runtime_records.{game_id}`. If `status` is `stopped` or `removed`, return
|
||||
idempotent success (`error_code=replay_no_op`).
|
||||
2. `docker stop` with `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`). Docker fires
|
||||
SIGKILL if the engine ignores SIGTERM beyond the timeout. RTM does not call any HTTP
|
||||
shutdown endpoint on the engine.
|
||||
3. Update `runtime_records` (`status=stopped`, `stopped_at`, `last_op_at`).
|
||||
4. Append `operation_log` entry.
|
||||
5. Publish `runtime:job_results` (for Lobby) or REST `200` (for REST callers).
|
||||
|
||||
The container stays in `exited` state until the cleanup worker removes it (TTL) or an admin
|
||||
command forces removal.
|
||||
|
||||
**Failure paths:**
|
||||
|
||||
| Failure | Outcome |
|
||||
| --- | --- |
|
||||
| Container not found in Docker but record `running` | Update record `status=removed`, publish `container_disappeared`, return `success` (RTM treats this as already-stopped). |
|
||||
| `docker stop` returns non-zero, container still alive | Failure recorded, no state change. Caller may retry. |
|
||||
|
||||
### Restart
|
||||
|
||||
**Triggers:**
|
||||
|
||||
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/restart`.
|
||||
|
||||
Restart is **recreate**: stop + remove + run with the same `image_ref` and the same bind
|
||||
mount. `container_id` changes; `engine_endpoint` is stable.
|
||||
|
||||
**Flow:**
|
||||
|
||||
1. Read `runtime_records.{game_id}`. The current `image_ref` is captured.
|
||||
2. Acquire lease.
|
||||
3. Run the stop flow (without releasing the lease).
|
||||
4. `docker rm` the container.
|
||||
5. Run the start flow with the captured `image_ref`.
|
||||
6. Append a single `operation_log` entry with `op_kind=restart` and a correlation id linking
|
||||
the implicit stop and start log entries.
|
||||
|
||||
If any inner step fails, the operation log records the partial outcome and the outer caller
|
||||
receives the same failure; the runtime record converges to whatever state Docker reports.
|
||||
|
||||
### Patch
|
||||
|
||||
**Triggers:**
|
||||
|
||||
- Game Master / Admin Service: `POST /api/v1/internal/runtimes/{game_id}/patch` with body
|
||||
`{image_ref}`.
|
||||
|
||||
Patch is restart with a **new** `image_ref`. The engine reads its state from the bind mount
|
||||
on startup, so any data written before the patch survives.
|
||||
|
||||
**Pre-conditions:**
|
||||
|
||||
- New and current image refs both parse as semver tags. `image_ref_not_semver` failure
|
||||
otherwise.
|
||||
- Major and minor versions are equal between current and new (`semver_patch_only` failure
|
||||
otherwise).
|
||||
|
||||
**Flow:** identical to restart, with a new `image_ref` injected before the start step.
|
||||
`operation_log` entry has `op_kind=patch`.
|
||||
|
||||
### Cleanup
|
||||
|
||||
**Triggers:**
|
||||
|
||||
- Periodic worker: every container with `runtime_records.status=stopped` and
|
||||
`last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
|
||||
- Admin Service: `DELETE /api/v1/internal/runtimes/{game_id}/container`.
|
||||
|
||||
**Pre-conditions:**
|
||||
|
||||
- The container is not in `running` state. RTM refuses to remove a running container through
|
||||
this path; stop first.
|
||||
|
||||
**Flow:**
|
||||
|
||||
1. Acquire lease.
|
||||
2. `docker rm` the container.
|
||||
3. Update `runtime_records` (`status=removed`, `removed_at`, `current_container_id=NULL`,
|
||||
`last_op_at`).
|
||||
4. Append `operation_log` entry (`op_kind=cleanup_container`,
|
||||
`op_source ∈ {auto_ttl, admin_rest}`).
|
||||
|
||||
The host state directory is left untouched.
|
||||
|
||||
## Health Monitoring
|
||||
|
||||
Three independent sources feed `runtime:health_events` and `health_snapshots`:
|
||||
|
||||
1. **Docker events listener.** Subscribes to the Docker events stream and filters
|
||||
container-scoped events by the `com.galaxy.owner=rtmanager` label written into every
|
||||
container by the start service. Emits:
|
||||
- `container_exited` (action=`die` with non-zero exit code; exit `0` is the normal
|
||||
graceful stop and is suppressed).
|
||||
- `container_oom` (action=`oom`).
|
||||
- `container_disappeared` (action=`destroy` observed for a `runtime_records.status=running`
|
||||
row whose `current_container_id` still matches the destroyed container, i.e. a destroy
|
||||
RTM did not initiate).
|
||||
|
||||
`container_started` is emitted by the start service when it runs the container (see
|
||||
`internal/service/startruntime`), not by this listener.
|
||||
2. **Periodic Docker inspect** every `RTMANAGER_INSPECT_INTERVAL` (default `30s`). Emits
|
||||
`inspect_unhealthy` when:
|
||||
- `RestartCount` increases between observations;
|
||||
- `State.Status != "running"` for a record marked running;
|
||||
- `State.Health.Status == "unhealthy"` if the image declares a Docker `HEALTHCHECK`.
|
||||
3. **Active HTTP probe** every `RTMANAGER_PROBE_INTERVAL` (default `15s`). Calls
|
||||
`GET {engine_endpoint}/healthz` with `RTMANAGER_PROBE_TIMEOUT` (default `2s`). Emits:
|
||||
- `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures
|
||||
(default `3`);
|
||||
- `probe_recovered` on the first success after a `probe_failed` was published.
|
||||
|
||||
Every emission updates `health_snapshots.{game_id}` (latest event becomes the snapshot) and
|
||||
appends to `runtime:health_events`.
|
||||
|
||||
In v1, RTM publishes admin-only notification intents only for first-touch failures of the
|
||||
start flow. All ongoing health changes (probe failures, OOMs, exits) flow through
|
||||
`runtime:health_events` only. `Game Master` is the consumer that decides whether to escalate
|
||||
runtime-level events into notifications.
|
||||
|
||||
The three workers that implement the sources above live in
|
||||
`internal/worker/{dockerevents,dockerinspect,healthprobe}`. Their design rationale —
|
||||
`container_started` ownership, `container_disappeared` emission rules, `die` exit-code
|
||||
suppression, probe hysteresis state model, parallel-probe cap, and the events-listener
|
||||
reconnect policy — is captured in [`docs/workers.md`](docs/workers.md).
|
||||
|
||||
## Reconciliation
|
||||
|
||||
RTM never assumes Docker and PostgreSQL are in sync.
|
||||
|
||||
At startup (blocking, before workers start) and every `RTMANAGER_RECONCILE_INTERVAL`
|
||||
(default `5m`):
|
||||
|
||||
1. List Docker containers with label `com.galaxy.owner=rtmanager`.
|
||||
2. For each running container without a matching record:
|
||||
- Insert a `runtime_records` row with `status=running`, the discovered
|
||||
`current_image_ref`, `engine_endpoint`, and `started_at` taken from
|
||||
`com.galaxy.started_at_ms` if present (otherwise from `State.StartedAt`).
|
||||
- Append `operation_log` entry with `op_kind=reconcile_adopt`,
|
||||
`op_source=auto_reconcile`.
|
||||
- **Never stop or remove an unrecorded container.** Operators may have started one
|
||||
manually for diagnostics; RTM stays out of their way.
|
||||
3. For each `runtime_records` row with `status=running` whose container is missing:
|
||||
- Update `status=removed`, `removed_at=now`, `current_container_id=NULL`.
|
||||
- Publish `runtime:health_events` `container_disappeared`.
|
||||
- Append `operation_log` entry with `op_kind=reconcile_dispose`.
|
||||
4. For each `runtime_records` row with `status=running` whose container exists but is in
|
||||
`exited`:
|
||||
- Update `status=stopped`, `stopped_at=now` (reconciler observation time).
|
||||
- Publish `runtime:health_events` `container_exited` with the observed exit code.
|
||||
|
||||
The reconciler implementation lives at `internal/worker/reconcile/` and the periodic
|
||||
TTL-cleanup worker at `internal/worker/containercleanup/`; the cleanup worker delegates
|
||||
removal to `internal/service/cleanupcontainer/`. The design rationale — the per-game
|
||||
lease around every drift mutation, the third `observed_exited` path beyond the two
|
||||
named cases, the synchronous `ReconcileNow` plus periodic `Component` split, and why
|
||||
the cleanup worker is a thin TTL filter on top of the existing service — is captured in
|
||||
[`docs/workers.md`](docs/workers.md).
|
||||
|
||||
## Trusted Surfaces
|
||||
|
||||
### Internal REST
|
||||
|
||||
The internal REST surface is consumed by `Game Master` (sync interactions for inspect,
|
||||
restart, patch, stop, cleanup) and `Admin Service` (operational tooling, force-cleanup).
|
||||
The listener is unauthenticated; downstream services rely on network segmentation.
|
||||
|
||||
| Method | Path | Operation ID | Caller |
|
||||
| --- | --- | --- | --- |
|
||||
| `GET` | `/healthz` | `internalHealthz` | platform probes |
|
||||
| `GET` | `/readyz` | `internalReadyz` | platform probes |
|
||||
| `GET` | `/api/v1/internal/runtimes` | `internalListRuntimes` | GM, Admin |
|
||||
| `GET` | `/api/v1/internal/runtimes/{game_id}` | `internalGetRuntime` | GM, Admin |
|
||||
| `POST` | `/api/v1/internal/runtimes/{game_id}/start` | `internalStartRuntime` | GM, Admin |
|
||||
| `POST` | `/api/v1/internal/runtimes/{game_id}/stop` | `internalStopRuntime` | GM, Admin |
|
||||
| `POST` | `/api/v1/internal/runtimes/{game_id}/restart` | `internalRestartRuntime` | GM, Admin |
|
||||
| `POST` | `/api/v1/internal/runtimes/{game_id}/patch` | `internalPatchRuntime` | GM, Admin |
|
||||
| `DELETE` | `/api/v1/internal/runtimes/{game_id}/container` | `internalCleanupRuntimeContainer` | Admin |
|
||||
|
||||
Request and response shapes are defined in [`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
|
||||
Unknown JSON fields are rejected with `invalid_request`.
|
||||
|
||||
Callers identify themselves through the optional `X-Galaxy-Caller`
|
||||
request header (`gm` for `Game Master`, `admin` for `Admin Service`).
|
||||
The header is recorded as `op_source` in `operation_log` (`gm_rest` or
|
||||
`admin_rest`); when missing or carrying any other value Runtime
|
||||
Manager defaults to `op_source = admin_rest`. The header is documented
|
||||
on every runtime endpoint of
|
||||
[`./api/internal-openapi.yaml`](./api/internal-openapi.yaml).
|
||||
|
||||
## Async Stream Contracts
|
||||
|
||||
### `runtime:start_jobs` (in)
|
||||
|
||||
Producer: `Game Lobby`.
|
||||
|
||||
| Field | Type | Notes |
|
||||
| --- | --- | --- |
|
||||
| `game_id` | string | Lobby `game_id`. |
|
||||
| `image_ref` | string | Docker reference. Lobby resolves it from `target_engine_version` using `LOBBY_ENGINE_IMAGE_TEMPLATE`. |
|
||||
| `requested_at_ms` | int64 | UTC milliseconds. Used for diagnostics, not authoritative. |
|
||||
|
||||
### `runtime:stop_jobs` (in)
|
||||
|
||||
Producer: `Game Lobby`.
|
||||
|
||||
| Field | Type | Notes |
|
||||
| --- | --- | --- |
|
||||
| `game_id` | string | |
|
||||
| `reason` | enum | `orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`. Recorded in `operation_log.error_code` when the reason matters; otherwise opaque. |
|
||||
| `requested_at_ms` | int64 | |
|
||||
|
||||
### `runtime:job_results` (out)
|
||||
|
||||
Producer: `Runtime Manager`. Consumer: `Game Lobby`.
|
||||
|
||||
| Field | Type | Notes |
|
||||
| --- | --- | --- |
|
||||
| `game_id` | string | |
|
||||
| `outcome` | enum | `success`, `failure`. |
|
||||
| `container_id` | string | Required for `success`. Empty on `failure`. |
|
||||
| `engine_endpoint` | string | Required for `success`. Empty on `failure`. |
|
||||
| `error_code` | string | Stable code. `replay_no_op` for idempotent re-runs. |
|
||||
| `error_message` | string | Operator-readable detail. |
|
||||
|
||||
### `runtime:health_events` (out, new)
|
||||
|
||||
Producer: `Runtime Manager`. Consumers: `Game Master`; `Game Lobby` and `Admin Service`
|
||||
are reserved as future consumers.
|
||||
|
||||
| Field | Type | Notes |
|
||||
| --- | --- | --- |
|
||||
| `game_id` | string | |
|
||||
| `container_id` | string | The container observed (may differ from current after a restart race). |
|
||||
| `event_type` | enum | See below. |
|
||||
| `occurred_at_ms` | int64 | UTC milliseconds. |
|
||||
| `details` | json | Type-specific payload. |
|
||||
|
||||
`event_type` values and their `details` schemas:
|
||||
|
||||
| `event_type` | `details` payload |
|
||||
| --- | --- |
|
||||
| `container_started` | `{image_ref}` |
|
||||
| `container_exited` | `{exit_code, oom: bool}` |
|
||||
| `container_oom` | `{exit_code}` |
|
||||
| `container_disappeared` | `{}` |
|
||||
| `inspect_unhealthy` | `{restart_count, state, health}` |
|
||||
| `probe_failed` | `{consecutive_failures, last_status, last_error}` |
|
||||
| `probe_recovered` | `{prior_failure_count}` |
|
||||
|
||||
The full schema is enforced by [`./api/runtime-health-asyncapi.yaml`](./api/runtime-health-asyncapi.yaml).
|
||||
|
||||
## Notification Contracts
|
||||
|
||||
`Runtime Manager` publishes admin-only notification intents only for failures invisible to
|
||||
any other service:
|
||||
|
||||
| Trigger | `notification_type` | Audience | Channels |
|
||||
| --- | --- | --- | --- |
|
||||
| Image pull error during start | `runtime.image_pull_failed` | admin | email |
|
||||
| `docker create` / `docker start` error | `runtime.container_start_failed` | admin | email |
|
||||
| Configuration validation error at start (bad image_ref, missing network) | `runtime.start_config_invalid` | admin | email |
|
||||
|
||||
Constructors live in `galaxy/pkg/notificationintent`. Catalog entries live in
|
||||
[`../notification/README.md`](../notification/README.md) and
|
||||
[`../notification/api/intents-asyncapi.yaml`](../notification/api/intents-asyncapi.yaml).
|
||||
All three intents share the frozen field set
|
||||
`{game_id, image_ref, error_code, error_message, attempted_at_ms}`; the
|
||||
`_ms` suffix on `attempted_at_ms` follows the repo-wide convention for
|
||||
millisecond integer fields.
|
||||
The Redis Streams publisher wrapper used to emit these intents from RTM
|
||||
ships in `internal/adapters/notificationpublisher/`; the rationale for the
|
||||
signature shim that drops the upstream entry id lives in
|
||||
[`docs/domain-and-ports.md` §7](docs/domain-and-ports.md) and the production
|
||||
wiring is documented in [`docs/adapters.md`](docs/adapters.md).
|
||||
|
||||
Runtime-level changes after a successful start (probe failures, OOM, container exited)
|
||||
**do not** produce notifications from RTM. Game Master decides whether to escalate.
|
||||
|
||||
## Persistence Layout
|
||||
|
||||
### PostgreSQL durable state (schema `rtmanager`)
|
||||
|
||||
| Table | Purpose | Key |
|
||||
| --- | --- | --- |
|
||||
| `runtime_records` | One row per game, latest known runtime status. | `game_id` |
|
||||
| `operation_log` | Append-only audit of every operation RTM performed. | `id` (auto) |
|
||||
| `health_snapshots` | Latest health observation per game. | `game_id` |
|
||||
|
||||
`runtime_records` columns:
|
||||
|
||||
- `game_id` — primary key, references Lobby's identifier.
|
||||
- `status` — `running | stopped | removed`.
|
||||
- `current_container_id` — nullable when `status=removed`.
|
||||
- `current_image_ref` — non-null when status is `running` or `stopped`.
|
||||
- `engine_endpoint` — `http://galaxy-game-{game_id}:8080`.
|
||||
- `state_path` — absolute host path of the bind-mounted directory.
|
||||
- `docker_network` — network name observed at create time.
|
||||
- `started_at`, `stopped_at`, `removed_at` — last transition timestamps.
|
||||
- `last_op_at` — drives retention TTL.
|
||||
- `created_at` — first time RTM saw the game.
|
||||
|
||||
`operation_log` columns:
|
||||
|
||||
- `id`, `game_id`, `op_kind` (`start | stop | restart | patch | cleanup_container |
|
||||
reconcile_adopt | reconcile_dispose`), `op_source` (`lobby_stream | gm_rest | admin_rest |
|
||||
auto_ttl | auto_reconcile`), `source_ref` (stream entry id, REST request id, or admin
|
||||
user), `image_ref`, `container_id`, `outcome` (`success | failure`), `error_code`,
|
||||
`error_message`, `started_at`, `finished_at`.
|
||||
|
||||
`health_snapshots` columns:
|
||||
|
||||
- `game_id`, `container_id`, `status`
|
||||
(`healthy | probe_failed | exited | oom | inspect_unhealthy | container_disappeared`),
|
||||
`source` (`docker_event | inspect | probe`), `details` (jsonb), `observed_at`.
|
||||
|
||||
Indexes:
|
||||
|
||||
- `runtime_records (status, last_op_at)` — drives cleanup worker.
|
||||
- `operation_log (game_id, started_at DESC)` — drives audit reads.
|
||||
|
||||
Migrations are embedded `00001_init.sql` (single-init pre-launch policy from
|
||||
`ARCHITECTURE.md §Persistence Backends`).
|
||||
|
||||
### Redis runtime-coordination state
|
||||
|
||||
| Key shape | Purpose |
|
||||
| --- | --- |
|
||||
| `rtmanager:stream_offsets:{label}` | Last processed entry id per consumer (`startjobs`, `stopjobs`). Same shape as Lobby. |
|
||||
| `rtmanager:game_lease:{game_id}` | Per-game lease string (`SET ... NX PX <ttl>`). TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default 60s); not renewed mid-operation in v1. The trade-off is documented in [`docs/services.md` §1](docs/services.md). |
|
||||
|
||||
Stream key shapes themselves are configurable:
|
||||
|
||||
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
|
||||
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
|
||||
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
|
||||
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
|
||||
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
|
||||
|
||||
## Error Model
|
||||
|
||||
Error envelope: `{ "error": { "code": "...", "message": "..." } }`, identical to Lobby's.
|
||||
|
||||
Stable error codes:
|
||||
|
||||
| Code | Meaning |
|
||||
| --- | --- |
|
||||
| `invalid_request` | Malformed JSON, unknown fields, missing required parameter. |
|
||||
| `not_found` | Runtime record does not exist. |
|
||||
| `conflict` | Operation incompatible with current `status`. |
|
||||
| `service_unavailable` | Dependency unavailable (Docker daemon, PG, Redis). |
|
||||
| `internal_error` | Unspecified failure. |
|
||||
| `image_pull_failed` | Image pull attempt failed. |
|
||||
| `image_ref_not_semver` | Patch attempted with a tag that is not parseable semver. |
|
||||
| `semver_patch_only` | Patch attempted across major/minor boundary. |
|
||||
| `container_start_failed` | `docker create` / `docker start` failed. |
|
||||
| `start_config_invalid` | Network missing, bind path inaccessible, or other config error. |
|
||||
| `docker_unavailable` | Docker daemon ping failed. |
|
||||
| `replay_no_op` | Idempotent replay; outcome is success but no work was done. |
|
||||
|
||||
## Configuration
|
||||
|
||||
All variables use the `RTMANAGER_` prefix. Required variables fail-fast on startup.
|
||||
|
||||
### Required
|
||||
|
||||
- `RTMANAGER_INTERNAL_HTTP_ADDR`
|
||||
- `RTMANAGER_POSTGRES_PRIMARY_DSN`
|
||||
- `RTMANAGER_REDIS_MASTER_ADDR`
|
||||
- `RTMANAGER_REDIS_PASSWORD`
|
||||
- `RTMANAGER_DOCKER_HOST`
|
||||
- `RTMANAGER_DOCKER_NETWORK`
|
||||
- `RTMANAGER_GAME_STATE_ROOT`
|
||||
|
||||
### Configuration groups
|
||||
|
||||
**Listener:**
|
||||
|
||||
- `RTMANAGER_INTERNAL_HTTP_ADDR` (e.g. `:8096`).
|
||||
- `RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT` (default `5s`).
|
||||
- `RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT` (default `15s`).
|
||||
- `RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT` (default `60s`).
|
||||
|
||||
**Docker:**
|
||||
|
||||
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`).
|
||||
- `RTMANAGER_DOCKER_API_VERSION` (default empty — let SDK negotiate).
|
||||
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`).
|
||||
- `RTMANAGER_DOCKER_LOG_DRIVER` (default `json-file`).
|
||||
- `RTMANAGER_DOCKER_LOG_OPTS` (default empty).
|
||||
- `RTMANAGER_IMAGE_PULL_POLICY` (default `if_missing`,
|
||||
values `if_missing | always | never`).
|
||||
|
||||
**Container defaults:**
|
||||
|
||||
- `RTMANAGER_DEFAULT_CPU_QUOTA` (default `1.0`).
|
||||
- `RTMANAGER_DEFAULT_MEMORY` (default `512m`).
|
||||
- `RTMANAGER_DEFAULT_PIDS_LIMIT` (default `512`).
|
||||
- `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS` (default `30`).
|
||||
- `RTMANAGER_CONTAINER_RETENTION_DAYS` (default `30`).
|
||||
- `RTMANAGER_ENGINE_STATE_MOUNT_PATH` (default `/var/lib/galaxy-game`).
|
||||
- `RTMANAGER_ENGINE_STATE_ENV_NAME` (default `GAME_STATE_PATH`).
|
||||
- `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`).
|
||||
- `RTMANAGER_GAME_STATE_OWNER_UID` (default `0`).
|
||||
- `RTMANAGER_GAME_STATE_OWNER_GID` (default `0`).
|
||||
- `RTMANAGER_GAME_STATE_ROOT` (host path).
|
||||
|
||||
**Postgres:**
|
||||
|
||||
- `RTMANAGER_POSTGRES_PRIMARY_DSN` (`postgres://rtmanager:<pwd>@<host>:5432/galaxy?search_path=rtmanager&sslmode=disable`).
|
||||
- `RTMANAGER_POSTGRES_REPLICA_DSNS` (optional, comma-separated; not used in v1).
|
||||
- `RTMANAGER_POSTGRES_OPERATION_TIMEOUT` (default `2s`).
|
||||
- `RTMANAGER_POSTGRES_MAX_OPEN_CONNS` (default `10`).
|
||||
- `RTMANAGER_POSTGRES_MAX_IDLE_CONNS` (default `2`).
|
||||
- `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME` (default `30m`).
|
||||
|
||||
**Redis:**
|
||||
|
||||
- `RTMANAGER_REDIS_MASTER_ADDR`.
|
||||
- `RTMANAGER_REDIS_REPLICA_ADDRS` (optional, comma-separated).
|
||||
- `RTMANAGER_REDIS_PASSWORD`.
|
||||
- `RTMANAGER_REDIS_DB` (default `0`).
|
||||
- `RTMANAGER_REDIS_OPERATION_TIMEOUT` (default `2s`).
|
||||
|
||||
**Streams:**
|
||||
|
||||
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`).
|
||||
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`).
|
||||
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`).
|
||||
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`).
|
||||
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`).
|
||||
- `RTMANAGER_STREAM_BLOCK_TIMEOUT` (default `5s`).
|
||||
|
||||
**Health monitoring:**
|
||||
|
||||
- `RTMANAGER_INSPECT_INTERVAL` (default `30s`).
|
||||
- `RTMANAGER_PROBE_INTERVAL` (default `15s`).
|
||||
- `RTMANAGER_PROBE_TIMEOUT` (default `2s`).
|
||||
- `RTMANAGER_PROBE_FAILURES_THRESHOLD` (default `3`).
|
||||
|
||||
**Reconciler / cleanup:**
|
||||
|
||||
- `RTMANAGER_RECONCILE_INTERVAL` (default `5m`).
|
||||
- `RTMANAGER_CLEANUP_INTERVAL` (default `1h`).
|
||||
|
||||
**Coordination:**
|
||||
|
||||
- `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60`).
|
||||
|
||||
**Lobby internal client:**
|
||||
|
||||
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` (e.g. `http://lobby:8095`).
|
||||
- `RTMANAGER_LOBBY_INTERNAL_TIMEOUT` (default `2s`).
|
||||
|
||||
**Logging:**
|
||||
|
||||
- `RTMANAGER_LOG_LEVEL` (default `info`).
|
||||
|
||||
**Lifecycle:**
|
||||
|
||||
- `RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`).
|
||||
|
||||
**Telemetry:** uses the standard OTLP env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`,
|
||||
`OTEL_EXPORTER_OTLP_PROTOCOL`, etc.) shared with other Galaxy services.
|
||||
|
||||
## Observability
|
||||
|
||||
### Metrics (OpenTelemetry, low cardinality)
|
||||
|
||||
- `rtmanager.start_outcomes` — counter, labels `outcome`, `error_code`, `op_source`.
|
||||
- `rtmanager.stop_outcomes` — counter, labels `outcome`, `reason`, `op_source`.
|
||||
- `rtmanager.restart_outcomes` — counter, labels `outcome`, `error_code`.
|
||||
- `rtmanager.patch_outcomes` — counter, labels `outcome`, `error_code`.
|
||||
- `rtmanager.cleanup_outcomes` — counter, labels `outcome`, `op_source`.
|
||||
- `rtmanager.docker_op_latency` — histogram, label `op` (`pull | create | start | stop | rm
|
||||
| inspect | events`).
|
||||
- `rtmanager.health_events` — counter, label `event_type`.
|
||||
- `rtmanager.reconcile_drift` — counter, label `kind` (`adopt | dispose | observed_exited`).
|
||||
- `rtmanager.runtime_records_by_status` — gauge, label `status`.
|
||||
- `rtmanager.lease_acquire_latency` — histogram.
|
||||
- `rtmanager.notification_intents` — counter, label `notification_type`.
|
||||
|
||||
### Structured logs (slog JSON to stdout)
|
||||
|
||||
Common fields on every entry: `service=rtmanager`, `request_id`, `trace_id`, `span_id`,
|
||||
`game_id` (when known), `container_id` (when known), `op_kind`, `op_source`, `outcome`,
|
||||
`error_code`.
|
||||
|
||||
Worker-specific fields: `stream_entry_id` (consumers), `event_type` (health), `image_ref`
|
||||
(start/patch).
|
||||
|
||||
## Verification
|
||||
|
||||
Service-level (TESTING.md §7):
|
||||
|
||||
- Unit tests for every service-layer operation against mocked Docker.
|
||||
- Adapter tests (PG, Redis, Docker) using `testcontainers-go` for PG/Redis and the Docker
|
||||
daemon socket for the real Docker adapter.
|
||||
- Contract tests for `internal-openapi.yaml`, `runtime-jobs-asyncapi.yaml`,
|
||||
`runtime-health-asyncapi.yaml`.
|
||||
|
||||
Service-local integration suite under `rtmanager/integration/`:
|
||||
|
||||
- Lifecycle end-to-end (start, inspect, stop, restart, patch, cleanup) against the real
|
||||
`galaxy/game` test image.
|
||||
- Replay safety (duplicate stream entries are no-ops).
|
||||
- Health observability (kill the engine externally, observe `container_disappeared`; relaunch
|
||||
manually, observe reconcile adopt).
|
||||
- Notification on first-touch failures (publish a start with an unresolvable image, observe
|
||||
`runtime.image_pull_failed` intent and a `failure` job result).
|
||||
|
||||
Inter-service suite under `integration/lobbyrtm/`:
|
||||
|
||||
- Real Lobby + real RTM + real `galaxy/game` test image. Covers happy path, cancel, and
|
||||
start-failed flows.
|
||||
|
||||
Manual smoke (development):
|
||||
|
||||
```sh
|
||||
docker network create galaxy-net # once
|
||||
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games \
|
||||
RTMANAGER_DOCKER_NETWORK=galaxy-net \
|
||||
RTMANAGER_INTERNAL_HTTP_ADDR=:8096 \
|
||||
... go run ./rtmanager/cmd/rtmanager
|
||||
```
|
||||
|
||||
After start, `curl http://localhost:8096/readyz` returns `200`. Driving Lobby through its
|
||||
public flow brings up `galaxy-game-{game_id}` containers; RTM logs each lifecycle transition
|
||||
and publishes the corresponding stream entries.
|
||||
@@ -0,0 +1,534 @@
|
||||
openapi: 3.0.3
|
||||
info:
|
||||
title: Galaxy Runtime Manager Internal REST API
|
||||
version: v1
|
||||
description: |
|
||||
This specification documents the internal trusted REST contract of
|
||||
`galaxy/rtmanager` served on `RTMANAGER_INTERNAL_HTTP_ADDR`
|
||||
(default `:8096`).
|
||||
|
||||
The listener is not reachable from the public internet. Two caller
|
||||
classes use it: `Game Master` (inspect / restart / patch / stop /
|
||||
cleanup) and `Admin Service` (operational tooling, including
|
||||
force-cleanup). Runtime Manager treats every caller on this port as
|
||||
trusted and performs no user-level authorization; downstream services
|
||||
rely on network segmentation. There is no `X-User-ID` header
|
||||
contract.
|
||||
|
||||
Transport rules:
|
||||
- request bodies are strict JSON only; unknown fields are rejected
|
||||
with `invalid_request`;
|
||||
- error responses use `{ "error": { "code", "message" } }`, identical
|
||||
to the Lobby contract;
|
||||
- stable error codes are: `invalid_request`, `not_found`, `conflict`,
|
||||
`service_unavailable`, `internal_error`, `image_pull_failed`,
|
||||
`image_ref_not_semver`, `semver_patch_only`,
|
||||
`container_start_failed`, `start_config_invalid`,
|
||||
`docker_unavailable`, `replay_no_op`.
|
||||
|
||||
Caller identification:
|
||||
- the optional `X-Galaxy-Caller` request header carries the calling
|
||||
service identity (`gm` for `Game Master`, `admin` for `Admin
|
||||
Service`). Runtime Manager records the value as `op_source` in
|
||||
the `operation_log` (`gm_rest` or `admin_rest`). When the header
|
||||
is missing or carries an unknown value, Runtime Manager defaults
|
||||
to `op_source = admin_rest`.
|
||||
servers:
|
||||
- url: http://localhost:8096
|
||||
description: Default local internal listener for Runtime Manager.
|
||||
tags:
|
||||
- name: Runtimes
|
||||
description: Runtime lifecycle endpoints called by Game Master and Admin Service.
|
||||
- name: Probes
|
||||
description: Health and readiness probes.
|
||||
paths:
|
||||
/healthz:
|
||||
get:
|
||||
tags:
|
||||
- Probes
|
||||
operationId: internalHealthz
|
||||
summary: Internal listener health probe
|
||||
responses:
|
||||
"200":
|
||||
description: Service is alive.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ProbeResponse"
|
||||
examples:
|
||||
ok:
|
||||
value:
|
||||
status: ok
|
||||
/readyz:
|
||||
get:
|
||||
tags:
|
||||
- Probes
|
||||
operationId: internalReadyz
|
||||
summary: Internal listener readiness probe
|
||||
description: |
|
||||
Returns `200` only when the PostgreSQL primary, Redis master, and
|
||||
Docker daemon are reachable and the configured Docker network
|
||||
exists. Returns `503` with the standard error envelope otherwise.
|
||||
responses:
|
||||
"200":
|
||||
description: Service is ready to serve traffic.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ProbeResponse"
|
||||
examples:
|
||||
ready:
|
||||
value:
|
||||
status: ready
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes:
|
||||
get:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalListRuntimes
|
||||
summary: List all known runtime records
|
||||
description: |
|
||||
Returns the full list of runtime records known to Runtime Manager.
|
||||
Pagination is not supported in v1 — the working set is bounded by
|
||||
the number of games tracked by Lobby and is small enough to return
|
||||
in one response.
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
responses:
|
||||
"200":
|
||||
description: All runtime records.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimesList"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes/{game_id}:
|
||||
get:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalGetRuntime
|
||||
summary: Get one runtime record by game id
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/GameIDPath"
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
responses:
|
||||
"200":
|
||||
description: Runtime record for the game.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFoundError"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes/{game_id}/start:
|
||||
post:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalStartRuntime
|
||||
summary: Start a game engine container
|
||||
description: |
|
||||
Pulls the supplied `image_ref` per the configured pull policy and
|
||||
creates the engine container. Idempotent: a re-start with the same
|
||||
`image_ref` for an already-running record returns `200` with the
|
||||
current record and `error_code=replay_no_op` recorded in the
|
||||
operation log.
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/GameIDPath"
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
requestBody:
|
||||
required: true
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/StartRequest"
|
||||
responses:
|
||||
"200":
|
||||
description: Runtime record after the start operation.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
"400":
|
||||
$ref: "#/components/responses/InvalidRequestError"
|
||||
"409":
|
||||
$ref: "#/components/responses/ConflictError"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes/{game_id}/stop:
|
||||
post:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalStopRuntime
|
||||
summary: Stop a running game engine container
|
||||
description: |
|
||||
Issues `docker stop` with the configured timeout. Idempotent: stop
|
||||
on a record that is already `stopped` or `removed` returns
|
||||
success with `error_code=replay_no_op` recorded in the operation
|
||||
log.
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/GameIDPath"
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
requestBody:
|
||||
required: true
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/StopRequest"
|
||||
responses:
|
||||
"200":
|
||||
description: Runtime record after the stop operation.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
"400":
|
||||
$ref: "#/components/responses/InvalidRequestError"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFoundError"
|
||||
"409":
|
||||
$ref: "#/components/responses/ConflictError"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes/{game_id}/restart:
|
||||
post:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalRestartRuntime
|
||||
summary: Recreate a game engine container with the same image
|
||||
description: |
|
||||
Stops, removes, and re-runs the container with the current
|
||||
`image_ref`. The container id changes; the engine endpoint stays
|
||||
stable.
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/GameIDPath"
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
responses:
|
||||
"200":
|
||||
description: Runtime record after the restart operation.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFoundError"
|
||||
"409":
|
||||
$ref: "#/components/responses/ConflictError"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes/{game_id}/patch:
|
||||
post:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalPatchRuntime
|
||||
summary: Recreate a game engine container with a new image
|
||||
description: |
|
||||
Restart with a new `image_ref`. Allowed only as a semver patch
|
||||
within the same major and minor line. Cross-major or cross-minor
|
||||
attempts return `409 conflict` with `error_code=semver_patch_only`.
|
||||
A non-semver `image_ref` returns `400 invalid_request` with
|
||||
`error_code=image_ref_not_semver`.
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/GameIDPath"
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
requestBody:
|
||||
required: true
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/PatchRequest"
|
||||
responses:
|
||||
"200":
|
||||
description: Runtime record after the patch operation.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
"400":
|
||||
$ref: "#/components/responses/InvalidRequestError"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFoundError"
|
||||
"409":
|
||||
$ref: "#/components/responses/ConflictError"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
/api/v1/internal/runtimes/{game_id}/container:
|
||||
delete:
|
||||
tags:
|
||||
- Runtimes
|
||||
operationId: internalCleanupRuntimeContainer
|
||||
summary: Remove an exited container
|
||||
description: |
|
||||
Calls `docker rm` for an already-stopped container and updates the
|
||||
runtime record to `removed`. Refuses with `409 conflict` if the
|
||||
record is still `running`. The host state directory is not
|
||||
deleted.
|
||||
parameters:
|
||||
- $ref: "#/components/parameters/GameIDPath"
|
||||
- $ref: "#/components/parameters/XGalaxyCallerHeader"
|
||||
responses:
|
||||
"200":
|
||||
description: Runtime record after the cleanup operation.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
"404":
|
||||
$ref: "#/components/responses/NotFoundError"
|
||||
"409":
|
||||
$ref: "#/components/responses/ConflictError"
|
||||
"500":
|
||||
$ref: "#/components/responses/InternalError"
|
||||
"503":
|
||||
$ref: "#/components/responses/ServiceUnavailableError"
|
||||
components:
|
||||
parameters:
|
||||
GameIDPath:
|
||||
name: game_id
|
||||
in: path
|
||||
required: true
|
||||
description: Opaque stable game identifier owned by Lobby.
|
||||
schema:
|
||||
type: string
|
||||
XGalaxyCallerHeader:
|
||||
name: X-Galaxy-Caller
|
||||
in: header
|
||||
required: false
|
||||
description: |
|
||||
Identifies the calling service so Runtime Manager can record the
|
||||
right `op_source` in `operation_log` (`gm_rest` for `gm`,
|
||||
`admin_rest` for `admin`). Missing or unknown values default to
|
||||
`admin_rest`.
|
||||
schema:
|
||||
type: string
|
||||
enum:
|
||||
- gm
|
||||
- admin
|
||||
schemas:
|
||||
RuntimeRecord:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- game_id
|
||||
- status
|
||||
- state_path
|
||||
- docker_network
|
||||
- last_op_at
|
||||
- created_at
|
||||
properties:
|
||||
game_id:
|
||||
type: string
|
||||
description: Opaque stable game identifier owned by Lobby.
|
||||
status:
|
||||
type: string
|
||||
enum:
|
||||
- running
|
||||
- stopped
|
||||
- removed
|
||||
description: Current runtime status maintained by Runtime Manager.
|
||||
current_container_id:
|
||||
type: string
|
||||
nullable: true
|
||||
description: Docker container id; null when status is removed.
|
||||
current_image_ref:
|
||||
type: string
|
||||
nullable: true
|
||||
description: Image reference of the current container; null when status is removed.
|
||||
engine_endpoint:
|
||||
type: string
|
||||
nullable: true
|
||||
description: Stable engine URL `http://galaxy-game-{game_id}:8080`; null when status is removed.
|
||||
state_path:
|
||||
type: string
|
||||
description: Absolute host path of the per-game bind-mounted state directory.
|
||||
docker_network:
|
||||
type: string
|
||||
description: Docker network name observed when the container was created.
|
||||
started_at:
|
||||
type: string
|
||||
format: date-time
|
||||
nullable: true
|
||||
description: UTC timestamp of the most recent successful start.
|
||||
stopped_at:
|
||||
type: string
|
||||
format: date-time
|
||||
nullable: true
|
||||
description: UTC timestamp of the most recent stop.
|
||||
removed_at:
|
||||
type: string
|
||||
format: date-time
|
||||
nullable: true
|
||||
description: UTC timestamp of the most recent container removal.
|
||||
last_op_at:
|
||||
type: string
|
||||
format: date-time
|
||||
description: UTC timestamp of the most recent operation; drives retention TTL.
|
||||
created_at:
|
||||
type: string
|
||||
format: date-time
|
||||
description: UTC timestamp of the first observation of this game.
|
||||
RuntimesList:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- items
|
||||
properties:
|
||||
items:
|
||||
type: array
|
||||
items:
|
||||
$ref: "#/components/schemas/RuntimeRecord"
|
||||
StartRequest:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- image_ref
|
||||
properties:
|
||||
image_ref:
|
||||
type: string
|
||||
description: Docker reference resolved by the producer (Game Master or Admin Service).
|
||||
StopRequest:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- reason
|
||||
properties:
|
||||
reason:
|
||||
$ref: "#/components/schemas/StopReason"
|
||||
PatchRequest:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- image_ref
|
||||
properties:
|
||||
image_ref:
|
||||
type: string
|
||||
description: New Docker reference within the same semver major and minor line.
|
||||
StopReason:
|
||||
type: string
|
||||
enum:
|
||||
- orphan_cleanup
|
||||
- cancelled
|
||||
- finished
|
||||
- admin_request
|
||||
- timeout
|
||||
description: Reason carried in the stop envelope and recorded in the operation log.
|
||||
ErrorCode:
|
||||
type: string
|
||||
enum:
|
||||
- invalid_request
|
||||
- not_found
|
||||
- conflict
|
||||
- service_unavailable
|
||||
- internal_error
|
||||
- image_pull_failed
|
||||
- image_ref_not_semver
|
||||
- semver_patch_only
|
||||
- container_start_failed
|
||||
- start_config_invalid
|
||||
- docker_unavailable
|
||||
- replay_no_op
|
||||
description: Stable internal API error code.
|
||||
ProbeResponse:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- status
|
||||
properties:
|
||||
status:
|
||||
type: string
|
||||
ErrorResponse:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- error
|
||||
properties:
|
||||
error:
|
||||
$ref: "#/components/schemas/ErrorBody"
|
||||
ErrorBody:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- code
|
||||
- message
|
||||
properties:
|
||||
code:
|
||||
$ref: "#/components/schemas/ErrorCode"
|
||||
message:
|
||||
type: string
|
||||
description: Human-readable trusted error message.
|
||||
responses:
|
||||
InvalidRequestError:
|
||||
description: Request validation failed.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
examples:
|
||||
invalidRequest:
|
||||
value:
|
||||
error:
|
||||
code: invalid_request
|
||||
message: request is invalid
|
||||
NotFoundError:
|
||||
description: The requested runtime record does not exist.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
examples:
|
||||
notFound:
|
||||
value:
|
||||
error:
|
||||
code: not_found
|
||||
message: runtime record not found
|
||||
ConflictError:
|
||||
description: The requested operation is not allowed in the current runtime state.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
examples:
|
||||
conflict:
|
||||
value:
|
||||
error:
|
||||
code: conflict
|
||||
message: operation not allowed in current status
|
||||
InternalError:
|
||||
description: Unexpected internal service error.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
examples:
|
||||
internal:
|
||||
value:
|
||||
error:
|
||||
code: internal_error
|
||||
message: internal server error
|
||||
ServiceUnavailableError:
|
||||
description: An upstream dependency is unavailable.
|
||||
content:
|
||||
application/json:
|
||||
schema:
|
||||
$ref: "#/components/schemas/ErrorResponse"
|
||||
examples:
|
||||
unavailable:
|
||||
value:
|
||||
error:
|
||||
code: service_unavailable
|
||||
message: service is unavailable
|
||||
@@ -0,0 +1,195 @@
|
||||
asyncapi: 3.1.0
|
||||
info:
|
||||
title: Galaxy Runtime Health Events Contract
|
||||
version: 1.0.0
|
||||
description: |
|
||||
Stable Redis Streams contract for technical container health events
|
||||
published by `Runtime Manager`. Consumers include `Game Master`;
|
||||
`Game Lobby` and `Admin Service` are reserved as future consumers.
|
||||
|
||||
Three independent sources feed this stream: the Docker events
|
||||
listener, the periodic Docker inspect worker, and the active HTTP
|
||||
`/healthz` probe. Every emission also upserts the latest snapshot
|
||||
into `health_snapshots` in PostgreSQL.
|
||||
|
||||
Polymorphism: the `details` field carries an `event_type`-specific
|
||||
payload selected via `oneOf` per type. Each variant is a closed object
|
||||
(no unknown fields).
|
||||
|
||||
The `event_type` enum is fixed in this contract; adding a new value
|
||||
requires a contract bump and a coordinated consumer change.
|
||||
channels:
|
||||
healthEvents:
|
||||
address: runtime:health_events
|
||||
messages:
|
||||
runtimeHealthEvent:
|
||||
$ref: '#/components/messages/RuntimeHealthEvent'
|
||||
operations:
|
||||
publishHealthEvent:
|
||||
action: send
|
||||
summary: Publish one technical health event for downstream consumers.
|
||||
channel:
|
||||
$ref: '#/channels/healthEvents'
|
||||
messages:
|
||||
- $ref: '#/channels/healthEvents/messages/runtimeHealthEvent'
|
||||
components:
|
||||
messages:
|
||||
RuntimeHealthEvent:
|
||||
name: RuntimeHealthEvent
|
||||
title: Runtime health event
|
||||
summary: One technical health observation about a game engine container.
|
||||
payload:
|
||||
$ref: '#/components/schemas/RuntimeHealthEventPayload'
|
||||
examples:
|
||||
- name: containerStarted
|
||||
summary: Engine container has been created and started.
|
||||
payload:
|
||||
game_id: game-123
|
||||
container_id: 7c2b5d1a4f6e
|
||||
event_type: container_started
|
||||
occurred_at_ms: 1775121700000
|
||||
details:
|
||||
image_ref: registry.example.com/galaxy/game:1.4.7
|
||||
- name: containerExited
|
||||
summary: Engine container terminated with a non-zero exit code.
|
||||
payload:
|
||||
game_id: game-123
|
||||
container_id: 7c2b5d1a4f6e
|
||||
event_type: container_exited
|
||||
occurred_at_ms: 1775121800000
|
||||
details:
|
||||
exit_code: 137
|
||||
oom: false
|
||||
- name: probeFailed
|
||||
summary: Active probe observed three consecutive failures.
|
||||
payload:
|
||||
game_id: game-123
|
||||
container_id: 7c2b5d1a4f6e
|
||||
event_type: probe_failed
|
||||
occurred_at_ms: 1775121810000
|
||||
details:
|
||||
consecutive_failures: 3
|
||||
last_status: 0
|
||||
last_error: "context deadline exceeded"
|
||||
schemas:
|
||||
RuntimeHealthEventPayload:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- game_id
|
||||
- container_id
|
||||
- event_type
|
||||
- occurred_at_ms
|
||||
- details
|
||||
properties:
|
||||
game_id:
|
||||
type: string
|
||||
description: Opaque stable game identifier owned by Lobby.
|
||||
container_id:
|
||||
type: string
|
||||
description: Docker container id observed by Runtime Manager. May differ from the current container id after a restart race.
|
||||
event_type:
|
||||
$ref: '#/components/schemas/EventType'
|
||||
occurred_at_ms:
|
||||
type: integer
|
||||
format: int64
|
||||
description: UTC milliseconds when Runtime Manager observed the event.
|
||||
details:
|
||||
oneOf:
|
||||
- $ref: '#/components/schemas/ContainerStartedDetails'
|
||||
- $ref: '#/components/schemas/ContainerExitedDetails'
|
||||
- $ref: '#/components/schemas/ContainerOomDetails'
|
||||
- $ref: '#/components/schemas/ContainerDisappearedDetails'
|
||||
- $ref: '#/components/schemas/InspectUnhealthyDetails'
|
||||
- $ref: '#/components/schemas/ProbeFailedDetails'
|
||||
- $ref: '#/components/schemas/ProbeRecoveredDetails'
|
||||
description: Polymorphic payload selected by event_type.
|
||||
EventType:
|
||||
type: string
|
||||
enum:
|
||||
- container_started
|
||||
- container_exited
|
||||
- container_oom
|
||||
- container_disappeared
|
||||
- inspect_unhealthy
|
||||
- probe_failed
|
||||
- probe_recovered
|
||||
description: Discriminator selecting the details variant.
|
||||
ContainerStartedDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- image_ref
|
||||
properties:
|
||||
image_ref:
|
||||
type: string
|
||||
description: Image reference of the started container.
|
||||
ContainerExitedDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- exit_code
|
||||
- oom
|
||||
properties:
|
||||
exit_code:
|
||||
type: integer
|
||||
description: Exit code reported by Docker.
|
||||
oom:
|
||||
type: boolean
|
||||
description: True when the container was killed by the OOM killer.
|
||||
ContainerOomDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- exit_code
|
||||
properties:
|
||||
exit_code:
|
||||
type: integer
|
||||
description: Exit code reported by Docker for the OOM event.
|
||||
ContainerDisappearedDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
description: Empty payload; emitted when a destroy event is observed for a record Runtime Manager did not initiate.
|
||||
InspectUnhealthyDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- restart_count
|
||||
- state
|
||||
- health
|
||||
properties:
|
||||
restart_count:
|
||||
type: integer
|
||||
description: Docker RestartCount observed at this inspection.
|
||||
state:
|
||||
type: string
|
||||
description: Docker State.Status observed at this inspection.
|
||||
health:
|
||||
type: string
|
||||
description: Docker State.Health.Status observed at this inspection; empty when the image declares no HEALTHCHECK.
|
||||
ProbeFailedDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- consecutive_failures
|
||||
- last_status
|
||||
- last_error
|
||||
properties:
|
||||
consecutive_failures:
|
||||
type: integer
|
||||
description: Number of consecutive probe failures that crossed the threshold.
|
||||
last_status:
|
||||
type: integer
|
||||
description: HTTP status of the last probe attempt; 0 when the probe failed before receiving a response.
|
||||
last_error:
|
||||
type: string
|
||||
description: Operator-readable error of the last probe attempt; empty when not applicable.
|
||||
ProbeRecoveredDetails:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- prior_failure_count
|
||||
properties:
|
||||
prior_failure_count:
|
||||
type: integer
|
||||
description: Number of consecutive failures observed immediately before the recovery.
|
||||
@@ -0,0 +1,226 @@
|
||||
asyncapi: 3.1.0
|
||||
info:
|
||||
title: Galaxy Runtime Jobs Stream Contract
|
||||
version: 1.0.0
|
||||
description: |
|
||||
Stable Redis Streams contract carrying runtime jobs between
|
||||
`Game Lobby` and `Runtime Manager`.
|
||||
|
||||
`Game Lobby` is the sole producer for `runtime:start_jobs` and
|
||||
`runtime:stop_jobs`. `Runtime Manager` consumes both, executes the
|
||||
Docker work, and publishes one outcome per job to `runtime:job_results`,
|
||||
which is consumed by `Game Lobby`'s runtime-job-result worker.
|
||||
|
||||
Replay safety:
|
||||
- duplicate start jobs for an already-running game with the same
|
||||
`image_ref` produce a `success` job result with
|
||||
`error_code=replay_no_op`;
|
||||
- duplicate stop jobs for an already-stopped or already-removed game
|
||||
produce a `success` job result with `error_code=replay_no_op`.
|
||||
|
||||
The `reason` enum on `runtime:stop_jobs` is fixed in this contract.
|
||||
Adding a new value requires a contract bump and a coordinated
|
||||
Lobby/Runtime Manager change.
|
||||
channels:
|
||||
startJobs:
|
||||
address: runtime:start_jobs
|
||||
messages:
|
||||
runtimeStartJob:
|
||||
$ref: '#/components/messages/RuntimeStartJob'
|
||||
stopJobs:
|
||||
address: runtime:stop_jobs
|
||||
messages:
|
||||
runtimeStopJob:
|
||||
$ref: '#/components/messages/RuntimeStopJob'
|
||||
jobResults:
|
||||
address: runtime:job_results
|
||||
messages:
|
||||
runtimeJobResult:
|
||||
$ref: '#/components/messages/RuntimeJobResult'
|
||||
operations:
|
||||
consumeStartJob:
|
||||
action: receive
|
||||
summary: Receive one start job from Game Lobby and run a container.
|
||||
channel:
|
||||
$ref: '#/channels/startJobs'
|
||||
messages:
|
||||
- $ref: '#/channels/startJobs/messages/runtimeStartJob'
|
||||
consumeStopJob:
|
||||
action: receive
|
||||
summary: Receive one stop job from Game Lobby and stop a container.
|
||||
channel:
|
||||
$ref: '#/channels/stopJobs'
|
||||
messages:
|
||||
- $ref: '#/channels/stopJobs/messages/runtimeStopJob'
|
||||
publishJobResult:
|
||||
action: send
|
||||
summary: Publish one runtime job outcome for Game Lobby.
|
||||
channel:
|
||||
$ref: '#/channels/jobResults'
|
||||
messages:
|
||||
- $ref: '#/channels/jobResults/messages/runtimeJobResult'
|
||||
components:
|
||||
messages:
|
||||
RuntimeStartJob:
|
||||
name: RuntimeStartJob
|
||||
title: Runtime start job
|
||||
summary: Lobby request to start one game engine container.
|
||||
payload:
|
||||
$ref: '#/components/schemas/RuntimeStartJobPayload'
|
||||
examples:
|
||||
- name: startJob
|
||||
summary: Start a game engine container with a producer-resolved image_ref.
|
||||
payload:
|
||||
game_id: game-123
|
||||
image_ref: registry.example.com/galaxy/game:1.4.7
|
||||
requested_at_ms: 1775121700000
|
||||
RuntimeStopJob:
|
||||
name: RuntimeStopJob
|
||||
title: Runtime stop job
|
||||
summary: Lobby request to stop one game engine container.
|
||||
payload:
|
||||
$ref: '#/components/schemas/RuntimeStopJobPayload'
|
||||
examples:
|
||||
- name: cancelled
|
||||
summary: Stop the engine because the game was cancelled.
|
||||
payload:
|
||||
game_id: game-123
|
||||
reason: cancelled
|
||||
requested_at_ms: 1775121800000
|
||||
- name: orphanCleanup
|
||||
summary: Stop an engine whose Lobby metadata persistence failed.
|
||||
payload:
|
||||
game_id: game-456
|
||||
reason: orphan_cleanup
|
||||
requested_at_ms: 1775121810000
|
||||
RuntimeJobResult:
|
||||
name: RuntimeJobResult
|
||||
title: Runtime job result
|
||||
summary: Outcome of one start or stop job.
|
||||
payload:
|
||||
$ref: '#/components/schemas/RuntimeJobResultPayload'
|
||||
examples:
|
||||
- name: startSuccess
|
||||
summary: Successful start, container_id and engine_endpoint are populated.
|
||||
payload:
|
||||
game_id: game-123
|
||||
outcome: success
|
||||
container_id: 7c2b5d1a4f6e
|
||||
engine_endpoint: http://galaxy-game-game-123:8080
|
||||
error_code: ""
|
||||
error_message: ""
|
||||
- name: imagePullFailed
|
||||
summary: Failed start due to an image pull error.
|
||||
payload:
|
||||
game_id: game-789
|
||||
outcome: failure
|
||||
container_id: ""
|
||||
engine_endpoint: ""
|
||||
error_code: image_pull_failed
|
||||
error_message: "manifest unknown"
|
||||
- name: replayNoOp
|
||||
summary: Idempotent replay; the job was a no-op.
|
||||
payload:
|
||||
game_id: game-123
|
||||
outcome: success
|
||||
container_id: 7c2b5d1a4f6e
|
||||
engine_endpoint: http://galaxy-game-game-123:8080
|
||||
error_code: replay_no_op
|
||||
error_message: ""
|
||||
schemas:
|
||||
RuntimeStartJobPayload:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- game_id
|
||||
- image_ref
|
||||
- requested_at_ms
|
||||
properties:
|
||||
game_id:
|
||||
type: string
|
||||
description: Opaque stable game identifier owned by Lobby.
|
||||
image_ref:
|
||||
type: string
|
||||
description: Docker reference resolved by Lobby from LOBBY_ENGINE_IMAGE_TEMPLATE.
|
||||
requested_at_ms:
|
||||
type: integer
|
||||
format: int64
|
||||
description: UTC milliseconds; used for diagnostics, not authoritative.
|
||||
RuntimeStopJobPayload:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- game_id
|
||||
- reason
|
||||
- requested_at_ms
|
||||
properties:
|
||||
game_id:
|
||||
type: string
|
||||
description: Opaque stable game identifier owned by Lobby.
|
||||
reason:
|
||||
$ref: '#/components/schemas/StopReason'
|
||||
requested_at_ms:
|
||||
type: integer
|
||||
format: int64
|
||||
description: UTC milliseconds; used for diagnostics, not authoritative.
|
||||
RuntimeJobResultPayload:
|
||||
type: object
|
||||
additionalProperties: false
|
||||
required:
|
||||
- game_id
|
||||
- outcome
|
||||
- container_id
|
||||
- engine_endpoint
|
||||
- error_code
|
||||
- error_message
|
||||
properties:
|
||||
game_id:
|
||||
type: string
|
||||
description: Opaque stable game identifier matching the originating job.
|
||||
outcome:
|
||||
type: string
|
||||
enum:
|
||||
- success
|
||||
- failure
|
||||
description: High-level outcome of the runtime job.
|
||||
container_id:
|
||||
type: string
|
||||
description: Docker container id of the engine; populated on success, empty on failure.
|
||||
engine_endpoint:
|
||||
type: string
|
||||
description: Stable engine URL `http://galaxy-game-{game_id}:8080`; populated on success, empty on failure.
|
||||
error_code:
|
||||
$ref: '#/components/schemas/ErrorCode'
|
||||
error_message:
|
||||
type: string
|
||||
description: Operator-readable detail; empty when not applicable.
|
||||
StopReason:
|
||||
type: string
|
||||
enum:
|
||||
- orphan_cleanup
|
||||
- cancelled
|
||||
- finished
|
||||
- admin_request
|
||||
- timeout
|
||||
description: Reason value carried by every runtime:stop_jobs envelope.
|
||||
ErrorCode:
|
||||
type: string
|
||||
enum:
|
||||
- ""
|
||||
- invalid_request
|
||||
- not_found
|
||||
- conflict
|
||||
- service_unavailable
|
||||
- internal_error
|
||||
- image_pull_failed
|
||||
- image_ref_not_semver
|
||||
- semver_patch_only
|
||||
- container_start_failed
|
||||
- start_config_invalid
|
||||
- docker_unavailable
|
||||
- replay_no_op
|
||||
description: |
|
||||
Stable error code identical to the internal REST contract. The empty
|
||||
string is a valid value for successful job results that did not
|
||||
produce a code (the field is required to be present so consumers
|
||||
can rely on the schema).
|
||||
@@ -0,0 +1,236 @@
|
||||
// Command jetgen regenerates the go-jet/v2 query-builder code under
|
||||
// galaxy/rtmanager/internal/adapters/postgres/jet/ against a transient
|
||||
// PostgreSQL instance.
|
||||
//
|
||||
// The program is intended to be invoked as `go run ./cmd/jetgen` (or via
|
||||
// the `make jet` Makefile target) from within `galaxy/rtmanager`. It is
|
||||
// not part of the runtime binary.
|
||||
//
|
||||
// Steps:
|
||||
//
|
||||
// 1. start a postgres:16-alpine container via testcontainers-go
|
||||
// 2. open it through pkg/postgres as the superuser
|
||||
// 3. CREATE ROLE rtmanagerservice and CREATE SCHEMA "rtmanager"
|
||||
// AUTHORIZATION rtmanagerservice
|
||||
// 4. open a second pool as rtmanagerservice with search_path=rtmanager
|
||||
// and apply the embedded goose migrations
|
||||
// 5. run jet's PostgreSQL generator against schema=rtmanager, writing
|
||||
// into ../internal/adapters/postgres/jet
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log"
|
||||
"net/url"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/migrations"
|
||||
|
||||
jetpostgres "github.com/go-jet/jet/v2/generator/postgres"
|
||||
testcontainers "github.com/testcontainers/testcontainers-go"
|
||||
tcpostgres "github.com/testcontainers/testcontainers-go/modules/postgres"
|
||||
"github.com/testcontainers/testcontainers-go/wait"
|
||||
)
|
||||
|
||||
const (
|
||||
postgresImage = "postgres:16-alpine"
|
||||
superuserName = "galaxy"
|
||||
superuserPassword = "galaxy"
|
||||
superuserDatabase = "galaxy_rtmanager"
|
||||
serviceRole = "rtmanagerservice"
|
||||
servicePassword = "rtmanagerservice"
|
||||
serviceSchema = "rtmanager"
|
||||
containerStartup = 90 * time.Second
|
||||
defaultOpTimeout = 10 * time.Second
|
||||
jetOutputDirSuffix = "internal/adapters/postgres/jet"
|
||||
)
|
||||
|
||||
func main() {
|
||||
if err := run(context.Background()); err != nil {
|
||||
log.Fatalf("jetgen: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func run(ctx context.Context) error {
|
||||
outputDir, err := jetOutputDir()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
container, err := tcpostgres.Run(ctx, postgresImage,
|
||||
tcpostgres.WithDatabase(superuserDatabase),
|
||||
tcpostgres.WithUsername(superuserName),
|
||||
tcpostgres.WithPassword(superuserPassword),
|
||||
testcontainers.WithWaitStrategy(
|
||||
wait.ForLog("database system is ready to accept connections").
|
||||
WithOccurrence(2).
|
||||
WithStartupTimeout(containerStartup),
|
||||
),
|
||||
)
|
||||
if err != nil {
|
||||
return fmt.Errorf("start postgres container: %w", err)
|
||||
}
|
||||
defer func() {
|
||||
if termErr := testcontainers.TerminateContainer(container); termErr != nil {
|
||||
log.Printf("jetgen: terminate container: %v", termErr)
|
||||
}
|
||||
}()
|
||||
|
||||
baseDSN, err := container.ConnectionString(ctx, "sslmode=disable")
|
||||
if err != nil {
|
||||
return fmt.Errorf("resolve container dsn: %w", err)
|
||||
}
|
||||
|
||||
if err := provisionRoleAndSchema(ctx, baseDSN); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
scopedDSN, err := dsnForServiceRole(baseDSN)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if err := applyMigrations(ctx, scopedDSN); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if err := os.RemoveAll(outputDir); err != nil {
|
||||
return fmt.Errorf("remove existing jet output %q: %w", outputDir, err)
|
||||
}
|
||||
if err := os.MkdirAll(filepath.Dir(outputDir), 0o755); err != nil {
|
||||
return fmt.Errorf("ensure jet output parent: %w", err)
|
||||
}
|
||||
|
||||
jetCfg := postgres.DefaultConfig()
|
||||
jetCfg.PrimaryDSN = scopedDSN
|
||||
jetCfg.OperationTimeout = defaultOpTimeout
|
||||
jetDB, err := postgres.OpenPrimary(ctx, jetCfg)
|
||||
if err != nil {
|
||||
return fmt.Errorf("open scoped pool for jet generation: %w", err)
|
||||
}
|
||||
defer func() { _ = jetDB.Close() }()
|
||||
|
||||
if err := jetpostgres.GenerateDB(jetDB, serviceSchema, outputDir); err != nil {
|
||||
return fmt.Errorf("jet generate: %w", err)
|
||||
}
|
||||
|
||||
log.Printf("jetgen: generated jet code into %s (schema=%s)", outputDir, serviceSchema)
|
||||
return nil
|
||||
}
|
||||
|
||||
func provisionRoleAndSchema(ctx context.Context, baseDSN string) error {
|
||||
cfg := postgres.DefaultConfig()
|
||||
cfg.PrimaryDSN = baseDSN
|
||||
cfg.OperationTimeout = defaultOpTimeout
|
||||
db, err := postgres.OpenPrimary(ctx, cfg)
|
||||
if err != nil {
|
||||
return fmt.Errorf("open admin pool: %w", err)
|
||||
}
|
||||
defer func() { _ = db.Close() }()
|
||||
|
||||
statements := []string{
|
||||
fmt.Sprintf(`DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = %s) THEN
|
||||
CREATE ROLE %s LOGIN PASSWORD %s;
|
||||
END IF;
|
||||
END $$;`, sqlLiteral(serviceRole), sqlIdentifier(serviceRole), sqlLiteral(servicePassword)),
|
||||
fmt.Sprintf(`CREATE SCHEMA IF NOT EXISTS %s AUTHORIZATION %s;`,
|
||||
sqlIdentifier(serviceSchema), sqlIdentifier(serviceRole)),
|
||||
fmt.Sprintf(`GRANT USAGE ON SCHEMA %s TO %s;`,
|
||||
sqlIdentifier(serviceSchema), sqlIdentifier(serviceRole)),
|
||||
}
|
||||
for _, statement := range statements {
|
||||
if _, err := db.ExecContext(ctx, statement); err != nil {
|
||||
return fmt.Errorf("provision %q/%q: %w", serviceSchema, serviceRole, err)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func dsnForServiceRole(baseDSN string) (string, error) {
|
||||
parsed, err := url.Parse(baseDSN)
|
||||
if err != nil {
|
||||
return "", fmt.Errorf("parse base dsn: %w", err)
|
||||
}
|
||||
values := url.Values{}
|
||||
values.Set("search_path", serviceSchema)
|
||||
values.Set("sslmode", "disable")
|
||||
scoped := url.URL{
|
||||
Scheme: parsed.Scheme,
|
||||
User: url.UserPassword(serviceRole, servicePassword),
|
||||
Host: parsed.Host,
|
||||
Path: parsed.Path,
|
||||
RawQuery: values.Encode(),
|
||||
}
|
||||
return scoped.String(), nil
|
||||
}
|
||||
|
||||
func applyMigrations(ctx context.Context, dsn string) error {
|
||||
cfg := postgres.DefaultConfig()
|
||||
cfg.PrimaryDSN = dsn
|
||||
cfg.OperationTimeout = defaultOpTimeout
|
||||
db, err := postgres.OpenPrimary(ctx, cfg)
|
||||
if err != nil {
|
||||
return fmt.Errorf("open scoped pool: %w", err)
|
||||
}
|
||||
defer func() { _ = db.Close() }()
|
||||
|
||||
if err := postgres.Ping(ctx, db, defaultOpTimeout); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := postgres.RunMigrations(ctx, db, migrations.FS(), "."); err != nil {
|
||||
return fmt.Errorf("run migrations: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// jetOutputDir returns the absolute path that jet should write into. We
|
||||
// rely on the runtime caller info to anchor it to galaxy/rtmanager
|
||||
// regardless of the invoking working directory.
|
||||
func jetOutputDir() (string, error) {
|
||||
_, file, _, ok := runtime.Caller(0)
|
||||
if !ok {
|
||||
return "", errors.New("resolve runtime caller for jet output path")
|
||||
}
|
||||
dir := filepath.Dir(file)
|
||||
// dir = .../galaxy/rtmanager/cmd/jetgen
|
||||
moduleRoot := filepath.Clean(filepath.Join(dir, "..", ".."))
|
||||
return filepath.Join(moduleRoot, jetOutputDirSuffix), nil
|
||||
}
|
||||
|
||||
func sqlIdentifier(name string) string {
|
||||
return `"` + escapeDoubleQuotes(name) + `"`
|
||||
}
|
||||
|
||||
func sqlLiteral(value string) string {
|
||||
return "'" + escapeSingleQuotes(value) + "'"
|
||||
}
|
||||
|
||||
func escapeDoubleQuotes(value string) string {
|
||||
out := make([]byte, 0, len(value))
|
||||
for index := 0; index < len(value); index++ {
|
||||
if value[index] == '"' {
|
||||
out = append(out, '"', '"')
|
||||
continue
|
||||
}
|
||||
out = append(out, value[index])
|
||||
}
|
||||
return string(out)
|
||||
}
|
||||
|
||||
func escapeSingleQuotes(value string) string {
|
||||
out := make([]byte, 0, len(value))
|
||||
for index := 0; index < len(value); index++ {
|
||||
if value[index] == '\'' {
|
||||
out = append(out, '\'', '\'')
|
||||
continue
|
||||
}
|
||||
out = append(out, value[index])
|
||||
}
|
||||
return string(out)
|
||||
}
|
||||
@@ -0,0 +1,47 @@
|
||||
// Binary rtmanager is the runnable Runtime Manager Service process
|
||||
// entrypoint.
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/signal"
|
||||
"syscall"
|
||||
|
||||
"galaxy/rtmanager/internal/app"
|
||||
"galaxy/rtmanager/internal/config"
|
||||
"galaxy/rtmanager/internal/logging"
|
||||
)
|
||||
|
||||
func main() {
|
||||
if err := run(); err != nil {
|
||||
_, _ = fmt.Fprintf(os.Stderr, "rtmanager: %v\n", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
func run() error {
|
||||
cfg, err := config.LoadFromEnv()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
logger, err := logging.New(cfg.Logging.Level)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
rootCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
|
||||
defer stop()
|
||||
|
||||
runtime, err := app.NewRuntime(rootCtx, cfg, logger)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer func() {
|
||||
_ = runtime.Close()
|
||||
}()
|
||||
|
||||
return runtime.Run(rootCtx)
|
||||
}
|
||||
@@ -0,0 +1,392 @@
|
||||
package rtmanager
|
||||
|
||||
import (
|
||||
"os"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"testing"
|
||||
|
||||
"github.com/stretchr/testify/require"
|
||||
"gopkg.in/yaml.v3"
|
||||
)
|
||||
|
||||
var expectedStopReasonEnum = []string{
|
||||
"orphan_cleanup",
|
||||
"cancelled",
|
||||
"finished",
|
||||
"admin_request",
|
||||
"timeout",
|
||||
}
|
||||
|
||||
var expectedJobResultErrorCodeEnum = []string{
|
||||
"",
|
||||
"invalid_request",
|
||||
"not_found",
|
||||
"conflict",
|
||||
"service_unavailable",
|
||||
"internal_error",
|
||||
"image_pull_failed",
|
||||
"image_ref_not_semver",
|
||||
"semver_patch_only",
|
||||
"container_start_failed",
|
||||
"start_config_invalid",
|
||||
"docker_unavailable",
|
||||
"replay_no_op",
|
||||
}
|
||||
|
||||
var expectedHealthEventTypeEnum = []string{
|
||||
"container_started",
|
||||
"container_exited",
|
||||
"container_oom",
|
||||
"container_disappeared",
|
||||
"inspect_unhealthy",
|
||||
"probe_failed",
|
||||
"probe_recovered",
|
||||
}
|
||||
|
||||
var expectedHealthDetailsBranches = []struct {
|
||||
schema string
|
||||
required []string
|
||||
}{
|
||||
{schema: "ContainerStartedDetails", required: []string{"image_ref"}},
|
||||
{schema: "ContainerExitedDetails", required: []string{"exit_code", "oom"}},
|
||||
{schema: "ContainerOomDetails", required: []string{"exit_code"}},
|
||||
{schema: "ContainerDisappearedDetails", required: nil},
|
||||
{schema: "InspectUnhealthyDetails", required: []string{"restart_count", "state", "health"}},
|
||||
{schema: "ProbeFailedDetails", required: []string{"consecutive_failures", "last_status", "last_error"}},
|
||||
{schema: "ProbeRecoveredDetails", required: []string{"prior_failure_count"}},
|
||||
}
|
||||
|
||||
func TestRuntimeJobsAsyncAPISpecLoads(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
require.Equal(t, "3.1.0", getStringValue(t, doc, "asyncapi"))
|
||||
}
|
||||
|
||||
func TestRuntimeJobsSpecFreezesChannelAddresses(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
channels := getMapValue(t, doc, "channels")
|
||||
|
||||
require.Equal(t, "runtime:start_jobs",
|
||||
getStringValue(t, getMapValue(t, channels, "startJobs"), "address"))
|
||||
require.Equal(t, "runtime:stop_jobs",
|
||||
getStringValue(t, getMapValue(t, channels, "stopJobs"), "address"))
|
||||
require.Equal(t, "runtime:job_results",
|
||||
getStringValue(t, getMapValue(t, channels, "jobResults"), "address"))
|
||||
}
|
||||
|
||||
func TestRuntimeJobsSpecFreezesOperationActions(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
operations := getMapValue(t, doc, "operations")
|
||||
|
||||
cases := []struct {
|
||||
operation string
|
||||
action string
|
||||
channel string
|
||||
}{
|
||||
{operation: "consumeStartJob", action: "receive", channel: "#/channels/startJobs"},
|
||||
{operation: "consumeStopJob", action: "receive", channel: "#/channels/stopJobs"},
|
||||
{operation: "publishJobResult", action: "send", channel: "#/channels/jobResults"},
|
||||
}
|
||||
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.operation, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
op := getMapValue(t, operations, tc.operation)
|
||||
require.Equal(t, tc.action, getStringValue(t, op, "action"))
|
||||
require.Equal(t, tc.channel,
|
||||
getStringValue(t, getMapValue(t, op, "channel"), "$ref"))
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestRuntimeJobsSpecFreezesMessageNames(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
messages := getMapValue(t, doc, "components", "messages")
|
||||
|
||||
for _, name := range []string{"RuntimeStartJob", "RuntimeStopJob", "RuntimeJobResult"} {
|
||||
t.Run(name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
message := getMapValue(t, messages, name)
|
||||
require.Equal(t, name, getStringValue(t, message, "name"))
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestRuntimeJobsSpecFreezesStartJobPayload(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
payload := getMapValue(t, doc, "components", "schemas", "RuntimeStartJobPayload")
|
||||
|
||||
require.ElementsMatch(t,
|
||||
[]string{"game_id", "image_ref", "requested_at_ms"},
|
||||
getStringSlice(t, payload, "required"))
|
||||
require.False(t, getBoolValue(t, payload, "additionalProperties"),
|
||||
"RuntimeStartJobPayload must reject unknown fields")
|
||||
}
|
||||
|
||||
func TestRuntimeJobsSpecFreezesStopJobPayload(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
payload := getMapValue(t, doc, "components", "schemas", "RuntimeStopJobPayload")
|
||||
|
||||
require.ElementsMatch(t,
|
||||
[]string{"game_id", "reason", "requested_at_ms"},
|
||||
getStringSlice(t, payload, "required"))
|
||||
require.False(t, getBoolValue(t, payload, "additionalProperties"),
|
||||
"RuntimeStopJobPayload must reject unknown fields")
|
||||
|
||||
reason := getMapValue(t, payload, "properties", "reason")
|
||||
require.Equal(t, "#/components/schemas/StopReason",
|
||||
getStringValue(t, reason, "$ref"),
|
||||
"RuntimeStopJobPayload.reason must reference StopReason")
|
||||
|
||||
stopReason := getMapValue(t, doc, "components", "schemas", "StopReason")
|
||||
require.ElementsMatch(t, expectedStopReasonEnum,
|
||||
getStringSlice(t, stopReason, "enum"))
|
||||
}
|
||||
|
||||
func TestRuntimeJobsSpecFreezesJobResultPayload(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-jobs-asyncapi.yaml"))
|
||||
payload := getMapValue(t, doc, "components", "schemas", "RuntimeJobResultPayload")
|
||||
|
||||
require.ElementsMatch(t,
|
||||
[]string{"game_id", "outcome", "container_id", "engine_endpoint", "error_code", "error_message"},
|
||||
getStringSlice(t, payload, "required"))
|
||||
require.False(t, getBoolValue(t, payload, "additionalProperties"),
|
||||
"RuntimeJobResultPayload must reject unknown fields")
|
||||
|
||||
outcome := getMapValue(t, payload, "properties", "outcome")
|
||||
require.ElementsMatch(t, []string{"success", "failure"},
|
||||
getStringSlice(t, outcome, "enum"))
|
||||
|
||||
errorCode := getMapValue(t, payload, "properties", "error_code")
|
||||
require.Equal(t, "#/components/schemas/ErrorCode",
|
||||
getStringValue(t, errorCode, "$ref"),
|
||||
"RuntimeJobResultPayload.error_code must reference ErrorCode")
|
||||
|
||||
errorCodeSchema := getMapValue(t, doc, "components", "schemas", "ErrorCode")
|
||||
require.ElementsMatch(t, expectedJobResultErrorCodeEnum,
|
||||
getStringSlice(t, errorCodeSchema, "enum"))
|
||||
}
|
||||
|
||||
func TestRuntimeHealthAsyncAPISpecLoads(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
|
||||
require.Equal(t, "3.1.0", getStringValue(t, doc, "asyncapi"))
|
||||
}
|
||||
|
||||
func TestRuntimeHealthSpecFreezesChannelAndOperation(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
|
||||
|
||||
channel := getMapValue(t, doc, "channels", "healthEvents")
|
||||
require.Equal(t, "runtime:health_events", getStringValue(t, channel, "address"))
|
||||
|
||||
operation := getMapValue(t, doc, "operations", "publishHealthEvent")
|
||||
require.Equal(t, "send", getStringValue(t, operation, "action"))
|
||||
require.Equal(t, "#/channels/healthEvents",
|
||||
getStringValue(t, getMapValue(t, operation, "channel"), "$ref"))
|
||||
|
||||
message := getMapValue(t, doc, "components", "messages", "RuntimeHealthEvent")
|
||||
require.Equal(t, "RuntimeHealthEvent", getStringValue(t, message, "name"))
|
||||
}
|
||||
|
||||
func TestRuntimeHealthSpecFreezesEnvelope(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
|
||||
payload := getMapValue(t, doc, "components", "schemas", "RuntimeHealthEventPayload")
|
||||
|
||||
require.ElementsMatch(t,
|
||||
[]string{"game_id", "container_id", "event_type", "occurred_at_ms", "details"},
|
||||
getStringSlice(t, payload, "required"))
|
||||
require.False(t, getBoolValue(t, payload, "additionalProperties"),
|
||||
"RuntimeHealthEventPayload must reject unknown fields")
|
||||
|
||||
eventType := getMapValue(t, payload, "properties", "event_type")
|
||||
require.Equal(t, "#/components/schemas/EventType",
|
||||
getStringValue(t, eventType, "$ref"),
|
||||
"RuntimeHealthEventPayload.event_type must reference EventType")
|
||||
}
|
||||
|
||||
func TestRuntimeHealthSpecFreezesEventTypeEnum(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
|
||||
schema := getMapValue(t, doc, "components", "schemas", "EventType")
|
||||
|
||||
require.ElementsMatch(t, expectedHealthEventTypeEnum,
|
||||
getStringSlice(t, schema, "enum"))
|
||||
}
|
||||
|
||||
func TestRuntimeHealthSpecFreezesDetailsOneOfBranches(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadAsyncAPISpec(t, filepath.Join("api", "runtime-health-asyncapi.yaml"))
|
||||
details := getMapValue(t, doc, "components", "schemas", "RuntimeHealthEventPayload",
|
||||
"properties", "details")
|
||||
|
||||
branches := getSliceValue(t, details, "oneOf")
|
||||
require.Lenf(t, branches, len(expectedHealthDetailsBranches),
|
||||
"details.oneOf must have %d branches", len(expectedHealthDetailsBranches))
|
||||
|
||||
gotRefs := make([]string, 0, len(branches))
|
||||
for _, raw := range branches {
|
||||
branch, ok := raw.(map[string]any)
|
||||
require.True(t, ok, "details.oneOf entry must be a mapping")
|
||||
gotRefs = append(gotRefs, getStringValue(t, branch, "$ref"))
|
||||
}
|
||||
|
||||
wantRefs := make([]string, 0, len(expectedHealthDetailsBranches))
|
||||
for _, branch := range expectedHealthDetailsBranches {
|
||||
wantRefs = append(wantRefs, "#/components/schemas/"+branch.schema)
|
||||
}
|
||||
require.ElementsMatch(t, wantRefs, gotRefs)
|
||||
|
||||
for _, branch := range expectedHealthDetailsBranches {
|
||||
t.Run(branch.schema, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
schema := getMapValue(t, doc, "components", "schemas", branch.schema)
|
||||
require.False(t, getBoolValue(t, schema, "additionalProperties"),
|
||||
"%s must reject unknown fields", branch.schema)
|
||||
if branch.required == nil {
|
||||
_, hasRequired := schema["required"]
|
||||
require.False(t, hasRequired,
|
||||
"%s must not declare required fields", branch.schema)
|
||||
return
|
||||
}
|
||||
require.ElementsMatch(t, branch.required,
|
||||
getStringSlice(t, schema, "required"))
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func loadAsyncAPISpec(t *testing.T, relativePath string) map[string]any {
|
||||
t.Helper()
|
||||
|
||||
payload := loadTextFile(t, relativePath)
|
||||
|
||||
var doc map[string]any
|
||||
if err := yaml.Unmarshal([]byte(payload), &doc); err != nil {
|
||||
require.Failf(t, "test failed", "decode spec: %v", err)
|
||||
}
|
||||
|
||||
return doc
|
||||
}
|
||||
|
||||
func loadTextFile(t *testing.T, relativePath string) string {
|
||||
t.Helper()
|
||||
|
||||
path := filepath.Join(moduleRoot(t), relativePath)
|
||||
payload, err := os.ReadFile(path)
|
||||
if err != nil {
|
||||
require.Failf(t, "test failed", "read file %s: %v", path, err)
|
||||
}
|
||||
|
||||
return string(payload)
|
||||
}
|
||||
|
||||
func moduleRoot(t *testing.T) string {
|
||||
t.Helper()
|
||||
|
||||
_, thisFile, _, ok := runtime.Caller(0)
|
||||
if !ok {
|
||||
require.FailNow(t, "runtime.Caller failed")
|
||||
}
|
||||
|
||||
return filepath.Dir(thisFile)
|
||||
}
|
||||
|
||||
func getMapValue(t *testing.T, value map[string]any, path ...string) map[string]any {
|
||||
t.Helper()
|
||||
|
||||
current := value
|
||||
for _, segment := range path {
|
||||
raw, ok := current[segment]
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "missing map key %s", segment)
|
||||
}
|
||||
next, ok := raw.(map[string]any)
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "value at %s is not a map", segment)
|
||||
}
|
||||
current = next
|
||||
}
|
||||
|
||||
return current
|
||||
}
|
||||
|
||||
func getStringValue(t *testing.T, value map[string]any, key string) string {
|
||||
t.Helper()
|
||||
|
||||
raw, ok := value[key]
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "missing key %s", key)
|
||||
}
|
||||
result, ok := raw.(string)
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "value at %s is not a string", key)
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
func getBoolValue(t *testing.T, value map[string]any, key string) bool {
|
||||
t.Helper()
|
||||
|
||||
raw, ok := value[key]
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "missing key %s", key)
|
||||
}
|
||||
result, ok := raw.(bool)
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "value at %s is not a bool", key)
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
func getStringSlice(t *testing.T, value map[string]any, key string) []string {
|
||||
t.Helper()
|
||||
|
||||
raw := getSliceValue(t, value, key)
|
||||
result := make([]string, 0, len(raw))
|
||||
for _, item := range raw {
|
||||
text, ok := item.(string)
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "value at %s is not a string slice", key)
|
||||
}
|
||||
result = append(result, text)
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
func getSliceValue(t *testing.T, value map[string]any, key string) []any {
|
||||
t.Helper()
|
||||
|
||||
raw, ok := value[key]
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "missing key %s", key)
|
||||
}
|
||||
result, ok := raw.([]any)
|
||||
if !ok {
|
||||
require.Failf(t, "test failed", "value at %s is not a slice", key)
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
@@ -0,0 +1,384 @@
|
||||
package rtmanager
|
||||
|
||||
import (
|
||||
"context"
|
||||
"net/http"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"testing"
|
||||
|
||||
"github.com/getkin/kin-openapi/openapi3"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// TestInternalOpenAPISpecValidates loads internal-openapi.yaml and verifies
|
||||
// it is a syntactically valid OpenAPI 3.0 document.
|
||||
func TestInternalOpenAPISpecValidates(t *testing.T) {
|
||||
t.Parallel()
|
||||
loadInternalOpenAPISpec(t)
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesOperationIDs verifies that every documented
|
||||
// endpoint declares the exact operationId required by the Runtime Manager
|
||||
// internal contract. Missing or renamed operationIds break the contract
|
||||
// for Game Master and Admin Service.
|
||||
func TestInternalSpecFreezesOperationIDs(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
|
||||
cases := []struct {
|
||||
method string
|
||||
path string
|
||||
operationID string
|
||||
}{
|
||||
{http.MethodGet, "/healthz", "internalHealthz"},
|
||||
{http.MethodGet, "/readyz", "internalReadyz"},
|
||||
{http.MethodGet, "/api/v1/internal/runtimes", "internalListRuntimes"},
|
||||
{http.MethodGet, "/api/v1/internal/runtimes/{game_id}", "internalGetRuntime"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/start", "internalStartRuntime"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/stop", "internalStopRuntime"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/restart", "internalRestartRuntime"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/patch", "internalPatchRuntime"},
|
||||
{http.MethodDelete, "/api/v1/internal/runtimes/{game_id}/container", "internalCleanupRuntimeContainer"},
|
||||
}
|
||||
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.operationID, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
op := getOperation(t, doc, tc.path, tc.method)
|
||||
require.Equal(t, tc.operationID, op.OperationID)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesRuntimeRecordSchema verifies that RuntimeRecord
|
||||
// declares the required field set documented in
|
||||
// rtmanager/README.md §Persistence Layout, with the status enum frozen.
|
||||
func TestInternalSpecFreezesRuntimeRecordSchema(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "RuntimeRecord")
|
||||
|
||||
assertRequiredFields(t, schema,
|
||||
"game_id", "status", "state_path", "docker_network",
|
||||
"last_op_at", "created_at",
|
||||
)
|
||||
|
||||
for _, optional := range []string{
|
||||
"current_container_id", "current_image_ref", "engine_endpoint",
|
||||
"started_at", "stopped_at", "removed_at",
|
||||
} {
|
||||
require.Contains(t, schema.Value.Properties, optional,
|
||||
"RuntimeRecord.%s must be present in properties", optional)
|
||||
}
|
||||
|
||||
assertStringEnum(t, schema, "status", "running", "stopped", "removed")
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesStartRequest verifies that StartRequest requires
|
||||
// only image_ref and rejects unknown fields.
|
||||
func TestInternalSpecFreezesStartRequest(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "StartRequest")
|
||||
|
||||
assertRequiredFields(t, schema, "image_ref")
|
||||
require.NotNil(t, schema.Value.AdditionalProperties.Has)
|
||||
require.False(t, *schema.Value.AdditionalProperties.Has,
|
||||
"StartRequest must reject unknown fields")
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesStopRequest verifies that StopRequest requires
|
||||
// only reason, that reason references the StopReason schema, and that
|
||||
// unknown fields are rejected.
|
||||
func TestInternalSpecFreezesStopRequest(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "StopRequest")
|
||||
|
||||
assertRequiredFields(t, schema, "reason")
|
||||
require.NotNil(t, schema.Value.AdditionalProperties.Has)
|
||||
require.False(t, *schema.Value.AdditionalProperties.Has,
|
||||
"StopRequest must reject unknown fields")
|
||||
|
||||
reason := schema.Value.Properties["reason"]
|
||||
require.NotNil(t, reason, "StopRequest.reason must be present")
|
||||
require.Equal(t, "#/components/schemas/StopReason", reason.Ref,
|
||||
"StopRequest.reason must reference StopReason")
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesPatchRequest verifies that PatchRequest requires
|
||||
// only image_ref and rejects unknown fields.
|
||||
func TestInternalSpecFreezesPatchRequest(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "PatchRequest")
|
||||
|
||||
assertRequiredFields(t, schema, "image_ref")
|
||||
require.NotNil(t, schema.Value.AdditionalProperties.Has)
|
||||
require.False(t, *schema.Value.AdditionalProperties.Has,
|
||||
"PatchRequest must reject unknown fields")
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesStopReasonEnum verifies that the stop reason enum
|
||||
// matches the contract recorded in
|
||||
// rtmanager/README.md §Async Stream Contracts.
|
||||
func TestInternalSpecFreezesStopReasonEnum(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "StopReason")
|
||||
|
||||
got := make([]string, 0, len(schema.Value.Enum))
|
||||
for _, value := range schema.Value.Enum {
|
||||
got = append(got, value.(string))
|
||||
}
|
||||
|
||||
require.ElementsMatch(t,
|
||||
[]string{"orphan_cleanup", "cancelled", "finished", "admin_request", "timeout"},
|
||||
got)
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesErrorCodeCatalog verifies that ErrorCode contains
|
||||
// every stable code declared in rtmanager/README.md §Error Model.
|
||||
func TestInternalSpecFreezesErrorCodeCatalog(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "ErrorCode")
|
||||
|
||||
got := make([]string, 0, len(schema.Value.Enum))
|
||||
for _, value := range schema.Value.Enum {
|
||||
got = append(got, value.(string))
|
||||
}
|
||||
|
||||
require.ElementsMatch(t,
|
||||
[]string{
|
||||
"invalid_request",
|
||||
"not_found",
|
||||
"conflict",
|
||||
"service_unavailable",
|
||||
"internal_error",
|
||||
"image_pull_failed",
|
||||
"image_ref_not_semver",
|
||||
"semver_patch_only",
|
||||
"container_start_failed",
|
||||
"start_config_invalid",
|
||||
"docker_unavailable",
|
||||
"replay_no_op",
|
||||
},
|
||||
got)
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesErrorEnvelope verifies that ErrorResponse uses the
|
||||
// `{ "error": { "code", "message" } }` shape and that error.code references
|
||||
// the ErrorCode enum.
|
||||
func TestInternalSpecFreezesErrorEnvelope(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
|
||||
envelope := componentSchemaRef(t, doc, "ErrorResponse")
|
||||
assertRequiredFields(t, envelope, "error")
|
||||
require.Equal(t, "#/components/schemas/ErrorBody",
|
||||
envelope.Value.Properties["error"].Ref,
|
||||
"ErrorResponse.error must reference ErrorBody")
|
||||
|
||||
body := componentSchemaRef(t, doc, "ErrorBody")
|
||||
assertRequiredFields(t, body, "code", "message")
|
||||
require.Equal(t, "#/components/schemas/ErrorCode",
|
||||
body.Value.Properties["code"].Ref,
|
||||
"ErrorBody.code must reference ErrorCode")
|
||||
require.Equal(t, "string",
|
||||
body.Value.Properties["message"].Value.Type.Slice()[0],
|
||||
"ErrorBody.message must be a string")
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesProbeResponses verifies that /healthz returns 200
|
||||
// with the probe payload and /readyz declares both 200 and 503.
|
||||
func TestInternalSpecFreezesProbeResponses(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
|
||||
healthz := getOperation(t, doc, "/healthz", http.MethodGet)
|
||||
assertSchemaRef(t, responseSchemaRef(t, healthz, http.StatusOK),
|
||||
"#/components/schemas/ProbeResponse", "internalHealthz 200")
|
||||
|
||||
readyz := getOperation(t, doc, "/readyz", http.MethodGet)
|
||||
assertSchemaRef(t, responseSchemaRef(t, readyz, http.StatusOK),
|
||||
"#/components/schemas/ProbeResponse", "internalReadyz 200")
|
||||
require.NotNil(t, readyz.Responses.Status(http.StatusServiceUnavailable),
|
||||
"internalReadyz must declare a 503 response")
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesXGalaxyCallerHeader verifies that the optional
|
||||
// X-Galaxy-Caller header parameter is declared and referenced from every
|
||||
// runtime operation. Removing the parameter or detaching it from any of
|
||||
// the seven runtime endpoints would silently drop the only signal RTM
|
||||
// uses to distinguish gm_rest from admin_rest in operation_log.
|
||||
func TestInternalSpecFreezesXGalaxyCallerHeader(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
|
||||
param := doc.Components.Parameters["XGalaxyCallerHeader"]
|
||||
require.NotNil(t, param, "XGalaxyCallerHeader parameter must be declared")
|
||||
require.NotNil(t, param.Value, "XGalaxyCallerHeader parameter must have a value")
|
||||
require.Equal(t, "header", param.Value.In)
|
||||
require.Equal(t, "X-Galaxy-Caller", param.Value.Name)
|
||||
require.False(t, param.Value.Required, "X-Galaxy-Caller must be optional")
|
||||
|
||||
enum := param.Value.Schema.Value.Enum
|
||||
got := make([]string, 0, len(enum))
|
||||
for _, value := range enum {
|
||||
got = append(got, value.(string))
|
||||
}
|
||||
require.ElementsMatch(t, []string{"gm", "admin"}, got)
|
||||
|
||||
runtimeOps := []struct {
|
||||
method string
|
||||
path string
|
||||
}{
|
||||
{http.MethodGet, "/api/v1/internal/runtimes"},
|
||||
{http.MethodGet, "/api/v1/internal/runtimes/{game_id}"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/start"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/stop"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/restart"},
|
||||
{http.MethodPost, "/api/v1/internal/runtimes/{game_id}/patch"},
|
||||
{http.MethodDelete, "/api/v1/internal/runtimes/{game_id}/container"},
|
||||
}
|
||||
for _, rop := range runtimeOps {
|
||||
t.Run(rop.method+" "+rop.path, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
op := getOperation(t, doc, rop.path, rop.method)
|
||||
found := false
|
||||
for _, ref := range op.Parameters {
|
||||
if ref.Ref == "#/components/parameters/XGalaxyCallerHeader" {
|
||||
found = true
|
||||
break
|
||||
}
|
||||
}
|
||||
require.Truef(t, found,
|
||||
"%s %s must reference XGalaxyCallerHeader", rop.method, rop.path)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestInternalSpecFreezesRuntimesListShape verifies that the list endpoint
|
||||
// returns the items envelope expected by callers.
|
||||
func TestInternalSpecFreezesRuntimesListShape(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadInternalOpenAPISpec(t)
|
||||
schema := componentSchemaRef(t, doc, "RuntimesList")
|
||||
|
||||
assertRequiredFields(t, schema, "items")
|
||||
items := schema.Value.Properties["items"]
|
||||
require.NotNil(t, items, "RuntimesList.items must be declared")
|
||||
require.Equal(t, "#/components/schemas/RuntimeRecord", items.Value.Items.Ref,
|
||||
"RuntimesList.items[] must reference RuntimeRecord")
|
||||
}
|
||||
|
||||
func loadInternalOpenAPISpec(t *testing.T) *openapi3.T {
|
||||
t.Helper()
|
||||
|
||||
_, thisFile, _, ok := runtime.Caller(0)
|
||||
if !ok {
|
||||
require.FailNow(t, "runtime.Caller failed")
|
||||
}
|
||||
|
||||
specPath := filepath.Join(filepath.Dir(thisFile), "api", "internal-openapi.yaml")
|
||||
loader := openapi3.NewLoader()
|
||||
doc, err := loader.LoadFromFile(specPath)
|
||||
if err != nil {
|
||||
require.Failf(t, "test failed", "load spec %s: %v", specPath, err)
|
||||
}
|
||||
if doc == nil {
|
||||
require.Failf(t, "test failed", "load spec %s: returned nil document", specPath)
|
||||
}
|
||||
if err := doc.Validate(context.Background()); err != nil {
|
||||
require.Failf(t, "test failed", "validate spec %s: %v", specPath, err)
|
||||
}
|
||||
|
||||
return doc
|
||||
}
|
||||
|
||||
func getOperation(t *testing.T, doc *openapi3.T, path, method string) *openapi3.Operation {
|
||||
t.Helper()
|
||||
|
||||
if doc.Paths == nil {
|
||||
require.FailNow(t, "spec is missing paths")
|
||||
}
|
||||
pathItem := doc.Paths.Value(path)
|
||||
if pathItem == nil {
|
||||
require.Failf(t, "test failed", "spec is missing path %s", path)
|
||||
}
|
||||
op := pathItem.GetOperation(method)
|
||||
if op == nil {
|
||||
require.Failf(t, "test failed", "spec is missing %s operation for path %s", method, path)
|
||||
}
|
||||
|
||||
return op
|
||||
}
|
||||
|
||||
func responseSchemaRef(t *testing.T, op *openapi3.Operation, status int) *openapi3.SchemaRef {
|
||||
t.Helper()
|
||||
|
||||
ref := op.Responses.Status(status)
|
||||
if ref == nil || ref.Value == nil {
|
||||
require.Failf(t, "test failed", "operation is missing %d response", status)
|
||||
}
|
||||
mt := ref.Value.Content.Get("application/json")
|
||||
if mt == nil || mt.Schema == nil {
|
||||
require.Failf(t, "test failed", "operation is missing application/json schema for %d response", status)
|
||||
}
|
||||
|
||||
return mt.Schema
|
||||
}
|
||||
|
||||
func componentSchemaRef(t *testing.T, doc *openapi3.T, name string) *openapi3.SchemaRef {
|
||||
t.Helper()
|
||||
|
||||
if doc.Components.Schemas == nil {
|
||||
require.FailNow(t, "spec is missing component schemas")
|
||||
}
|
||||
ref := doc.Components.Schemas[name]
|
||||
if ref == nil {
|
||||
require.Failf(t, "test failed", "spec is missing component schema %s", name)
|
||||
}
|
||||
|
||||
return ref
|
||||
}
|
||||
|
||||
func assertSchemaRef(t *testing.T, schemaRef *openapi3.SchemaRef, want, name string) {
|
||||
t.Helper()
|
||||
require.NotNil(t, schemaRef, "%s schema ref", name)
|
||||
require.Equal(t, want, schemaRef.Ref, "%s schema ref", name)
|
||||
}
|
||||
|
||||
func assertRequiredFields(t *testing.T, schemaRef *openapi3.SchemaRef, fields ...string) {
|
||||
t.Helper()
|
||||
require.NotNil(t, schemaRef)
|
||||
require.ElementsMatch(t, fields, schemaRef.Value.Required)
|
||||
}
|
||||
|
||||
func assertStringEnum(t *testing.T, schemaRef *openapi3.SchemaRef, property string, values ...string) {
|
||||
t.Helper()
|
||||
require.NotNil(t, schemaRef)
|
||||
|
||||
propRef := schemaRef.Value.Properties[property]
|
||||
require.NotNil(t, propRef, "schema property %s", property)
|
||||
|
||||
got := make([]string, 0, len(propRef.Value.Enum))
|
||||
for _, v := range propRef.Value.Enum {
|
||||
got = append(got, v.(string))
|
||||
}
|
||||
|
||||
require.ElementsMatch(t, values, got)
|
||||
}
|
||||
@@ -0,0 +1,44 @@
|
||||
# Runtime Manager — Service-Local Documentation
|
||||
|
||||
This directory hosts the service-local documentation for `Runtime
|
||||
Manager`. The top-level [`../README.md`](../README.md) describes the
|
||||
current-state contract (purpose, scope, lifecycles, surfaces,
|
||||
configuration, observability); the documents below complement it with
|
||||
focused content docs and design-rationale records.
|
||||
|
||||
## Content docs
|
||||
|
||||
- [Runtime and components](runtime.md) — process diagram, listeners,
|
||||
workers, lifecycle services, stream offsets, configuration groups,
|
||||
runtime invariants.
|
||||
- [Flows](flows.md) — mermaid sequence diagrams for the lifecycle and
|
||||
observability flows.
|
||||
- [Operator runbook](runbook.md) — startup, readiness, shutdown, and
|
||||
recovery scenarios.
|
||||
- [Configuration and contract examples](examples.md) — `.env`,
|
||||
REST request bodies, stream payloads, storage inspection snippets.
|
||||
|
||||
## Design rationale
|
||||
|
||||
- [PostgreSQL schema decisions](postgres-migration.md) — the schema
|
||||
decision record consolidating the persistence-layer agreements
|
||||
(tables, indexes, CAS shape, `created_at` preservation, jsonb
|
||||
round-trip, schema/role provisioning split).
|
||||
- [Domain and ports](domain-and-ports.md) — string-typed enums, the
|
||||
four allowed runtime transitions, why `Inspect` splits into
|
||||
`InspectImage` / `InspectContainer`, why `LobbyGameRecord` is
|
||||
minimal, and other domain-layer choices.
|
||||
- [Adapters](adapters.md) — Docker SDK adapter, Lobby internal HTTP
|
||||
client, the three Redis publishers, the `mockgen` convention for
|
||||
wide ports, and the unit-test strategy for HTTP-backed adapters.
|
||||
- [Lifecycle services](services.md) — per-game lease semantics, the
|
||||
`Result`-shaped contract, failure-mode tables, the lease-bypass
|
||||
`Run` method on inner services, the `X-Galaxy-Caller` header
|
||||
convention, and the canonical error code → HTTP status mapping.
|
||||
- [Background workers](workers.md) — single-ownership table per
|
||||
`event_type`, `container_disappeared` suppression rules, probe
|
||||
hysteresis, the events listener reconnect policy, the reconciler's
|
||||
per-game lease and three drift kinds.
|
||||
- [Service-local integration suite](integration-tests.md) — the
|
||||
`integration` build tag, the in-process `app.NewRuntime` choice,
|
||||
the Lobby HTTP stub, and the test isolation strategy.
|
||||
@@ -0,0 +1,192 @@
|
||||
# Adapters
|
||||
|
||||
This document explains why the production adapters under
|
||||
[`../internal/adapters/`](../internal/adapters) — Docker SDK,
|
||||
Lobby internal HTTP client, notification-intent publisher, health-event
|
||||
publisher, job-result publisher — are shaped the way they are. The
|
||||
PostgreSQL stores and the Redis-coordination adapters live in
|
||||
[`postgres-migration.md`](postgres-migration.md).
|
||||
|
||||
## 1. `mockgen` is the repo-wide convention for wide ports
|
||||
|
||||
The Docker port has nine methods plus eight value types in the
|
||||
signatures, and most lifecycle services exercise nearly every method
|
||||
pair (start, stop, restart, patch, cleanup, reconcile, events, probe).
|
||||
A hand-rolled fake would either miss methods or balloon to a per-test
|
||||
fixture.
|
||||
|
||||
`internal/adapters/docker/` therefore uses `go.uber.org/mock` mocks:
|
||||
|
||||
- `//go:generate` directives live next to the interface declaration in
|
||||
`internal/ports/dockerclient.go`;
|
||||
- generated code is committed under `internal/adapters/docker/mocks/`
|
||||
(matching the `internal/adapters/postgres/jet/` discipline);
|
||||
- `make -C rtmanager mocks` is the single command operators run after
|
||||
a port-signature change.
|
||||
|
||||
The maintained `go.uber.org/mock` fork is preferred over the archived
|
||||
`github.com/golang/mock`. This convention applies to wide / recorder
|
||||
ports across the repository — Lobby uses the same pipeline for its
|
||||
narrow recorder ports (`RuntimeManager`, `IntentPublisher`,
|
||||
`GMClient`, `UserService`); see
|
||||
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for the cross-service
|
||||
rule.
|
||||
|
||||
The other two RTM ports (`LobbyInternalClient`,
|
||||
`NotificationIntentPublisher`) keep inline `_test.go` fakes: small
|
||||
surfaces, easy to fake by hand inside a single test file when needed.
|
||||
|
||||
## 2. `EngineEndpoint` is built inside the Docker adapter
|
||||
|
||||
The engine port is fixed at `8080`. Pushing it into `RunSpec` would
|
||||
force the start service to know an engine implementation detail;
|
||||
pushing it into config would give operators a knob that the engine
|
||||
image already does not honour. The Docker adapter exposes
|
||||
`EnginePort = 8080` as a package constant and constructs
|
||||
`RunResult.EngineEndpoint = "http://" + spec.Hostname + ":8080"`
|
||||
itself.
|
||||
|
||||
The adapter also leaves `container.Config.ExposedPorts` empty: RTM
|
||||
never publishes ports to the host. The user-defined Docker bridge
|
||||
network gives every container in the network DNS access to the engine
|
||||
via `galaxy-game-{game_id}:8080`.
|
||||
|
||||
## 3. `Run` removes the container on `ContainerStart` failure
|
||||
|
||||
`README.md §Lifecycles → Start` requires no orphan to remain after a
|
||||
failed start path. If `ContainerCreate` succeeds but `ContainerStart`
|
||||
fails, the adapter calls `ContainerRemove(force=true)` inside a fresh
|
||||
`context.Background()` (with a 10s timeout) so the cleanup runs even
|
||||
when the original ctx is already cancelled. The cleanup is best-effort:
|
||||
a remove failure is silently discarded because the original start
|
||||
failure is the actionable error returned to the caller.
|
||||
|
||||
The alternative — leaving rollback to the start service — would either
|
||||
duplicate the same code in every caller or invite a service that forgets
|
||||
to do it. Centralising the rule in the adapter keeps the port contract
|
||||
simple. The start service adds an additional rollback layer for the
|
||||
post-`Run` `Upsert` failure path; see [`services.md`](services.md) §5.
|
||||
|
||||
## 4. `RunSpec.Cmd` is optional
|
||||
|
||||
`ports.RunSpec` exposes an optional `Cmd []string`. Production callers
|
||||
leave it `nil` so the engine image's own `CMD` runs;
|
||||
`internal/adapters/docker/smoke_test.go` uses it to drive
|
||||
`["/bin/sh","-c","sleep 60"]` against `alpine:3.21`.
|
||||
|
||||
The alternative — building a dedicated test image with a pre-baked
|
||||
`sleep` command — would require an extra `Dockerfile` under testdata
|
||||
and a build step inside the smoke test. The single new field is
|
||||
documented as optional and ignored when empty; production behaviour is
|
||||
unchanged.
|
||||
|
||||
## 5. `EventsListen` filters at the adapter boundary
|
||||
|
||||
The Docker `/events` API accepts a `filters` query parameter, but the
|
||||
daemon treats it as a hint, not a guarantee. The adapter therefore
|
||||
double-checks at the boundary: only `Type == events.ContainerEventType`
|
||||
messages are passed through to the typed `<-chan ports.DockerEvent`.
|
||||
Doing the filter at the SDK level would still require a defensive
|
||||
recheck on the consumer side; consolidating the check in the adapter
|
||||
keeps the contract crisp and the consumer free of Docker-internal type
|
||||
discriminants.
|
||||
|
||||
The decoded event copies the actor's full `Attributes` map into
|
||||
`DockerEvent.Labels`. Docker mixes container labels and runtime
|
||||
attributes (`exitCode`, `image`, `name`, etc.) flat in the same map;
|
||||
RTM consumers filter by the `com.galaxy.` prefix when they care about
|
||||
labels, and the adapter extracts `exitCode` separately for `die`
|
||||
events.
|
||||
|
||||
## 6. Lobby HTTP client error mapping
|
||||
|
||||
`ports.LobbyInternalClient.GetGame` fixes:
|
||||
|
||||
- `200` → `LobbyGameRecord` decoded tolerantly (unknown fields
|
||||
ignored);
|
||||
- `404` → `ports.ErrLobbyGameNotFound`;
|
||||
- transport, timeout, or any other non-2xx → `ports.ErrLobbyUnavailable`
|
||||
wrapped with the original error so callers can `errors.Is` and still
|
||||
log the cause.
|
||||
|
||||
The start service treats `ErrLobbyUnavailable` as recoverable: it
|
||||
continues without the diagnostic data because the start envelope
|
||||
already carries the only required field (`image_ref`). The client
|
||||
mirrors `notification/internal/adapters/userservice/client.go`: cloned
|
||||
`*http.Transport`, `otelhttp.NewTransport` wrap, per-request
|
||||
`context.WithTimeout`, idempotent `Close()` releasing idle connections.
|
||||
|
||||
JSON decoding is tolerant: unknown fields in the success body do not
|
||||
break the call, so additive changes to Lobby's `GameRecord` schema do
|
||||
not require an RTM release.
|
||||
|
||||
## 7. Notification publisher wrapper signature
|
||||
|
||||
The wrapper drops the entry id returned by
|
||||
`notificationintent.Publisher.Publish` (rationale in
|
||||
[`domain-and-ports.md`](domain-and-ports.md) §7). The adapter is a
|
||||
thin shim:
|
||||
|
||||
- `NewPublisher(cfg)` constructs the inner publisher and forwards
|
||||
validation;
|
||||
- `Publish(ctx, intent)` calls the inner publisher and discards the
|
||||
entry id.
|
||||
|
||||
The compile-time assertion `var _ ports.NotificationIntentPublisher =
|
||||
(*Publisher)(nil)` lives in `publisher.go`.
|
||||
|
||||
## 8. Health-events publisher: snapshot upsert before stream XADD
|
||||
|
||||
Every emission goes through
|
||||
`ports.HealthEventPublisher.Publish`, which both XADDs to
|
||||
`runtime:health_events` and upserts `health_snapshots`. The snapshot
|
||||
upsert runs **before** the XADD: a successful Publish always leaves
|
||||
the snapshot store at least as fresh as the stream, and a partial
|
||||
failure leaves the snapshot a best-effort lower bound. Reversing the
|
||||
order would let consumers observe a stream entry whose
|
||||
`health_snapshots` row reflects the prior observation — a misleading
|
||||
inversion.
|
||||
|
||||
The `event_type → SnapshotStatus / SnapshotSource` mapping mirrors the
|
||||
table in [`../README.md` §Health Monitoring](../README.md). In
|
||||
particular, `container_started` collapses to `SnapshotStatusHealthy`
|
||||
and `probe_recovered` does the same (rationale in
|
||||
[`domain-and-ports.md`](domain-and-ports.md) §4).
|
||||
|
||||
## 9. Unit-test strategy
|
||||
|
||||
Both HTTP-backed adapters (Docker SDK, Lobby client) use
|
||||
`httptest.Server` fixtures. The Docker SDK speaks HTTP under the hood
|
||||
for both unix sockets and TCP, so adapter unit tests construct a
|
||||
Docker client with `client.WithHost(server.URL)` and
|
||||
`client.WithHTTPClient(server.Client())`, which lets table-driven
|
||||
handlers fake every Docker API endpoint without touching the real
|
||||
daemon. The Docker API version is pinned to `1.45`
|
||||
(`client.WithVersion("1.45")`) so the URL prefix is stable across CI
|
||||
machines whose daemon advertises a different default. Production
|
||||
wiring (in `internal/app/bootstrap.go`) keeps API negotiation enabled.
|
||||
|
||||
The notification publisher uses `miniredis` directly because the
|
||||
adapter's only side effect is an `XADD`, which `miniredis` reproduces
|
||||
faithfully and matches every other Galaxy intent test.
|
||||
|
||||
## 10. Docker smoke test
|
||||
|
||||
`internal/adapters/docker/smoke_test.go` runs on the default
|
||||
`go test ./...` invocation and calls `t.Skip` unless the local daemon
|
||||
is reachable (`/var/run/docker.sock` exists or `DOCKER_HOST` is set).
|
||||
The covered sequence:
|
||||
|
||||
1. provision a temporary user-defined bridge network;
|
||||
2. assert `EnsureNetwork` for present and missing names;
|
||||
3. pull `alpine:3.21` (`PullPolicyIfMissing`);
|
||||
4. subscribe to events;
|
||||
5. run a sleep container with the full `RunSpec` field set;
|
||||
6. observe a `start` event for the new container id;
|
||||
7. inspect, stop, remove, and verify `ErrContainerNotFound` is
|
||||
reported afterwards.
|
||||
|
||||
This is the production adapter's only end-to-end check that runs from
|
||||
the default `go test` pass; the broader service-local integration
|
||||
suite ([`integration-tests.md`](integration-tests.md)) is gated
|
||||
behind `-tags=integration`.
|
||||
@@ -0,0 +1,167 @@
|
||||
# Domain and Ports
|
||||
|
||||
This document explains why the `rtmanager` domain layer
|
||||
([`../internal/domain/`](../internal/domain)) and the port interfaces
|
||||
([`../internal/ports/`](../internal/ports)) are shaped the way they are.
|
||||
The current-state types and method signatures are the source of truth in
|
||||
the code; this file records the rationale so future readers do not
|
||||
re-litigate the same trade-offs.
|
||||
|
||||
For the surrounding behaviour see
|
||||
[`../README.md`](../README.md), the SQL CHECK constraints in
|
||||
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql),
|
||||
the wire contracts under [`../api/`](../api), and
|
||||
[`postgres-migration.md`](postgres-migration.md) for the persistence
|
||||
layer.
|
||||
|
||||
## 1. String-typed status enums
|
||||
|
||||
`runtime.Status`, `operation.OpKind`, `operation.OpSource`,
|
||||
`operation.Outcome`, `health.EventType`, `health.SnapshotStatus`, and
|
||||
`health.SnapshotSource` are all `type X string`.
|
||||
|
||||
The string approach wins on three counts:
|
||||
|
||||
- the SQL CHECK constraints already store the values as `text`, so a
|
||||
string domain type maps one-to-one with no codec layer;
|
||||
- it matches Lobby (`game.Status`, `membership.Status`,
|
||||
`application.Status`), so reviewers do not switch encoding mental
|
||||
models when crossing service boundaries;
|
||||
- `IsKnown` keeps the invariant cheap (a single switch); a `type X uint8`
|
||||
with stringer-generated names would pay a constant lookup and make raw
|
||||
SQL columns harder to read in diagnostics.
|
||||
|
||||
## 2. Plain `string` for `CurrentContainerID` and `CurrentImageRef`
|
||||
|
||||
The PostgreSQL columns are nullable. The domain model uses plain
|
||||
`string` with empty == NULL and bridges the SQL nullability inside the
|
||||
adapter. Pointer fields would force every consumer to dereference
|
||||
defensively even though business logic rarely cares about the
|
||||
NULL/empty distinction (removed records may legitimately carry either
|
||||
form depending on whether the record passed through `stopped` first).
|
||||
|
||||
The adapter's job is to translate `sql.NullString` ⇄ `string`; the rest
|
||||
of the codebase reads the field as a regular value.
|
||||
|
||||
## 3. `*time.Time` for nullable timestamps
|
||||
|
||||
`StartedAt`, `StoppedAt`, `RemovedAt` retain pointer types. `time.Time{}`
|
||||
is a real, comparable value in Go (`IsZero` only reports the canonical
|
||||
zero time); mixing "missing" and "set to UTC zero" through plain
|
||||
`time.Time` would invite bugs. The jet-generated `model.RuntimeRecords`
|
||||
already declares the same fields as `*time.Time`, so the domain type
|
||||
aligns with the persistence type and the adapter does not re-shape
|
||||
pointers.
|
||||
|
||||
## 4. `EventType` and `SnapshotStatus` are deliberately distinct
|
||||
|
||||
`runtime-health-asyncapi.yaml.EventType` enumerates seven values; the
|
||||
SQL CHECK on `health_snapshots.status` enumerates six. The two sets
|
||||
overlap but are not identical:
|
||||
|
||||
- `container_started` is an *event*; the snapshot collapses it to
|
||||
`healthy` (a successful start is observed as the container being
|
||||
live, not as an ongoing event);
|
||||
- `probe_recovered` is an *event*; it does not become a snapshot row of
|
||||
its own — the next inspect/probe overwrites the prior `probe_failed`
|
||||
with `healthy`.
|
||||
|
||||
Modelling them as one shared enum would require a separate "event vs
|
||||
snapshot" boolean and invite accidental mismatches. Two distinct types
|
||||
with explicit `IsKnown` matrices keep each surface honest at compile
|
||||
time.
|
||||
|
||||
## 5. `Inspect` split into `InspectImage` + `InspectContainer`
|
||||
|
||||
Two narrow methods replace a single polymorphic `Inspect`. The surface
|
||||
RTM exercises has two shapes:
|
||||
|
||||
- the start service inspects the *image* by reference to read resource
|
||||
limits from labels;
|
||||
- the periodic inspect worker, the reconciler, and the events listener
|
||||
inspect *containers* by id to read state, health, restart count, and
|
||||
exit code.
|
||||
|
||||
The inputs differ (ref vs id), and the result types differ
|
||||
(`ImageInspect.Labels` is the only field used at start time, while
|
||||
`ContainerInspect` carries a dozen state fields). One polymorphic
|
||||
method would either split internally on input type or return a tagged
|
||||
union; either is messier than two narrow methods.
|
||||
|
||||
## 6. `LobbyGameRecord` is intentionally minimal
|
||||
|
||||
`LobbyInternalClient.GetGame` returns `GameID`, `Status`, and
|
||||
`TargetEngineVersion`. The fetch is classified as ancillary diagnostics
|
||||
because the start envelope already carries the only required field
|
||||
(`image_ref`).
|
||||
|
||||
Anything more would invite RTM consumers to depend on Lobby's schema in
|
||||
ways that violate the "RTM never resolves engine versions" rule.
|
||||
Future fields are additive: each new field is opt-in to the consumer
|
||||
and does not break existing call sites. The minimalism is also a hedge
|
||||
against schema drift — Lobby's `GameRecord` is large and changes more
|
||||
often than RTM needs to track.
|
||||
|
||||
## 7. `NotificationIntentPublisher.Publish` returns `error`, not `(string, error)`
|
||||
|
||||
Lobby's `IntentPublisher.Publish` returns the Redis Stream entry id so
|
||||
business workflows that key on it (idempotency keys, audit
|
||||
correlation) can capture it. RTM publishes admin-only failure intents
|
||||
where the entry id has no consumer — failing starts do not loop back
|
||||
to RTM, and notification routing keys on the producer-supplied
|
||||
`idempotency_key` rather than the stream id. The adapter wraps
|
||||
`pkg/notificationintent.Publisher` and discards the entry id at the
|
||||
wrapper boundary.
|
||||
|
||||
## 8. Exactly four allowed runtime transitions
|
||||
|
||||
`runtime.AllowedTransitions` covers:
|
||||
|
||||
- `running → stopped` — graceful stop, observed exit, reconcile
|
||||
observed exited;
|
||||
- `running → removed` — `reconcile_dispose` when the container
|
||||
vanished;
|
||||
- `stopped → running` — restart and patch inner start;
|
||||
- `stopped → removed` — cleanup TTL or admin DELETE.
|
||||
|
||||
Other pairs are intentionally rejected:
|
||||
|
||||
- `running → running` and `stopped → stopped` would mean Upsert
|
||||
overwrote state without a CAS guard. Idempotent re-start / re-stop
|
||||
never transitions; the service layer returns `replay_no_op` and the
|
||||
record is left untouched.
|
||||
- `removed → *` is forbidden because `removed` is terminal. The
|
||||
reconciler creates fresh records with `reconcile_adopt` rather than
|
||||
resurrecting old ones.
|
||||
|
||||
Encoding the table this way means a future bug where a service tries
|
||||
to revive a removed record is rejected at the domain layer rather than
|
||||
the adapter, which keeps the failure mode close to the offending code.
|
||||
|
||||
## 9. `PullPolicy` re-declared inside `ports/dockerclient.go`
|
||||
|
||||
The same enum exists as `config.ImagePullPolicy`. Importing
|
||||
`internal/config` from the ports package would couple two unrelated
|
||||
layers and create a cyclic risk once the wiring layer pulls both in.
|
||||
The runtime/wiring layer (in `internal/app`) is the single point that
|
||||
translates between the two type aliases — both are `string`-typed, the
|
||||
value sets are identical, and the validation lives on each side
|
||||
independently.
|
||||
|
||||
## 10. Compile-time interface assertions live with adapters
|
||||
|
||||
Every interface has a `var _ ports.X = (*Y)(nil)` assertion, but the
|
||||
assertion lives in the adapter package (e.g.
|
||||
`var _ ports.RuntimeRecordStore = (*Store)(nil)` inside
|
||||
`internal/adapters/postgres/runtimerecordstore`). Putting the
|
||||
assertions in the port package would force the port package to import
|
||||
its own implementations and create an obvious import cycle.
|
||||
|
||||
## 11. `RunSpec.Validate` lives on the request type
|
||||
|
||||
The Docker port carries a non-trivial request type (`RunSpec`) with
|
||||
eight required fields and per-mount invariants. Putting `Validate` on
|
||||
the request struct keeps the rule next to the type definition, mirrors
|
||||
the pattern used by `lobby/internal/ports/gmclient.go`
|
||||
(`RegisterGameRequest.Validate`), and lets the adapter call it as the
|
||||
first defensive check before invoking the Docker SDK.
|
||||
@@ -0,0 +1,429 @@
|
||||
# Configuration And Contract Examples
|
||||
|
||||
The examples below are illustrative. Replace `localhost`, port
|
||||
numbers, IDs, and timestamps with values that match the deployment
|
||||
under inspection.
|
||||
|
||||
## Example `.env`
|
||||
|
||||
A minimum-viable `RTMANAGER_*` set for a local run against a single
|
||||
Redis container plus a PostgreSQL container with the `rtmanager`
|
||||
schema and the `rtmanagerservice` role provisioned. The full list
|
||||
with defaults lives in [`../README.md` §Configuration](../README.md).
|
||||
|
||||
```bash
|
||||
# Required
|
||||
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
||||
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
|
||||
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
||||
RTMANAGER_REDIS_PASSWORD=local
|
||||
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
RTMANAGER_DOCKER_NETWORK=galaxy-net
|
||||
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
||||
|
||||
# Lobby internal client (diagnostic GET only in v1)
|
||||
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
||||
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
|
||||
|
||||
# Container defaults (image labels override these per container)
|
||||
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
|
||||
RTMANAGER_DEFAULT_MEMORY=512m
|
||||
RTMANAGER_DEFAULT_PIDS_LIMIT=512
|
||||
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
|
||||
RTMANAGER_CONTAINER_RETENTION_DAYS=30
|
||||
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
|
||||
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
|
||||
RTMANAGER_GAME_STATE_DIR_MODE=0750
|
||||
RTMANAGER_GAME_STATE_OWNER_UID=0
|
||||
RTMANAGER_GAME_STATE_OWNER_GID=0
|
||||
|
||||
# Workers
|
||||
RTMANAGER_INSPECT_INTERVAL=30s
|
||||
RTMANAGER_PROBE_INTERVAL=15s
|
||||
RTMANAGER_PROBE_TIMEOUT=2s
|
||||
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
|
||||
RTMANAGER_RECONCILE_INTERVAL=5m
|
||||
RTMANAGER_CLEANUP_INTERVAL=1h
|
||||
|
||||
# Coordination
|
||||
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
|
||||
|
||||
# Process and logging
|
||||
RTMANAGER_LOG_LEVEL=info
|
||||
RTMANAGER_SHUTDOWN_TIMEOUT=30s
|
||||
|
||||
# Telemetry (disabled for local dev — enable to ship traces / metrics)
|
||||
OTEL_SERVICE_NAME=galaxy-rtmanager
|
||||
OTEL_TRACES_EXPORTER=none
|
||||
OTEL_METRICS_EXPORTER=none
|
||||
```
|
||||
|
||||
For a production-shaped deployment, set
|
||||
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
|
||||
a tag mutation is immediately visible to the next runtime),
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
|
||||
container's user, and configure `OTEL_*` against the cluster's OTLP
|
||||
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
|
||||
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
|
||||
sink the operator runs (fluentd, journald, etc.).
|
||||
|
||||
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
|
||||
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
|
||||
service-local harness brings up
|
||||
([`integration-tests.md` §7](integration-tests.md)).
|
||||
|
||||
## Internal HTTP Examples
|
||||
|
||||
Every endpoint admits the optional `X-Galaxy-Caller` header which the
|
||||
handler records as `op_source` in `operation_log` (`gm` → `gm_rest`,
|
||||
`admin` → `admin_rest`; missing or unknown values default to
|
||||
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
|
||||
|
||||
### Probe a runtime record
|
||||
|
||||
```bash
|
||||
curl -s -H 'X-Galaxy-Caller: gm' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
|
||||
```
|
||||
|
||||
Response (`200 OK`):
|
||||
|
||||
```json
|
||||
{
|
||||
"game_id": "game-01HZ...",
|
||||
"status": "running",
|
||||
"current_container_id": "1f2a...",
|
||||
"current_image_ref": "galaxy/game:1.4.0",
|
||||
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
|
||||
"state_path": "/var/lib/galaxy/games/game-01HZ...",
|
||||
"docker_network": "galaxy-net",
|
||||
"started_at": "2026-04-28T07:18:54Z",
|
||||
"stopped_at": null,
|
||||
"removed_at": null,
|
||||
"last_op_at": "2026-04-28T07:18:54Z",
|
||||
"created_at": "2026-04-28T07:18:54Z"
|
||||
}
|
||||
```
|
||||
|
||||
### List all runtimes
|
||||
|
||||
```bash
|
||||
curl -s -H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes
|
||||
```
|
||||
|
||||
The response shape is `{"items":[<RuntimeRecord>...]}`.
|
||||
|
||||
### Start a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: gm' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
|
||||
-d '{"image_ref": "galaxy/game:1.4.0"}'
|
||||
```
|
||||
|
||||
A `200` returns the `RuntimeRecord` for the running runtime. Failure
|
||||
shapes use the canonical envelope; e.g. an invalid image_ref:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "start_config_invalid",
|
||||
"message": "image_ref shape rejected by docker reference parser"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Stop a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
|
||||
-d '{"reason": "admin_request"}'
|
||||
```
|
||||
|
||||
Valid `reason` values:
|
||||
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
|
||||
|
||||
### Restart a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
|
||||
```
|
||||
|
||||
The body is empty; restart re-uses the current `image_ref`.
|
||||
|
||||
### Patch a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
|
||||
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
||||
```
|
||||
|
||||
Patch enforces the semver-only rule: a non-semver tag returns
|
||||
`image_ref_not_semver`; a cross-major or cross-minor change returns
|
||||
`semver_patch_only`.
|
||||
|
||||
### Cleanup a stopped runtime container
|
||||
|
||||
```bash
|
||||
curl -s -X DELETE \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
|
||||
```
|
||||
|
||||
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
|
||||
|
||||
## Stream Payload Examples
|
||||
|
||||
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
|
||||
the defaults are used below. Field types and required/optional
|
||||
semantics are frozen by
|
||||
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
|
||||
and
|
||||
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
|
||||
|
||||
### `runtime:start_jobs` (Lobby → RTM)
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:start_jobs '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
image_ref 'galaxy/game:1.4.0' \
|
||||
requested_at_ms 1714081234567
|
||||
```
|
||||
|
||||
### `runtime:stop_jobs` (Lobby → RTM)
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:stop_jobs '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
reason 'cancelled' \
|
||||
requested_at_ms 1714081234567
|
||||
```
|
||||
|
||||
### `runtime:job_results` (RTM → Lobby)
|
||||
|
||||
Success envelope:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id '1f2a...' \
|
||||
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
||||
error_code '' \
|
||||
error_message ''
|
||||
```
|
||||
|
||||
Failure envelope:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'failure' \
|
||||
container_id '' \
|
||||
engine_endpoint '' \
|
||||
error_code 'image_pull_failed' \
|
||||
error_message 'pull failed: manifest unknown'
|
||||
```
|
||||
|
||||
Idempotent replay envelope (success outcome with explicit
|
||||
`replay_no_op`):
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id '1f2a...' \
|
||||
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
||||
error_code 'replay_no_op' \
|
||||
error_message ''
|
||||
```
|
||||
|
||||
The contract permits empty `container_id` and `engine_endpoint`
|
||||
strings on every value of `outcome` so the consumer can decode the
|
||||
envelope uniformly ([`workers.md` §11](workers.md)).
|
||||
|
||||
### `runtime:health_events` (RTM out)
|
||||
|
||||
The wire shape is the same for every event type — only the
|
||||
`details` payload differs.
|
||||
|
||||
`container_started`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_started' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"image_ref":"galaxy/game:1.4.0"}'
|
||||
```
|
||||
|
||||
`container_exited`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_exited' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"exit_code":137,"oom":false}'
|
||||
```
|
||||
|
||||
`container_oom`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_oom' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"exit_code":137}'
|
||||
```
|
||||
|
||||
`container_disappeared`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_disappeared' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{}'
|
||||
```
|
||||
|
||||
`inspect_unhealthy`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'inspect_unhealthy' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
|
||||
```
|
||||
|
||||
`probe_failed` (after the threshold is crossed):
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'probe_failed' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
|
||||
```
|
||||
|
||||
`probe_recovered`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'probe_recovered' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"prior_failure_count":3}'
|
||||
```
|
||||
|
||||
### `notification:intents` (RTM admin notifications)
|
||||
|
||||
RTM publishes admin-only notification intents only for the three
|
||||
first-touch start failures. Every payload shares the frozen field
|
||||
set `{game_id, image_ref, error_code, error_message,
|
||||
attempted_at_ms}`
|
||||
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
||||
|
||||
`runtime.image_pull_failed`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD notification:intents '*' \
|
||||
envelope '{
|
||||
"type": "runtime.image_pull_failed",
|
||||
"producer": "rtmanager",
|
||||
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
|
||||
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
|
||||
"payload": {
|
||||
"game_id": "game-01HZ...",
|
||||
"image_ref": "galaxy/game:1.4.0",
|
||||
"error_code": "image_pull_failed",
|
||||
"error_message": "pull failed: manifest unknown",
|
||||
"attempted_at_ms": 1714081234567
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
`runtime.container_start_failed` and `runtime.start_config_invalid`
|
||||
share the same envelope with their respective `type` and
|
||||
`error_code` values.
|
||||
|
||||
## Storage Inspection
|
||||
|
||||
### Inspect a runtime record (PostgreSQL)
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
|
||||
```
|
||||
|
||||
Columns mirror the fields documented in
|
||||
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
|
||||
|
||||
### Inspect runtime status counts
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
||||
```
|
||||
|
||||
### Inspect the operation log for a game
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code,
|
||||
started_at, finished_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = 'game-01HZ...'
|
||||
ORDER BY started_at DESC, id DESC
|
||||
LIMIT 50"
|
||||
```
|
||||
|
||||
### Inspect the latest health snapshot
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT game_id, container_id, status, source, observed_at, details
|
||||
FROM rtmanager.health_snapshots
|
||||
WHERE game_id = 'game-01HZ...'"
|
||||
```
|
||||
|
||||
### Inspect Redis runtime-coordination keys
|
||||
|
||||
```bash
|
||||
# Stream offsets
|
||||
redis-cli GET rtmanager:stream_offsets:startjobs
|
||||
redis-cli GET rtmanager:stream_offsets:stopjobs
|
||||
|
||||
# Per-game lease (only present while an operation is in flight)
|
||||
redis-cli GET rtmanager:game_lease:game-01HZ...
|
||||
redis-cli TTL rtmanager:game_lease:game-01HZ...
|
||||
|
||||
# Recent stream entries
|
||||
redis-cli XRANGE runtime:start_jobs - + COUNT 20
|
||||
redis-cli XRANGE runtime:job_results - + COUNT 20
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 50
|
||||
|
||||
# Stream metadata
|
||||
redis-cli XINFO STREAM runtime:start_jobs
|
||||
redis-cli XINFO STREAM runtime:stop_jobs
|
||||
redis-cli XINFO STREAM runtime:health_events
|
||||
```
|
||||
@@ -0,0 +1,305 @@
|
||||
# Flows
|
||||
|
||||
This document collects the lifecycle and observability flows that
|
||||
span Runtime Manager and its synchronous and asynchronous neighbours.
|
||||
Narrative descriptions of the rules these flows enforce live in
|
||||
[`../README.md`](../README.md); the diagrams here focus on the message
|
||||
order across the boundary. Design-rationale records linked from each
|
||||
section explain the *why*.
|
||||
|
||||
## Start (happy path)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Lobby as Lobby publisher
|
||||
participant Stream as runtime:start_jobs
|
||||
participant Consumer as startjobsconsumer
|
||||
participant Service as startruntime
|
||||
participant Lease as Redis lease
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Health as runtime:health_events
|
||||
participant Results as runtime:job_results
|
||||
|
||||
Lobby->>Stream: XADD {game_id, image_ref, requested_at_ms}
|
||||
Consumer->>Stream: XREAD
|
||||
Consumer->>Service: Handle(game_id, image_ref, OpSourceLobbyStream, entry_id)
|
||||
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Service->>PG: SELECT runtime_records WHERE game_id
|
||||
Service->>Docker: PullImage(image_ref) per pull policy
|
||||
Service->>Docker: InspectImage → resource limits
|
||||
Service->>Service: prepareStateDir(<root>/{game_id})
|
||||
Service->>Docker: ContainerCreate + ContainerStart
|
||||
Service->>PG: Upsert runtime_records (status=running)
|
||||
Service->>PG: INSERT operation_log (op_kind=start, outcome=success)
|
||||
Service->>Health: XADD container_started
|
||||
Service-->>Consumer: Result{Outcome=success, ContainerID, EngineEndpoint}
|
||||
Consumer->>Results: XADD {outcome=success, container_id, engine_endpoint}
|
||||
Service->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
```
|
||||
|
||||
REST callers (Game Master, Admin Service) drive the same service
|
||||
through `POST /api/v1/internal/runtimes/{game_id}/start`; the
|
||||
diagram's last two arrows collapse to an HTTP `200` response carrying
|
||||
the runtime record. Sources:
|
||||
[`../README.md` §Lifecycles → Start](../README.md#start),
|
||||
[`services.md` §3](services.md).
|
||||
|
||||
## Start failure (image pull)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Service as startruntime
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Intents as notification:intents
|
||||
participant Results as runtime:job_results
|
||||
|
||||
Service->>Docker: PullImage(image_ref)
|
||||
Docker-->>Service: error
|
||||
Service->>PG: INSERT operation_log (op_kind=start, outcome=failure, error_code=image_pull_failed)
|
||||
Service->>Intents: XADD runtime.image_pull_failed {game_id, image_ref, error_code, error_message, attempted_at_ms}
|
||||
Service-->>Service: Result{Outcome=failure, ErrorCode=image_pull_failed}
|
||||
Service->>Results: XADD {outcome=failure, error_code=image_pull_failed}
|
||||
```
|
||||
|
||||
The same shape applies to the configuration-validation failures
|
||||
(`start_config_invalid` from `EnsureNetwork(ErrNetworkMissing)`,
|
||||
`prepareStateDir`, or invalid `image_ref` shape) and the Docker
|
||||
create/start failure (`container_start_failed`); only the error code
|
||||
and the matching `runtime.*` notification type differ. Three failure
|
||||
codes do **not** raise an admin notification: `conflict`,
|
||||
`service_unavailable`, `internal_error`
|
||||
([`services.md` §4](services.md)).
|
||||
|
||||
## Start failure (orphan / Upsert-after-Run rollback)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Service as startruntime
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Intents as notification:intents
|
||||
|
||||
Service->>Docker: ContainerCreate + ContainerStart
|
||||
Docker-->>Service: container running
|
||||
Service->>PG: Upsert runtime_records
|
||||
PG-->>Service: error (transport / constraint)
|
||||
Note over Service: container is now an orphan<br/>(running, no PG record)
|
||||
Service->>Docker: Remove(container_id) [fresh background context]
|
||||
Docker-->>Service: ok or logged failure
|
||||
Service->>PG: INSERT operation_log (outcome=failure, error_code=container_start_failed)
|
||||
Service->>Intents: XADD runtime.container_start_failed
|
||||
Service-->>Service: Result{Outcome=failure, ErrorCode=container_start_failed}
|
||||
```
|
||||
|
||||
The Docker adapter already removes the container when `Run` itself
|
||||
fails after a successful `ContainerCreate`
|
||||
([`adapters.md` §3](adapters.md)); the start service adds the
|
||||
post-`Run` rollback for the `Upsert` path. A `Remove` failure is
|
||||
logged but not propagated; the reconciler adopts surviving orphans on
|
||||
its periodic pass ([`services.md` §5](services.md)).
|
||||
|
||||
## Stop
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Caller as Lobby / GM / Admin
|
||||
participant Service as stopruntime
|
||||
participant Lease as Redis lease
|
||||
participant PG as Postgres
|
||||
participant Docker
|
||||
participant Results as runtime:job_results
|
||||
|
||||
Caller->>Service: stop(game_id, reason)
|
||||
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Service->>PG: SELECT runtime_records WHERE game_id
|
||||
alt status in {stopped, removed}
|
||||
Service->>PG: INSERT operation_log (outcome=success, error_code=replay_no_op)
|
||||
Service-->>Caller: success / replay_no_op
|
||||
else status = running
|
||||
Service->>Docker: ContainerStop(container_id, RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS)
|
||||
Docker-->>Service: ok
|
||||
Service->>PG: UpdateStatus running→stopped (CAS by container_id)
|
||||
Service->>PG: INSERT operation_log (op_kind=stop, outcome=success)
|
||||
Service-->>Caller: success
|
||||
end
|
||||
Service->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
```
|
||||
|
||||
Lobby callers receive the outcome through `runtime:job_results`; REST
|
||||
callers receive an HTTP `200`. The `reason` enum
|
||||
(`orphan_cleanup | cancelled | finished | admin_request | timeout`)
|
||||
is recorded in `operation_log` and is otherwise opaque to the stop
|
||||
service — RTM does not branch on the reason in v1
|
||||
([`services.md` §15, §17](services.md)).
|
||||
|
||||
## Restart
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Admin as GM / Admin
|
||||
participant Service as restartruntime
|
||||
participant Stop as stopruntime.Run
|
||||
participant Start as startruntime.Run
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
|
||||
Admin->>Service: POST /restart
|
||||
Service->>PG: SELECT runtime_records WHERE game_id
|
||||
Note over Service: capture current image_ref
|
||||
Service->>Service: acquire per-game lease (held across both inner ops)
|
||||
Service->>Stop: Run(game_id) [lease bypass]
|
||||
Stop->>Docker: ContainerStop
|
||||
Stop->>PG: UpdateStatus running→stopped
|
||||
Service->>Docker: ContainerRemove
|
||||
Service->>Start: Run(game_id, image_ref) [lease bypass]
|
||||
Start->>Docker: PullImage / Run
|
||||
Start->>PG: Upsert runtime_records (status=running)
|
||||
Service->>PG: INSERT operation_log (op_kind=restart, outcome=success, source_ref=correlation_id)
|
||||
Service-->>Admin: 200 {runtime_record}
|
||||
Service->>Service: release lease
|
||||
```
|
||||
|
||||
The lease is acquired by `restartruntime` and held across both inner
|
||||
operations; `stopruntime.Run` and `startruntime.Run` are
|
||||
lease-bypass entry points that skip the inner lease acquisition
|
||||
([`services.md` §12](services.md)). The single `operation_log` row
|
||||
uses `Input.SourceRef` as a correlation id linking the implicit stop
|
||||
and start entries ([`services.md` §13](services.md)).
|
||||
|
||||
## Patch
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Admin as GM / Admin
|
||||
participant Service as patchruntime
|
||||
participant Restart as restartruntime.Run
|
||||
|
||||
Admin->>Service: POST /patch {image_ref: "galaxy/game:1.4.2"}
|
||||
Service->>Service: parse new image_ref + current image_ref
|
||||
alt either ref not semver
|
||||
Service-->>Admin: 422 image_ref_not_semver
|
||||
else major or minor differ
|
||||
Service-->>Admin: 422 semver_patch_only
|
||||
else major.minor match, patch differs (or equal)
|
||||
Service->>Restart: Run(game_id, new_image_ref)
|
||||
Restart-->>Service: Result
|
||||
Service-->>Admin: 200 {runtime_record}
|
||||
end
|
||||
```
|
||||
|
||||
The semver gate uses the tag fragment of the Docker reference; the
|
||||
extraction strategy is recorded in [`services.md` §14](services.md).
|
||||
The restart delegate already owns the lease, the inner stop/start,
|
||||
the operation log, and the `runtime:health_events container_started`
|
||||
emission ([`workers.md` §1](workers.md)).
|
||||
|
||||
## Cleanup TTL
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Worker as containercleanup worker
|
||||
participant PG as Postgres
|
||||
participant Service as cleanupcontainer
|
||||
participant Lease as Redis lease
|
||||
participant Docker
|
||||
|
||||
loop every RTMANAGER_CLEANUP_INTERVAL
|
||||
Worker->>PG: SELECT runtime_records WHERE status='stopped' AND last_op_at < now - retention
|
||||
loop per game
|
||||
Worker->>Service: cleanup(game_id, op_source=auto_ttl)
|
||||
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Service->>PG: re-read runtime_records WHERE game_id
|
||||
alt status = running
|
||||
Service-->>Worker: refused / conflict
|
||||
else status in {stopped, removed}
|
||||
Service->>Docker: ContainerRemove(container_id)
|
||||
Service->>PG: UpdateStatus stopped→removed (CAS)
|
||||
Service->>PG: INSERT operation_log (op_kind=cleanup_container)
|
||||
Service-->>Worker: success
|
||||
end
|
||||
Service->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Admin-driven cleanup follows the same path through
|
||||
`DELETE /api/v1/internal/runtimes/{game_id}/container` with
|
||||
`op_source=admin_rest` instead of `auto_ttl`. The host state directory
|
||||
is **never** removed by this flow
|
||||
([`../README.md` §Cleanup](../README.md#cleanup),
|
||||
[`services.md` §17](services.md),
|
||||
[`workers.md` §19](workers.md)).
|
||||
|
||||
## Reconcile drift adopt
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Reconciler as reconcile worker
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Lease as Redis lease
|
||||
|
||||
Note over Reconciler: read pass (lockless)
|
||||
Reconciler->>Docker: List({label=com.galaxy.owner=rtmanager})
|
||||
Reconciler->>PG: ListByStatus(running)
|
||||
Note over Reconciler: write pass (per-game lease)
|
||||
loop per Docker container without matching record
|
||||
Reconciler->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Reconciler->>PG: re-read runtime_records WHERE game_id
|
||||
alt record now exists
|
||||
Reconciler-->>Reconciler: skip (state changed since read pass)
|
||||
else record still missing
|
||||
Reconciler->>PG: Upsert runtime_records (status=running, image_ref, started_at)
|
||||
Reconciler->>PG: INSERT operation_log (op_kind=reconcile_adopt, op_source=auto_reconcile)
|
||||
end
|
||||
Reconciler->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
end
|
||||
```
|
||||
|
||||
The reconciler **never** stops or removes an unrecorded container —
|
||||
operators may have started one manually for diagnostics. The
|
||||
`reconcile_dispose` and `observed_exited` paths follow the same
|
||||
read-pass / write-pass split, with `dispose` updating the orphaned
|
||||
record to `removed` and emitting `container_disappeared`, and
|
||||
`observed_exited` updating to `stopped` and emitting `container_exited`
|
||||
([`../README.md` §Reconciliation](../README.md#reconciliation),
|
||||
[`workers.md` §14–§16](workers.md)).
|
||||
|
||||
## Health probe hysteresis
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Worker as healthprobe worker
|
||||
participant State as in-memory probe state
|
||||
participant Engine as galaxy-game-{id}:8080
|
||||
participant Health as runtime:health_events
|
||||
|
||||
loop every RTMANAGER_PROBE_INTERVAL
|
||||
Worker->>Worker: ListByStatus(running)
|
||||
Worker->>State: prune entries for games no longer running
|
||||
loop per game (semaphore cap = 16)
|
||||
Worker->>Engine: GET /healthz (RTMANAGER_PROBE_TIMEOUT)
|
||||
alt success
|
||||
State->>State: consecutiveFailures = 0
|
||||
opt failurePublished was true
|
||||
Worker->>Health: XADD probe_recovered {prior_failure_count}
|
||||
State->>State: failurePublished = false
|
||||
end
|
||||
else failure
|
||||
State->>State: consecutiveFailures++
|
||||
opt consecutiveFailures == RTMANAGER_PROBE_FAILURES_THRESHOLD AND not failurePublished
|
||||
Worker->>Health: XADD probe_failed {consecutive_failures, last_status, last_error}
|
||||
State->>State: failurePublished = true
|
||||
end
|
||||
end
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Hysteresis prevents a single transient failure from emitting a
|
||||
`probe_failed` event, and prevents repeated emission while the failure
|
||||
persists. State is non-persistent: a process restart re-establishes
|
||||
the counters from scratch; a game's state is pruned when it transitions
|
||||
out of the running list ([`workers.md` §5–§6](workers.md)).
|
||||
@@ -0,0 +1,163 @@
|
||||
# Service-Local Integration Suite
|
||||
|
||||
This document explains the design of the service-local integration
|
||||
suite under [`../integration/`](../integration). The current-state
|
||||
behaviour (harness layout, env knobs, scenario coverage) lives next
|
||||
to the files themselves; this document records the rationale.
|
||||
|
||||
The cross-service Lobby↔RTM suite at
|
||||
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
|
||||
different rules (it lives in the top-level `galaxy/integration`
|
||||
module) and is documented inside that package.
|
||||
|
||||
## 1. Build tag `integration`
|
||||
|
||||
The scenarios under [`../integration/*_test.go`](../integration) are
|
||||
guarded by `//go:build integration`. The default `go test ./...`
|
||||
invocation skips them, while `go test -tags=integration
|
||||
./integration/...` (and the `make integration` target) runs the full
|
||||
set:
|
||||
|
||||
```sh
|
||||
make -C rtmanager integration
|
||||
```
|
||||
|
||||
The harness package itself ([`../integration/harness`](../integration/harness))
|
||||
has no build tag. It compiles on every run because each helper guards
|
||||
its Docker-dependent paths with `t.Skip` when the daemon is
|
||||
unavailable. This keeps the harness loadable from a tagless `go vet`
|
||||
or IDE workflow without dragging Docker into the default `go test`
|
||||
critical path.
|
||||
|
||||
## 2. Smoke test runs in the default `go test` pass
|
||||
|
||||
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
|
||||
runs in the regular `go test ./...` pass and falls back on
|
||||
`skipUnlessDockerAvailable` when no Docker socket is present. The
|
||||
smoke test is intentionally kept separate from the new `integration/`
|
||||
suite because it exercises the production adapter shape (one
|
||||
container at a time against `alpine:3.21`), not the full runtime;
|
||||
both surfaces are useful.
|
||||
|
||||
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
|
||||
|
||||
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
|
||||
logger)` directly rather than spawning the binary from
|
||||
`cmd/rtmanager/main.go`:
|
||||
|
||||
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
|
||||
the runtime context and call `runtime.Close()`; the goroutine
|
||||
driving `runtime.Run` returns with `context.Canceled` and the
|
||||
helper waits on it via the `runDone` channel. With a subprocess the
|
||||
equivalent dance requires SIGTERM, output capture, and graceful
|
||||
shutdown timing tied to the child's signal handler.
|
||||
- **Goroutine and store visibility.** Tests read the durable PG state
|
||||
directly through the harness-owned pool and read every Redis stream
|
||||
through the harness-owned client. Both observe the exact wire shape
|
||||
Lobby will see in the cross-service suite.
|
||||
- **Logger isolation.** The harness defaults to `slog.Discard` so the
|
||||
default test output stays focused on assertions; flipping
|
||||
`EnvOptions.LogToStderr` lights up the runtime's structured logs
|
||||
for local debugging without requiring any subprocess plumbing.
|
||||
|
||||
The cross-service inter-process suite at `integration/lobbyrtm/`
|
||||
re-uses the existing `integration/internal/harness` binary-spawn
|
||||
helpers; the in-process choice here is specific to the service-local
|
||||
scope.
|
||||
|
||||
## 4. `httptest.Server` stub for the Lobby internal client
|
||||
|
||||
Runtime Manager configuration requires a non-empty
|
||||
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
|
||||
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
|
||||
as a no-op (the start envelope already carries the only required
|
||||
field, `image_ref`; rationale in [`services.md`](services.md) §7).
|
||||
The harness therefore stands up a tiny `httptest.Server` per test
|
||||
that returns a stable `200 OK` response. The stub is intentionally
|
||||
unconfigurable: every integration scenario produces the same
|
||||
ancillary fetch, and adding routing/error injection would invite
|
||||
test code to depend on a contract the start service deliberately
|
||||
ignores.
|
||||
|
||||
## 5. One built engine image, two semver-compatible tags
|
||||
|
||||
The patch lifecycle expects the new and current image refs to share
|
||||
the same major / minor version (`semver_patch_only` failure
|
||||
otherwise). Building two distinct images would multiply the per-run
|
||||
build cost without changing what the test verifies — the patch path
|
||||
exercises `image_ref_not_semver` and `semver_patch_only` validation
|
||||
plus the recreate-with-new-tag flow, none of which depend on
|
||||
distinct image *content*. The harness builds the engine once and
|
||||
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
|
||||
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
|
||||
|
||||
The integration tags use the `*-rtm-it` suffix (rather than plain
|
||||
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
|
||||
accidentally consume a hand-built dev image, and so a `docker image
|
||||
rm` of integration leftovers does not nuke a production-shaped tag.
|
||||
|
||||
## 6. Per-test Docker network and per-test state root
|
||||
|
||||
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
|
||||
and registers cleanup; `t.ArtifactDir()` provides the per-game state
|
||||
root. Both ensure that two scenarios running back-to-back cannot
|
||||
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
|
||||
filesystem state. Game ids are themselves unique per test
|
||||
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
|
||||
the per-test network and state root, the suite is safe to run with
|
||||
`-count` greater than one.
|
||||
|
||||
`t.ArtifactDir()` keeps the engine state directory around when a
|
||||
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
|
||||
failure and inspect what the engine wrote. On success the directory
|
||||
is automatically cleaned up.
|
||||
|
||||
## 7. PostgreSQL and Redis containers shared per-package
|
||||
|
||||
Both fixtures use `sync.Once` to start one testcontainer per test
|
||||
package, mirroring the
|
||||
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
|
||||
pattern. `TruncatePostgres` and `FlushRedis` reset state between
|
||||
tests so each scenario starts on an empty stack. The trade-off versus
|
||||
per-test containers is the standard one: container startup dominates
|
||||
the per-package latency, so amortising it across the suite keeps the
|
||||
loop tight while the truncate/flush ensures isolation. The ~1–2 s
|
||||
difference matters in CI.
|
||||
|
||||
## 8. Engine image cache is intentionally retained between runs
|
||||
|
||||
`buildAndTagEngineImage` runs once per package via `sync.Once` and
|
||||
leaves both image tags in the local Docker cache after the suite
|
||||
exits. The cache is a substantial speed-up on a developer laptop
|
||||
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
|
||||
hot), and a stale image is unlikely because the tags carry the
|
||||
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
|
||||
with multiple test runs. Operators who suspect a stale image can
|
||||
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
|
||||
the next run rebuilds.
|
||||
|
||||
## 9. Scenario coverage
|
||||
|
||||
The suite covers the four end-to-end flows operators care about:
|
||||
|
||||
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
|
||||
restart → patch → stop → cleanup. The intermediate `stop` between
|
||||
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
|
||||
to remove a running container per
|
||||
[`../README.md` §Cleanup](../README.md#cleanup).
|
||||
- **replay** (`replay_test.go`) — duplicate start / stop entries
|
||||
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
|
||||
- **health** (`health_test.go`) — external `docker rm` produces
|
||||
`container_disappeared`; manual `docker run` is adopted by the
|
||||
reconciler.
|
||||
- **notification** (`notification_test.go`) — unresolvable `image_ref`
|
||||
produces `runtime.image_pull_failed` plus a `failure` job_result.
|
||||
|
||||
## 10. Service-local scope only
|
||||
|
||||
This suite runs Runtime Manager against a real Docker daemon plus
|
||||
testcontainers PG / Redis but **does not** include any other Galaxy
|
||||
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
|
||||
in the top-level `galaxy/integration/` module, where the harness
|
||||
spawns multiple service binaries and uses real (not stubbed) cross-
|
||||
service streams.
|
||||
@@ -0,0 +1,531 @@
|
||||
# PostgreSQL Schema Decisions
|
||||
|
||||
Runtime Manager has been PostgreSQL-and-Redis from day one — there is
|
||||
no Redis-only predecessor and no migration window. This document
|
||||
records the schema decisions and the non-obvious agreements behind
|
||||
them, mirroring the shape of
|
||||
[`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md)
|
||||
and serving the same role: a single coherent reference for "why does
|
||||
the persistence layer look this way".
|
||||
|
||||
Use this document together with the migration script
|
||||
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
and the runtime wiring
|
||||
[`../internal/app/runtime.go`](../internal/app/runtime.go).
|
||||
|
||||
## Outcomes
|
||||
|
||||
- Schema `rtmanager` (provisioned externally) holds the durable
|
||||
service state across three tables: `runtime_records`,
|
||||
`operation_log`, `health_snapshots`. The three tables map onto the
|
||||
three runtime concerns documented in
|
||||
[`../README.md` §Persistence Layout](../README.md#persistence-layout):
|
||||
current state per game, audit trail per operation, and latest
|
||||
technical health observation per game.
|
||||
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
|
||||
applies embedded goose migrations strictly before any HTTP listener
|
||||
becomes ready, and exits non-zero when migration or ping fails.
|
||||
Already-applied migrations exit zero — the
|
||||
`pkg/postgres`-supplied migrator treats "no work to do" as success.
|
||||
- The runtime opens one shared `*redis.Client` via
|
||||
`pkg/redisconn.NewMasterClient` and passes it to the stream offset
|
||||
store, the per-game lease store, the consumer pipelines, and every
|
||||
publisher (`runtime:job_results`, `runtime:health_events`,
|
||||
`notification:intents`).
|
||||
- The Redis adapter package
|
||||
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
|
||||
owns one shared `Keyspace` struct with the
|
||||
`defaultPrefix = "rtmanager:"` constant and per-store subpackages
|
||||
for stream offsets and the per-game lease.
|
||||
- Generated jet code under
|
||||
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
|
||||
is committed; `make -C rtmanager jet` regenerates it via the
|
||||
testcontainers-driven `cmd/jetgen` pipeline.
|
||||
- Configuration uses the `RTMANAGER_` prefix for every variable.
|
||||
The schema-per-service rule from
|
||||
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
|
||||
applies: each service's role is grant-restricted to its own
|
||||
schema; RTM never touches Lobby's `lobby` schema or vice versa.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. One schema, externally-provisioned `rtmanagerservice` role
|
||||
|
||||
**Decision.** The `rtmanager` schema and the matching
|
||||
`rtmanagerservice` role are created outside the migration sequence
|
||||
(in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema`
|
||||
and by the integration harness; in production, by an ops init script
|
||||
not in scope for any service stage). The embedded migration
|
||||
`00001_init.sql` only contains DDL for the service-owned tables and
|
||||
indexes and assumes it runs as the schema owner with
|
||||
`search_path=rtmanager`.
|
||||
|
||||
**Why.** Mixing role creation, schema creation, and table DDL into
|
||||
one script forces every consumer of the migration to run as a
|
||||
superuser. The schema-per-service architectural rule
|
||||
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
|
||||
operational split: ops provisions roles and schemas, the service
|
||||
applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA`
|
||||
from its runtime role would relax the
|
||||
"each service's role grants are restricted to its own schema"
|
||||
defense-in-depth rule.
|
||||
|
||||
### 2. `runtime_records.game_id` is the natural primary key
|
||||
|
||||
**Decision.** `runtime_records` uses
|
||||
`game_id text PRIMARY KEY`. There is no surrogate key. The `status`
|
||||
column carries a CHECK constraint enforcing the
|
||||
`running | stopped | removed` enum.
|
||||
|
||||
```sql
|
||||
CREATE TABLE runtime_records (
|
||||
game_id text PRIMARY KEY,
|
||||
status text NOT NULL,
|
||||
-- ...
|
||||
CONSTRAINT runtime_records_status_chk
|
||||
CHECK (status IN ('running', 'stopped', 'removed'))
|
||||
);
|
||||
```
|
||||
|
||||
**Why.** `game_id` is the platform-wide identifier owned by Lobby;
|
||||
RTM stores at most one record per game ever. A surrogate
|
||||
`bigserial` would force every cross-service join to translate
|
||||
through a lookup table; the natural key keeps RTM's persistence
|
||||
layer pin-compatible with the streams contract (every
|
||||
`runtime:start_jobs` envelope already names the `game_id`). The
|
||||
status CHECK reproduces the Go-level enum from
|
||||
[`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go)
|
||||
as a defense-in-depth gate at the storage boundary. Decision context:
|
||||
[`domain-and-ports.md`](domain-and-ports.md).
|
||||
|
||||
### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus`
|
||||
|
||||
**Decision.** `runtime_records_status_last_op_idx` is a composite
|
||||
index on `(status, last_op_at)`. The container cleanup worker scans
|
||||
`status='stopped' AND last_op_at < cutoff`; the
|
||||
`runtimerecordstore.ListByStatus` adapter method orders rows
|
||||
`last_op_at DESC, game_id ASC`.
|
||||
|
||||
```sql
|
||||
CREATE INDEX runtime_records_status_last_op_idx
|
||||
ON runtime_records (status, last_op_at);
|
||||
```
|
||||
|
||||
**Why.** Both read shapes share the same composite. The cleanup
|
||||
worker drives the index from one direction (range scan on
|
||||
`last_op_at` filtered by status); `ListByStatus` drives it from the
|
||||
other (equality on status, sorted by `last_op_at`). PostgreSQL
|
||||
satisfies both shapes through one index scan once the planner picks
|
||||
the index for the WHERE clause. The secondary `game_id ASC` tiebreak
|
||||
in the adapter ORDER BY is satisfied by primary-key ordering after
|
||||
the index returns the rows.
|
||||
|
||||
A second supporting index for the cleanup worker was considered and
|
||||
rejected: the workload is so small (single-instance v1, bounded
|
||||
running game count) that one composite is dominantly cheaper than
|
||||
two narrow ones.
|
||||
|
||||
### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index
|
||||
|
||||
**Decision.** `operation_log` carries a `bigserial id PRIMARY KEY`
|
||||
and is written exclusively through INSERT — there is no UPDATE
|
||||
pathway, no soft-delete column, and no foreign key to
|
||||
`runtime_records`. The audit index
|
||||
`operation_log_game_started_idx (game_id, started_at DESC)` drives
|
||||
the GM/Admin REST audit reads. The adapter's `ListByGame` orders
|
||||
results `started_at DESC, id DESC` and applies `LIMIT $2`.
|
||||
|
||||
```sql
|
||||
CREATE INDEX operation_log_game_started_idx
|
||||
ON operation_log (game_id, started_at DESC);
|
||||
```
|
||||
|
||||
**Why.** The audit's correctness invariant is "every operation RTM
|
||||
performed gets exactly one row"; CASCADE deletes from
|
||||
`runtime_records` would silently lose history when an admin removes
|
||||
a runtime and would break the
|
||||
[`../README.md` §Persistence Layout](../README.md) commitment. The
|
||||
secondary `id DESC` tiebreak inside the adapter is necessary because
|
||||
the audit log can write multiple rows in the same millisecond when
|
||||
`reconcile_adopt` and a real operation interleave on a single tick;
|
||||
without the tiebreak the test that asserts insertion-order-stable
|
||||
reads becomes flaky. A non-positive `limit` is rejected before the
|
||||
SQL is issued; an empty result set returns as `nil` (matching the
|
||||
lobby pattern, so service-layer callers can do `len(entries) == 0`
|
||||
without an extra allocation).
|
||||
|
||||
### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome`
|
||||
|
||||
**Decision.** `operation_log` reproduces the three Go-level enums
|
||||
as CHECK constraints:
|
||||
|
||||
```sql
|
||||
CONSTRAINT operation_log_op_kind_chk
|
||||
CHECK (op_kind IN (
|
||||
'start', 'stop', 'restart', 'patch',
|
||||
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
|
||||
)),
|
||||
CONSTRAINT operation_log_op_source_chk
|
||||
CHECK (op_source IN (
|
||||
'lobby_stream', 'gm_rest', 'admin_rest',
|
||||
'auto_ttl', 'auto_reconcile'
|
||||
)),
|
||||
CONSTRAINT operation_log_outcome_chk
|
||||
CHECK (outcome IN ('success', 'failure'))
|
||||
```
|
||||
|
||||
The Go-level enums in
|
||||
[`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
|
||||
remain the source of truth.
|
||||
|
||||
**Why.** A defence-in-depth gate at the storage boundary catches any
|
||||
adapter regression that would otherwise persist an unexpected
|
||||
string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`)
|
||||
benefit from the enum being verifiable directly in psql without
|
||||
consulting the Go source. Adding a new value requires editing two
|
||||
places (the Go enum and the migration), which is the right friction
|
||||
level: every new value is a wire-protocol change and deserves an
|
||||
explicit migration. The alternative of using PostgreSQL's `CREATE
|
||||
TYPE … AS ENUM` was rejected because adding a value to a PG enum
|
||||
type requires `ALTER TYPE` outside a transaction and complicates the
|
||||
single-init pre-launch policy (decision §12).
|
||||
|
||||
### 6. `health_snapshots` is one row per game; status enum collapses event types
|
||||
|
||||
**Decision.** `health_snapshots` carries `game_id text PRIMARY KEY`
|
||||
and stores the latest technical health observation per game. The
|
||||
`status` column enumerates the **observed engine state**, not the
|
||||
**triggering event type**:
|
||||
|
||||
```sql
|
||||
CONSTRAINT health_snapshots_status_chk
|
||||
CHECK (status IN (
|
||||
'healthy', 'probe_failed', 'exited',
|
||||
'oom', 'inspect_unhealthy', 'container_disappeared'
|
||||
))
|
||||
```
|
||||
|
||||
The `runtime:health_events` `event_type` enum has seven values
|
||||
(`container_started`, `container_exited`, `container_oom`,
|
||||
`container_disappeared`, `inspect_unhealthy`, `probe_failed`,
|
||||
`probe_recovered`). The snapshot status has six — the two probe
|
||||
events fold into `healthy` (after `probe_recovered`) and
|
||||
`probe_failed`, and `container_started` collapses into `healthy`.
|
||||
|
||||
**Why.** Health snapshots answer "what state is the engine in
|
||||
**right now**", not "what event was just emitted". A consumer who
|
||||
wants the event firehose reads `runtime:health_events`; a consumer
|
||||
who wants the latest verdict reads `health_snapshots`. The two
|
||||
surfaces have different lifetimes (stream entries are bounded only
|
||||
by Redis trim; snapshot rows are overwritten on every new
|
||||
observation), so collapsing the seven event types into six status
|
||||
states aligns the column with the consumer's mental model. The
|
||||
adapter that implements this collapse lives in
|
||||
[`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go);
|
||||
every emission to the stream also upserts the snapshot.
|
||||
|
||||
### 7. Two-axis CAS shape on `runtime_records.UpdateStatus`
|
||||
|
||||
**Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS
|
||||
guard into a single `WHERE … AND …` clause. Status must equal the
|
||||
caller's `ExpectedFrom`; when the caller supplies a non-empty
|
||||
`ExpectedContainerID`, `current_container_id` must equal it as
|
||||
well:
|
||||
|
||||
```sql
|
||||
UPDATE rtmanager.runtime_records
|
||||
SET status = $1, last_op_at = $2, ...
|
||||
WHERE game_id = $3
|
||||
AND status = $4
|
||||
[AND current_container_id = $5]
|
||||
```
|
||||
|
||||
A `RowsAffected() == 0` result is ambiguous — the row may be absent
|
||||
or the predicate may have failed. The adapter resolves the ambiguity
|
||||
through a follow-up `SELECT status FROM ... WHERE game_id = $1`:
|
||||
missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`.
|
||||
The probe runs only on the slow path; happy-path UPDATEs cost a
|
||||
single round trip.
|
||||
|
||||
**Why.** The two-axis CAS is what services need: a stop driven by an
|
||||
old container_id (from a stale REST request) must not clobber a
|
||||
fresh `running` record installed by a concurrent restart. Status-only
|
||||
CAS would collapse those two cases. The optional shape on
|
||||
`ExpectedContainerID` lets reconciliation flows that legitimately
|
||||
target "this game in `running` state without caring which container"
|
||||
omit the second predicate. The follow-up probe matches the
|
||||
gamestore / invitestore precedent in `lobby/internal/adapters/postgres`
|
||||
and produces clean per-error sentinels at the service layer.
|
||||
|
||||
`TestUpdateStatusConcurrentCAS` exercises the path end to end with
|
||||
eight goroutines racing the same transition: exactly one returns
|
||||
`nil`, the rest see `runtime.ErrConflict`. The test is deterministic
|
||||
because PostgreSQL serialises row-level UPDATEs through the row's
|
||||
MVCC tuple.
|
||||
|
||||
### 8. Destination-driven `SET` clause on `UpdateStatus`
|
||||
|
||||
**Decision.** `UpdateStatus` updates a different column subset
|
||||
depending on the destination status:
|
||||
|
||||
| Destination | Columns set |
|
||||
| --- | --- |
|
||||
| `stopped` | `status`, `last_op_at`, `stopped_at` |
|
||||
| `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` |
|
||||
| `running` | `status`, `last_op_at` |
|
||||
|
||||
The implementation switches on `input.To` and writes the UPDATE
|
||||
chain inline per branch — three short branches read better than one
|
||||
parametric helper.
|
||||
|
||||
**Why.** Each destination has a different invariant. `stopped`
|
||||
records the wall-clock at which the engine ceased serving; `removed`
|
||||
nulls the container_id because the row no longer points at any
|
||||
Docker resource; `running` only updates the status and the
|
||||
last-op timestamp because the running invariants
|
||||
(`current_container_id`, fresh `started_at`, `current_image_ref`,
|
||||
`engine_endpoint`) are installed through `Upsert` on the `start`
|
||||
path.
|
||||
|
||||
A previous draft built the SET list via `[]pg.Column` / `[]any`
|
||||
slices and a helper, but jet's `UPDATE(columns ...jet.Column)`
|
||||
variadic refuses a `[]postgres.Column` slice spread because the
|
||||
element type does not match `jet.Column` after the type-alias
|
||||
resolution. The final code switches inline per branch.
|
||||
|
||||
The `running` destination is implemented even though the start
|
||||
service uses `Upsert` for the inner start of restart and patch.
|
||||
Keeping the `running` path live preserves a one-to-one match between
|
||||
`runtime.AllowedTransitions()` and the adapter's capability matrix —
|
||||
otherwise a future caller exercising the `stopped → running`
|
||||
transition through `UpdateStatus` would hit a runtime error inside
|
||||
the adapter rather than a domain rejection. The path only updates
|
||||
`status` and `last_op_at`; callers responsible for the running
|
||||
invariants install them through `Upsert` first.
|
||||
|
||||
### 9. `created_at` preservation on `Upsert`
|
||||
|
||||
**Decision.** `runtimerecordstore.Upsert` is implemented as
|
||||
`INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable
|
||||
column from EXCLUDED>` — `created_at` is deliberately omitted from
|
||||
the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt`
|
||||
value never overwrites the stored timestamp.
|
||||
|
||||
```sql
|
||||
INSERT INTO rtmanager.runtime_records (...)
|
||||
VALUES (...)
|
||||
ON CONFLICT (game_id) DO UPDATE
|
||||
SET status = EXCLUDED.status,
|
||||
current_container_id = EXCLUDED.current_container_id,
|
||||
current_image_ref = EXCLUDED.current_image_ref,
|
||||
engine_endpoint = EXCLUDED.engine_endpoint,
|
||||
state_path = EXCLUDED.state_path,
|
||||
docker_network = EXCLUDED.docker_network,
|
||||
started_at = EXCLUDED.started_at,
|
||||
stopped_at = EXCLUDED.stopped_at,
|
||||
removed_at = EXCLUDED.removed_at,
|
||||
last_op_at = EXCLUDED.last_op_at
|
||||
-- created_at intentionally NOT updated
|
||||
```
|
||||
|
||||
`TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the
|
||||
invariant.
|
||||
|
||||
**Why.** `runtime_records.created_at` records "first time RTM saw
|
||||
the game". Every restart and every reconcile_adopt re-Upserts the
|
||||
row with the current wall-clock as `CreatedAt` from the adapter
|
||||
boundary; without the omission rule the timestamp would drift
|
||||
forward. Preserving the original creation time keeps a stable
|
||||
horizon for retention reasoning and matches
|
||||
`lobby/internal/adapters/postgres/gamestore.Save`, which uses the
|
||||
same approach for the `games.created_at` column.
|
||||
|
||||
### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default
|
||||
|
||||
**Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT
|
||||
'{}'::jsonb`. The jet-generated model declares
|
||||
`Details string` (jet maps `jsonb` to `string`). The adapter:
|
||||
|
||||
- on `Upsert`, substitutes the SQL DEFAULT `{}` when
|
||||
`snapshot.Details` is empty, so the column never holds a non-JSON
|
||||
empty string;
|
||||
- on `Get`, scans `details` as `[]byte` and wraps the bytes in a
|
||||
`json.RawMessage` so the caller receives verbatim bytes without
|
||||
an extra round of parsing.
|
||||
|
||||
`TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and
|
||||
`TestUpsertAndGetRoundTrip` cover the two cases.
|
||||
|
||||
**Why.** The detail payload is type-specific (the keys differ
|
||||
between `probe_failed` and `inspect_unhealthy`) and is opaque to
|
||||
queries — the column is never element-filtered. JSONB matches the
|
||||
"everything outside primary fields is JSON" pattern that the
|
||||
Notification Service already established and allows a future
|
||||
GIN index (e.g. for an admin search-by-key feature) without a
|
||||
schema rewrite. Substituting the SQL DEFAULT for an empty
|
||||
parameter avoids the trap where the database accepts `''` for
|
||||
`text` but rejects it for `jsonb`.
|
||||
|
||||
### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary
|
||||
|
||||
**Decision.** Every time-valued column on every RTM table uses
|
||||
PostgreSQL's `timestamptz`. The domain model continues to use
|
||||
`time.Time`; the adapter normalises every `time.Time` parameter to
|
||||
UTC at the binding site (`record.X.UTC()` or the `nullableTime`
|
||||
helper that wraps a possibly-zero `time.Time`), and re-wraps every
|
||||
scanned `time.Time` with `.UTC()` (directly or via
|
||||
`timeFromNullable` for nullable columns) before the value leaves
|
||||
the adapter.
|
||||
|
||||
The architecture-wide form of this rule lives in
|
||||
[`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md).
|
||||
|
||||
**Why.** `timestamptz` is the right column type for every cross-
|
||||
service timestamp the platform observes, and the domain model needs
|
||||
a `time.Time` API the service layer can compare and arithmetise.
|
||||
Without explicit `.UTC()` on the bind site, the pgx driver returns
|
||||
scanned values in `time.Local`, which silently breaks equality
|
||||
tests, JSON formatting, and comparison against pointer fields
|
||||
elsewhere in the codebase. The defensive `.UTC()` rule on both
|
||||
sides eliminates the class of bug where a timezone difference
|
||||
between the adapter and the test harness flips assertions
|
||||
intermittently.
|
||||
|
||||
The same shape is used in User Service, Mail Service, and
|
||||
Notification Service — RTM matches the existing convention rather
|
||||
than introducing a fourth encoding path.
|
||||
|
||||
### 12. Single-init pre-launch policy
|
||||
|
||||
**Decision.** `00001_init.sql` evolves in place until first
|
||||
production deploy. Adding a column, an index, or a new table during
|
||||
the pre-launch development window edits this file directly rather
|
||||
than producing `00002_*.sql`. The runtime applies the migration on
|
||||
every boot; if the schema is already at head, `pkg/postgres`'s
|
||||
goose adapter exits zero.
|
||||
|
||||
**Why.** The schema-per-service architectural rule
|
||||
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
|
||||
endorses a single-init policy for pre-launch services. The
|
||||
pre-launch window allows non-additive changes (column rename, type
|
||||
narrowing, CHECK tightening) that a multi-step migration sequence
|
||||
would force into awkward two-step rewrites. Once the service ships
|
||||
to production, the next schema change becomes `00002_*.sql` and
|
||||
the policy lifts; from that point onward edits to `00001_init.sql`
|
||||
are rejected by code review.
|
||||
|
||||
This applies to RTM exactly the same way it applies to every other
|
||||
PG-backed service in the workspace; the README explicitly carries
|
||||
the reminder. The exit-zero behaviour for already-applied
|
||||
migrations is what makes the policy operationally cheap: a
|
||||
freshly-spawned replica re-applies the same `00001_init.sql` with
|
||||
no work to do, no logged error, and proceeds to open its
|
||||
listeners.
|
||||
|
||||
### 13. Query layer is `go-jet/jet/v2`; generated code is committed
|
||||
|
||||
**Decision.** All three RTM PG-store packages
|
||||
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
|
||||
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
|
||||
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore))
|
||||
build SQL through the jet builder API
|
||||
(`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
|
||||
`pg.AND/OR/SET/COALESCE/...` DSL).
|
||||
|
||||
Generated table models live under
|
||||
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
|
||||
and are regenerated by `make -C rtmanager jet`. The target invokes
|
||||
[`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a
|
||||
transient PostgreSQL container via testcontainers, provisions the
|
||||
`rtmanager` schema and `rtmanagerservice` role, applies the embedded
|
||||
goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
|
||||
against the provisioned schema. Generated code is committed to the
|
||||
repo, so build consumers do not need Docker.
|
||||
|
||||
Statements are run through the `database/sql` API
|
||||
(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
|
||||
helpers preserve the codecs.go boundary translations and
|
||||
domain-type mapping (status enum decoding, `time.Time` UTC
|
||||
normalisation, JSONB `[]byte` ↔ `json.RawMessage`).
|
||||
|
||||
PostgreSQL constructs that the jet builder does not cover natively
|
||||
(`COALESCE`, `LOWER` on subselects, JSONB params) are expressed
|
||||
through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct
|
||||
`[]byte`/string params for JSONB columns).
|
||||
|
||||
**Why.** Aligns with the workspace-wide convention from
|
||||
[`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is
|
||||
`github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed
|
||||
service. Hand-rolled SQL would multiply boundary-translation paths
|
||||
and require per-store query-builder helpers for what jet already
|
||||
covers. Committing generated code keeps `go build ./...` working
|
||||
without Docker.
|
||||
|
||||
### 14. `redisstate` keyspace ownership and per-store subpackages
|
||||
|
||||
**Decision.** The
|
||||
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
|
||||
package owns one shared `Keyspace` struct with a
|
||||
`defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter
|
||||
lives in its own subpackage:
|
||||
|
||||
- [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/)
|
||||
for the stream offset store consumed by the start-jobs and
|
||||
stop-jobs consumers;
|
||||
- [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/)
|
||||
for the per-game lease store consumed by every lifecycle service
|
||||
and the reconciler.
|
||||
|
||||
Both subpackages take a `redisstate.Keyspace{}` value and use it to
|
||||
build their key shapes (`rtmanager:stream_offsets:{label}`,
|
||||
`rtmanager:game_lease:{game_id}`).
|
||||
|
||||
**Why.** Keeping the parent package as the single owner of the prefix
|
||||
and the key-shape builder mirrors the way Lobby's `redisstate`
|
||||
namespace centralises every key shape and supports multiple Redis-
|
||||
backed adapters (stream offsets, the per-game lease) without a
|
||||
restructure as the surface grows.
|
||||
|
||||
The per-store subpackage choice (rather than Lobby's flat
|
||||
single-package shape) is driven by three considerations:
|
||||
|
||||
- It keeps the docker mock generator scoped to one package, since
|
||||
`mockgen` regenerates per-directory.
|
||||
- It allows finer-grained dependency selection: `miniredis` is a
|
||||
dev-only dep, and keeping the `streamoffsets` package
|
||||
self-contained leaves room for `gamelease` to depend only on the
|
||||
production `redis` client.
|
||||
- Each subpackage carries its own tests, which keeps the test
|
||||
surface focused on one Redis primitive rather than mixing offset
|
||||
semantics with lease semantics in shared fixtures.
|
||||
|
||||
## Cross-References
|
||||
|
||||
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
— the embedded schema migration.
|
||||
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
|
||||
— `//go:embed *.sql` and `FS()` exporter consumed by the runtime.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
|
||||
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
|
||||
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)
|
||||
— the three jet-backed PG adapters and their testcontainers-driven
|
||||
unit suites.
|
||||
- [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
|
||||
— committed generated jet models.
|
||||
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and
|
||||
[`../Makefile`](../Makefile) `jet` target — the regeneration
|
||||
pipeline.
|
||||
- [`../internal/adapters/redisstate/`](../internal/adapters/redisstate),
|
||||
[`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/),
|
||||
[`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/)
|
||||
— Redis adapter package layout.
|
||||
- [`../internal/app/runtime.go`](../internal/app/runtime.go)
|
||||
— runtime wiring: PG pool open + migration apply + Redis client
|
||||
open + adapter assembly.
|
||||
- [`../internal/config/`](../internal/config) — the config groups
|
||||
consumed by the wiring (`Postgres`, `Redis`, `Streams`,
|
||||
`Coordination`).
|
||||
- Companion design rationales:
|
||||
[`domain-and-ports.md`](domain-and-ports.md) for status enum and
|
||||
domain shape, [`adapters.md`](adapters.md) for the redisstate
|
||||
publishers and clients.
|
||||
@@ -0,0 +1,368 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup,
|
||||
steady-state readiness, shutdown, and the handful of recovery paths
|
||||
specific to Runtime Manager.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`)
|
||||
reaches a Docker daemon the operator controls. RTM is the only
|
||||
Galaxy service permitted to interact with the Docker socket;
|
||||
scoping the daemon to RTM-only callers is operator domain.
|
||||
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a
|
||||
user-defined bridge network that has already been created (e.g.
|
||||
via `docker network create galaxy-net` in the environment's
|
||||
bootstrap script). RTM **validates** the network at startup but
|
||||
never creates it. A missing network is fail-fast and the process
|
||||
exits non-zero before opening any listener.
|
||||
- `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user
|
||||
can read and write. Per-game subdirectories are created with
|
||||
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the
|
||||
uid/gid to match the engine container's user when running with a
|
||||
non-root engine.
|
||||
- `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
|
||||
that hosts the `rtmanager` schema. The DSN must include
|
||||
`search_path=rtmanager` and `sslmode=disable` (or a real SSL mode
|
||||
for production). Embedded goose migrations apply at startup before
|
||||
any HTTP listener opens; a migration or ping failure terminates the
|
||||
process with a non-zero exit. The `rtmanager` schema and the
|
||||
matching `rtmanagerservice` role are provisioned externally
|
||||
([`postgres-migration.md` §1](postgres-migration.md)).
|
||||
- `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach
|
||||
the Redis deployment used for the runtime-coordination state:
|
||||
stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`),
|
||||
publishers (`runtime:job_results`, `runtime:health_events`,
|
||||
`notification:intents`), persisted offsets, and the per-game
|
||||
lease. RTM does not maintain durable business state on Redis.
|
||||
- Stream names match the producers and consumers RTM integrates with:
|
||||
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`)
|
||||
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
|
||||
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`)
|
||||
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`)
|
||||
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
|
||||
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal
|
||||
HTTP listener. RTM's start service issues a diagnostic
|
||||
`GET /api/v1/internal/games/{game_id}` per start; failure is logged
|
||||
at debug and does not abort the start
|
||||
([`services.md` §7](services.md)).
|
||||
|
||||
The startup sequence runs in the order recorded in
|
||||
[`../README.md` §Startup dependencies](../README.md#startup-dependencies):
|
||||
|
||||
1. PostgreSQL primary opens; goose migrations apply synchronously.
|
||||
2. Redis master client opens and pings.
|
||||
3. Docker daemon ping; configured network presence check.
|
||||
4. Telemetry exporter (OTLP grpc/http or stdout).
|
||||
5. Internal HTTP listener.
|
||||
6. Reconciler runs **once synchronously** and blocks until done.
|
||||
7. Background workers start.
|
||||
|
||||
A failure at any step is fatal. The synchronous reconciler pass is
|
||||
the reason orphaned containers from a prior process never reach the
|
||||
periodic workers in an inconsistent state
|
||||
([`workers.md` §17](workers.md)).
|
||||
|
||||
Expected log lines on a healthy boot:
|
||||
|
||||
- `migrations applied`,
|
||||
- `postgres ping ok`,
|
||||
- `redis ping ok`,
|
||||
- `docker ping ok` and `docker network found`,
|
||||
- `telemetry exporter started`,
|
||||
- `internal http listening`,
|
||||
- `reconciler initial pass completed`,
|
||||
- one `worker started` entry per background worker (seven expected).
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually verify:
|
||||
|
||||
- `GET /healthz` confirms the listener is alive — no dependency
|
||||
check.
|
||||
- `GET /readyz` live-pings PostgreSQL primary, Redis master, and the
|
||||
Docker daemon, then asserts the configured Docker network exists.
|
||||
Returns `{"status":"ready"}` when every check passes; otherwise
|
||||
returns `503` with the canonical
|
||||
`{"error":{"code":"service_unavailable","message":"…"}}` envelope
|
||||
identifying the first failing dependency.
|
||||
|
||||
`/readyz` is the strongest readiness signal RTM exposes; unlike
|
||||
Lobby's `/readyz`, it does **not** rely on a one-shot boot ping.
|
||||
Each request hits the daemon and the database fresh.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted the listener and worker startup logs;
|
||||
2. check `GET /healthz` and `GET /readyz`;
|
||||
3. verify `rtmanager.runtime_records_by_status{status="running"}`
|
||||
gauge tracks the expected live game count after the first start
|
||||
completes;
|
||||
4. verify `rtmanager.docker_op_latency` histograms have at least one
|
||||
sample after the first lifecycle operation.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behaviour:
|
||||
|
||||
- the per-component shutdown budget is controlled by
|
||||
`RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`);
|
||||
- the internal HTTP listener drains in-flight requests before closing;
|
||||
- stream consumers stop their `XREAD` loops and persist the latest
|
||||
offset before returning; the offset survives the restart
|
||||
([`workers.md` §9](workers.md));
|
||||
- the Docker events listener cancels its subscription;
|
||||
- the in-flight services release their per-game lease through the
|
||||
surrounding context cancellation;
|
||||
- the reconciler completes its current pass or aborts mid-write at
|
||||
the next lease re-acquisition.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for the listener and component-stop logs;
|
||||
3. expect any consumer that was mid-cycle to retry from the persisted
|
||||
offset on the next process start;
|
||||
4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`.
|
||||
|
||||
## Engine Container Died
|
||||
|
||||
A running engine container that exits unexpectedly surfaces through
|
||||
three observation channels:
|
||||
|
||||
- The Docker events listener emits `container_exited` (non-zero exit
|
||||
code) or `container_oom` (Docker action `oom`).
|
||||
- The active probe worker eventually emits `probe_failed` once the
|
||||
threshold is crossed.
|
||||
- The Docker inspect worker may emit `inspect_unhealthy` if the
|
||||
engine restarts under Docker's healthcheck or if Docker reports an
|
||||
unexpected status.
|
||||
|
||||
Triage:
|
||||
|
||||
1. Inspect the `runtime:health_events` stream for the affected
|
||||
`game_id` and `event_type`:
|
||||
```bash
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 200 \
|
||||
| grep -A4 'game_id\s*<game_id>'
|
||||
```
|
||||
2. Read the runtime record and the operation log:
|
||||
```bash
|
||||
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code, started_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = '<game_id>'
|
||||
ORDER BY started_at DESC LIMIT 20"
|
||||
```
|
||||
3. If Lobby has not reacted (the game's status remains `running` in
|
||||
`lobby.games`), check `runtime:job_results` lag and Lobby's
|
||||
`runtimejobresult` worker. RTM publishes the result; Lobby is the
|
||||
consumer.
|
||||
4. If the container is already gone (`docker ps -a` shows no row for
|
||||
`galaxy-game-<game_id>`), the reconciler will move the record to
|
||||
`removed` on its next pass. Run the periodic reconcile manually
|
||||
by sending `SIGHUP` is **not** supported — wait
|
||||
`RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the
|
||||
process; the synchronous boot pass will handle the drift.
|
||||
5. The `notification:intents` stream is **not** the place to look
|
||||
for ongoing health changes. Only the three first-touch start
|
||||
failures (`runtime.image_pull_failed`,
|
||||
`runtime.container_start_failed`,
|
||||
`runtime.start_config_invalid`) produce a notification intent;
|
||||
probe failures, OOMs, and exits flow through health events only
|
||||
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
||||
|
||||
## Patch Upgrade
|
||||
|
||||
A patch upgrade replaces the container with a new `image_ref` while
|
||||
preserving the bind-mounted state directory.
|
||||
|
||||
Pre-conditions:
|
||||
|
||||
- The new and current `image_ref` tags both parse as semver. RTM
|
||||
rejects non-semver tags with `image_ref_not_semver`.
|
||||
- The new and current major / minor versions match. A cross-major or
|
||||
cross-minor patch returns `semver_patch_only`.
|
||||
|
||||
Driving the upgrade:
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
|
||||
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
||||
```
|
||||
|
||||
Behaviour:
|
||||
|
||||
- The container is stopped, removed, and recreated. The
|
||||
`current_container_id` changes; the `engine_endpoint`
|
||||
(`http://galaxy-game-<game_id>:8080`) is stable.
|
||||
- The engine reads its state from the bind mount on startup, so any
|
||||
data written before the patch survives.
|
||||
- A single `operation_log` row is appended with `op_kind=patch` and
|
||||
the old / new image refs.
|
||||
- A `runtime:health_events container_started` is emitted by the
|
||||
inner start ([`workers.md` §1](workers.md)).
|
||||
|
||||
Post-patch verification:
|
||||
|
||||
```bash
|
||||
curl -s http://galaxy-game-<game_id>:8080/healthz
|
||||
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
|
||||
```
|
||||
|
||||
The `current_image_ref` field on the runtime record reflects the new
|
||||
tag.
|
||||
|
||||
## Manual Cleanup
|
||||
|
||||
The cleanup endpoint removes the container and updates the record to
|
||||
`removed`. It refuses to remove a `running` container — stop first.
|
||||
|
||||
```bash
|
||||
# Stop, then clean up
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
|
||||
-d '{"reason":"admin_request"}'
|
||||
|
||||
curl -s -X DELETE \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
|
||||
```
|
||||
|
||||
The host state directory under `<RTMANAGER_GAME_STATE_ROOT>/<game_id>`
|
||||
is **never** deleted by RTM. Removing the directory is operator
|
||||
domain (backup tooling, future Admin Service workflow). The
|
||||
operation_log records `op_kind=cleanup_container` with
|
||||
`op_source=admin_rest`.
|
||||
|
||||
## Reconcile Drift After Docker Daemon Restart
|
||||
|
||||
A Docker daemon restart drops every running engine container; PG
|
||||
records remain. On RTM's next boot (or its next periodic reconcile):
|
||||
|
||||
1. The reconciler observes `running` records whose containers are
|
||||
missing from `docker ps`. It updates each record to `removed`,
|
||||
appends `operation_log` with `op_kind=reconcile_dispose`, and
|
||||
publishes `runtime:health_events container_disappeared`
|
||||
([`workers.md` §14–§15](workers.md)).
|
||||
2. Lobby's `runtimejobresult` worker does not consume the dispose
|
||||
event in v1, so the cascade does not auto-restart the engine.
|
||||
Operators trigger restarts through Lobby's user-facing flow or
|
||||
directly via the GM/Admin REST `restart` endpoint.
|
||||
3. If the operator brings up an engine container manually for
|
||||
diagnostics (`docker run` with the
|
||||
`com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>` labels),
|
||||
the reconciler **adopts** it on the next pass: a new
|
||||
`runtime_records` row appears with `op_kind=reconcile_adopt`.
|
||||
The reconciler **never stops or removes** an unrecorded
|
||||
container — operators stay in control of manual containers
|
||||
([`../README.md` §Reconciliation](../README.md#reconciliation)).
|
||||
|
||||
Three drift kinds run through the same lease-guarded write pass:
|
||||
`adopt`, `dispose`, and the README-level path
|
||||
`observed_exited` (a record marked `running` whose container exists
|
||||
but is in `exited`). Telemetry counter
|
||||
`rtmanager.reconcile_drift{kind}` exposes the three independently
|
||||
([`workers.md` §15](workers.md)).
|
||||
|
||||
## Testing Locally
|
||||
|
||||
```sh
|
||||
# One-time bootstrap
|
||||
docker network create galaxy-net
|
||||
|
||||
# Minimal env (see docs/examples.md for a complete .env)
|
||||
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
||||
export RTMANAGER_DOCKER_NETWORK=galaxy-net
|
||||
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
||||
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
|
||||
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
||||
export RTMANAGER_REDIS_PASSWORD=local
|
||||
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
||||
|
||||
go run ./rtmanager/cmd/rtmanager
|
||||
```
|
||||
|
||||
After start:
|
||||
|
||||
- `curl http://localhost:8096/healthz` returns `{"status":"ok"}`;
|
||||
- `curl http://localhost:8096/readyz` returns `{"status":"ready"}`
|
||||
once PG, Redis, and Docker pings pass and the configured network
|
||||
exists;
|
||||
- driving Lobby through its public flow (`POST /api/v1/lobby/games/<id>/start`)
|
||||
brings up `galaxy-game-<game_id>` containers; RTM logs each
|
||||
lifecycle transition.
|
||||
|
||||
The integration suite under `rtmanager/integration/` exercises the
|
||||
end-to-end flows against the real Docker daemon. The default
|
||||
`go test ./...` skips it via the `integration` build tag; run
|
||||
explicitly with:
|
||||
|
||||
```sh
|
||||
make -C rtmanager integration
|
||||
```
|
||||
|
||||
The suite requires a reachable Docker daemon. Without one, the
|
||||
harness helpers call `t.Skip` and the package becomes a no-op
|
||||
([`integration-tests.md` §1](integration-tests.md)).
|
||||
|
||||
## Diagnostic Queries
|
||||
|
||||
Durable runtime state lives in PostgreSQL; runtime-coordination state
|
||||
stays in Redis. CLI snippets that help during incidents:
|
||||
|
||||
```bash
|
||||
# Live runtime count by status (PostgreSQL)
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
||||
|
||||
# Inspect a specific runtime record
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
|
||||
|
||||
# Last 20 operations for a game (newest first)
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code,
|
||||
started_at, finished_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = '<game_id>'
|
||||
ORDER BY started_at DESC, id DESC
|
||||
LIMIT 20"
|
||||
|
||||
# Latest health snapshot
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
|
||||
|
||||
# Containers RTM owns (Docker)
|
||||
docker ps --filter label=com.galaxy.owner=rtmanager \
|
||||
--format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
|
||||
|
||||
# Stream lag (Redis)
|
||||
redis-cli XINFO STREAM runtime:start_jobs
|
||||
redis-cli XINFO STREAM runtime:stop_jobs
|
||||
redis-cli GET rtmanager:stream_offsets:startjobs
|
||||
redis-cli GET rtmanager:stream_offsets:stopjobs
|
||||
|
||||
# Recent health events (oldest first)
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 100
|
||||
|
||||
# Per-game lease (only present while an operation runs)
|
||||
redis-cli GET rtmanager:game_lease:<game_id>
|
||||
redis-cli TTL rtmanager:game_lease:<game_id>
|
||||
```
|
||||
|
||||
Operators reach the gauges and counters surfaced through
|
||||
OpenTelemetry as the primary observability surface; raw PostgreSQL
|
||||
and Redis access is for last-resort triage.
|
||||
@@ -0,0 +1,309 @@
|
||||
# Runtime and Components
|
||||
|
||||
The diagram below focuses on the deployed `galaxy/rtmanager` process
|
||||
and its runtime dependencies. The current-state contract for every
|
||||
listener, worker, and adapter lives in [`../README.md`](../README.md);
|
||||
this document is the navigation aid that points at the right code path
|
||||
and the right design-rationale record.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Clients
|
||||
GM["Game Master"]
|
||||
Admin["Admin Service"]
|
||||
Lobby["Game Lobby"]
|
||||
end
|
||||
|
||||
subgraph RTM["Runtime Manager process"]
|
||||
InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
|
||||
StartJobs["startjobsconsumer"]
|
||||
StopJobs["stopjobsconsumer"]
|
||||
DockerEvents["dockerevents listener"]
|
||||
HealthProbe["healthprobe worker"]
|
||||
DockerInspect["dockerinspect worker"]
|
||||
Reconcile["reconcile worker"]
|
||||
Cleanup["containercleanup worker"]
|
||||
Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
|
||||
IntentPublisher["notification:intents publisher"]
|
||||
ResultsPublisher["runtime:job_results publisher"]
|
||||
HealthPublisher["runtime:health_events publisher"]
|
||||
Telemetry["Logs, traces, metrics"]
|
||||
end
|
||||
|
||||
Docker["Docker Daemon"]
|
||||
Engine["galaxy-game-{game_id} container"]
|
||||
Postgres["PostgreSQL\nschema rtmanager"]
|
||||
Redis["Redis\nstreams + leases + offsets"]
|
||||
LobbyHTTP["Lobby internal HTTP"]
|
||||
|
||||
Lobby -. runtime:start_jobs .-> StartJobs
|
||||
Lobby -. runtime:stop_jobs .-> StopJobs
|
||||
GM --> InternalHTTP
|
||||
Admin --> InternalHTTP
|
||||
|
||||
StartJobs --> Services
|
||||
StopJobs --> Services
|
||||
InternalHTTP --> Services
|
||||
|
||||
Services --> Docker
|
||||
Services --> Postgres
|
||||
Services --> Redis
|
||||
Services --> ResultsPublisher
|
||||
Services --> HealthPublisher
|
||||
Services --> IntentPublisher
|
||||
Services -. GET diagnostic .-> LobbyHTTP
|
||||
|
||||
DockerEvents --> Docker
|
||||
DockerInspect --> Docker
|
||||
HealthProbe --> Engine
|
||||
Reconcile --> Docker
|
||||
Reconcile --> Postgres
|
||||
Cleanup --> Postgres
|
||||
Cleanup --> Services
|
||||
|
||||
DockerEvents --> HealthPublisher
|
||||
DockerInspect --> HealthPublisher
|
||||
HealthProbe --> HealthPublisher
|
||||
|
||||
HealthPublisher --> Redis
|
||||
ResultsPublisher --> Redis
|
||||
IntentPublisher --> Redis
|
||||
|
||||
StartJobs --> Redis
|
||||
StopJobs --> Redis
|
||||
InternalHTTP --> Postgres
|
||||
|
||||
Docker -->|create / start / stop / rm| Engine
|
||||
Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
|
||||
|
||||
InternalHTTP --> Telemetry
|
||||
Services --> Telemetry
|
||||
StartJobs --> Telemetry
|
||||
StopJobs --> Telemetry
|
||||
DockerEvents --> Telemetry
|
||||
HealthProbe --> Telemetry
|
||||
DockerInspect --> Telemetry
|
||||
Reconcile --> Telemetry
|
||||
Cleanup --> Telemetry
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
|
||||
goose migrations fail, when Redis ping fails, when the Docker daemon
|
||||
ping fails, or when the configured Docker network is missing. Lobby
|
||||
reachability is **not** verified at boot — the start service's
|
||||
diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
|
||||
outside of debug logging
|
||||
([`services.md` §7](services.md)).
|
||||
- The reconciler runs **synchronously** once on startup before
|
||||
`app.App.Run` registers any other component, then re-runs
|
||||
periodically as a regular `Component`. The synchronous pass is the
|
||||
reason why orphaned containers from a prior process can never be
|
||||
observed by the events listener with no PG record
|
||||
([`workers.md` §17](workers.md)).
|
||||
- A single internal HTTP listener exposes both probes
|
||||
(`/healthz`, `/readyz`) and the trusted REST surface for Game Master
|
||||
and Admin Service. There is no public listener — RTM does not face
|
||||
end users.
|
||||
|
||||
## Listeners
|
||||
|
||||
| Listener | Default addr | Purpose |
|
||||
| --- | --- | --- |
|
||||
| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
|
||||
|
||||
Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
|
||||
|
||||
- read timeout: `5s`
|
||||
- write timeout: `15s`
|
||||
- idle timeout: `60s`
|
||||
|
||||
The listener is unauthenticated and assumes a trusted network segment.
|
||||
The `X-Galaxy-Caller` request header carries an optional caller
|
||||
identity (`gm` or `admin`) that the handler records as
|
||||
`operation_log.op_source`
|
||||
([`services.md` §18](services.md)).
|
||||
|
||||
Probe routes:
|
||||
|
||||
- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
|
||||
the listener is up.
|
||||
- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
|
||||
Docker daemon, then asserts the configured Docker network exists.
|
||||
Returns `{"status":"ready"}` only when every check passes; otherwise
|
||||
returns `503` with the canonical error envelope.
|
||||
|
||||
## Background Workers
|
||||
|
||||
Every worker runs as an `app.Component` and is registered in the
|
||||
order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
|
||||
|
||||
| Worker | Source | Trigger | Function |
|
||||
| --- | --- | --- | --- |
|
||||
| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
|
||||
| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
|
||||
| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
|
||||
| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
|
||||
| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
|
||||
| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
|
||||
| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
|
||||
|
||||
The events listener and the inspect worker do **not** emit
|
||||
`container_started` — that event is owned by the start service
|
||||
([`workers.md` §1](workers.md)). The events listener and the inspect
|
||||
worker also do not emit `container_disappeared` autonomously when a
|
||||
record is missing or stale; the conditional emission rules live in
|
||||
[`workers.md` §2](workers.md) and [`§4`](workers.md).
|
||||
|
||||
## Lifecycle Services
|
||||
|
||||
The five lifecycle services are pure orchestrators called from both
|
||||
the stream consumers and the REST handlers. Each service owns the
|
||||
per-game lease for the duration of its operation.
|
||||
|
||||
| Service | Source | Triggers | Failure envelope |
|
||||
| --- | --- | --- | --- |
|
||||
| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
|
||||
| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
|
||||
| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
|
||||
| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
|
||||
| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
|
||||
|
||||
All services share three behaviours captured in
|
||||
[`services.md`](services.md):
|
||||
|
||||
- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
|
||||
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
|
||||
not by the caller — which keeps consumer and REST callers symmetric
|
||||
([`services.md` §1](services.md));
|
||||
- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
|
||||
`ContainerID`, `EngineEndpoint`) is what consumers and REST
|
||||
handlers translate into job_results / HTTP responses
|
||||
([`services.md` §3](services.md));
|
||||
- failures pass through one `operation_log` write before returning,
|
||||
and three of the failure codes (`start_config_invalid`,
|
||||
`image_pull_failed`, `container_start_failed`) also publish a
|
||||
`runtime.*` admin notification intent
|
||||
([`services.md` §4](services.md)).
|
||||
|
||||
## Synchronous Upstream Client
|
||||
|
||||
| Client | Endpoint | Failure mapping |
|
||||
| --- | --- | --- |
|
||||
| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
|
||||
|
||||
Lobby's outbound transport is the only synchronous client RTM holds.
|
||||
Every other interaction (Notification Service, Game Master, Admin
|
||||
Service) crosses an asynchronous boundary or is initiated by the peer.
|
||||
|
||||
## Stream Offsets
|
||||
|
||||
Each consumer persists its position under a fixed label so process
|
||||
restart preserves stream progress.
|
||||
|
||||
| Stream | Offset key | Block timeout env |
|
||||
| --- | --- | --- |
|
||||
| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
|
||||
| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
|
||||
|
||||
The labels `startjobs` and `stopjobs` are stable identifiers — they
|
||||
are decoupled from the underlying stream key. An operator who renames
|
||||
a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
|
||||
Decision: [`workers.md` §9](workers.md).
|
||||
|
||||
The `runtime:job_results`, `runtime:health_events`, and
|
||||
`notification:intents` streams are outbound; RTM does not consume them
|
||||
itself.
|
||||
|
||||
## Configuration Groups
|
||||
|
||||
The full env-var list with defaults lives in
|
||||
[`../README.md` §Configuration](../README.md). The groups below
|
||||
summarise the structure:
|
||||
|
||||
- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
|
||||
`RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
|
||||
`RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
|
||||
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
|
||||
- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
|
||||
- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
|
||||
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
|
||||
`RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
|
||||
- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
|
||||
`RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
|
||||
`RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
|
||||
`RTMANAGER_CONTAINER_RETENTION_DAYS`,
|
||||
`RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
|
||||
`RTMANAGER_ENGINE_STATE_ENV_NAME`,
|
||||
`RTMANAGER_GAME_STATE_DIR_MODE`,
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID`,
|
||||
`RTMANAGER_GAME_STATE_OWNER_GID`.
|
||||
- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
|
||||
`RTMANAGER_POSTGRES_REPLICA_DSNS`,
|
||||
`RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
|
||||
`RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
|
||||
`RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
|
||||
`RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
|
||||
- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
|
||||
`RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
|
||||
`RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
|
||||
- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM`,
|
||||
`RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
|
||||
`RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
|
||||
`RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
|
||||
`RTMANAGER_STREAM_BLOCK_TIMEOUT`.
|
||||
- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
|
||||
`RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
|
||||
`RTMANAGER_PROBE_FAILURES_THRESHOLD`.
|
||||
- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
|
||||
`RTMANAGER_CLEANUP_INTERVAL`.
|
||||
- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
|
||||
- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
|
||||
`RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
|
||||
- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
|
||||
`RTMANAGER_SHUTDOWN_TIMEOUT`.
|
||||
- **Telemetry** — standard `OTEL_*`.
|
||||
|
||||
## Runtime Notes
|
||||
|
||||
- **Single-instance v1.** Multi-instance Runtime Manager with Redis
|
||||
Streams consumer groups is explicitly out of scope for the current
|
||||
iteration. The per-game lease serialises operations on one game
|
||||
across the consumer + REST entry points; cross-instance
|
||||
coordination is deferred until a real workload demands it.
|
||||
- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
|
||||
`SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
|
||||
(default `60s`). The lease is **not renewed mid-operation** in v1;
|
||||
long pulls of multi-GB images can therefore expire the lease
|
||||
before the operation finishes — the trade-off is documented in
|
||||
[`services.md` §1](services.md). The reconciler honours the same
|
||||
lease around every drift mutation
|
||||
([`workers.md` §14](workers.md)).
|
||||
- **Operation log is the source of truth.** Every lifecycle and
|
||||
reconcile mutation appends one row to `rtmanager.operation_log`.
|
||||
The `runtime:health_events` stream and the `notification:intents`
|
||||
emissions are best-effort — a publish failure logs at `Error` and
|
||||
proceeds, never rolling back the recorded operation
|
||||
([`workers.md` §8](workers.md)).
|
||||
- **In-memory probe hysteresis.** The active HTTP probe keeps
|
||||
per-game `consecutiveFailures` and `failurePublished` counters in a
|
||||
mutex-guarded map. State is non-persistent: a process restart that
|
||||
loses the counters re-establishes hysteresis from scratch, and
|
||||
state for a game that transitions through `stopped → running` is
|
||||
pruned at the start of every probe tick
|
||||
([`workers.md` §5](workers.md)).
|
||||
- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
|
||||
`if_missing` (default), `always`, and `never`. Image labels
|
||||
(`com.galaxy.cpu_quota`, `com.galaxy.memory`,
|
||||
`com.galaxy.pids_limit`) drive resource limits when present; the
|
||||
matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
|
||||
label is absent or unparseable. Producers never pass limits.
|
||||
- **State directory ownership.** RTM creates per-game state
|
||||
directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
|
||||
mode and uid/gid, but **never deletes them**. Removing the directory
|
||||
is operator domain (backup tooling, a future Admin Service
|
||||
workflow). A cleanup that removes the container leaves the
|
||||
directory intact.
|
||||
@@ -0,0 +1,443 @@
|
||||
# Lifecycle Services
|
||||
|
||||
This document explains the design of the five lifecycle services
|
||||
(`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
|
||||
`cleanupcontainer`) under [`../internal/service/`](../internal/service)
|
||||
plus the per-handler REST glue under
|
||||
[`../internal/api/internalhttp/`](../internal/api/internalhttp).
|
||||
|
||||
The current-state behaviour (lifecycle steps, failure tables, the
|
||||
per-game lease semantics, the wire contracts) lives in
|
||||
[`../README.md`](../README.md), the OpenAPI spec at
|
||||
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml), and the
|
||||
AsyncAPI spec at
|
||||
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml).
|
||||
This file records the *why*.
|
||||
|
||||
## 1. Per-game lease lives at the service layer
|
||||
|
||||
Every lifecycle service acquires `rtmanager:game_lease:{game_id}` via
|
||||
[`ports.GameLeaseStore`](../internal/ports/gamelease.go) before doing
|
||||
any work, and releases it on the way out:
|
||||
|
||||
- the lease primitive serialises operations on a single game across
|
||||
every entry point (stream consumers and REST handlers);
|
||||
- holding the lease at the service layer keeps the consumer / REST
|
||||
callers symmetric — neither acquires the lease itself, both call
|
||||
the service the same way;
|
||||
- the Redis-backed adapter
|
||||
([`../internal/adapters/redisstate/gamelease/store.go`](../internal/adapters/redisstate/gamelease/store.go))
|
||||
uses `SET NX PX` on acquire, Lua compare-and-delete on release; a
|
||||
release whose caller-supplied token no longer matches is a silent
|
||||
no-op.
|
||||
|
||||
The lease key shape is `rtmanager:game_lease:{base64url(game_id)}` so
|
||||
opaque game ids may contain any characters without leaking through
|
||||
the key syntax.
|
||||
|
||||
The lease TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60s`)
|
||||
and is **not renewed mid-operation** in v1. A multi-GB image pull can
|
||||
theoretically expire the lease before the start service finishes;
|
||||
operators see this as a `reconcile_adopt` event later because the
|
||||
container is created with the standard owner labels. A renewal helper
|
||||
is deliberately deferred until a workload makes it necessary.
|
||||
|
||||
The reconciler ([`workers.md`](workers.md) §4) honours the same lease
|
||||
around every drift mutation, which closes the
|
||||
restart-vs-`reconcile_dispose` race documented in §6 below.
|
||||
|
||||
## 2. Health-events publisher lands with the start service
|
||||
|
||||
The start service publishes `container_started` after `docker run`
|
||||
returns; the events listener intentionally does **not** duplicate
|
||||
the event ([`workers.md`](workers.md) §1). Centralising the publisher
|
||||
on the start service avoids a "who emits what" ambiguity and lets the
|
||||
publisher be a thin port wrapper rather than a worker-specific
|
||||
helper.
|
||||
|
||||
The publisher port lives next to the snapshot-upsert rule
|
||||
([`adapters.md`](adapters.md) §8): one Publish call updates both
|
||||
surfaces.
|
||||
|
||||
## 3. `Result`-shaped contract
|
||||
|
||||
`Service.Handle` returns `(Result, error)`. The Go-level `error` is
|
||||
reserved for system-level / programmer faults (nil context, nil
|
||||
service). All business outcomes flow through `Result`:
|
||||
|
||||
- `Outcome=success`, `ErrorCode=""` — fresh start succeeded;
|
||||
- `Outcome=success`, `ErrorCode="replay_no_op"` — idempotent replay;
|
||||
- `Outcome=failure`, `ErrorCode` set — business failure
|
||||
(`start_config_invalid` / `image_pull_failed` /
|
||||
`container_start_failed` / `conflict` / `service_unavailable` /
|
||||
`internal_error`).
|
||||
|
||||
The stream consumer uses `Outcome` and `ErrorCode` to populate
|
||||
`runtime:job_results` directly; the REST handler maps `Outcome=failure`
|
||||
plus `ErrorCode` to the matching HTTP status. Both callers are simpler
|
||||
with this contract than with an `errors.Is`-driven sentinel taxonomy.
|
||||
|
||||
`ports.JobResult` and the two `JobOutcome*` string constants live in
|
||||
the ports package next to `JobResultPublisher` so the wire shape is
|
||||
defined exactly once. The constants are intentionally not aliases of
|
||||
`operation.Outcome` — the audit-log enum is allowed to grow without
|
||||
breaking the wire format.
|
||||
|
||||
## 4. Start service failure-mode mapping
|
||||
|
||||
| Failure | Error code | Notification intent |
|
||||
| --- | --- | --- |
|
||||
| Invalid input (empty fields, unknown op_source) | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| Lease busy | `conflict` | — |
|
||||
| Existing record running with a different image_ref | `conflict` | — |
|
||||
| Get returns a non-NotFound transport error | `internal_error` | — |
|
||||
| `image_ref` shape rejected by `distribution/reference` | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| `EnsureNetwork` returns `ErrNetworkMissing` | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| `EnsureNetwork` returns any other error | `service_unavailable` | — |
|
||||
| `PullImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
|
||||
| `InspectImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
|
||||
| `prepareStateDir` failure | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| `Run` failure | `container_start_failed` | `runtime.container_start_failed` |
|
||||
| `Upsert` failure after successful Run | `container_start_failed` | `runtime.container_start_failed` |
|
||||
|
||||
Three error codes do **not** raise an admin notification: `conflict`,
|
||||
`service_unavailable`, and `internal_error` are operational classes
|
||||
(another caller is in flight, a dependency is down, an unclassified
|
||||
fault) where the corrective action is not a configuration change. The
|
||||
operator already sees them through telemetry and structured logs; an
|
||||
email per occurrence would be noise.
|
||||
|
||||
## 5. Upsert-after-Run rollback
|
||||
|
||||
A `Run` that succeeded but whose `Upsert` failed leaves a running
|
||||
container with no PG record. The service issues a best-effort
|
||||
`docker.Remove(containerID)` in a fresh `context.Background()` (the
|
||||
request context may already be cancelled) before recording the failure.
|
||||
A Remove failure is logged but not propagated; the reconciler adopts
|
||||
surviving orphans on its periodic pass.
|
||||
|
||||
The Docker adapter already removes the container when `Run` itself
|
||||
returns an error after a successful `ContainerCreate` ([`adapters.md`](adapters.md) §3).
|
||||
The service-layer rollback covers the additional post-`Run` Upsert
|
||||
failure path.
|
||||
|
||||
## 6. Pre-existing record handling
|
||||
|
||||
Only `status=running` + same `image_ref` is a `replay_no_op`.
|
||||
`running` + a different `image_ref` returns `failure / conflict` (use
|
||||
`patch` to change the image of a running container).
|
||||
|
||||
Anything else (`stopped`, `removed`, missing record) proceeds with a
|
||||
fresh start that ends in `Upsert`. `Upsert` overwrites verbatim and is
|
||||
not bound by the transitions table, so installing a `running` record
|
||||
over a `removed` row is permitted — the `removed` terminus rule lives
|
||||
in `runtime.AllowedTransitions` (which guards `UpdateStatus`), not in
|
||||
`Upsert`.
|
||||
|
||||
`created_at` is preserved across re-starts: the start service reuses
|
||||
`existing.CreatedAt` when the record was found, so the
|
||||
"first time RTM saw the game" semantics from
|
||||
[`postgres-migration.md`](postgres-migration.md) §9 hold even when the
|
||||
start path goes through `Upsert` rather than through the runtime
|
||||
adapter's `INSERT ... ON CONFLICT DO UPDATE` EXCLUDED list.
|
||||
|
||||
A residual `galaxy-game-{game_id}` container left over from a previous
|
||||
start that was stopped but never cleaned up will fail at `docker run`
|
||||
with a name conflict. The service surfaces that as
|
||||
`container_start_failed`; cleanup plus the reconciler is the standard
|
||||
remedy. A pre-emptive Remove inside the start service was rejected
|
||||
because it would silently undo manual operator inspection on stopped
|
||||
containers.
|
||||
|
||||
## 7. `LobbyInternalClient.GetGame` is best-effort
|
||||
|
||||
The fetch happens after the lease is acquired and before the Docker
|
||||
work, with the configured `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
|
||||
`ErrLobbyUnavailable` and `ErrLobbyGameNotFound` are logged at
|
||||
`debug`; the start operation continues either way. The fetched
|
||||
`Status` and `TargetEngineVersion` enrich logs only — the start
|
||||
envelope already carries the only required field (`image_ref`), and
|
||||
the port docstring fixes the recoverable-failure contract.
|
||||
|
||||
## 8. `image_ref` validation
|
||||
|
||||
Validation uses `github.com/distribution/reference.ParseNormalizedNamed`
|
||||
before any Docker round-trip. Rejected shapes surface as
|
||||
`start_config_invalid` plus a `runtime.start_config_invalid` intent.
|
||||
Daemon-side rejections after a valid parse (manifest unknown,
|
||||
authentication required) surface as `image_pull_failed` plus a
|
||||
`runtime.image_pull_failed` intent. The split keeps operator-actionable
|
||||
configuration mistakes distinct from registry-side failures.
|
||||
|
||||
## 9. State-directory preparer is overrideable
|
||||
|
||||
`Dependencies.PrepareStateDir` is a `func(gameID string) (string, error)`
|
||||
injection point that defaults to `os.MkdirAll` + `os.Chmod` +
|
||||
`os.Chown` against `RTMANAGER_GAME_STATE_ROOT`. Tests override it to
|
||||
point at a `t.TempDir()`-style fake without exercising the real
|
||||
filesystem permissions (which require either matching uid/gid or
|
||||
root). This is a deliberate non-port abstraction: the start service
|
||||
does no other filesystem work and the cost of a new port for one
|
||||
helper is not worth the indirection.
|
||||
|
||||
## 10. Container env: both `GAME_STATE_PATH` and `STORAGE_PATH`
|
||||
|
||||
Both names are accepted by the v1 engine. The start service always
|
||||
sets both; the configured `RTMANAGER_ENGINE_STATE_ENV_NAME` controls
|
||||
the primary. When the operator overrides the primary to `STORAGE_PATH`,
|
||||
the deduplicating map collapses the two entries into one.
|
||||
|
||||
## 11. Wiring layer construction
|
||||
|
||||
`internal/app/wiring.go` is the single point that builds every
|
||||
production store, adapter, and service from `config.Config`. The
|
||||
struct exposes typed fields so handlers and workers can grab the
|
||||
singletons without re-wiring; an `addCloser` slice releases adapter
|
||||
resources (currently the Lobby HTTP client's idle-connection pool) at
|
||||
runtime shutdown. The `runtimeRecordsProbe` adapter installed during
|
||||
construction registers the `rtmanager.runtime_records_by_status`
|
||||
gauge documented in [`../README.md` §Observability](../README.md).
|
||||
|
||||
The persistence-only `CountByStatus` method on the `runtimerecordstore`
|
||||
adapter is **not** part of `ports.RuntimeRecordStore` because it is
|
||||
only used by the gauge probe; widening the port for one caller would
|
||||
force every adapter and test fake to grow with no benefit. The adapter
|
||||
exposes it directly and the wiring composes a concrete-typed wrapper.
|
||||
|
||||
## 12. Shared lease across composed operations (restart, patch)
|
||||
|
||||
Restart and patch must hold the lease across the inner
|
||||
`stop → docker rm → start` sequence, otherwise a concurrent stop or
|
||||
restart could observe a half-recreated runtime.
|
||||
|
||||
`startruntime.Service` and `stopruntime.Service` therefore expose a
|
||||
second public method:
|
||||
|
||||
```go
|
||||
// Run executes the lifecycle assuming the per-game lease is already
|
||||
// held by the caller. Reserved for orchestrator services that compose
|
||||
// stop or start with another operation under a single outer lease.
|
||||
// External callers must use Handle.
|
||||
func (service *Service) Run(ctx context.Context, input Input) (Result, error)
|
||||
```
|
||||
|
||||
`Handle` acquires the lease, defers its release, and calls `Run`.
|
||||
Restart and patch acquire the outer lease themselves and call `Run`
|
||||
on the inner services. The inner services record their own
|
||||
`operation_log` entries, telemetry counters, health events, and admin
|
||||
notification intents identically to a top-level `Handle`.
|
||||
|
||||
A typed `LeaseTicket` parameter (a small internal-package zero-size
|
||||
struct that only the lease store can construct) was considered and
|
||||
rejected for v1: only sister services in `internal/service/` ever call
|
||||
`Run`, the docstring is loud about the precondition, and the pattern
|
||||
can be tightened later without breaking the public surface that
|
||||
consumers and handlers consume.
|
||||
|
||||
## 13. Correlation id on `source_ref`
|
||||
|
||||
The outer restart and patch services reuse the existing
|
||||
`Input.SourceRef` as a correlation key:
|
||||
|
||||
- when `Input.SourceRef` is non-empty (REST request id, stream entry
|
||||
id), all three entries — outer restart / patch + inner stop +
|
||||
inner start — share that value;
|
||||
- when empty, the outer service generates a 32-byte base64url string
|
||||
via the same `NewToken` generator that produces lease tokens, and
|
||||
uses it as the correlation key for all three entries.
|
||||
|
||||
The outer entry's `source_ref` keeps its dual semantics: actor ref
|
||||
when the caller supplied one, generated correlation id otherwise. Pure
|
||||
top-level operations (caller invokes start, stop, or cleanup directly)
|
||||
keep the original meaning. Composed operations (restart, patch) use
|
||||
the same value in three places to make audit queries trivial.
|
||||
|
||||
This is not the cleanest end-state — a dedicated `correlation_id`
|
||||
column would carry the link without ambiguity — but it is the smallest
|
||||
change that does not touch the schema. A future stage that adds the
|
||||
column can rename the field and clear up the dual role in one move.
|
||||
|
||||
## 14. Semver validation for patch
|
||||
|
||||
`internal/service/patchruntime/semver.go` enforces the
|
||||
patch-precondition (current and new `image_ref` parse as semver, share
|
||||
major and minor):
|
||||
|
||||
- `extractSemverTag(imageRef)` parses with
|
||||
`github.com/distribution/reference.ParseNormalizedNamed`, casts to
|
||||
`reference.NamedTagged`, then validates the tag with
|
||||
`golang.org/x/mod/semver.IsValid` (after prepending `v` when the tag
|
||||
omits it). Failures map to `image_ref_not_semver`;
|
||||
- `samePatchSeries(currentSemver, newSemver)` compares
|
||||
`semver.MajorMinor` of the two canonical strings; mismatch maps to
|
||||
`semver_patch_only`.
|
||||
|
||||
`golang.org/x/mod` is a direct require to avoid a transitive-version
|
||||
surprise. `github.com/Masterminds/semver/v3` (also in the module
|
||||
graph) was rejected to avoid two semver libraries on disk for the
|
||||
same job; `x/mod/semver` already covers Lobby. A hand-rolled
|
||||
`vMajor.Minor.Patch` parser was rejected as premature.
|
||||
|
||||
Pre-checks run before any inner stop or `docker rm`: a rejected patch
|
||||
never disturbs the running runtime. Patch with
|
||||
`new_image_ref == current_image_ref` proceeds through the recreate
|
||||
flow unchanged (not `replay_no_op`: the inner start still runs); the
|
||||
outer `op_kind=patch` entry records the no-op patch for audit.
|
||||
|
||||
## 15. `StopReason` placement
|
||||
|
||||
The reason enum mirrors `lobby/internal/ports/runtimemanager.go`
|
||||
verbatim and lives at `internal/service/stopruntime/stopreason.go`.
|
||||
The stream consumer and the REST handler import `stopruntime` for
|
||||
the same enum the service requires.
|
||||
|
||||
Inner stop calls from restart and patch always pass
|
||||
`StopReasonAdminRequest`. Restart and patch are platform-internal
|
||||
recreate flows; `admin_request` is the closest semantic match in the
|
||||
five-value vocabulary. The actor that originated the recreate (REST
|
||||
request id, admin user id) flows through the `op_source` /
|
||||
`source_ref` pair, not through the stop reason.
|
||||
|
||||
## 16. Error code centralisation
|
||||
|
||||
`internal/service/startruntime/errors.go` is the canonical home for
|
||||
the stable error codes returned in `Result.ErrorCode`. The other four
|
||||
services (`stopruntime`, `restartruntime`, `patchruntime`,
|
||||
`cleanupcontainer`) import the constants from `startruntime` rather
|
||||
than redeclaring them. The package comment of `errors.go` flags the
|
||||
shared usage so future readers do not chase per-service declarations.
|
||||
|
||||
`start_config_invalid` is reserved for start because every start
|
||||
validation failure also raises an admin notification intent. The
|
||||
other services use the more general `invalid_request` for input
|
||||
validation failures.
|
||||
|
||||
## 17. Stop / restart / patch / cleanup failure tables
|
||||
|
||||
### `stopruntime`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | No notification intent. |
|
||||
| Lease busy | `conflict` | Lease release skipped because acquire returned false. |
|
||||
| Lease error | `service_unavailable` | Redis unreachable. |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `stopped` / `removed` | success / `replay_no_op` | Idempotent re-stop. |
|
||||
| `docker.Stop` returns `ErrContainerNotFound` | success | Record transitions `running → removed`, `container_disappeared` health event published. |
|
||||
| `docker.Stop` other error | `service_unavailable` | Record untouched; caller may retry. |
|
||||
| `UpdateStatus` returns `ErrConflict` (CAS race) | success / `replay_no_op` | The desired state was reached by another path (reconciler / restart). |
|
||||
| `UpdateStatus` returns `ErrNotFound` | `not_found` | Record vanished mid-stop. |
|
||||
| `UpdateStatus` other error | `internal_error` | |
|
||||
|
||||
### `restartruntime`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | |
|
||||
| Lease busy / lease error | `conflict` / `service_unavailable` | Same as stop. |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `removed` | `conflict` | Image_ref may be empty; restart cannot proceed. |
|
||||
| Inner stop fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner stop failed: ". |
|
||||
| `docker.Remove` fails | `service_unavailable` | Inner stop already moved record to `stopped`; runtime stays in `stopped`. Admin must call `cleanup_container` before retrying restart. |
|
||||
| Inner start fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner start failed: ". |
|
||||
|
||||
The post-stop `docker rm` failure is the only path that leaves the
|
||||
runtime in a state from which the same operation cannot recover by
|
||||
itself: a residual `galaxy-game-{game_id}` container blocks a fresh
|
||||
inner start (the start service surfaces this as
|
||||
`container_start_failed`). The runbook entry — "call cleanup, then
|
||||
restart again" — is the standard remedy.
|
||||
|
||||
### `patchruntime`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | |
|
||||
| Lease busy / lease error | `conflict` / `service_unavailable` | |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `removed` | `conflict` | |
|
||||
| Current `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
|
||||
| New `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
|
||||
| Major / minor mismatch | `semver_patch_only` | Pre-check; no inner ops fired. |
|
||||
| Inner stop / `docker rm` / inner start fails | inherits inner code | Same propagation as restart. |
|
||||
|
||||
### `cleanupcontainer`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | |
|
||||
| Lease busy / lease error | `conflict` / `service_unavailable` | |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `removed` | success / `replay_no_op` | |
|
||||
| Status `running` | `conflict` | Error message: "stop the runtime first". |
|
||||
| Status `stopped` | proceed | |
|
||||
| `docker.Remove` returns `ErrContainerNotFound` | success | Adapter swallows not-found into nil. |
|
||||
| `docker.Remove` other error | `service_unavailable` | Record untouched; caller may retry. |
|
||||
| `UpdateStatus` returns `ErrConflict` | success / `replay_no_op` | Race with reconciler dispose. |
|
||||
| `UpdateStatus` returns `ErrNotFound` | `not_found` | |
|
||||
| `UpdateStatus` other error | `internal_error` | |
|
||||
|
||||
## 18. REST handler conventions
|
||||
|
||||
The internal HTTP handlers under
|
||||
[`../internal/api/internalhttp/handlers/`](../internal/api/internalhttp/handlers)
|
||||
follow these rules:
|
||||
|
||||
- **`X-Galaxy-Caller` header.** The optional header carries the
|
||||
calling service identity (`gm` / `admin`); the handler records the
|
||||
value as `op_source` in `operation_log` (`gm_rest` / `admin_rest`).
|
||||
Missing or unknown values default to `admin_rest` because every
|
||||
audit-log query already filters on the cleanup endpoint
|
||||
(`op_source ∈ {auto_ttl, admin_rest}`); making the default match
|
||||
the most-restricted surface keeps existing dashboards correct when
|
||||
an unconfigured client hits the listener. The header is declared as
|
||||
a reusable parameter (`components.parameters.XGalaxyCallerHeader`)
|
||||
in the OpenAPI spec and is referenced from each runtime operation
|
||||
but not from `/healthz` and `/readyz`.
|
||||
- **Error code → HTTP status mapping.** One canonical table in
|
||||
`handlers/common.go`:
|
||||
|
||||
| ErrorCode | HTTP status |
|
||||
| --- | ---: |
|
||||
| (success, including `replay_no_op`) | 200 |
|
||||
| `invalid_request`, `start_config_invalid`, `image_ref_not_semver` | 400 |
|
||||
| `not_found` | 404 |
|
||||
| `conflict`, `semver_patch_only` | 409 |
|
||||
| `service_unavailable`, `docker_unavailable` | 503 |
|
||||
| `internal_error`, `image_pull_failed`, `container_start_failed` | 500 |
|
||||
|
||||
`image_pull_failed` and `container_start_failed` are operational
|
||||
failures that originate inside RTM (registry / daemon problems),
|
||||
not client-side validation issues; they map to `500` so callers
|
||||
retry through their normal resilience paths instead of treating
|
||||
the call as a 4xx that must be fixed at the source.
|
||||
`docker_unavailable` is reserved for future producers; today the
|
||||
start service emits `service_unavailable` for Docker-daemon
|
||||
failures. Unknown error codes default to `500`.
|
||||
- **List and Get bypass the service layer.** `internalListRuntimes`
|
||||
and `internalGetRuntime` read directly from
|
||||
`ports.RuntimeRecordStore`. Reads do not produce `operation_log`
|
||||
rows, do not change Docker state, do not need the per-game lease,
|
||||
and do not have a stream-side counterpart — none of the lifecycle
|
||||
service machinery is justified.
|
||||
- **`RuntimeRecordStore.List(ctx)` returns every record regardless
|
||||
of status.** A single SELECT ordered by
|
||||
`(last_op_at DESC, game_id ASC)` — the same direction the
|
||||
`runtime_records_status_last_op_idx` index supports, so freshly
|
||||
active games surface first. Pagination is intentionally not
|
||||
modelled in v1; the working set is bounded by the games tracked
|
||||
by Lobby.
|
||||
- **Per-handler service ports use `mockgen`.** The handler layer
|
||||
depends on five narrow interfaces — one per lifecycle service —
|
||||
declared in `handlers/services.go`. Production wiring passes the
|
||||
concrete `*<lifecycle>.Service` pointers (each satisfies the
|
||||
matching interface implicitly); tests pass the mockgen-generated
|
||||
mocks under `handlers/mocks/`.
|
||||
- **Conformance test scope.** `internalhttp/conformance_test.go`
|
||||
drives every documented runtime operation against a real
|
||||
`internalhttp.Server` whose service deps are deterministic stubs.
|
||||
The test uses `kin-openapi/routers/legacy.NewRouter`, calls
|
||||
`openapi3filter.ValidateRequest` and
|
||||
`openapi3filter.ValidateResponse` so both directions match the
|
||||
contract. The scope is happy-path only; the failure-path response
|
||||
shapes are validated by the per-handler tests.
|
||||
@@ -0,0 +1,412 @@
|
||||
# Background Workers
|
||||
|
||||
This document explains the design of the seven background workers
|
||||
under [`../internal/worker/`](../internal/worker):
|
||||
|
||||
- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
|
||||
[`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
|
||||
consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
|
||||
- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
|
||||
subscription;
|
||||
- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
|
||||
`InspectContainer` worker;
|
||||
- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
|
||||
`/healthz` probe;
|
||||
- [`reconcile`](../internal/worker/reconcile) — startup + periodic
|
||||
drift reconciliation;
|
||||
- [`containercleanup`](../internal/worker/containercleanup) —
|
||||
periodic TTL cleanup.
|
||||
|
||||
The current-state behaviour and configuration surface live in
|
||||
[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
|
||||
§Reconciliation), and operational notes are in
|
||||
[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
|
||||
[`runbook.md`](runbook.md). This file records the rationale.
|
||||
|
||||
## 1. Single ownership per `event_type`
|
||||
|
||||
The `runtime:health_events` vocabulary is shared across four sources;
|
||||
each event type is owned by exactly one of them.
|
||||
|
||||
| `event_type` | Owner |
|
||||
| --- | --- |
|
||||
| `container_started` | `internal/service/startruntime` |
|
||||
| `container_exited` | `internal/worker/dockerevents` |
|
||||
| `container_oom` | `internal/worker/dockerevents` |
|
||||
| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
|
||||
| `inspect_unhealthy` | `internal/worker/dockerinspect` |
|
||||
| `probe_failed` | `internal/worker/healthprobe` |
|
||||
| `probe_recovered` | `internal/worker/healthprobe` |
|
||||
|
||||
`container_started` is intentionally not duplicated by the events
|
||||
listener, even though Docker emits a `start` action whenever the start
|
||||
service runs the container. The start service already publishes the
|
||||
event with the same wire shape; observing the action in the listener
|
||||
would produce two entries per real start.
|
||||
|
||||
## 2. `container_disappeared` is conditional on PG state
|
||||
|
||||
The Docker events listener inspects the runtime record before emitting
|
||||
`container_disappeared` for a `destroy` action. Three suppression rules
|
||||
apply:
|
||||
|
||||
- record missing → suppress (the destroyed container was never owned
|
||||
by RTM as a tracked runtime, so no consumer cares);
|
||||
- record `status != running` → suppress (RTM already finished a stop
|
||||
or cleanup; the destroy is the expected tail of that operation);
|
||||
- record `current_container_id != event.ContainerID` → suppress (RTM
|
||||
swapped to a new container through restart or patch; the destroy is
|
||||
the expected removal of the prior container id).
|
||||
|
||||
Only a destroy that arrives for a `running` record whose
|
||||
`current_container_id` still equals the event id is treated as
|
||||
unexpected. This is the wire-side analogue of the reconciler's
|
||||
PG-drift check: the reconciler observes "PG=running, no Docker
|
||||
container" while the events listener observes "Docker says destroy,
|
||||
PG still says running pointing at this container". Together they cover
|
||||
both directions of drift.
|
||||
|
||||
A read failure against `runtime_records` is treated conservatively as
|
||||
"suppress" — the listener cannot tell whether the destroy was external
|
||||
or RTM-initiated, and over-emitting `container_disappeared` would lead
|
||||
to a real consumer (`Game Master`) escalating a false positive.
|
||||
|
||||
## 3. `die` with exit code `0` is suppressed
|
||||
|
||||
`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
|
||||
event with exit code `0`. The `container_exited` contract guarantees a
|
||||
non-zero exit; emitting on exit `0` would shower consumers with
|
||||
normal-stop noise. The listener silently drops the event; the
|
||||
operation log already records the stop on the caller side.
|
||||
|
||||
## 4. Inspect worker leaves `container_disappeared` to the reconciler
|
||||
|
||||
When `dockerinspect` calls `InspectContainer` and the daemon returns
|
||||
`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
|
||||
|
||||
- the reconciler is the single authority for PG-drift reconciliation.
|
||||
Adding a third source for `container_disappeared` would risk double
|
||||
emission and complicate the consumer story;
|
||||
- inspect ticks every 30 seconds; the reconciler ticks every 5
|
||||
minutes. The latency window for "Docker drops the container, RTM
|
||||
notices" is therefore at most 5 minutes in v1, which is acceptable
|
||||
for the kinds of drift the reconciler covers (manual `docker rm`
|
||||
outside RTM, daemon restart with stale records). If a future
|
||||
requirement tightens the window, promoting the inspect-side
|
||||
observation to a real `container_disappeared` is a one-line change.
|
||||
|
||||
## 5. Probe hysteresis is in-memory and pruned per tick
|
||||
|
||||
The active probe worker keeps per-game state in a
|
||||
`map[string]*probeState` guarded by a mutex. Two counters live there:
|
||||
|
||||
- `consecutiveFailures` — incremented on every failed probe, reset on
|
||||
every success;
|
||||
- `failurePublished` — prevents repeated `probe_failed` emission while
|
||||
the failure persists, and triggers a single `probe_recovered` on the
|
||||
first success after the threshold was crossed.
|
||||
|
||||
The state is non-persistent. RTM is single-instance in v1, and a
|
||||
process restart that loses the counters merely re-establishes the
|
||||
hysteresis from scratch — the only consequence is that a probe failure
|
||||
already in progress at the moment of restart needs another full
|
||||
threshold of failures to surface. Making the state durable would add a
|
||||
Redis round-trip to every probe attempt without buying anything that
|
||||
operators or downstream consumers depend on.
|
||||
|
||||
State pruning happens at the start of every tick. The worker reads the
|
||||
current running list and removes any state entry whose `game_id` is
|
||||
not in the list. A game that transitions through stopped → running
|
||||
again starts fresh; previously-accumulated counters do not bleed into
|
||||
the new lifecycle.
|
||||
|
||||
## 6. Probe concurrency is bounded by a fixed cap
|
||||
|
||||
Probes inside one tick run in parallel through a buffered-channel
|
||||
semaphore (`defaultMaxConcurrency = 16`). Three reasons:
|
||||
|
||||
- A single slow engine cannot delay the entire cohort. Sequential
|
||||
per-game probing would multiply the worst case by `len(records)`,
|
||||
which is the wrong shape for what is fundamentally a fan-out
|
||||
observation pattern.
|
||||
- An unbounded fan-out (one goroutine per record per tick without a
|
||||
cap) was rejected to avoid pathological CPU and connection bursts
|
||||
if the running list ever grows beyond what RTM was sized for. 16
|
||||
in-flight probes at the default 2s timeout fit a single RTM
|
||||
instance well within typical OS file-descriptor and TCP
|
||||
ephemeral-port limits.
|
||||
- The cap is a constant rather than an env var because RTM v1 is
|
||||
single-instance and the active-game count is bounded by Lobby; a
|
||||
configurable cap is something we promote to env if a real workload
|
||||
demands it.
|
||||
|
||||
The same reasoning argues against parallelism in the inspect worker:
|
||||
inspect calls are cheap (sub-ms in the local Docker socket case) and
|
||||
serial execution avoids unnecessary concurrency on the daemon socket.
|
||||
|
||||
## 7. Events listener reconnects with fixed backoff
|
||||
|
||||
The Docker daemon's events stream is a long-lived subscription; the
|
||||
SDK channel terminates on any transport error (daemon restart, socket
|
||||
hiccup, connection reset). The listener's outer loop handles this by
|
||||
re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
|
||||
indefinitely while ctx is alive.
|
||||
|
||||
Crashing the process on a transport error was rejected because losing
|
||||
a few seconds of health observations is a much smaller blast radius
|
||||
than losing the entire RTM process while the start/stop pipelines are
|
||||
running. The save-offset case is different: a lost offset replays the
|
||||
entire backlog and breaks correctness, while a missed health event is
|
||||
observation-only.
|
||||
|
||||
A subscription error is logged at `Warn` so operators can see the
|
||||
reconnect activity without it dominating the log volume.
|
||||
|
||||
## 8. Health publisher remains best-effort
|
||||
|
||||
Every emission goes through `ports.HealthEventPublisher.Publish`, the
|
||||
same surface the start service already uses
|
||||
([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
|
||||
and proceeds; the worker does not retry, does not adjust its in-memory
|
||||
hysteresis, and does not surface the failure to the caller. The
|
||||
operation log is the source of truth for runtime state; the event
|
||||
stream is a best-effort notification surface to consumers.
|
||||
|
||||
## 9. Stream offset labels are stable identifiers
|
||||
|
||||
Both consumers persist their progress through
|
||||
`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
|
||||
start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
|
||||
labels match `rtmanager:stream_offsets:{label}` and stay stable when
|
||||
the underlying stream key is renamed via
|
||||
`RTMANAGER_REDIS_START_JOBS_STREAM` /
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
|
||||
consumer at a different stream key does not lose the persisted offset.
|
||||
|
||||
## 10. `OpSource` and `SourceRef` originate at the consumer boundary
|
||||
|
||||
Every consumed envelope is translated into a `Service.Handle` call
|
||||
with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
|
||||
`SourceRef` is the Redis Stream entry id (`message.ID`); the
|
||||
`operation_log` rows therefore record the originating envelope id, and
|
||||
restart / patch correlation logic ([`services.md`](services.md) §13)
|
||||
keeps working when those services are invoked indirectly.
|
||||
|
||||
## 11. Replay-no-op detection lives in the service layer
|
||||
|
||||
The consumer does not detect replays itself. `startruntime.Service`
|
||||
returns `Outcome=success, ErrorCode=replay_no_op` when the existing
|
||||
record is already `running` with the same `image_ref`;
|
||||
`stopruntime.Service` does the same for an already-stopped or
|
||||
already-removed record. The consumer copies the result fields into
|
||||
the `RuntimeJobResult` payload verbatim and lets Lobby observe the
|
||||
replay through `error_code`.
|
||||
|
||||
The wire-shape consequences:
|
||||
|
||||
- `success` + empty `error_code` → fresh start / fresh stop;
|
||||
- `success` + `error_code=replay_no_op` → idempotent replay. For
|
||||
start, the existing record carries `container_id` and
|
||||
`engine_endpoint`; for stop on `status=removed`, both fields are
|
||||
empty strings (the record was nulled by an earlier cleanup) — the
|
||||
AsyncAPI contract permits empty strings on these required fields;
|
||||
- `failure` + non-empty `error_code` → the start / stop service
|
||||
returned a zero `Record`; the consumer publishes empty
|
||||
`container_id` and `engine_endpoint`.
|
||||
|
||||
## 12. Per-message errors are absorbed; the offset always advances
|
||||
|
||||
The consumer run loop logs and absorbs any decode error, any go-level
|
||||
service error, and any publish failure; `streamOffsetStore.Save` runs
|
||||
unconditionally after each handled message. Pinning the offset on a
|
||||
single transient publish failure was rejected because the durable side
|
||||
effect (operation_log row, runtime_records mutation, Docker state) has
|
||||
already happened on the first pass; pinning the offset to retry the
|
||||
publish would duplicate audit rows for hours until the operator
|
||||
intervened.
|
||||
|
||||
The exception is `streamOffsetStore.Save` itself: a save failure
|
||||
returns a wrapped error from `Run`. The component supervisor in
|
||||
`internal/app/app.go` then exits the process and lets the operator
|
||||
escalate, because losing the offset would cause every subsequent
|
||||
restart to re-process every prior envelope.
|
||||
|
||||
## 13. `requested_at_ms` is logged-only
|
||||
|
||||
The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
|
||||
consumer parses it (rejecting unparseable values) but only includes
|
||||
the value in structured logs — the field is "used for diagnostics, not
|
||||
authoritative" per the contract. The service layer ignores it; the
|
||||
operation_log uses `service.clock()` for `started_at` / `finished_at`
|
||||
so Lobby's wall-clock skew never bleeds into RTM persistence.
|
||||
|
||||
## 14. Reconciler: per-game lease around every write
|
||||
|
||||
A `running → removed` mutation that races a restart's inner stop
|
||||
would clobber the restart's freshly-installed `running` record without
|
||||
any other guard. The reconciler honours the same per-game lease that
|
||||
the lifecycle services hold ([`services.md`](services.md) §1).
|
||||
|
||||
The reconciler splits its work into two phases:
|
||||
|
||||
- **Read pass — lockless.**
|
||||
`docker.List({com.galaxy.owner=rtmanager})` followed by
|
||||
`RuntimeRecords.ListByStatus(running)`. No lease is taken; both
|
||||
reads are point-in-time observations of independent systems and a
|
||||
stale view here only delays a mutation by one tick.
|
||||
- **Write pass — lease-guarded.** Every drift mutation
|
||||
(`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
|
||||
per-game lease, re-reads the record under the lease, and then
|
||||
either applies the mutation or returns when state has changed.
|
||||
A lease conflict (`acquired=false`) is logged at `info` and the
|
||||
game is silently skipped — the next tick will retry. A lease-store
|
||||
error is logged at `warn`; the rest of the pass continues.
|
||||
|
||||
The re-read after lease acquisition is intentional: the read pass is
|
||||
lockless, so by the time the lease is held the runtime record may
|
||||
have moved. `UpdateStatus` already provides CAS via
|
||||
`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
|
||||
does not, so the explicit re-read keeps the three paths uniform and
|
||||
makes the skip condition obvious in code review.
|
||||
|
||||
## 15. Three drift kinds covered by the reconciler
|
||||
|
||||
- `adopt` — Docker reports a container labelled
|
||||
`com.galaxy.owner=rtmanager` for which RTM has no record; insert a
|
||||
fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
|
||||
stop or remove the container (operators may have started it
|
||||
manually for diagnostics).
|
||||
- `dispose` — RTM has a `running` record whose container is missing
|
||||
in Docker; mark `status=removed`, publish
|
||||
`container_disappeared`, append `op_kind=reconcile_dispose`.
|
||||
- `observed_exited` — RTM has a `running` record whose container
|
||||
exists but is in `exited`; mark `status=stopped`, publish
|
||||
`container_exited` with the observed exit code. This third path
|
||||
exists because the events listener sees only live events; a
|
||||
container that died while RTM was offline would otherwise stay
|
||||
`running` indefinitely. The drift is exposed through
|
||||
`rtmanager.reconcile_drift{kind=observed_exited}` and through the
|
||||
`container_exited` health event; no `operation_log` entry is
|
||||
written because the audit log records explicit RTM operations, not
|
||||
passive observations of Docker state.
|
||||
|
||||
## 16. `stopped_at = now (reconciler observation time)`
|
||||
|
||||
The `observed_exited` path writes `stopped_at = now`, where `now` is
|
||||
the reconciler's observation time. The persistence adapter
|
||||
([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
|
||||
`stopped_at = now` for the `stopped` destination — there is no
|
||||
port-level knob for an explicit timestamp, and the reconciler does not
|
||||
read `State.FinishedAt` from Docker.
|
||||
|
||||
The trade-off: `stopped_at` diverges from the daemon's
|
||||
`State.FinishedAt` by at most one tick interval (default 5 minutes).
|
||||
If a downstream consumer ever needs the daemon-observed exit
|
||||
timestamp, the upgrade path is a one-call extension of
|
||||
`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
|
||||
that change is deferred until a consumer materialises.
|
||||
|
||||
## 17. Synchronous initial pass + periodic Component
|
||||
|
||||
`README §Startup dependencies` step 6 demands "Reconciler runs once
|
||||
and blocks until done" before background workers start, but
|
||||
`app.App.Run` starts every registered `Component` concurrently —
|
||||
component ordering does not translate into start ordering.
|
||||
|
||||
The reconciler exposes a public `ReconcileNow(ctx)` method that the
|
||||
runtime calls synchronously between `newWiring` and `app.New`. The
|
||||
same `*Reconciler` is then registered as a `Component`; its `Run`
|
||||
only ticks (no immediate pass) so the startup work is not duplicated.
|
||||
The cost is one public method on the worker; the benefit is that the
|
||||
README invariant holds verbatim and the periodic loop is a textbook
|
||||
`Component`.
|
||||
|
||||
## 18. Adopt through `Upsert`, race with start is benign
|
||||
|
||||
The adopt path constructs a fresh `runtime.RuntimeRecord` (status
|
||||
running, container id and image_ref from labels, `started_at` from
|
||||
`com.galaxy.started_at_ms` or inspect, state path and docker network
|
||||
from configuration, engine endpoint from the
|
||||
`http://galaxy-game-{game_id}:8080` rule) and calls
|
||||
`RuntimeRecords.Upsert`.
|
||||
|
||||
Race scenario: the start service has called `docker.Run` but has not
|
||||
yet finished its own `Upsert` when the reconciler observes the
|
||||
container without a record. Both writers eventually arrive at PG with
|
||||
the same key data — the start service knows the canonical
|
||||
`image_ref`, but the reconciler reads it from the
|
||||
`com.galaxy.engine_image_ref` label that the start service itself
|
||||
wrote. The CAS-free overwrite is therefore benign:
|
||||
|
||||
- `created_at` is preserved across upserts by the
|
||||
`ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
|
||||
game" timestamp stays stable regardless of which writer lands last;
|
||||
- all other fields in this race carry identical values (same
|
||||
container, same image, same hostname, same state path).
|
||||
|
||||
Under the per-game lease this is doubly safe: the reconciler only
|
||||
issues `Upsert` while holding the lease, and only after re-reading
|
||||
the record finds it absent. Concurrent start would block on the same
|
||||
lease; concurrent stop / restart would have moved the record out of
|
||||
"absent" by the time the reconciler re-reads.
|
||||
|
||||
## 19. Cleanup worker delegates to the service
|
||||
|
||||
The TTL-cleanup worker is intentionally tiny: it lists
|
||||
`runtime_records.status='stopped'`, filters in process by
|
||||
`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
|
||||
`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
|
||||
candidate. The service already owns:
|
||||
|
||||
- the per-game lease around the Docker `Remove` call;
|
||||
- the `running → removed` CAS via `UpdateStatus`;
|
||||
- the operation_log entry (`op_kind=cleanup_container`,
|
||||
`op_source=auto_ttl`);
|
||||
- the telemetry counter and structured log fields.
|
||||
|
||||
In-memory filtering is acceptable in v1 because the cardinality of
|
||||
`status=stopped` rows is bounded by Lobby's active-game count plus
|
||||
retention period. The dedicated `(status, last_op_at)` index drives
|
||||
the underlying `ListByStatus(stopped)` query so the database does
|
||||
the heavy lifting; the Go-side filter is microseconds-per-row.
|
||||
|
||||
The worker uses a small `Cleaner` interface in its own package rather
|
||||
than depending on `*cleanupcontainer.Service` directly. This keeps
|
||||
the worker's tests light — no need to construct Docker, lease,
|
||||
operation-log, and telemetry doubles just to verify TTL math — while
|
||||
the production wiring still binds the real service via a compile-time
|
||||
interface assertion in `internal/app/wiring.go`.
|
||||
|
||||
## 20. Sequential per-game work in reconciler and cleanup
|
||||
|
||||
Both workers process games sequentially within a tick. The
|
||||
reconciler's mutations are dominated by `Get` + `Upsert` /
|
||||
`UpdateStatus` round-trips against PG plus an occasional Docker
|
||||
`InspectContainer`; the cleanup worker's mutations are dominated by
|
||||
the cleanup service's `docker.Remove` call. Parallelising either
|
||||
would multiply the load on the Docker daemon socket and the PG pool
|
||||
without buying anything that v1 cardinality demands.
|
||||
|
||||
## 21. Cross-module test boundary for the consumer integration test
|
||||
|
||||
[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
|
||||
covers the contract roundtrip without importing
|
||||
`lobby/internal/...`:
|
||||
|
||||
- it XADDs a start envelope in the AsyncAPI wire shape (the same
|
||||
shape Lobby's `runtimemanager.Publisher` writes);
|
||||
- it runs the real `startruntime.Service` against in-memory fakes for
|
||||
the persistence stores, the lease, and the notification / health
|
||||
publishers, plus a gomock-backed `ports.DockerClient`;
|
||||
- it lets the real `jobresultspublisher.Publisher` write to
|
||||
`runtime:job_results`;
|
||||
- it reads the resulting entry and asserts the symmetric wire shape;
|
||||
- it then XADDs the same envelope a second time and asserts the
|
||||
`error_code=replay_no_op` outcome with no further Docker calls.
|
||||
|
||||
The cross-module integration that runs both the real Lobby publisher
|
||||
and the real Lobby consumer alongside RTM lives at
|
||||
`integration/lobbyrtm/`, which is the home for inter-service
|
||||
fixtures. Keeping the in-package test free of `lobby/...` imports
|
||||
avoids module-internal coupling and keeps `rtmanager`'s test suite
|
||||
buildable on its own.
|
||||
@@ -1,3 +1,132 @@
|
||||
module galaxy/rtmanager
|
||||
|
||||
go 1.26.2
|
||||
|
||||
require (
|
||||
galaxy/notificationintent v0.0.0-00010101000000-000000000000
|
||||
galaxy/postgres v0.0.0-00010101000000-000000000000
|
||||
galaxy/redisconn v0.0.0-00010101000000-000000000000
|
||||
github.com/alicebob/miniredis/v2 v2.37.0
|
||||
github.com/containerd/errdefs v1.0.0
|
||||
github.com/distribution/reference v0.6.0
|
||||
github.com/docker/docker v28.5.2+incompatible
|
||||
github.com/docker/go-units v0.5.0
|
||||
github.com/getkin/kin-openapi v0.135.0
|
||||
github.com/go-jet/jet/v2 v2.14.1
|
||||
github.com/jackc/pgx/v5 v5.9.2
|
||||
github.com/redis/go-redis/v9 v9.18.0
|
||||
github.com/stretchr/testify v1.11.1
|
||||
github.com/testcontainers/testcontainers-go v0.42.0
|
||||
github.com/testcontainers/testcontainers-go/modules/postgres v0.42.0
|
||||
github.com/testcontainers/testcontainers-go/modules/redis v0.42.0
|
||||
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.68.0
|
||||
go.opentelemetry.io/otel v1.43.0
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0
|
||||
go.opentelemetry.io/otel/exporters/stdout/stdoutmetric v1.43.0
|
||||
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0
|
||||
go.opentelemetry.io/otel/metric v1.43.0
|
||||
go.opentelemetry.io/otel/sdk v1.43.0
|
||||
go.opentelemetry.io/otel/sdk/metric v1.43.0
|
||||
go.opentelemetry.io/otel/trace v1.43.0
|
||||
go.uber.org/mock v0.6.0
|
||||
golang.org/x/mod v0.35.0
|
||||
gopkg.in/yaml.v3 v3.0.1
|
||||
)
|
||||
|
||||
require (
|
||||
dario.cat/mergo v1.0.2 // indirect
|
||||
github.com/Azure/go-ansiterm v0.0.0-20250102033503-faa5f7b0171c // indirect
|
||||
github.com/Microsoft/go-winio v0.6.2 // indirect
|
||||
github.com/XSAM/otelsql v0.42.0 // indirect
|
||||
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
|
||||
github.com/cenkalti/backoff/v5 v5.0.3 // indirect
|
||||
github.com/cespare/xxhash/v2 v2.3.0 // indirect
|
||||
github.com/containerd/errdefs/pkg v0.3.0 // indirect
|
||||
github.com/containerd/log v0.1.0 // indirect
|
||||
github.com/containerd/platforms v0.2.1 // indirect
|
||||
github.com/cpuguy83/dockercfg v0.3.2 // indirect
|
||||
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
|
||||
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect
|
||||
github.com/docker/go-connections v0.7.0 // indirect
|
||||
github.com/ebitengine/purego v0.10.0 // indirect
|
||||
github.com/felixge/httpsnoop v1.0.4 // indirect
|
||||
github.com/go-logr/logr v1.4.3 // indirect
|
||||
github.com/go-logr/stdr v1.2.2 // indirect
|
||||
github.com/go-ole/go-ole v1.2.6 // indirect
|
||||
github.com/go-openapi/jsonpointer v0.21.0 // indirect
|
||||
github.com/go-openapi/swag v0.23.0 // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0 // indirect
|
||||
github.com/jackc/chunkreader/v2 v2.0.1 // indirect
|
||||
github.com/jackc/pgconn v1.14.3 // indirect
|
||||
github.com/jackc/pgio v1.0.0 // indirect
|
||||
github.com/jackc/pgpassfile v1.0.0 // indirect
|
||||
github.com/jackc/pgproto3/v2 v2.3.3 // indirect
|
||||
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect
|
||||
github.com/jackc/pgtype v1.14.4 // indirect
|
||||
github.com/jackc/puddle/v2 v2.2.2 // indirect
|
||||
github.com/josharian/intern v1.0.0 // indirect
|
||||
github.com/klauspost/compress v1.18.5 // indirect
|
||||
github.com/lib/pq v1.10.9 // indirect
|
||||
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 // indirect
|
||||
github.com/magiconair/properties v1.8.10 // indirect
|
||||
github.com/mailru/easyjson v0.7.7 // indirect
|
||||
github.com/mdelapenya/tlscert v0.2.0 // indirect
|
||||
github.com/mfridman/interpolate v0.0.2 // indirect
|
||||
github.com/moby/docker-image-spec v1.3.1 // indirect
|
||||
github.com/moby/go-archive v0.2.0 // indirect
|
||||
github.com/moby/moby/api v1.54.2 // indirect
|
||||
github.com/moby/moby/client v0.4.1 // indirect
|
||||
github.com/moby/patternmatcher v0.6.1 // indirect
|
||||
github.com/moby/sys/atomicwriter v0.1.0 // indirect
|
||||
github.com/moby/sys/sequential v0.6.0 // indirect
|
||||
github.com/moby/sys/user v0.4.0 // indirect
|
||||
github.com/moby/sys/userns v0.1.0 // indirect
|
||||
github.com/moby/term v0.5.2 // indirect
|
||||
github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826 // indirect
|
||||
github.com/morikuni/aec v1.1.0 // indirect
|
||||
github.com/oasdiff/yaml v0.0.9 // indirect
|
||||
github.com/oasdiff/yaml3 v0.0.9 // indirect
|
||||
github.com/opencontainers/go-digest v1.0.0 // indirect
|
||||
github.com/opencontainers/image-spec v1.1.1 // indirect
|
||||
github.com/perimeterx/marshmallow v1.1.5 // indirect
|
||||
github.com/pkg/errors v0.9.1 // indirect
|
||||
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
|
||||
github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55 // indirect
|
||||
github.com/pressly/goose/v3 v3.27.1 // indirect
|
||||
github.com/redis/go-redis/extra/rediscmd/v9 v9.18.0 // indirect
|
||||
github.com/redis/go-redis/extra/redisotel/v9 v9.18.0 // indirect
|
||||
github.com/sethvargo/go-retry v0.3.0 // indirect
|
||||
github.com/shirou/gopsutil/v4 v4.26.3 // indirect
|
||||
github.com/sirupsen/logrus v1.9.4 // indirect
|
||||
github.com/tklauser/go-sysconf v0.3.16 // indirect
|
||||
github.com/tklauser/numcpus v0.11.0 // indirect
|
||||
github.com/ugorji/go/codec v1.3.1 // indirect
|
||||
github.com/woodsbury/decimal128 v1.3.0 // indirect
|
||||
github.com/yuin/gopher-lua v1.1.1 // indirect
|
||||
github.com/yusufpapurcu/wmi v1.2.4 // indirect
|
||||
go.opentelemetry.io/auto/sdk v1.2.1 // indirect
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 // indirect
|
||||
go.opentelemetry.io/proto/otlp v1.10.0 // indirect
|
||||
go.uber.org/atomic v1.11.0 // indirect
|
||||
go.uber.org/multierr v1.11.0 // indirect
|
||||
golang.org/x/crypto v0.50.0 // indirect
|
||||
golang.org/x/net v0.53.0 // indirect
|
||||
golang.org/x/sync v0.20.0 // indirect
|
||||
golang.org/x/sys v0.43.0 // indirect
|
||||
golang.org/x/text v0.36.0 // indirect
|
||||
golang.org/x/time v0.15.0 // indirect
|
||||
google.golang.org/genproto/googleapis/api v0.0.0-20260401024825-9d38bb4040a9 // indirect
|
||||
google.golang.org/genproto/googleapis/rpc v0.0.0-20260420184626-e10c466a9529 // indirect
|
||||
google.golang.org/grpc v1.80.0 // indirect
|
||||
google.golang.org/protobuf v1.36.11 // indirect
|
||||
)
|
||||
|
||||
replace galaxy/postgres => ../pkg/postgres
|
||||
|
||||
replace galaxy/redisconn => ../pkg/redisconn
|
||||
|
||||
replace galaxy/notificationintent => ../pkg/notificationintent
|
||||
|
||||
@@ -0,0 +1,475 @@
|
||||
dario.cat/mergo v1.0.2 h1:85+piFYR1tMbRrLcDwR18y4UKJ3aH1Tbzi24VRW1TK8=
|
||||
dario.cat/mergo v1.0.2/go.mod h1:E/hbnu0NxMFBjpMIE34DRGLWqDy0g5FuKDhCb31ngxA=
|
||||
github.com/AdaLogics/go-fuzz-headers v0.0.0-20240806141605-e8a1dd7889d6 h1:He8afgbRMd7mFxO99hRNu+6tazq8nFF9lIwo9JFroBk=
|
||||
github.com/AdaLogics/go-fuzz-headers v0.0.0-20240806141605-e8a1dd7889d6/go.mod h1:8o94RPi1/7XTJvwPpRSzSUedZrtlirdB3r9Z20bi2f8=
|
||||
github.com/Azure/go-ansiterm v0.0.0-20250102033503-faa5f7b0171c h1:udKWzYgxTojEKWjV8V+WSxDXJ4NFATAsZjh8iIbsQIg=
|
||||
github.com/Azure/go-ansiterm v0.0.0-20250102033503-faa5f7b0171c/go.mod h1:xomTg63KZ2rFqZQzSB4Vz2SUXa1BpHTVz9L5PTmPC4E=
|
||||
github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU=
|
||||
github.com/Masterminds/semver/v3 v3.1.1/go.mod h1:VPu/7SZ7ePZ3QOrcuXROw5FAcLl4a0cBrbBpGY/8hQs=
|
||||
github.com/Microsoft/go-winio v0.6.2 h1:F2VQgta7ecxGYO8k3ZZz3RS8fVIXVxONVUPlNERoyfY=
|
||||
github.com/Microsoft/go-winio v0.6.2/go.mod h1:yd8OoFMLzJbo9gZq8j5qaps8bJ9aShtEA8Ipt1oGCvU=
|
||||
github.com/XSAM/otelsql v0.42.0 h1:Li0xF4eJUxG2e0x3D4rvRlys1f27yJKvjTh7ljkUP5o=
|
||||
github.com/XSAM/otelsql v0.42.0/go.mod h1:4mOrEv+cS1KmKzrvTktvJnstr5GtKSAK+QHvFR9OcpI=
|
||||
github.com/alicebob/miniredis/v2 v2.37.0 h1:RheObYW32G1aiJIj81XVt78ZHJpHonHLHW7OLIshq68=
|
||||
github.com/alicebob/miniredis/v2 v2.37.0/go.mod h1:TcL7YfarKPGDAthEtl5NBeHZfeUQj6OXMm/+iu5cLMM=
|
||||
github.com/bsm/ginkgo/v2 v2.12.0 h1:Ny8MWAHyOepLGlLKYmXG4IEkioBysk6GpaRTLC8zwWs=
|
||||
github.com/bsm/ginkgo/v2 v2.12.0/go.mod h1:SwYbGRRDovPVboqFv0tPTcG1sN61LM1Z4ARdbAV9g4c=
|
||||
github.com/bsm/gomega v1.27.10 h1:yeMWxP2pV2fG3FgAODIY8EiRE3dy0aeFYt4l7wh6yKA=
|
||||
github.com/bsm/gomega v1.27.10/go.mod h1:JyEr/xRbxbtgWNi8tIEVPUYZ5Dzef52k01W3YH0H+O0=
|
||||
github.com/cenkalti/backoff/v4 v4.3.0 h1:MyRJ/UdXutAwSAT+s3wNd7MfTIcy71VQueUuFK343L8=
|
||||
github.com/cenkalti/backoff/v4 v4.3.0/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE=
|
||||
github.com/cenkalti/backoff/v5 v5.0.3 h1:ZN+IMa753KfX5hd8vVaMixjnqRZ3y8CuJKRKj1xcsSM=
|
||||
github.com/cenkalti/backoff/v5 v5.0.3/go.mod h1:rkhZdG3JZukswDf7f0cwqPNk4K0sa+F97BxZthm/crw=
|
||||
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
|
||||
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
|
||||
github.com/cockroachdb/apd v1.1.0/go.mod h1:8Sl8LxpKi29FqWXR16WEFZRNSz3SoPzUzeMeY4+DwBQ=
|
||||
github.com/containerd/errdefs v1.0.0 h1:tg5yIfIlQIrxYtu9ajqY42W3lpS19XqdxRQeEwYG8PI=
|
||||
github.com/containerd/errdefs v1.0.0/go.mod h1:+YBYIdtsnF4Iw6nWZhJcqGSg/dwvV7tyJ/kCkyJ2k+M=
|
||||
github.com/containerd/errdefs/pkg v0.3.0 h1:9IKJ06FvyNlexW690DXuQNx2KA2cUJXx151Xdx3ZPPE=
|
||||
github.com/containerd/errdefs/pkg v0.3.0/go.mod h1:NJw6s9HwNuRhnjJhM7pylWwMyAkmCQvQ4GpJHEqRLVk=
|
||||
github.com/containerd/log v0.1.0 h1:TCJt7ioM2cr/tfR8GPbGf9/VRAX8D2B4PjzCpfX540I=
|
||||
github.com/containerd/log v0.1.0/go.mod h1:VRRf09a7mHDIRezVKTRCrOq78v577GXq3bSa3EhrzVo=
|
||||
github.com/containerd/platforms v0.2.1 h1:zvwtM3rz2YHPQsF2CHYM8+KtB5dvhISiXh5ZpSBQv6A=
|
||||
github.com/containerd/platforms v0.2.1/go.mod h1:XHCb+2/hzowdiut9rkudds9bE5yJ7npe7dG/wG+uFPw=
|
||||
github.com/coreos/go-systemd v0.0.0-20190321100706-95778dfbb74e/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
|
||||
github.com/coreos/go-systemd v0.0.0-20190719114852-fd7a80b32e1f/go.mod h1:F5haX7vjVVG0kc13fIWeqUViNPyEJxv/OmvnBo0Yme4=
|
||||
github.com/cpuguy83/dockercfg v0.3.2 h1:DlJTyZGBDlXqUZ2Dk2Q3xHs/FtnooJJVaad2S9GKorA=
|
||||
github.com/cpuguy83/dockercfg v0.3.2/go.mod h1:sugsbF4//dDlL/i+S+rtpIWp+5h0BHJHfjj5/jFyUJc=
|
||||
github.com/creack/pty v1.1.7/go.mod h1:lj5s0c3V2DBrqTV7llrYr5NG6My20zk30Fl46Y7DoTY=
|
||||
github.com/creack/pty v1.1.24 h1:bJrF4RRfyJnbTJqzRLHzcGaZK1NeM5kTC9jGgovnR1s=
|
||||
github.com/creack/pty v1.1.24/go.mod h1:08sCNb52WyoAwi2QDyzUCTgcvVFhUzewun7wtTfvcwE=
|
||||
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM=
|
||||
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f h1:lO4WD4F/rVNCu3HqELle0jiPLLBs70cWOduZpkS1E78=
|
||||
github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f/go.mod h1:cuUVRXasLTGF7a8hSLbxyZXjz+1KgoB3wDUb6vlszIc=
|
||||
github.com/distribution/reference v0.6.0 h1:0IXCQ5g4/QMHHkarYzh5l+u8T3t73zM5QvfrDyIgxBk=
|
||||
github.com/distribution/reference v0.6.0/go.mod h1:BbU0aIcezP1/5jX/8MP0YiH4SdvB5Y4f/wlDRiLyi3E=
|
||||
github.com/docker/docker v28.5.2+incompatible h1:DBX0Y0zAjZbSrm1uzOkdr1onVghKaftjlSWt4AFexzM=
|
||||
github.com/docker/docker v28.5.2+incompatible/go.mod h1:eEKB0N0r5NX/I1kEveEz05bcu8tLC/8azJZsviup8Sk=
|
||||
github.com/docker/go-connections v0.7.0 h1:6SsRfJddP22WMrCkj19x9WKjEDTB+ahsdiGYf0mN39c=
|
||||
github.com/docker/go-connections v0.7.0/go.mod h1:no1qkHdjq7kLMGUXYAduOhYPSJxxvgWBh7ogVvptn3Q=
|
||||
github.com/docker/go-units v0.5.0 h1:69rxXcBk27SvSaaxTtLh/8llcHD8vYHT7WSdRZ/jvr4=
|
||||
github.com/docker/go-units v0.5.0/go.mod h1:fgPhTUdO+D/Jk86RDLlptpiXQzgHJF7gydDDbaIK4Dk=
|
||||
github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY=
|
||||
github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto=
|
||||
github.com/ebitengine/purego v0.10.0 h1:QIw4xfpWT6GWTzaW5XEKy3HXoqrJGx1ijYHzTF0/ISU=
|
||||
github.com/ebitengine/purego v0.10.0/go.mod h1:iIjxzd6CiRiOG0UyXP+V1+jWqUXVjPKLAI0mRfJZTmQ=
|
||||
github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg=
|
||||
github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U=
|
||||
github.com/getkin/kin-openapi v0.135.0 h1:751SjYfbiwqukYuVjwYEIKNfrSwS5YpA7DZnKSwQgtg=
|
||||
github.com/getkin/kin-openapi v0.135.0/go.mod h1:6dd5FJl6RdX4usBtFBaQhk9q62Yb2J0Mk5IhUO/QqFI=
|
||||
github.com/go-jet/jet/v2 v2.14.1 h1:wsfD9e7CGP9h46+IFNlftfncBcmVnKddikbTtapQM3M=
|
||||
github.com/go-jet/jet/v2 v2.14.1/go.mod h1:dqTAECV2Mo3S2NFjbm4vJ1aDruZjhaJ1RAAR8rGUkkc=
|
||||
github.com/go-kit/log v0.1.0/go.mod h1:zbhenjAZHb184qTLMA9ZjW7ThYL0H2mk7Q6pNt4vbaY=
|
||||
github.com/go-logfmt/logfmt v0.5.0/go.mod h1:wCYkCAKZfumFQihp8CzCvQ3paCTfi41vtzG1KdI/P7A=
|
||||
github.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=
|
||||
github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=
|
||||
github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
|
||||
github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag=
|
||||
github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE=
|
||||
github.com/go-ole/go-ole v1.2.6 h1:/Fpf6oFPoeFik9ty7siob0G6Ke8QvQEuVcuChpwXzpY=
|
||||
github.com/go-ole/go-ole v1.2.6/go.mod h1:pprOEPIfldk/42T2oK7lQ4v4JSDwmV0As9GaiUsvbm0=
|
||||
github.com/go-openapi/jsonpointer v0.21.0 h1:YgdVicSA9vH5RiHs9TZW5oyafXZFc6+2Vc1rr/O9oNQ=
|
||||
github.com/go-openapi/jsonpointer v0.21.0/go.mod h1:IUyH9l/+uyhIYQ/PXVA41Rexl+kOkAPDdXEYns6fzUY=
|
||||
github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+GrE=
|
||||
github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ=
|
||||
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
|
||||
github.com/go-test/deep v1.0.8 h1:TDsG77qcSprGbC6vTN8OuXp5g+J+b5Pcguhf7Zt61VM=
|
||||
github.com/go-test/deep v1.0.8/go.mod h1:5C2ZWiW0ErCdrYzpqxLbTX7MG14M9iiw8DgHncVwcsE=
|
||||
github.com/gofrs/uuid v4.0.0+incompatible/go.mod h1:b2aQJv3Z4Fp6yNu3cdSllBxTCLRxnplIgP/c0N/04lM=
|
||||
github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
|
||||
github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
|
||||
github.com/google/go-cmp v0.5.6/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
|
||||
github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
|
||||
github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
|
||||
github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI=
|
||||
github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
|
||||
github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
|
||||
github.com/gorilla/mux v1.8.0 h1:i40aqfkR1h2SlN9hojwV5ZA91wcXFOvkdNIeFDP5koI=
|
||||
github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0 h1:HWRh5R2+9EifMyIHV7ZV+MIZqgz+PMpZ14Jynv3O2Zs=
|
||||
github.com/grpc-ecosystem/grpc-gateway/v2 v2.28.0/go.mod h1:JfhWUomR1baixubs02l85lZYYOm7LV6om4ceouMv45c=
|
||||
github.com/jackc/chunkreader v1.0.0/go.mod h1:RT6O25fNZIuasFJRyZ4R/Y2BbhasbmZXF9QQ7T3kePo=
|
||||
github.com/jackc/chunkreader/v2 v2.0.0/go.mod h1:odVSm741yZoC3dpHEUXIqA9tQRhFrgOHwnPIn9lDKlk=
|
||||
github.com/jackc/chunkreader/v2 v2.0.1 h1:i+RDz65UE+mmpjTfyz0MoVTnzeYxroil2G82ki7MGG8=
|
||||
github.com/jackc/chunkreader/v2 v2.0.1/go.mod h1:odVSm741yZoC3dpHEUXIqA9tQRhFrgOHwnPIn9lDKlk=
|
||||
github.com/jackc/pgconn v0.0.0-20190420214824-7e0022ef6ba3/go.mod h1:jkELnwuX+w9qN5YIfX0fl88Ehu4XC3keFuOJJk9pcnA=
|
||||
github.com/jackc/pgconn v0.0.0-20190824142844-760dd75542eb/go.mod h1:lLjNuW/+OfW9/pnVKPazfWOgNfH2aPem8YQ7ilXGvJE=
|
||||
github.com/jackc/pgconn v0.0.0-20190831204454-2fabfa3c18b7/go.mod h1:ZJKsE/KZfsUgOEh9hBm+xYTstcNHg7UPMVJqRfQxq4s=
|
||||
github.com/jackc/pgconn v1.8.0/go.mod h1:1C2Pb36bGIP9QHGBYCjnyhqu7Rv3sGshaQUvmfGIB/o=
|
||||
github.com/jackc/pgconn v1.9.0/go.mod h1:YctiPyvzfU11JFxoXokUOOKQXQmDMoJL9vJzHH8/2JY=
|
||||
github.com/jackc/pgconn v1.9.1-0.20210724152538-d89c8390a530/go.mod h1:4z2w8XhRbP1hYxkpTuBjTS3ne3J48K83+u0zoyvg2pI=
|
||||
github.com/jackc/pgconn v1.14.3 h1:bVoTr12EGANZz66nZPkMInAV/KHD2TxH9npjXXgiB3w=
|
||||
github.com/jackc/pgconn v1.14.3/go.mod h1:RZbme4uasqzybK2RK5c65VsHxoyaml09lx3tXOcO/VM=
|
||||
github.com/jackc/pgio v1.0.0 h1:g12B9UwVnzGhueNavwioyEEpAmqMe1E/BN9ES+8ovkE=
|
||||
github.com/jackc/pgio v1.0.0/go.mod h1:oP+2QK2wFfUWgr+gxjoBH9KGBb31Eio69xUb0w5bYf8=
|
||||
github.com/jackc/pgmock v0.0.0-20190831213851-13a1b77aafa2/go.mod h1:fGZlG77KXmcq05nJLRkk0+p82V8B8Dw8KN2/V9c/OAE=
|
||||
github.com/jackc/pgmock v0.0.0-20201204152224-4fe30f7445fd/go.mod h1:hrBW0Enj2AZTNpt/7Y5rr2xe/9Mn757Wtb2xeBzPv2c=
|
||||
github.com/jackc/pgmock v0.0.0-20210724152146-4ad1a8207f65 h1:DadwsjnMwFjfWc9y5Wi/+Zz7xoE5ALHsRQlOctkOiHc=
|
||||
github.com/jackc/pgmock v0.0.0-20210724152146-4ad1a8207f65/go.mod h1:5R2h2EEX+qri8jOWMbJCtaPWkrrNc7OHwsp2TCqp7ak=
|
||||
github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
|
||||
github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg=
|
||||
github.com/jackc/pgproto3 v1.1.0/go.mod h1:eR5FA3leWg7p9aeAqi37XOTgTIbkABlvcPB3E5rlc78=
|
||||
github.com/jackc/pgproto3/v2 v2.0.0-alpha1.0.20190420180111-c116219b62db/go.mod h1:bhq50y+xrl9n5mRYyCBFKkpRVTLYJVWeCc+mEAI3yXA=
|
||||
github.com/jackc/pgproto3/v2 v2.0.0-alpha1.0.20190609003834-432c2951c711/go.mod h1:uH0AWtUmuShn0bcesswc4aBTWGvw0cAxIJp+6OB//Wg=
|
||||
github.com/jackc/pgproto3/v2 v2.0.0-rc3/go.mod h1:ryONWYqW6dqSg1Lw6vXNMXoBJhpzvWKnT95C46ckYeM=
|
||||
github.com/jackc/pgproto3/v2 v2.0.0-rc3.0.20190831210041-4c03ce451f29/go.mod h1:ryONWYqW6dqSg1Lw6vXNMXoBJhpzvWKnT95C46ckYeM=
|
||||
github.com/jackc/pgproto3/v2 v2.0.6/go.mod h1:WfJCnwN3HIg9Ish/j3sgWXnAfK8A9Y0bwXYU5xKaEdA=
|
||||
github.com/jackc/pgproto3/v2 v2.1.1/go.mod h1:WfJCnwN3HIg9Ish/j3sgWXnAfK8A9Y0bwXYU5xKaEdA=
|
||||
github.com/jackc/pgproto3/v2 v2.3.3 h1:1HLSx5H+tXR9pW3in3zaztoEwQYRC9SQaYUHjTSUOag=
|
||||
github.com/jackc/pgproto3/v2 v2.3.3/go.mod h1:WfJCnwN3HIg9Ish/j3sgWXnAfK8A9Y0bwXYU5xKaEdA=
|
||||
github.com/jackc/pgservicefile v0.0.0-20200714003250-2b9c44734f2b/go.mod h1:vsD4gTJCa9TptPL8sPkXrLZ+hDuNrZCnj29CQpr4X1E=
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
|
||||
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 h1:iCEnooe7UlwOQYpKFhBabPMi4aNAfoODPEFNiAnClxo=
|
||||
github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
|
||||
github.com/jackc/pgtype v0.0.0-20190421001408-4ed0de4755e0/go.mod h1:hdSHsc1V01CGwFsrv11mJRHWJ6aifDLfdV3aVjFF0zg=
|
||||
github.com/jackc/pgtype v0.0.0-20190824184912-ab885b375b90/go.mod h1:KcahbBH1nCMSo2DXpzsoWOAfFkdEtEJpPbVLq8eE+mc=
|
||||
github.com/jackc/pgtype v0.0.0-20190828014616-a8802b16cc59/go.mod h1:MWlu30kVJrUS8lot6TQqcg7mtthZ9T0EoIBFiJcmcyw=
|
||||
github.com/jackc/pgtype v1.8.1-0.20210724151600-32e20a603178/go.mod h1:C516IlIV9NKqfsMCXTdChteoXmwgUceqaLfjg2e3NlM=
|
||||
github.com/jackc/pgtype v1.14.0/go.mod h1:LUMuVrfsFfdKGLw+AFFVv6KtHOFMwRgDDzBt76IqCA4=
|
||||
github.com/jackc/pgtype v1.14.4 h1:fKuNiCumbKTAIxQwXfB/nsrnkEI6bPJrrSiMKgbJ2j8=
|
||||
github.com/jackc/pgtype v1.14.4/go.mod h1:aKeozOde08iifGosdJpz9MBZonJOUJxqNpPBcMJTlVA=
|
||||
github.com/jackc/pgx/v4 v4.0.0-20190420224344-cc3461e65d96/go.mod h1:mdxmSJJuR08CZQyj1PVQBHy9XOp5p8/SHH6a0psbY9Y=
|
||||
github.com/jackc/pgx/v4 v4.0.0-20190421002000-1b8f0016e912/go.mod h1:no/Y67Jkk/9WuGR0JG/JseM9irFbnEPbuWV2EELPNuM=
|
||||
github.com/jackc/pgx/v4 v4.0.0-pre1.0.20190824185557-6972a5742186/go.mod h1:X+GQnOEnf1dqHGpw7JmHqHc1NxDoalibchSk9/RWuDc=
|
||||
github.com/jackc/pgx/v4 v4.12.1-0.20210724153913-640aa07df17c/go.mod h1:1QD0+tgSXP7iUjYm9C1NxKhny7lq6ee99u/z+IHFcgs=
|
||||
github.com/jackc/pgx/v4 v4.18.2/go.mod h1:Ey4Oru5tH5sB6tV7hDmfWFahwF15Eb7DNXlRKx2CkVw=
|
||||
github.com/jackc/pgx/v4 v4.18.3 h1:dE2/TrEsGX3RBprb3qryqSV9Y60iZN1C6i8IrmW9/BA=
|
||||
github.com/jackc/pgx/v4 v4.18.3/go.mod h1:Ey4Oru5tH5sB6tV7hDmfWFahwF15Eb7DNXlRKx2CkVw=
|
||||
github.com/jackc/pgx/v5 v5.9.2 h1:3ZhOzMWnR4yJ+RW1XImIPsD1aNSz4T4fyP7zlQb56hw=
|
||||
github.com/jackc/pgx/v5 v5.9.2/go.mod h1:mal1tBGAFfLHvZzaYh77YS/eC6IX9OWbRV1QIIM0Jn4=
|
||||
github.com/jackc/puddle v0.0.0-20190413234325-e4ced69a3a2b/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
|
||||
github.com/jackc/puddle v0.0.0-20190608224051-11cab39313c9/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
|
||||
github.com/jackc/puddle v1.1.3/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
|
||||
github.com/jackc/puddle v1.3.0/go.mod h1:m4B5Dj62Y0fbyuIc15OsIqK0+JU8nkqQjsgx7dvjSWk=
|
||||
github.com/jackc/puddle/v2 v2.2.2 h1:PR8nw+E/1w0GLuRFSmiioY6UooMp6KJv0/61nB7icHo=
|
||||
github.com/jackc/puddle/v2 v2.2.2/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4=
|
||||
github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY=
|
||||
github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y=
|
||||
github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck=
|
||||
github.com/klauspost/compress v1.18.5 h1:/h1gH5Ce+VWNLSWqPzOVn6XBO+vJbCNGvjoaGBFW2IE=
|
||||
github.com/klauspost/compress v1.18.5/go.mod h1:cwPg85FWrGar70rWktvGQj8/hthj3wpl0PGDogxkrSQ=
|
||||
github.com/klauspost/cpuid/v2 v2.3.0 h1:S4CRMLnYUhGeDFDqkGriYKdfoFlDnMtqTiI/sFzhA9Y=
|
||||
github.com/klauspost/cpuid/v2 v2.3.0/go.mod h1:hqwkgyIinND0mEev00jJYCxPNVRVXFQeu1XKlok6oO0=
|
||||
github.com/konsorten/go-windows-terminal-sequences v1.0.1/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
|
||||
github.com/konsorten/go-windows-terminal-sequences v1.0.2/go.mod h1:T0+1ngSBFLxvqU3pZ+m/2kptfBszLMUkC4ZK/EgS/cQ=
|
||||
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
|
||||
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
|
||||
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
|
||||
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
|
||||
github.com/kr/pty v1.1.8/go.mod h1:O1sed60cT9XZ5uDucP5qwvh+TE3NnUj51EiZO/lmSfw=
|
||||
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
|
||||
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
|
||||
github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
|
||||
github.com/lib/pq v1.0.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
|
||||
github.com/lib/pq v1.1.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
|
||||
github.com/lib/pq v1.2.0/go.mod h1:5WUZQaWbwv1U+lTReE5YruASi9Al49XbQIvNi/34Woo=
|
||||
github.com/lib/pq v1.10.2/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o=
|
||||
github.com/lib/pq v1.10.9 h1:YXG7RB+JIjhP29X+OtkiDnYaXQwpS4JEWq7dtCCRUEw=
|
||||
github.com/lib/pq v1.10.9/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o=
|
||||
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 h1:6E+4a0GO5zZEnZ81pIr0yLvtUWk2if982qA3F3QD6H4=
|
||||
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0/go.mod h1:zJYVVT2jmtg6P3p1VtQj7WsuWi/y4VnjVBn7F8KPB3I=
|
||||
github.com/magiconair/properties v1.8.10 h1:s31yESBquKXCV9a/ScB3ESkOjUYYv+X0rg8SYxI99mE=
|
||||
github.com/magiconair/properties v1.8.10/go.mod h1:Dhd985XPs7jluiymwWYZ0G4Z61jb3vdS329zhj2hYo0=
|
||||
github.com/mailru/easyjson v0.7.7 h1:UGYAvKxe3sBsEDzO8ZeWOSlIQfWFlxbzLZe7hwFURr0=
|
||||
github.com/mailru/easyjson v0.7.7/go.mod h1:xzfreul335JAWq5oZzymOObrkdz5UnU4kGfJJLY9Nlc=
|
||||
github.com/mattn/go-colorable v0.1.1/go.mod h1:FuOcm+DKB9mbwrcAfNl7/TZVBZ6rcnceauSikq3lYCQ=
|
||||
github.com/mattn/go-colorable v0.1.6/go.mod h1:u6P/XSegPjTcexA+o6vUJrdnUu04hMope9wVRipJSqc=
|
||||
github.com/mattn/go-isatty v0.0.5/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
|
||||
github.com/mattn/go-isatty v0.0.7/go.mod h1:Iq45c/XA43vh69/j3iqttzPXn0bhXyGjM0Hdxcsrc5s=
|
||||
github.com/mattn/go-isatty v0.0.12/go.mod h1:cbi8OIDigv2wuxKPP5vlRcQ1OAZbq2CE4Kysco4FUpU=
|
||||
github.com/mattn/go-isatty v0.0.21 h1:xYae+lCNBP7QuW4PUnNG61ffM4hVIfm+zUzDuSzYLGs=
|
||||
github.com/mattn/go-isatty v0.0.21/go.mod h1:ZXfXG4SQHsB/w3ZeOYbR0PrPwLy+n6xiMrJlRFqopa4=
|
||||
github.com/mdelapenya/tlscert v0.2.0 h1:7H81W6Z/4weDvZBNOfQte5GpIMo0lGYEeWbkGp5LJHI=
|
||||
github.com/mdelapenya/tlscert v0.2.0/go.mod h1:O4njj3ELLnJjGdkN7M/vIVCpZ+Cf0L6muqOG4tLSl8o=
|
||||
github.com/mfridman/interpolate v0.0.2 h1:pnuTK7MQIxxFz1Gr+rjSIx9u7qVjf5VOoM/u6BbAxPY=
|
||||
github.com/mfridman/interpolate v0.0.2/go.mod h1:p+7uk6oE07mpE/Ik1b8EckO0O4ZXiGAfshKBWLUM9Xg=
|
||||
github.com/moby/docker-image-spec v1.3.1 h1:jMKff3w6PgbfSa69GfNg+zN/XLhfXJGnEx3Nl2EsFP0=
|
||||
github.com/moby/docker-image-spec v1.3.1/go.mod h1:eKmb5VW8vQEh/BAr2yvVNvuiJuY6UIocYsFu/DxxRpo=
|
||||
github.com/moby/go-archive v0.2.0 h1:zg5QDUM2mi0JIM9fdQZWC7U8+2ZfixfTYoHL7rWUcP8=
|
||||
github.com/moby/go-archive v0.2.0/go.mod h1:mNeivT14o8xU+5q1YnNrkQVpK+dnNe/K6fHqnTg4qPU=
|
||||
github.com/moby/moby/api v1.54.2 h1:wiat9QAhnDQjA7wk1kh/TqHz2I1uUA7M7t9SAl/JNXg=
|
||||
github.com/moby/moby/api v1.54.2/go.mod h1:+RQ6wluLwtYaTd1WnPLykIDPekkuyD/ROWQClE83pzs=
|
||||
github.com/moby/moby/client v0.4.1 h1:DMQgisVoMkmMs7fp3ROSdiBnoAu8+vo3GggFl06M/wY=
|
||||
github.com/moby/moby/client v0.4.1/go.mod h1:z52C9O2POPOsnxZAy//WtKcQ32P+jT/NGeXu/7nfjGQ=
|
||||
github.com/moby/patternmatcher v0.6.1 h1:qlhtafmr6kgMIJjKJMDmMWq7WLkKIo23hsrpR3x084U=
|
||||
github.com/moby/patternmatcher v0.6.1/go.mod h1:hDPoyOpDY7OrrMDLaYoY3hf52gNCR/YOUYxkhApJIxc=
|
||||
github.com/moby/sys/atomicwriter v0.1.0 h1:kw5D/EqkBwsBFi0ss9v1VG3wIkVhzGvLklJ+w3A14Sw=
|
||||
github.com/moby/sys/atomicwriter v0.1.0/go.mod h1:Ul8oqv2ZMNHOceF643P6FKPXeCmYtlQMvpizfsSoaWs=
|
||||
github.com/moby/sys/sequential v0.6.0 h1:qrx7XFUd/5DxtqcoH1h438hF5TmOvzC/lspjy7zgvCU=
|
||||
github.com/moby/sys/sequential v0.6.0/go.mod h1:uyv8EUTrca5PnDsdMGXhZe6CCe8U/UiTWd+lL+7b/Ko=
|
||||
github.com/moby/sys/user v0.4.0 h1:jhcMKit7SA80hivmFJcbB1vqmw//wU61Zdui2eQXuMs=
|
||||
github.com/moby/sys/user v0.4.0/go.mod h1:bG+tYYYJgaMtRKgEmuueC0hJEAZWwtIbZTB+85uoHjs=
|
||||
github.com/moby/sys/userns v0.1.0 h1:tVLXkFOxVu9A64/yh59slHVv9ahO9UIev4JZusOLG/g=
|
||||
github.com/moby/sys/userns v0.1.0/go.mod h1:IHUYgu/kao6N8YZlp9Cf444ySSvCmDlmzUcYfDHOl28=
|
||||
github.com/moby/term v0.5.2 h1:6qk3FJAFDs6i/q3W/pQ97SX192qKfZgGjCQqfCJkgzQ=
|
||||
github.com/moby/term v0.5.2/go.mod h1:d3djjFCrjnB+fl8NJux+EJzu0msscUP+f8it8hPkFLc=
|
||||
github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826 h1:RWengNIwukTxcDr9M+97sNutRR1RKhG96O6jWumTTnw=
|
||||
github.com/mohae/deepcopy v0.0.0-20170929034955-c48cc78d4826/go.mod h1:TaXosZuwdSHYgviHp1DAtfrULt5eUgsSMsZf+YrPgl8=
|
||||
github.com/morikuni/aec v1.1.0 h1:vBBl0pUnvi/Je71dsRrhMBtreIqNMYErSAbEeb8jrXQ=
|
||||
github.com/morikuni/aec v1.1.0/go.mod h1:xDRgiq/iw5l+zkao76YTKzKttOp2cwPEne25HDkJnBw=
|
||||
github.com/ncruces/go-strftime v1.0.0 h1:HMFp8mLCTPp341M/ZnA4qaf7ZlsbTc+miZjCLOFAw7w=
|
||||
github.com/ncruces/go-strftime v1.0.0/go.mod h1:Fwc5htZGVVkseilnfgOVb9mKy6w1naJmn9CehxcKcls=
|
||||
github.com/oasdiff/yaml v0.0.9 h1:zQOvd2UKoozsSsAknnWoDJlSK4lC0mpmjfDsfqNwX48=
|
||||
github.com/oasdiff/yaml v0.0.9/go.mod h1:8lvhgJG4xiKPj3HN5lDow4jZHPlx1i7dIwzkdAo6oAM=
|
||||
github.com/oasdiff/yaml3 v0.0.9 h1:rWPrKccrdUm8J0F3sGuU+fuh9+1K/RdJlWF7O/9yw2g=
|
||||
github.com/oasdiff/yaml3 v0.0.9/go.mod h1:y5+oSEHCPT/DGrS++Wc/479ERge0zTFxaF8PbGKcg2o=
|
||||
github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=
|
||||
github.com/opencontainers/go-digest v1.0.0/go.mod h1:0JzlMkj0TRzQZfJkVvzbP0HBR3IKzErnv2BNG4W4MAM=
|
||||
github.com/opencontainers/image-spec v1.1.1 h1:y0fUlFfIZhPF1W537XOLg0/fcx6zcHCJwooC2xJA040=
|
||||
github.com/opencontainers/image-spec v1.1.1/go.mod h1:qpqAh3Dmcf36wStyyWU+kCeDgrGnAve2nCC8+7h8Q0M=
|
||||
github.com/perimeterx/marshmallow v1.1.5 h1:a2LALqQ1BlHM8PZblsDdidgv1mWi1DgC2UmX50IvK2s=
|
||||
github.com/perimeterx/marshmallow v1.1.5/go.mod h1:dsXbUu8CRzfYP5a87xpp0xq9S3u0Vchtcl8we9tYaXw=
|
||||
github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
|
||||
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
|
||||
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U=
|
||||
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55 h1:o4JXh1EVt9k/+g42oCprj/FisM4qX9L3sZB3upGN2ZU=
|
||||
github.com/power-devops/perfstat v0.0.0-20240221224432-82ca36839d55/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE=
|
||||
github.com/pressly/goose/v3 v3.27.1 h1:6uEvcprBybDmW4hcz3gYujhARhye+GoWKhEWyzD5sh4=
|
||||
github.com/pressly/goose/v3 v3.27.1/go.mod h1:maruOxsPnIG2yHHyo8UqKWXYKFcH7Q76csUV7+7KYoM=
|
||||
github.com/redis/go-redis/extra/rediscmd/v9 v9.18.0 h1:QY4nmPHLFAJjtT5O4OMUEOxP8WVaRNOFpcbmxT2NLZU=
|
||||
github.com/redis/go-redis/extra/rediscmd/v9 v9.18.0/go.mod h1:WH8cY/0fT41Bsf341qzo8v4nx0GCE8FykAA23IVbVmo=
|
||||
github.com/redis/go-redis/extra/redisotel/v9 v9.18.0 h1:2dKdoEYBJ0CZCLPiCdvvc7luz3DPwY6hKdzjL6m1eHE=
|
||||
github.com/redis/go-redis/extra/redisotel/v9 v9.18.0/go.mod h1:WzkrVG9ro9BwCQD0eJOWn6AGL4Z1CleGflM45w1hu10=
|
||||
github.com/redis/go-redis/v9 v9.18.0 h1:pMkxYPkEbMPwRdenAzUNyFNrDgHx9U+DrBabWNfSRQs=
|
||||
github.com/redis/go-redis/v9 v9.18.0/go.mod h1:k3ufPphLU5YXwNTUcCRXGxUoF1fqxnhFQmscfkCoDA0=
|
||||
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec h1:W09IVJc94icq4NjY3clb7Lk8O1qJ8BdBEF8z0ibU0rE=
|
||||
github.com/remyoudompheng/bigfft v0.0.0-20230129092748-24d4a6f8daec/go.mod h1:qqbHyh8v60DhA7CoWK5oRCqLrMHRGoxYCSS9EjAz6Eo=
|
||||
github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4=
|
||||
github.com/rogpeppe/go-internal v1.14.1 h1:UQB4HGPB6osV0SQTLymcB4TgvyWu6ZyliaW0tI/otEQ=
|
||||
github.com/rogpeppe/go-internal v1.14.1/go.mod h1:MaRKkUm5W0goXpeCfT7UZI6fk/L7L7so1lCWt35ZSgc=
|
||||
github.com/rs/xid v1.2.1/go.mod h1:+uKXf+4Djp6Md1KODXJxgGQPKngRmWyn10oCKFzNHOQ=
|
||||
github.com/rs/zerolog v1.13.0/go.mod h1:YbFCdg8HfsridGWAh22vktObvhZbQsZXe4/zB0OKkWU=
|
||||
github.com/rs/zerolog v1.15.0/go.mod h1:xYTKnLHcpfU2225ny5qZjxnj9NvkumZYjJHlAThCjNc=
|
||||
github.com/satori/go.uuid v1.2.0/go.mod h1:dA0hQrYB0VpLJoorglMZABFdXlWrHn1NEOzdhQKdks0=
|
||||
github.com/sethvargo/go-retry v0.3.0 h1:EEt31A35QhrcRZtrYFDTBg91cqZVnFL2navjDrah2SE=
|
||||
github.com/sethvargo/go-retry v0.3.0/go.mod h1:mNX17F0C/HguQMyMyJxcnU471gOZGxCLyYaFyAZraas=
|
||||
github.com/shirou/gopsutil/v4 v4.26.3 h1:2ESdQt90yU3oXF/CdOlRCJxrP+Am1aBYubTMTfxJ1qc=
|
||||
github.com/shirou/gopsutil/v4 v4.26.3/go.mod h1:LZ6ewCSkBqUpvSOf+LsTGnRinC6iaNUNMGBtDkJBaLQ=
|
||||
github.com/shopspring/decimal v0.0.0-20180709203117-cd690d0c9e24/go.mod h1:M+9NzErvs504Cn4c5DxATwIqPbtswREoFCre64PpcG4=
|
||||
github.com/shopspring/decimal v1.2.0/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFRcu2hWCYk4o=
|
||||
github.com/sirupsen/logrus v1.4.1/go.mod h1:ni0Sbl8bgC9z8RoU9G6nDWqqs/fq4eDPysMBDgk/93Q=
|
||||
github.com/sirupsen/logrus v1.4.2/go.mod h1:tLMulIdttU9McNUspp0xgXVQah82FyeX6MwdIuYE2rE=
|
||||
github.com/sirupsen/logrus v1.9.4 h1:TsZE7l11zFCLZnZ+teH4Umoq5BhEIfIzfRDZ1Uzql2w=
|
||||
github.com/sirupsen/logrus v1.9.4/go.mod h1:ftWc9WdOfJ0a92nsE2jF5u5ZwH8Bv2zdeOC42RjbV2g=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.1.1/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.2.0/go.mod h1:qt09Ya8vawLte6SNmTgCsAVtYtaKzEcn8ATUoHMkEqE=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo=
|
||||
github.com/stretchr/objx v0.5.3 h1:jmXUvGomnU1o3W/V5h2VEradbpJDwGrzugQQvL0POH4=
|
||||
github.com/stretchr/objx v0.5.3/go.mod h1:rDQraq+vQZU7Fde9LOZLr8Tax6zZvy4kuNKF+QYS+U0=
|
||||
github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs=
|
||||
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
|
||||
github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=
|
||||
github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA=
|
||||
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
|
||||
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
|
||||
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
|
||||
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
|
||||
github.com/testcontainers/testcontainers-go v0.42.0 h1:He3IhTzTZOygSXLJPMX7n44XtK+qhjat1nI9cneBbUY=
|
||||
github.com/testcontainers/testcontainers-go v0.42.0/go.mod h1:vZjdY1YmUA1qEForxOIOazfsrdyORJAbhi0bp8plN30=
|
||||
github.com/testcontainers/testcontainers-go/modules/postgres v0.42.0 h1:GCbb1ndrF7OTDiIvxXyItaDab4qkzTFJ48LKFdM7EIo=
|
||||
github.com/testcontainers/testcontainers-go/modules/postgres v0.42.0/go.mod h1:IRPBaI8jXdrNfD0e4Zm7Fbcgaz5shKxOQv4axiL09xs=
|
||||
github.com/testcontainers/testcontainers-go/modules/redis v0.42.0 h1:id/6LH8ZeDrtAUVSuNvZUAJ1kVpb82y1pr9yweAWsRg=
|
||||
github.com/tklauser/go-sysconf v0.3.16 h1:frioLaCQSsF5Cy1jgRBrzr6t502KIIwQ0MArYICU0nA=
|
||||
github.com/tklauser/go-sysconf v0.3.16/go.mod h1:/qNL9xxDhc7tx3HSRsLWNnuzbVfh3e7gh/BmM179nYI=
|
||||
github.com/tklauser/numcpus v0.11.0 h1:nSTwhKH5e1dMNsCdVBukSZrURJRoHbSEQjdEbY+9RXw=
|
||||
github.com/tklauser/numcpus v0.11.0/go.mod h1:z+LwcLq54uWZTX0u/bGobaV34u6V7KNlTZejzM6/3MQ=
|
||||
github.com/ugorji/go/codec v1.3.1 h1:waO7eEiFDwidsBN6agj1vJQ4AG7lh2yqXyOXqhgQuyY=
|
||||
github.com/ugorji/go/codec v1.3.1/go.mod h1:pRBVtBSKl77K30Bv8R2P+cLSGaTtex6fsA2Wjqmfxj4=
|
||||
github.com/woodsbury/decimal128 v1.3.0 h1:8pffMNWIlC0O5vbyHWFZAt5yWvWcrHA+3ovIIjVWss0=
|
||||
github.com/woodsbury/decimal128 v1.3.0/go.mod h1:C5UTmyTjW3JftjUFzOVhC20BEQa2a4ZKOB5I6Zjb+ds=
|
||||
github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=
|
||||
github.com/yuin/gopher-lua v1.1.1 h1:kYKnWBjvbNP4XLT3+bPEwAXJx262OhaHDWDVOPjL46M=
|
||||
github.com/yuin/gopher-lua v1.1.1/go.mod h1:GBR0iDaNXjAgGg9zfCvksxSRnQx76gclCIb7kdAd1Pw=
|
||||
github.com/yusufpapurcu/wmi v1.2.4 h1:zFUKzehAFReQwLys1b/iSMl+JQGSCSjtVqQn9bBrPo0=
|
||||
github.com/yusufpapurcu/wmi v1.2.4/go.mod h1:SBZ9tNy3G9/m5Oi98Zks0QjeHVDvuK0qfxQmPyzfmi0=
|
||||
github.com/zeebo/xxh3 v1.0.2 h1:xZmwmqxHZA8AI603jOQ0tMqmBr9lPeFwGg6d+xy9DC0=
|
||||
github.com/zeebo/xxh3 v1.0.2/go.mod h1:5NWz9Sef7zIDm2JHfFlcQvNekmcEl9ekUZQQKCYaDcA=
|
||||
github.com/zenazn/goji v0.9.0/go.mod h1:7S9M489iMyHBNxwZnk9/EHS098H4/F6TATF2mIxtB1Q=
|
||||
go.opentelemetry.io/auto/sdk v1.2.1 h1:jXsnJ4Lmnqd11kwkBV2LgLoFMZKizbCi5fNZ/ipaZ64=
|
||||
go.opentelemetry.io/auto/sdk v1.2.1/go.mod h1:KRTj+aOaElaLi+wW1kO/DZRXwkF4C5xPbEe3ZiIhN7Y=
|
||||
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.68.0 h1:CqXxU8VOmDefoh0+ztfGaymYbhdB/tT3zs79QaZTNGY=
|
||||
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.68.0/go.mod h1:BuhAPThV8PBHBvg8ZzZ/Ok3idOdhWIodywz2xEcRbJo=
|
||||
go.opentelemetry.io/otel v1.43.0 h1:mYIM03dnh5zfN7HautFE4ieIig9amkNANT+xcVxAj9I=
|
||||
go.opentelemetry.io/otel v1.43.0/go.mod h1:JuG+u74mvjvcm8vj8pI5XiHy1zDeoCS2LB1spIq7Ay0=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0 h1:8UQVDcZxOJLtX6gxtDt3vY2WTgvZqMQRzjsqiIHQdkc=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.43.0/go.mod h1:2lmweYCiHYpEjQ/lSJBYhj9jP1zvCvQW4BqL9dnT7FQ=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0 h1:w1K+pCJoPpQifuVpsKamUdn9U0zM3xUziVOqsGksUrY=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp v1.43.0/go.mod h1:HBy4BjzgVE8139ieRI75oXm3EcDN+6GhD88JT1Kjvxg=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0 h1:88Y4s2C8oTui1LGM6bTWkw0ICGcOLCAI5l6zsD1j20k=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.43.0/go.mod h1:Vl1/iaggsuRlrHf/hfPJPvVag77kKyvrLeD10kpMl+A=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0 h1:RAE+JPfvEmvy+0LzyUA25/SGawPwIUbZ6u0Wug54sLc=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.43.0/go.mod h1:AGmbycVGEsRx9mXMZ75CsOyhSP6MFIcj/6dnG+vhVjk=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0 h1:3iZJKlCZufyRzPzlQhUIWVmfltrXuGyfjREgGP3UUjc=
|
||||
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp v1.43.0/go.mod h1:/G+nUPfhq2e+qiXMGxMwumDrP5jtzU+mWN7/sjT2rak=
|
||||
go.opentelemetry.io/otel/exporters/stdout/stdoutmetric v1.43.0 h1:TC+BewnDpeiAmcscXbGMfxkO+mwYUwE/VySwvw88PfA=
|
||||
go.opentelemetry.io/otel/exporters/stdout/stdoutmetric v1.43.0/go.mod h1:J/ZyF4vfPwsSr9xJSPyQ4LqtcTPULFR64KwTikGLe+A=
|
||||
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0 h1:mS47AX77OtFfKG4vtp+84kuGSFZHTyxtXIN269vChY0=
|
||||
go.opentelemetry.io/otel/exporters/stdout/stdouttrace v1.43.0/go.mod h1:PJnsC41lAGncJlPUniSwM81gc80GkgWJWr3cu2nKEtU=
|
||||
go.opentelemetry.io/otel/metric v1.43.0 h1:d7638QeInOnuwOONPp4JAOGfbCEpYb+K6DVWvdxGzgM=
|
||||
go.opentelemetry.io/otel/metric v1.43.0/go.mod h1:RDnPtIxvqlgO8GRW18W6Z/4P462ldprJtfxHxyKd2PY=
|
||||
go.opentelemetry.io/otel/sdk v1.43.0 h1:pi5mE86i5rTeLXqoF/hhiBtUNcrAGHLKQdhg4h4V9Dg=
|
||||
go.opentelemetry.io/otel/sdk v1.43.0/go.mod h1:P+IkVU3iWukmiit/Yf9AWvpyRDlUeBaRg6Y+C58QHzg=
|
||||
go.opentelemetry.io/otel/sdk/metric v1.43.0 h1:S88dyqXjJkuBNLeMcVPRFXpRw2fuwdvfCGLEo89fDkw=
|
||||
go.opentelemetry.io/otel/sdk/metric v1.43.0/go.mod h1:C/RJtwSEJ5hzTiUz5pXF1kILHStzb9zFlIEe85bhj6A=
|
||||
go.opentelemetry.io/otel/trace v1.43.0 h1:BkNrHpup+4k4w+ZZ86CZoHHEkohws8AY+WTX09nk+3A=
|
||||
go.opentelemetry.io/otel/trace v1.43.0/go.mod h1:/QJhyVBUUswCphDVxq+8mld+AvhXZLhe+8WVFxiFff0=
|
||||
go.opentelemetry.io/proto/otlp v1.10.0 h1:IQRWgT5srOCYfiWnpqUYz9CVmbO8bFmKcwYxpuCSL2g=
|
||||
go.opentelemetry.io/proto/otlp v1.10.0/go.mod h1:/CV4QoCR/S9yaPj8utp3lvQPoqMtxXdzn7ozvvozVqk=
|
||||
go.uber.org/atomic v1.3.2/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
|
||||
go.uber.org/atomic v1.4.0/go.mod h1:gD2HeocX3+yG+ygLZcrzQJaqmWj9AIm7n08wl/qW/PE=
|
||||
go.uber.org/atomic v1.5.0/go.mod h1:sABNBOSYdrvTF6hTgEIbc7YasKWGhgEQZyfxyTvoXHQ=
|
||||
go.uber.org/atomic v1.6.0/go.mod h1:sABNBOSYdrvTF6hTgEIbc7YasKWGhgEQZyfxyTvoXHQ=
|
||||
go.uber.org/atomic v1.11.0 h1:ZvwS0R+56ePWxUNi+Atn9dWONBPp/AUETXlHW0DxSjE=
|
||||
go.uber.org/atomic v1.11.0/go.mod h1:LUxbIzbOniOlMKjJjyPfpl4v+PKK2cNJn91OQbhoJI0=
|
||||
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
|
||||
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
|
||||
go.uber.org/mock v0.6.0 h1:hyF9dfmbgIX5EfOdasqLsWD6xqpNZlXblLB/Dbnwv3Y=
|
||||
go.uber.org/mock v0.6.0/go.mod h1:KiVJ4BqZJaMj4svdfmHM0AUx4NJYO8ZNpPnZn1Z+BBU=
|
||||
go.uber.org/multierr v1.1.0/go.mod h1:wR5kodmAFQ0UK8QlbwjlSNy0Z68gJhDJUG5sjR94q/0=
|
||||
go.uber.org/multierr v1.3.0/go.mod h1:VgVr7evmIr6uPjLBxg28wmKNXyqE9akIJ5XnfpiKl+4=
|
||||
go.uber.org/multierr v1.5.0/go.mod h1:FeouvMocqHpRaaGuG9EjoKcStLC43Zu/fmqdUMPcKYU=
|
||||
go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0=
|
||||
go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y=
|
||||
go.uber.org/tools v0.0.0-20190618225709-2cfd321de3ee/go.mod h1:vJERXedbb3MVM5f9Ejo0C68/HhF8uaILCdgjnY+goOA=
|
||||
go.uber.org/zap v1.9.1/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
|
||||
go.uber.org/zap v1.10.0/go.mod h1:vwi/ZaCAaUcBkycHslxD9B2zi4UTXhF60s6SWpuDF0Q=
|
||||
go.uber.org/zap v1.13.0/go.mod h1:zwrFLgMcdUuIBviXEYEH1YKNaOBnKXsx2IPda5bBwHM=
|
||||
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
|
||||
golang.org/x/crypto v0.0.0-20190411191339-88737f569e3a/go.mod h1:WFFai1msRO1wXaEeE5yQxYXgSfI8pQAWXbQop6sCtWE=
|
||||
golang.org/x/crypto v0.0.0-20190510104115-cbcb75029529/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
|
||||
golang.org/x/crypto v0.0.0-20190820162420-60c769a6c586/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
|
||||
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
|
||||
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
|
||||
golang.org/x/crypto v0.0.0-20201203163018-be400aefbc4c/go.mod h1:jdWPYTVW3xRLrWPugEBEK3UY2ZEsg3UU495nc5E+M+I=
|
||||
golang.org/x/crypto v0.0.0-20210616213533-5ff15b29337e/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
|
||||
golang.org/x/crypto v0.0.0-20210711020723-a769d52b0f97/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
|
||||
golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
|
||||
golang.org/x/crypto v0.19.0/go.mod h1:Iy9bg/ha4yyC70EfRS8jz+B6ybOBKMaSxLj6P6oBDfU=
|
||||
golang.org/x/crypto v0.20.0/go.mod h1:Xwo95rrVNIoSMx9wa1JroENMToLWn3RNVrTBpLHgZPQ=
|
||||
golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI=
|
||||
golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q=
|
||||
golang.org/x/lint v0.0.0-20190930215403-16217165b5de/go.mod h1:6SW0HCj/g11FgYtHlgUYUwCkIfeOF89ocIRzGO/8vkc=
|
||||
golang.org/x/mod v0.0.0-20190513183733-4bf6d317e70e/go.mod h1:mXi4GBBbnImb6dmsKGUJ2LatrhH/nqhxcFungHvyanc=
|
||||
golang.org/x/mod v0.1.1-0.20191105210325-c90efee705ee/go.mod h1:QqPTAvyqsEbceGzBzNggFXnrqF1CaUcvgkdR5Ot7KZg=
|
||||
golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4=
|
||||
golang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs=
|
||||
golang.org/x/mod v0.35.0 h1:Ww1D637e6Pg+Zb2KrWfHQUnH2dQRLBQyAtpr/haaJeM=
|
||||
golang.org/x/mod v0.35.0/go.mod h1:+GwiRhIInF8wPm+4AoT6L0FA1QWAad3OMdTRx4tFYlU=
|
||||
golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
|
||||
golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
|
||||
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
|
||||
golang.org/x/net v0.0.0-20190813141303-74dc4d7220e7/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
|
||||
golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=
|
||||
golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c=
|
||||
golang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=
|
||||
golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=
|
||||
golang.org/x/net v0.21.0/go.mod h1:bIjVDfnllIU7BJ2DNgfnXvpSvtn8VRwhlsaeUTyUS44=
|
||||
golang.org/x/net v0.53.0 h1:d+qAbo5L0orcWAr0a9JweQpjXF19LMXJE8Ey7hwOdUA=
|
||||
golang.org/x/net v0.53.0/go.mod h1:JvMuJH7rrdiCfbeHoo3fCQU24Lf5JJwT9W3sJFulfgs=
|
||||
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
|
||||
golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
|
||||
golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
|
||||
golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4=
|
||||
golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0=
|
||||
golang.org/x/sys v0.0.0-20180905080454-ebe1bf3edb33/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
|
||||
golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
|
||||
golang.org/x/sys v0.0.0-20190222072716-a9d3bda3a223/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
|
||||
golang.org/x/sys v0.0.0-20190403152447-81d4e9dc473e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20190422165155-953cdadca894/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20190813064441-fde4db37ae7a/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20190916202348-b4ddaad3f8a3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20191026070338-33540a1f6037/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20200116001909-b77594299b42/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20201204225414-ed752295db88/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20210616094352-59db8d763f22/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=
|
||||
golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI=
|
||||
golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw=
|
||||
golang.org/x/term v0.0.0-20201117132131-f5c789dd3221/go.mod h1:Nr5EML6q2oocZ2LXRh80K7BxOlk5/8JxuGnuhpl+muw=
|
||||
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
|
||||
golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
|
||||
golang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k=
|
||||
golang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo=
|
||||
golang.org/x/term v0.17.0/go.mod h1:lLRBjIVuehSbZlaOtGMbcMncT+aqLLLmKrsjNrUguwk=
|
||||
golang.org/x/term v0.42.0 h1:UiKe+zDFmJobeJ5ggPwOshJIVt6/Ft0rcfrXZDLWAWY=
|
||||
golang.org/x/term v0.42.0/go.mod h1:Dq/D+snpsbazcBG5+F9Q1n2rXV8Ma+71xEjTRufARgY=
|
||||
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
|
||||
golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk=
|
||||
golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
|
||||
golang.org/x/text v0.3.4/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
|
||||
golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=
|
||||
golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ=
|
||||
golang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=
|
||||
golang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8=
|
||||
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=
|
||||
golang.org/x/text v0.36.0 h1:JfKh3XmcRPqZPKevfXVpI1wXPTqbkE5f7JA92a55Yxg=
|
||||
golang.org/x/text v0.36.0/go.mod h1:NIdBknypM8iqVmPiuco0Dh6P5Jcdk8lJL0CUebqK164=
|
||||
golang.org/x/time v0.15.0 h1:bbrp8t3bGUeFOx08pvsMYRTCVSMk89u4tKbNOZbp88U=
|
||||
golang.org/x/time v0.15.0/go.mod h1:Y4YMaQmXwGQZoFaVFk4YpCt4FLQMYKZe9oeV/f4MSno=
|
||||
golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=
|
||||
golang.org/x/tools v0.0.0-20190311212946-11955173bddd/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs=
|
||||
golang.org/x/tools v0.0.0-20190425163242-31fd60d6bfdc/go.mod h1:RgjU9mgBXZiqYHBnxXauZ1Gv1EHHAz9KjViQ78xBX0Q=
|
||||
golang.org/x/tools v0.0.0-20190621195816-6e04913cbbac/go.mod h1:/rFqwRUd4F7ZHNgwSSTFct+R/Kf4OFW1sUzUTQQTgfc=
|
||||
golang.org/x/tools v0.0.0-20190823170909-c4a336ef6a2f/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
|
||||
golang.org/x/tools v0.0.0-20191029041327-9cc4af7d6b2c/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
|
||||
golang.org/x/tools v0.0.0-20191029190741-b9c20aec41a5/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
|
||||
golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
|
||||
golang.org/x/tools v0.0.0-20200103221440-774c71fcf114/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28=
|
||||
golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc=
|
||||
golang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU=
|
||||
golang.org/x/xerrors v0.0.0-20190410155217-1f06c39b4373/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
|
||||
golang.org/x/xerrors v0.0.0-20190513163551-3ee3066db522/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
|
||||
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
|
||||
golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
|
||||
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
|
||||
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
|
||||
gonum.org/v1/gonum v0.17.0 h1:VbpOemQlsSMrYmn7T2OUvQ4dqxQXU+ouZFQsZOx50z4=
|
||||
gonum.org/v1/gonum v0.17.0/go.mod h1:El3tOrEuMpv2UdMrbNlKEh9vd86bmQ6vqIcDwxEOc1E=
|
||||
google.golang.org/genproto/googleapis/api v0.0.0-20260401024825-9d38bb4040a9 h1:VPWxll4HlMw1Vs/qXtN7BvhZqsS9cdAittCNvVENElA=
|
||||
google.golang.org/genproto/googleapis/api v0.0.0-20260401024825-9d38bb4040a9/go.mod h1:7QBABkRtR8z+TEnmXTqIqwJLlzrZKVfAUm7tY3yGv0M=
|
||||
google.golang.org/genproto/googleapis/rpc v0.0.0-20260420184626-e10c466a9529 h1:XF8+t6QQiS0o9ArVan/HW8Q7cycNPGsJf6GA2nXxYAg=
|
||||
google.golang.org/genproto/googleapis/rpc v0.0.0-20260420184626-e10c466a9529/go.mod h1:4Hqkh8ycfw05ld/3BWL7rJOSfebL2Q+DVDeRgYgxUU8=
|
||||
google.golang.org/grpc v1.80.0 h1:Xr6m2WmWZLETvUNvIUmeD5OAagMw3FiKmMlTdViWsHM=
|
||||
google.golang.org/grpc v1.80.0/go.mod h1:ho/dLnxwi3EDJA4Zghp7k2Ec1+c2jqup0bFkw07bwF4=
|
||||
google.golang.org/protobuf v1.36.11 h1:fV6ZwhNocDyBLK0dj+fg8ektcVegBBuEolpbTQyBNVE=
|
||||
google.golang.org/protobuf v1.36.11/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
|
||||
gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
|
||||
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
|
||||
gopkg.in/inconshreveable/log15.v2 v2.0.0-20180818164646-67afb5ed74ec/go.mod h1:aPpfJ7XW+gOuirDoZ8gHhLh3kZ1B08FtV2bbmy7Jv3s=
|
||||
gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
|
||||
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gotest.tools/v3 v3.5.2 h1:7koQfIKdy+I8UTetycgUqXWSDwpgv193Ka+qRsmBY8Q=
|
||||
gotest.tools/v3 v3.5.2/go.mod h1:LtdLGcnqToBH83WByAAi/wiwSFCArdFIUV/xxN4pcjA=
|
||||
honnef.co/go/tools v0.0.1-2019.2.3/go.mod h1:a3bituU0lyd329TUQxRnasdCoJDkEUEAqEt0JzvZhAg=
|
||||
modernc.org/libc v1.72.1 h1:db1xwJ6u1kE3KHTFTTbe2GCrczHPKzlURP0aDC4NGD0=
|
||||
modernc.org/libc v1.72.1/go.mod h1:HRMiC/PhPGLIPM7GzAFCbI+oSgE3dhZ8FWftmRrHVlY=
|
||||
modernc.org/mathutil v1.7.1 h1:GCZVGXdaN8gTqB1Mf/usp1Y/hSqgI2vAGGP4jZMCxOU=
|
||||
modernc.org/mathutil v1.7.1/go.mod h1:4p5IwJITfppl0G4sUEDtCr4DthTaT47/N3aT6MhfgJg=
|
||||
modernc.org/memory v1.11.0 h1:o4QC8aMQzmcwCK3t3Ux/ZHmwFPzE6hf2Y5LbkRs+hbI=
|
||||
modernc.org/memory v1.11.0/go.mod h1:/JP4VbVC+K5sU2wZi9bHoq2MAkCnrt2r98UGeSK7Mjw=
|
||||
modernc.org/sqlite v1.49.1 h1:dYGHTKcX1sJ+EQDnUzvz4TJ5GbuvhNJa8Fg6ElGx73U=
|
||||
modernc.org/sqlite v1.49.1/go.mod h1:m0w8xhwYUVY3H6pSDwc3gkJ/irZT/0YEXwBlhaxQEew=
|
||||
pgregory.net/rapid v1.2.0 h1:keKAYRcjm+e1F0oAuU5F5+YPAWcyxNNRK2wud503Gnk=
|
||||
pgregory.net/rapid v1.2.0/go.mod h1:PY5XlDGj0+V1FCq0o192FdRhpKHGTRIWBgqjDBTrq04=
|
||||
@@ -0,0 +1,236 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"context"
|
||||
"crypto/rand"
|
||||
"encoding/hex"
|
||||
"errors"
|
||||
"fmt"
|
||||
"os"
|
||||
"os/exec"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"strings"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
cerrdefs "github.com/containerd/errdefs"
|
||||
"github.com/docker/docker/api/types/network"
|
||||
dockerclient "github.com/docker/docker/client"
|
||||
)
|
||||
|
||||
// Engine image tags used by the integration suite. `EngineImageRef` is
|
||||
// the image we actually build from `galaxy/game/Dockerfile`;
|
||||
// `PatchedEngineImageRef` is the same image content tagged at a higher
|
||||
// semver patch so the patch lifecycle test exercises the
|
||||
// `semver_patch_only` validation against a real image. Keeping both at
|
||||
// the same digest avoids a redundant build.
|
||||
const (
|
||||
EngineImageRef = "galaxy/game:1.0.0-rtm-it"
|
||||
PatchedEngineImageRef = "galaxy/game:1.0.1-rtm-it"
|
||||
|
||||
dockerNetworkPrefix = "rtmanager-it-"
|
||||
|
||||
dockerPingTimeout = 5 * time.Second
|
||||
dockerNetworkTimeout = 30 * time.Second
|
||||
imageBuildTimeout = 10 * time.Minute
|
||||
)
|
||||
|
||||
// DockerEnv carries the per-package Docker client plus the workspace
|
||||
// root used by image builds. The client is opened lazily on the first
|
||||
// EnsureDocker call and closed by ShutdownDocker at TestMain exit.
|
||||
type DockerEnv struct {
|
||||
client *dockerclient.Client
|
||||
workspaceRoot string
|
||||
}
|
||||
|
||||
// Client returns the harness-owned Docker SDK client. Tests use it
|
||||
// directly for "external actions" the harness does not wrap (e.g.,
|
||||
// removing a running container behind RTM's back in `health_test`).
|
||||
func (env *DockerEnv) Client() *dockerclient.Client { return env.client }
|
||||
|
||||
// WorkspaceRoot returns the absolute path of the galaxy/ workspace
|
||||
// root. It is exported so the runtime helper can resolve the host
|
||||
// game-state root relative to it if a test needs a deterministic
|
||||
// location, though the default places state under `t.ArtifactDir()`.
|
||||
func (env *DockerEnv) WorkspaceRoot() string { return env.workspaceRoot }
|
||||
|
||||
var (
|
||||
dockerOnce sync.Once
|
||||
dockerEnv *DockerEnv
|
||||
dockerErr error
|
||||
|
||||
imageOnce sync.Once
|
||||
imageErr error
|
||||
)
|
||||
|
||||
// EnsureDocker opens the shared Docker SDK client and verifies the
|
||||
// daemon is reachable. When the daemon is unavailable the helper calls
|
||||
// `t.Skip` so suites stay green on hosts without `/var/run/docker.sock`
|
||||
// or `DOCKER_HOST`.
|
||||
func EnsureDocker(t testing.TB) *DockerEnv {
|
||||
t.Helper()
|
||||
dockerOnce.Do(func() {
|
||||
dockerEnv, dockerErr = openDocker()
|
||||
})
|
||||
if dockerErr != nil {
|
||||
t.Skipf("rtmanager integration: docker daemon unavailable: %v", dockerErr)
|
||||
}
|
||||
return dockerEnv
|
||||
}
|
||||
|
||||
// EnsureEngineImage builds the `galaxy/game` engine image from the
|
||||
// workspace root once per package run via `sync.Once`, then tags the
|
||||
// resulting image at both `EngineImageRef` and `PatchedEngineImageRef`
|
||||
// so the patch lifecycle has a second semver-valid tag to point at.
|
||||
// Subsequent calls re-use the cached image. Any test that asks for the
|
||||
// engine image must invoke this helper first; it is intentionally
|
||||
// separate from `EnsureDocker` so suites that only need the daemon
|
||||
// (e.g., a future "Docker network missing" negative test) do not pay
|
||||
// the build cost.
|
||||
func EnsureEngineImage(t testing.TB) string {
|
||||
t.Helper()
|
||||
env := EnsureDocker(t)
|
||||
imageOnce.Do(func() {
|
||||
imageErr = buildAndTagEngineImage(env)
|
||||
})
|
||||
if imageErr != nil {
|
||||
t.Skipf("rtmanager integration: build galaxy/game image: %v", imageErr)
|
||||
}
|
||||
return EngineImageRef
|
||||
}
|
||||
|
||||
// EnsureNetwork creates a uniquely-named Docker bridge network for the
|
||||
// caller's test and registers cleanup. Each test gets its own network
|
||||
// so concurrent scenarios cannot collide on the per-game DNS hostname.
|
||||
func EnsureNetwork(t testing.TB) string {
|
||||
t.Helper()
|
||||
env := EnsureDocker(t)
|
||||
name := dockerNetworkPrefix + uniqueSuffix(t)
|
||||
|
||||
createCtx, cancel := context.WithTimeout(context.Background(), dockerNetworkTimeout)
|
||||
defer cancel()
|
||||
if _, err := env.client.NetworkCreate(createCtx, name, network.CreateOptions{Driver: "bridge"}); err != nil {
|
||||
t.Fatalf("rtmanager integration: create docker network %q: %v", name, err)
|
||||
}
|
||||
t.Cleanup(func() {
|
||||
removeCtx, removeCancel := context.WithTimeout(context.Background(), dockerNetworkTimeout)
|
||||
defer removeCancel()
|
||||
if err := env.client.NetworkRemove(removeCtx, name); err != nil && !cerrdefs.IsNotFound(err) {
|
||||
t.Logf("rtmanager integration: remove docker network %q: %v", name, err)
|
||||
}
|
||||
})
|
||||
return name
|
||||
}
|
||||
|
||||
// ShutdownDocker closes the shared Docker SDK client. `TestMain`
|
||||
// invokes it after `m.Run`. The harness deliberately leaves the engine
|
||||
// image in the local Docker cache so the next package run benefits
|
||||
// from the layer cache; operators can `docker image rm` the
|
||||
// `*-rtm-it` tags by hand if a stale image gets in the way.
|
||||
func ShutdownDocker() {
|
||||
if dockerEnv == nil {
|
||||
return
|
||||
}
|
||||
if dockerEnv.client != nil {
|
||||
_ = dockerEnv.client.Close()
|
||||
}
|
||||
dockerEnv = nil
|
||||
}
|
||||
|
||||
// uniqueSuffix returns 8 hex characters of randomness suitable for a
|
||||
// per-test resource name. The same helper is used in
|
||||
// `internal/adapters/docker/smoke_test.go`; we duplicate it instead of
|
||||
// importing because `_test.go`-only helpers cannot be exported.
|
||||
func uniqueSuffix(t testing.TB) string {
|
||||
t.Helper()
|
||||
buf := make([]byte, 4)
|
||||
if _, err := rand.Read(buf); err != nil {
|
||||
t.Fatalf("rtmanager integration: read random suffix: %v", err)
|
||||
}
|
||||
return hex.EncodeToString(buf)
|
||||
}
|
||||
|
||||
func openDocker() (*DockerEnv, error) {
|
||||
if os.Getenv("DOCKER_HOST") == "" {
|
||||
if _, err := os.Stat("/var/run/docker.sock"); err != nil {
|
||||
return nil, fmt.Errorf("set DOCKER_HOST or expose /var/run/docker.sock: %w", err)
|
||||
}
|
||||
}
|
||||
|
||||
client, err := dockerclient.NewClientWithOpts(
|
||||
dockerclient.FromEnv,
|
||||
dockerclient.WithAPIVersionNegotiation(),
|
||||
)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("new docker client: %w", err)
|
||||
}
|
||||
|
||||
pingCtx, cancel := context.WithTimeout(context.Background(), dockerPingTimeout)
|
||||
defer cancel()
|
||||
if _, err := client.Ping(pingCtx); err != nil {
|
||||
_ = client.Close()
|
||||
return nil, fmt.Errorf("ping docker daemon: %w", err)
|
||||
}
|
||||
|
||||
root, err := workspaceRoot()
|
||||
if err != nil {
|
||||
_ = client.Close()
|
||||
return nil, fmt.Errorf("resolve workspace root: %w", err)
|
||||
}
|
||||
|
||||
return &DockerEnv{
|
||||
client: client,
|
||||
workspaceRoot: root,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// buildAndTagEngineImage invokes `docker build` against the workspace
|
||||
// root context to materialise the `galaxy/game` image, then tags the
|
||||
// resulting image at the patch tag. Shelling out to the CLI keeps the
|
||||
// implementation tiny — using the SDK would require streaming a tar
|
||||
// of the workspace root, which is heavy and duplicates what the CLI
|
||||
// already optimises. The workspace-root build context is required by
|
||||
// `galaxy/game` (see `galaxy/game/README.md` §Build).
|
||||
func buildAndTagEngineImage(env *DockerEnv) error {
|
||||
if env == nil {
|
||||
return errors.New("nil docker env")
|
||||
}
|
||||
ctx, cancel := context.WithTimeout(context.Background(), imageBuildTimeout)
|
||||
defer cancel()
|
||||
|
||||
dockerfilePath := filepath.Join("game", "Dockerfile")
|
||||
cmd := exec.CommandContext(ctx, "docker", "build",
|
||||
"-f", dockerfilePath,
|
||||
"-t", EngineImageRef,
|
||||
".",
|
||||
)
|
||||
cmd.Dir = env.workspaceRoot
|
||||
cmd.Env = append(os.Environ(), "DOCKER_BUILDKIT=1")
|
||||
output, err := cmd.CombinedOutput()
|
||||
if err != nil {
|
||||
return fmt.Errorf("docker build (-f %s) in %s: %w; output:\n%s",
|
||||
dockerfilePath, env.workspaceRoot, err, strings.TrimSpace(string(output)))
|
||||
}
|
||||
|
||||
if err := env.client.ImageTag(ctx, EngineImageRef, PatchedEngineImageRef); err != nil {
|
||||
return fmt.Errorf("tag %s as %s: %w", EngineImageRef, PatchedEngineImageRef, err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// workspaceRoot resolves the absolute path of the galaxy/ workspace
|
||||
// root by anchoring on this file's location. The harness lives at
|
||||
// `galaxy/rtmanager/integration/harness/docker.go`, so the workspace
|
||||
// root is three directories up. Mirrors the `cmd/jetgen` strategy.
|
||||
func workspaceRoot() (string, error) {
|
||||
_, file, _, ok := runtime.Caller(0)
|
||||
if !ok {
|
||||
return "", errors.New("resolve runtime caller for workspace root")
|
||||
}
|
||||
dir := filepath.Dir(file)
|
||||
// dir = .../galaxy/rtmanager/integration/harness
|
||||
root := filepath.Clean(filepath.Join(dir, "..", "..", ".."))
|
||||
return root, nil
|
||||
}
|
||||
@@ -0,0 +1,59 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
)
|
||||
|
||||
// LobbyStub answers the single Lobby internal request the start
|
||||
// service performs ([`internal/adapters/lobbyclient`]). The start
|
||||
// service treats this response as ancillary diagnostics — the start
|
||||
// envelope already carries `image_ref` — so the stub returns a
|
||||
// deterministic 200 OK and lets the runtime ignore the payload.
|
||||
//
|
||||
// The stub only validates that the runtime configuration treats the
|
||||
// Lobby URL as required (so it cannot regress to nil-out the
|
||||
// ancillary fetch); the response body itself is unused by the
|
||||
// integration assertions.
|
||||
type LobbyStub struct {
|
||||
Server *httptest.Server
|
||||
}
|
||||
|
||||
// NewLobbyStub returns a started httptest.Server. The caller registers
|
||||
// `t.Cleanup(stub.Close)` themselves through the runtime helper so the
|
||||
// stub follows the same lifecycle as the rest of the per-test wiring.
|
||||
func NewLobbyStub(t testing.TB) *LobbyStub {
|
||||
t.Helper()
|
||||
mux := http.NewServeMux()
|
||||
mux.HandleFunc("GET /api/v1/internal/games/{game_id}", func(w http.ResponseWriter, r *http.Request) {
|
||||
gameID := strings.TrimSpace(r.PathValue("game_id"))
|
||||
if gameID == "" {
|
||||
writeStubError(w, http.StatusBadRequest, "invalid_request", "game_id is required")
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json; charset=utf-8")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_ = json.NewEncoder(w).Encode(map[string]string{
|
||||
"game_id": gameID,
|
||||
"status": "running",
|
||||
"target_engine_version": "1.0.0",
|
||||
})
|
||||
})
|
||||
server := httptest.NewServer(mux)
|
||||
t.Cleanup(server.Close)
|
||||
return &LobbyStub{Server: server}
|
||||
}
|
||||
|
||||
// URL returns the base URL of the running stub.
|
||||
func (stub *LobbyStub) URL() string { return stub.Server.URL }
|
||||
|
||||
func writeStubError(w http.ResponseWriter, status int, code, message string) {
|
||||
w.Header().Set("Content-Type", "application/json; charset=utf-8")
|
||||
w.WriteHeader(status)
|
||||
_ = json.NewEncoder(w).Encode(map[string]any{
|
||||
"error": map[string]string{"code": code, "message": message},
|
||||
})
|
||||
}
|
||||
@@ -0,0 +1,224 @@
|
||||
// Package harness exposes the testcontainers / Docker / image-build
|
||||
// scaffolding shared by the Runtime Manager service-local integration
|
||||
// suite under [`galaxy/rtmanager/integration`](..).
|
||||
//
|
||||
// Only `_test.go` files (and the harness itself) reference this
|
||||
// package; production code paths in `cmd/rtmanager` never import it.
|
||||
// The package therefore stays out of the production binary's import
|
||||
// graph, identical to the in-package `pgtest` and `integration/internal/harness`
|
||||
// patterns it mirrors.
|
||||
package harness
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"net/url"
|
||||
"os"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/migrations"
|
||||
|
||||
testcontainers "github.com/testcontainers/testcontainers-go"
|
||||
tcpostgres "github.com/testcontainers/testcontainers-go/modules/postgres"
|
||||
"github.com/testcontainers/testcontainers-go/wait"
|
||||
)
|
||||
|
||||
const (
|
||||
pgImage = "postgres:16-alpine"
|
||||
pgSuperUser = "galaxy"
|
||||
pgSuperPassword = "galaxy"
|
||||
pgSuperDatabase = "galaxy_rtmanager_it"
|
||||
pgServiceRole = "rtmanagerservice"
|
||||
pgServicePassword = "rtmanagerservice"
|
||||
pgServiceSchema = "rtmanager"
|
||||
pgStartupTimeout = 90 * time.Second
|
||||
|
||||
// pgOperationTimeout bounds the per-statement deadline used by every
|
||||
// pool the harness opens. Short enough to surface a runaway
|
||||
// integration test promptly, long enough to absorb laptop-grade I/O.
|
||||
pgOperationTimeout = 10 * time.Second
|
||||
)
|
||||
|
||||
// PostgresEnv carries the per-package PostgreSQL fixture. The container
|
||||
// is started lazily on the first EnsurePostgres call and torn down by
|
||||
// ShutdownPostgres at TestMain exit.
|
||||
type PostgresEnv struct {
|
||||
container *tcpostgres.PostgresContainer
|
||||
pool *sql.DB
|
||||
scopedDSN string
|
||||
}
|
||||
|
||||
// Pool returns the harness-owned `*sql.DB` scoped to the rtmanager
|
||||
// schema. Tests use it to read durable state directly through the
|
||||
// existing store adapters.
|
||||
func (env *PostgresEnv) Pool() *sql.DB { return env.pool }
|
||||
|
||||
// DSN returns the rtmanager-role-scoped DSN suitable for
|
||||
// `RTMANAGER_POSTGRES_PRIMARY_DSN`. Both this DSN and Pool address the
|
||||
// same database; the pool is reused across tests, while the runtime
|
||||
// under test opens its own pool through this DSN.
|
||||
func (env *PostgresEnv) DSN() string { return env.scopedDSN }
|
||||
|
||||
var (
|
||||
pgOnce sync.Once
|
||||
pgEnv *PostgresEnv
|
||||
pgErr error
|
||||
)
|
||||
|
||||
// EnsurePostgres starts the per-package PostgreSQL container on first
|
||||
// invocation and applies the embedded goose migrations. Subsequent
|
||||
// invocations reuse the same container. When Docker is unavailable the
|
||||
// helper calls `t.Skip` so the suite stays green on hosts without a
|
||||
// daemon (mirrors the contract from `internal/adapters/postgres/internal/pgtest`).
|
||||
func EnsurePostgres(t testing.TB) *PostgresEnv {
|
||||
t.Helper()
|
||||
pgOnce.Do(func() {
|
||||
pgEnv, pgErr = startPostgres()
|
||||
})
|
||||
if pgErr != nil {
|
||||
t.Skipf("rtmanager integration: postgres container start failed (Docker unavailable?): %v", pgErr)
|
||||
}
|
||||
return pgEnv
|
||||
}
|
||||
|
||||
// TruncatePostgres wipes every Runtime Manager table inside the shared
|
||||
// pool, leaving the schema and indexes intact. Tests call this from
|
||||
// their setup so each scenario starts on an empty state.
|
||||
func TruncatePostgres(t testing.TB) {
|
||||
t.Helper()
|
||||
env := EnsurePostgres(t)
|
||||
const stmt = `TRUNCATE TABLE runtime_records, operation_log, health_snapshots RESTART IDENTITY CASCADE`
|
||||
if _, err := env.pool.ExecContext(context.Background(), stmt); err != nil {
|
||||
t.Fatalf("truncate rtmanager tables: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// ShutdownPostgres terminates the shared container and closes the pool.
|
||||
// `TestMain` invokes it after `m.Run` so the container is released even
|
||||
// if individual tests panic.
|
||||
func ShutdownPostgres() {
|
||||
if pgEnv == nil {
|
||||
return
|
||||
}
|
||||
if pgEnv.pool != nil {
|
||||
_ = pgEnv.pool.Close()
|
||||
}
|
||||
if pgEnv.container != nil {
|
||||
_ = testcontainers.TerminateContainer(pgEnv.container)
|
||||
}
|
||||
pgEnv = nil
|
||||
}
|
||||
|
||||
// RunMain is a convenience helper for the integration package
|
||||
// `TestMain`: it runs the suite, captures the exit code, tears every
|
||||
// shared container down, and exits. Wiring it through one helper keeps
|
||||
// `TestMain` to two lines and centralises ordering.
|
||||
func RunMain(m *testing.M) {
|
||||
code := m.Run()
|
||||
ShutdownRedis()
|
||||
ShutdownPostgres()
|
||||
ShutdownDocker()
|
||||
os.Exit(code)
|
||||
}
|
||||
|
||||
func startPostgres() (*PostgresEnv, error) {
|
||||
ctx := context.Background()
|
||||
container, err := tcpostgres.Run(ctx, pgImage,
|
||||
tcpostgres.WithDatabase(pgSuperDatabase),
|
||||
tcpostgres.WithUsername(pgSuperUser),
|
||||
tcpostgres.WithPassword(pgSuperPassword),
|
||||
testcontainers.WithWaitStrategy(
|
||||
wait.ForLog("database system is ready to accept connections").
|
||||
WithOccurrence(2).
|
||||
WithStartupTimeout(pgStartupTimeout),
|
||||
),
|
||||
)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
baseDSN, err := container.ConnectionString(ctx, "sslmode=disable")
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
if err := provisionRoleAndSchema(ctx, baseDSN); err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
scopedDSN, err := scopedDSNForRole(baseDSN)
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
cfg := postgres.DefaultConfig()
|
||||
cfg.PrimaryDSN = scopedDSN
|
||||
cfg.OperationTimeout = pgOperationTimeout
|
||||
pool, err := postgres.OpenPrimary(ctx, cfg)
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
if err := postgres.Ping(ctx, pool, pgOperationTimeout); err != nil {
|
||||
_ = pool.Close()
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
if err := postgres.RunMigrations(ctx, pool, migrations.FS(), "."); err != nil {
|
||||
_ = pool.Close()
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
return &PostgresEnv{
|
||||
container: container,
|
||||
pool: pool,
|
||||
scopedDSN: scopedDSN,
|
||||
}, nil
|
||||
}
|
||||
|
||||
func provisionRoleAndSchema(ctx context.Context, baseDSN string) error {
|
||||
cfg := postgres.DefaultConfig()
|
||||
cfg.PrimaryDSN = baseDSN
|
||||
cfg.OperationTimeout = pgOperationTimeout
|
||||
db, err := postgres.OpenPrimary(ctx, cfg)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer func() { _ = db.Close() }()
|
||||
|
||||
statements := []string{
|
||||
`DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'rtmanagerservice') THEN
|
||||
CREATE ROLE rtmanagerservice LOGIN PASSWORD 'rtmanagerservice';
|
||||
END IF;
|
||||
END $$;`,
|
||||
`CREATE SCHEMA IF NOT EXISTS rtmanager AUTHORIZATION rtmanagerservice;`,
|
||||
`GRANT USAGE ON SCHEMA rtmanager TO rtmanagerservice;`,
|
||||
}
|
||||
for _, statement := range statements {
|
||||
if _, err := db.ExecContext(ctx, statement); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func scopedDSNForRole(baseDSN string) (string, error) {
|
||||
parsed, err := url.Parse(baseDSN)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
values := url.Values{}
|
||||
values.Set("search_path", pgServiceSchema)
|
||||
values.Set("sslmode", "disable")
|
||||
scoped := url.URL{
|
||||
Scheme: parsed.Scheme,
|
||||
User: url.UserPassword(pgServiceRole, pgServicePassword),
|
||||
Host: parsed.Host,
|
||||
Path: parsed.Path,
|
||||
RawQuery: values.Encode(),
|
||||
}
|
||||
return scoped.String(), nil
|
||||
}
|
||||
@@ -0,0 +1,102 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"context"
|
||||
"sync"
|
||||
"testing"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
testcontainers "github.com/testcontainers/testcontainers-go"
|
||||
rediscontainer "github.com/testcontainers/testcontainers-go/modules/redis"
|
||||
)
|
||||
|
||||
const redisImage = "redis:7"
|
||||
|
||||
// RedisEnv carries the per-package Redis fixture. The container is
|
||||
// started lazily on the first EnsureRedis call and torn down by
|
||||
// ShutdownRedis at TestMain exit. Both stream consumers and the
|
||||
// per-game lease store hit this real Redis (miniredis would suffice
|
||||
// for streams alone, but the lease semantics and eviction-by-TTL we
|
||||
// rely on in `health_test` are easier to verify against a real
|
||||
// daemon).
|
||||
type RedisEnv struct {
|
||||
container *rediscontainer.RedisContainer
|
||||
addr string
|
||||
}
|
||||
|
||||
// Addr returns the externally reachable host:port of the Redis
|
||||
// container. Both the runtime under test and the harness-owned client
|
||||
// connect through the same endpoint.
|
||||
func (env *RedisEnv) Addr() string { return env.addr }
|
||||
|
||||
// NewClient opens a fresh `*redis.Client` against the harness Redis.
|
||||
// Tests close their client through `t.Cleanup`; the harness keeps no
|
||||
// shared client to avoid cross-test connection-pool surprises.
|
||||
func (env *RedisEnv) NewClient(t testing.TB) *redis.Client {
|
||||
t.Helper()
|
||||
client := redis.NewClient(&redis.Options{Addr: env.addr})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
return client
|
||||
}
|
||||
|
||||
var (
|
||||
redisOnce sync.Once
|
||||
redisEnv *RedisEnv
|
||||
redisErr error
|
||||
)
|
||||
|
||||
// EnsureRedis starts the per-package Redis container on first
|
||||
// invocation and returns it. When Docker is unavailable the helper
|
||||
// calls `t.Skip` so the suite stays green on hosts without a daemon.
|
||||
func EnsureRedis(t testing.TB) *RedisEnv {
|
||||
t.Helper()
|
||||
redisOnce.Do(func() {
|
||||
redisEnv, redisErr = startRedis()
|
||||
})
|
||||
if redisErr != nil {
|
||||
t.Skipf("rtmanager integration: redis container start failed (Docker unavailable?): %v", redisErr)
|
||||
}
|
||||
return redisEnv
|
||||
}
|
||||
|
||||
// FlushRedis drops every key on the harness Redis. Tests call it from
|
||||
// their setup so streams, offset records, and leases from previous
|
||||
// scenarios do not leak.
|
||||
func FlushRedis(t testing.TB) {
|
||||
t.Helper()
|
||||
env := EnsureRedis(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: env.addr})
|
||||
defer func() { _ = client.Close() }()
|
||||
if _, err := client.FlushAll(context.Background()).Result(); err != nil {
|
||||
t.Fatalf("flush rtmanager redis: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// ShutdownRedis terminates the shared container. `TestMain` invokes it
|
||||
// after `m.Run`.
|
||||
func ShutdownRedis() {
|
||||
if redisEnv == nil {
|
||||
return
|
||||
}
|
||||
if redisEnv.container != nil {
|
||||
_ = testcontainers.TerminateContainer(redisEnv.container)
|
||||
}
|
||||
redisEnv = nil
|
||||
}
|
||||
|
||||
func startRedis() (*RedisEnv, error) {
|
||||
ctx := context.Background()
|
||||
container, err := rediscontainer.Run(ctx, redisImage)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
addr, err := container.Endpoint(ctx, "")
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
return &RedisEnv{
|
||||
container: container,
|
||||
addr: addr,
|
||||
}, nil
|
||||
}
|
||||
@@ -0,0 +1,195 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
// defaultHTTPClient backs the runtime-readiness poll and the REST
|
||||
// helpers below. A short timeout is enough — every internal endpoint
|
||||
// runs against an in-process listener.
|
||||
var defaultHTTPClient = &http.Client{Timeout: 5 * time.Second}
|
||||
|
||||
// newRequest is a thin shim over `http.NewRequestWithContext` so the
|
||||
// readiness poll and the REST client share one constructor.
|
||||
func newRequest(ctx context.Context, method, fullURL string, body io.Reader) (*http.Request, error) {
|
||||
req, err := http.NewRequestWithContext(ctx, method, fullURL, body)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if body != nil {
|
||||
req.Header.Set("Content-Type", "application/json; charset=utf-8")
|
||||
}
|
||||
req.Header.Set("Accept", "application/json")
|
||||
req.Header.Set("X-Galaxy-Caller", "admin")
|
||||
return req, nil
|
||||
}
|
||||
|
||||
// REST is a tiny client for the trusted internal HTTP surface RTM
|
||||
// exposes to Game Master and Admin Service. It always identifies the
|
||||
// caller as `admin` (the operation_log records `admin_rest`); tests
|
||||
// that need GM semantics should add an option later. v1 keeps the
|
||||
// helper minimal because the integration scenarios only need
|
||||
// admin-driven flows.
|
||||
type REST struct {
|
||||
baseURL string
|
||||
httpc *http.Client
|
||||
}
|
||||
|
||||
// NewREST builds a REST client targeting env.InternalAddr.
|
||||
func NewREST(env *Env) *REST {
|
||||
return &REST{
|
||||
baseURL: "http://" + env.InternalAddr,
|
||||
httpc: defaultHTTPClient,
|
||||
}
|
||||
}
|
||||
|
||||
// Get issues GET path and returns the response body and status code.
|
||||
func (r *REST) Get(t testing.TB, path string) ([]byte, int) {
|
||||
t.Helper()
|
||||
return r.do(t, http.MethodGet, path, nil)
|
||||
}
|
||||
|
||||
// Post issues POST path with body (a Go value JSON-marshaled).
|
||||
func (r *REST) Post(t testing.TB, path string, body any) ([]byte, int) {
|
||||
t.Helper()
|
||||
return r.do(t, http.MethodPost, path, body)
|
||||
}
|
||||
|
||||
// Delete issues DELETE path with no body.
|
||||
func (r *REST) Delete(t testing.TB, path string) ([]byte, int) {
|
||||
t.Helper()
|
||||
return r.do(t, http.MethodDelete, path, nil)
|
||||
}
|
||||
|
||||
// GetRuntime fetches a runtime record by game id and returns the
|
||||
// decoded payload, the status code, and the raw bytes for diagnostics.
|
||||
func (r *REST) GetRuntime(t testing.TB, gameID string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
body, status := r.Get(t, fmt.Sprintf("/api/v1/internal/runtimes/%s", url.PathEscape(gameID)))
|
||||
var resp RuntimeRecordResponse
|
||||
if status == http.StatusOK {
|
||||
if err := json.Unmarshal(body, &resp); err != nil {
|
||||
t.Fatalf("decode get-runtime response: %v; body=%s", err, string(body))
|
||||
}
|
||||
}
|
||||
return resp, status
|
||||
}
|
||||
|
||||
// StartRuntime invokes the start endpoint with imageRef.
|
||||
func (r *REST) StartRuntime(t testing.TB, gameID, imageRef string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
body, status := r.Post(t,
|
||||
fmt.Sprintf("/api/v1/internal/runtimes/%s/start", url.PathEscape(gameID)),
|
||||
map[string]string{"image_ref": imageRef},
|
||||
)
|
||||
return decodeRecord(t, body, status, "start")
|
||||
}
|
||||
|
||||
// StopRuntime invokes the stop endpoint with reason.
|
||||
func (r *REST) StopRuntime(t testing.TB, gameID, reason string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
body, status := r.Post(t,
|
||||
fmt.Sprintf("/api/v1/internal/runtimes/%s/stop", url.PathEscape(gameID)),
|
||||
map[string]string{"reason": reason},
|
||||
)
|
||||
return decodeRecord(t, body, status, "stop")
|
||||
}
|
||||
|
||||
// RestartRuntime invokes the restart endpoint.
|
||||
func (r *REST) RestartRuntime(t testing.TB, gameID string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
body, status := r.Post(t,
|
||||
fmt.Sprintf("/api/v1/internal/runtimes/%s/restart", url.PathEscape(gameID)),
|
||||
struct{}{},
|
||||
)
|
||||
return decodeRecord(t, body, status, "restart")
|
||||
}
|
||||
|
||||
// PatchRuntime invokes the patch endpoint with imageRef.
|
||||
func (r *REST) PatchRuntime(t testing.TB, gameID, imageRef string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
body, status := r.Post(t,
|
||||
fmt.Sprintf("/api/v1/internal/runtimes/%s/patch", url.PathEscape(gameID)),
|
||||
map[string]string{"image_ref": imageRef},
|
||||
)
|
||||
return decodeRecord(t, body, status, "patch")
|
||||
}
|
||||
|
||||
// CleanupRuntime invokes the DELETE container endpoint.
|
||||
func (r *REST) CleanupRuntime(t testing.TB, gameID string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
body, status := r.Delete(t,
|
||||
fmt.Sprintf("/api/v1/internal/runtimes/%s/container", url.PathEscape(gameID)),
|
||||
)
|
||||
return decodeRecord(t, body, status, "cleanup")
|
||||
}
|
||||
|
||||
// RuntimeRecordResponse mirrors the OpenAPI RuntimeRecord schema. Only
|
||||
// the fields integration scenarios assert against live here; the
|
||||
// listener encodes everything else.
|
||||
type RuntimeRecordResponse struct {
|
||||
GameID string `json:"game_id"`
|
||||
Status string `json:"status"`
|
||||
CurrentContainerID *string `json:"current_container_id"`
|
||||
CurrentImageRef *string `json:"current_image_ref"`
|
||||
EngineEndpoint *string `json:"engine_endpoint"`
|
||||
StatePath string `json:"state_path"`
|
||||
DockerNetwork string `json:"docker_network"`
|
||||
StartedAt *string `json:"started_at"`
|
||||
StoppedAt *string `json:"stopped_at"`
|
||||
RemovedAt *string `json:"removed_at"`
|
||||
LastOpAt string `json:"last_op_at"`
|
||||
CreatedAt string `json:"created_at"`
|
||||
}
|
||||
|
||||
func (r *REST) do(t testing.TB, method, path string, body any) ([]byte, int) {
|
||||
t.Helper()
|
||||
var reader io.Reader
|
||||
if body != nil {
|
||||
raw, err := json.Marshal(body)
|
||||
if err != nil {
|
||||
t.Fatalf("marshal request body: %v", err)
|
||||
}
|
||||
reader = bytes.NewReader(raw)
|
||||
}
|
||||
req, err := newRequest(context.Background(), method, r.baseURL+path, reader)
|
||||
if err != nil {
|
||||
t.Fatalf("build %s %s request: %v", method, path, err)
|
||||
}
|
||||
resp, err := r.httpc.Do(req)
|
||||
if err != nil {
|
||||
t.Fatalf("execute %s %s: %v", method, path, err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
raw, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
t.Fatalf("read %s %s response: %v", method, path, err)
|
||||
}
|
||||
return raw, resp.StatusCode
|
||||
}
|
||||
|
||||
func decodeRecord(t testing.TB, body []byte, status int, op string) (RuntimeRecordResponse, int) {
|
||||
t.Helper()
|
||||
if status != http.StatusOK {
|
||||
return RuntimeRecordResponse{}, status
|
||||
}
|
||||
var resp RuntimeRecordResponse
|
||||
if err := json.Unmarshal(body, &resp); err != nil {
|
||||
t.Fatalf("decode %s response: %v; body=%s", op, err, string(body))
|
||||
}
|
||||
return resp, status
|
||||
}
|
||||
|
||||
// PathEscape is a re-export so test files can call it without
|
||||
// importing `net/url` directly. Keeps the test source focused on
|
||||
// scenarios.
|
||||
func PathEscape(value string) string { return url.PathEscape(strings.TrimSpace(value)) }
|
||||
@@ -0,0 +1,398 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"io"
|
||||
"log/slog"
|
||||
"net/url"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/redisconn"
|
||||
"galaxy/rtmanager/internal/app"
|
||||
"galaxy/rtmanager/internal/config"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// Default stream key shapes used by the integration suite. They match
|
||||
// the production defaults so the wire shapes asserted in `streams.go`
|
||||
// are identical to what Game Lobby sees in `integration/lobbyrtm`.
|
||||
const (
|
||||
StartJobsStream = "runtime:start_jobs"
|
||||
StopJobsStream = "runtime:stop_jobs"
|
||||
JobResultsStream = "runtime:job_results"
|
||||
HealthEventsStream = "runtime:health_events"
|
||||
NotificationIntentsKey = "notification:intents"
|
||||
gameStateRootSubdir = "game-state"
|
||||
listenAddr = "127.0.0.1:0"
|
||||
listenerWaitTimeout = 10 * time.Second
|
||||
readyzPollInterval = 25 * time.Millisecond
|
||||
cleanupShutdownTimeout = 30 * time.Second
|
||||
)
|
||||
|
||||
// Env carries everything one integration scenario needs to drive the
|
||||
// Runtime Manager process. The struct is value-typed so tests reach
|
||||
// fields without intermediate getters.
|
||||
type Env struct {
|
||||
// Cfg is the resolved Runtime Manager configuration handed to
|
||||
// `app.NewRuntime`. Tests inspect it for stream key shapes,
|
||||
// container defaults, and timeout knobs.
|
||||
Cfg config.Config
|
||||
|
||||
// Runtime is the in-process Runtime Manager exposed for tests that
|
||||
// need to peek at internal state (`runtime.InternalServer().Addr()`).
|
||||
Runtime *app.Runtime
|
||||
|
||||
// Postgres holds the per-package PostgreSQL fixture.
|
||||
Postgres *PostgresEnv
|
||||
|
||||
// Redis holds the per-package Redis fixture plus a fresh client the
|
||||
// test owns.
|
||||
Redis *RedisEnv
|
||||
RedisClient *redis.Client
|
||||
|
||||
// Docker holds the per-package Docker daemon handle.
|
||||
Docker *DockerEnv
|
||||
|
||||
// Lobby is the per-test stub HTTP server.
|
||||
Lobby *LobbyStub
|
||||
|
||||
// Network is the unique Docker network name created for this test.
|
||||
Network string
|
||||
|
||||
// EngineImageRef and PatchedImageRef are the two semver-compatible
|
||||
// engine image tags the harness builds once per package. Patch
|
||||
// scenarios point at the second tag.
|
||||
EngineImageRef string
|
||||
PatchedImageRef string
|
||||
|
||||
// GameStateRoot is the host filesystem path RTM writes per-game
|
||||
// state directories under. It lives inside `t.ArtifactDir()` so
|
||||
// failed scenarios leave the engine state behind for inspection.
|
||||
GameStateRoot string
|
||||
|
||||
// InternalAddr is the bound address of RTM's internal HTTP listener
|
||||
// (resolved after Run binds the port).
|
||||
InternalAddr string
|
||||
}
|
||||
|
||||
// EnvOptions carry per-test overrides to the harness defaults. Empty
|
||||
// fields fall back to the defaults declared at the top of this file.
|
||||
type EnvOptions struct {
|
||||
// ReconcileInterval overrides the periodic reconciler interval.
|
||||
// Default 500ms (so reconcile drift is observable inside a single
|
||||
// scenario timeout).
|
||||
ReconcileInterval time.Duration
|
||||
|
||||
// CleanupInterval overrides the container-cleanup interval.
|
||||
CleanupInterval time.Duration
|
||||
|
||||
// InspectInterval overrides the Docker inspect worker interval.
|
||||
InspectInterval time.Duration
|
||||
|
||||
// ProbeInterval / ProbeTimeout / ProbeFailuresThreshold override
|
||||
// the active engine probe knobs.
|
||||
ProbeInterval time.Duration
|
||||
ProbeTimeout time.Duration
|
||||
ProbeFailuresThreshold int
|
||||
|
||||
// GameLeaseTTL overrides the per-game Redis lease TTL.
|
||||
GameLeaseTTL time.Duration
|
||||
|
||||
// StreamBlockTimeout overrides the consumer XREAD block window.
|
||||
StreamBlockTimeout time.Duration
|
||||
|
||||
// LogToStderr makes the harness write the runtime's structured
|
||||
// logs to stderr; the default discards them so test output stays
|
||||
// focused on assertions.
|
||||
LogToStderr bool
|
||||
}
|
||||
|
||||
// NewEnv stands up a fresh Runtime Manager process for the calling
|
||||
// test. It blocks until the internal HTTP listener is bound; tests can
|
||||
// issue REST and stream requests immediately after the call returns.
|
||||
//
|
||||
// `t.Cleanup` runs in reverse order: stop the runtime, close the
|
||||
// runtime, close the per-test redis client, remove the docker network,
|
||||
// terminate the lobby stub. Containers RTM created during the test are
|
||||
// removed by the test's own cleanup paths or by the integration
|
||||
// `health_test` external-action helpers.
|
||||
func NewEnv(t *testing.T, opts EnvOptions) *Env {
|
||||
t.Helper()
|
||||
|
||||
pg := EnsurePostgres(t)
|
||||
rd := EnsureRedis(t)
|
||||
dk := EnsureDocker(t)
|
||||
imageRef := EnsureEngineImage(t)
|
||||
TruncatePostgres(t)
|
||||
FlushRedis(t)
|
||||
network := EnsureNetwork(t)
|
||||
lobby := NewLobbyStub(t)
|
||||
stateRoot := stateRoot(t)
|
||||
|
||||
cfg := buildConfig(buildConfigInput{
|
||||
PostgresDSN: pg.DSN(),
|
||||
RedisAddr: rd.Addr(),
|
||||
DockerHost: resolveDockerHost(),
|
||||
Network: network,
|
||||
LobbyURL: lobby.URL(),
|
||||
GameStateRoot: stateRoot,
|
||||
ReconcileInterval: pickDuration(opts.ReconcileInterval, 500*time.Millisecond),
|
||||
CleanupInterval: pickDuration(opts.CleanupInterval, 500*time.Millisecond),
|
||||
InspectInterval: pickDuration(opts.InspectInterval, 500*time.Millisecond),
|
||||
ProbeInterval: pickDuration(opts.ProbeInterval, 500*time.Millisecond),
|
||||
ProbeTimeout: pickDuration(opts.ProbeTimeout, time.Second),
|
||||
ProbeFailures: pickInt(opts.ProbeFailuresThreshold, 2),
|
||||
GameLeaseTTL: pickDuration(opts.GameLeaseTTL, 5*time.Second),
|
||||
StreamBlockTimeout: pickDuration(opts.StreamBlockTimeout, 200*time.Millisecond),
|
||||
})
|
||||
|
||||
logger := newLogger(opts.LogToStderr)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
|
||||
runtime, err := app.NewRuntime(ctx, cfg, logger)
|
||||
if err != nil {
|
||||
cancel()
|
||||
t.Fatalf("rtmanager integration: new runtime: %v", err)
|
||||
}
|
||||
|
||||
runDone := make(chan error, 1)
|
||||
go func() {
|
||||
runDone <- runtime.Run(ctx)
|
||||
}()
|
||||
|
||||
internalAddr := waitForListener(t, runtime)
|
||||
waitForReady(t, runtime, listenerWaitTimeout)
|
||||
|
||||
var cleanupOnce sync.Once
|
||||
t.Cleanup(func() {
|
||||
cleanupOnce.Do(func() {
|
||||
cancel()
|
||||
waitCtx, waitCancel := context.WithTimeout(context.Background(), cleanupShutdownTimeout)
|
||||
defer waitCancel()
|
||||
select {
|
||||
case err := <-runDone:
|
||||
if err != nil && !isCleanShutdownErr(err) {
|
||||
t.Logf("rtmanager integration: runtime.Run returned: %v", err)
|
||||
}
|
||||
case <-waitCtx.Done():
|
||||
t.Logf("rtmanager integration: runtime did not stop within %s", cleanupShutdownTimeout)
|
||||
}
|
||||
if err := runtime.Close(); err != nil {
|
||||
t.Logf("rtmanager integration: runtime.Close: %v", err)
|
||||
}
|
||||
})
|
||||
})
|
||||
|
||||
return &Env{
|
||||
Cfg: cfg,
|
||||
Runtime: runtime,
|
||||
Postgres: pg,
|
||||
Redis: rd,
|
||||
RedisClient: rd.NewClient(t),
|
||||
Docker: dk,
|
||||
Lobby: lobby,
|
||||
Network: network,
|
||||
EngineImageRef: imageRef,
|
||||
PatchedImageRef: PatchedEngineImageRef,
|
||||
GameStateRoot: stateRoot,
|
||||
InternalAddr: internalAddr,
|
||||
}
|
||||
}
|
||||
|
||||
type buildConfigInput struct {
|
||||
PostgresDSN string
|
||||
RedisAddr string
|
||||
DockerHost string
|
||||
Network string
|
||||
LobbyURL string
|
||||
GameStateRoot string
|
||||
ReconcileInterval time.Duration
|
||||
CleanupInterval time.Duration
|
||||
InspectInterval time.Duration
|
||||
ProbeInterval time.Duration
|
||||
ProbeTimeout time.Duration
|
||||
ProbeFailures int
|
||||
GameLeaseTTL time.Duration
|
||||
StreamBlockTimeout time.Duration
|
||||
}
|
||||
|
||||
func buildConfig(in buildConfigInput) config.Config {
|
||||
cfg := config.DefaultConfig()
|
||||
cfg.InternalHTTP.Addr = listenAddr
|
||||
|
||||
cfg.Docker.Host = in.DockerHost
|
||||
cfg.Docker.Network = in.Network
|
||||
cfg.Docker.PullPolicy = config.ImagePullPolicyIfMissing
|
||||
|
||||
cfg.Postgres = config.PostgresConfig{
|
||||
Conn: postgres.Config{
|
||||
PrimaryDSN: in.PostgresDSN,
|
||||
OperationTimeout: pgOperationTimeout,
|
||||
MaxOpenConns: 5,
|
||||
MaxIdleConns: 2,
|
||||
ConnMaxLifetime: 30 * time.Minute,
|
||||
},
|
||||
}
|
||||
|
||||
cfg.Redis = config.RedisConfig{
|
||||
Conn: redisconn.Config{
|
||||
MasterAddr: in.RedisAddr,
|
||||
Password: "integration",
|
||||
OperationTimeout: 2 * time.Second,
|
||||
},
|
||||
}
|
||||
|
||||
cfg.Streams.StartJobs = StartJobsStream
|
||||
cfg.Streams.StopJobs = StopJobsStream
|
||||
cfg.Streams.JobResults = JobResultsStream
|
||||
cfg.Streams.HealthEvents = HealthEventsStream
|
||||
cfg.Streams.NotificationIntents = NotificationIntentsKey
|
||||
cfg.Streams.BlockTimeout = in.StreamBlockTimeout
|
||||
|
||||
cfg.Container.GameStateRoot = in.GameStateRoot
|
||||
// Pin chown target to the current process uid/gid; the dev sandbox
|
||||
// (and unprivileged dev machines) cannot chown to root.
|
||||
cfg.Container.GameStateOwnerUID = os.Getuid()
|
||||
cfg.Container.GameStateOwnerGID = os.Getgid()
|
||||
|
||||
cfg.Health.InspectInterval = in.InspectInterval
|
||||
cfg.Health.ProbeInterval = in.ProbeInterval
|
||||
cfg.Health.ProbeTimeout = in.ProbeTimeout
|
||||
cfg.Health.ProbeFailuresThreshold = in.ProbeFailures
|
||||
|
||||
cfg.Cleanup.ReconcileInterval = in.ReconcileInterval
|
||||
cfg.Cleanup.CleanupInterval = in.CleanupInterval
|
||||
|
||||
cfg.Coordination.GameLeaseTTL = in.GameLeaseTTL
|
||||
|
||||
cfg.Lobby = config.LobbyConfig{
|
||||
BaseURL: in.LobbyURL,
|
||||
Timeout: 2 * time.Second,
|
||||
}
|
||||
|
||||
cfg.Telemetry.TracesExporter = "none"
|
||||
cfg.Telemetry.MetricsExporter = "none"
|
||||
|
||||
return cfg
|
||||
}
|
||||
|
||||
func newLogger(toStderr bool) *slog.Logger {
|
||||
if toStderr {
|
||||
return slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelDebug}))
|
||||
}
|
||||
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
|
||||
}
|
||||
|
||||
func stateRoot(t *testing.T) string {
|
||||
t.Helper()
|
||||
dir := t.ArtifactDir()
|
||||
root := dir + string(os.PathSeparator) + gameStateRootSubdir
|
||||
if err := os.MkdirAll(root, 0o755); err != nil {
|
||||
t.Fatalf("rtmanager integration: create game-state root %q: %v", root, err)
|
||||
}
|
||||
return root
|
||||
}
|
||||
|
||||
func resolveDockerHost() string {
|
||||
if host := strings.TrimSpace(os.Getenv("DOCKER_HOST")); host != "" {
|
||||
return host
|
||||
}
|
||||
return "unix:///var/run/docker.sock"
|
||||
}
|
||||
|
||||
func pickDuration(value, fallback time.Duration) time.Duration {
|
||||
if value > 0 {
|
||||
return value
|
||||
}
|
||||
return fallback
|
||||
}
|
||||
|
||||
func pickInt(value, fallback int) int {
|
||||
if value > 0 {
|
||||
return value
|
||||
}
|
||||
return fallback
|
||||
}
|
||||
|
||||
// waitForListener spins until `runtime.InternalServer().Addr()` returns
|
||||
// a non-empty value or the deadline fires. The internal listener binds
|
||||
// during `runtime.Run`, which runs in its own goroutine; this helper
|
||||
// is the bridge between "Run started" and "tests can use REST".
|
||||
func waitForListener(t *testing.T, runtime *app.Runtime) string {
|
||||
t.Helper()
|
||||
deadline := time.Now().Add(listenerWaitTimeout)
|
||||
for {
|
||||
if runtime != nil && runtime.InternalServer() != nil {
|
||||
if addr := runtime.InternalServer().Addr(); addr != "" {
|
||||
return addr
|
||||
}
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("rtmanager integration: internal HTTP listener did not bind within %s", listenerWaitTimeout)
|
||||
}
|
||||
time.Sleep(readyzPollInterval)
|
||||
}
|
||||
}
|
||||
|
||||
// waitForReady polls `/readyz` until it returns 200 or the deadline
|
||||
// fires. RTM's readyz pings PG, Redis, and Docker; a successful
|
||||
// response means every dependency is reachable through the runtime
|
||||
// process.
|
||||
func waitForReady(t *testing.T, runtime *app.Runtime, timeout time.Duration) {
|
||||
t.Helper()
|
||||
deadline := time.Now().Add(timeout)
|
||||
addr := runtime.InternalServer().Addr()
|
||||
probeURL := (&url.URL{Scheme: "http", Host: addr, Path: "/readyz"}).String()
|
||||
for {
|
||||
req, err := newRequest(context.Background(), "GET", probeURL, nil)
|
||||
if err == nil {
|
||||
resp, err := defaultHTTPClient.Do(req)
|
||||
if err == nil {
|
||||
_, _ = io.Copy(io.Discard, resp.Body)
|
||||
_ = resp.Body.Close()
|
||||
if resp.StatusCode == 200 {
|
||||
return
|
||||
}
|
||||
}
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("rtmanager integration: /readyz did not return 200 within %s", timeout)
|
||||
}
|
||||
time.Sleep(readyzPollInterval)
|
||||
}
|
||||
}
|
||||
|
||||
func isCleanShutdownErr(err error) bool {
|
||||
return err == nil || errors.Is(err, context.Canceled)
|
||||
}
|
||||
|
||||
// IDFromTestName builds a deterministic-but-unique game id from the
|
||||
// caller's test name. Two tests with the same name running back-to-back
|
||||
// would otherwise collide on PG state through the per-test
|
||||
// `TruncatePostgres` window; pinning the suffix to `Now().UnixNano()`
|
||||
// rules that out.
|
||||
func IDFromTestName(t *testing.T) string {
|
||||
t.Helper()
|
||||
// The container hostname is `galaxy-game-{game_id}` and must fit
|
||||
// HOST_NAME_MAX=64 chars; runc rejects longer values with
|
||||
// "sethostname: invalid argument". Cap the lowercased test-name
|
||||
// component at 36 chars and append a 16-char base36 suffix so the
|
||||
// total stays comfortably under the limit (12 + 36 + 1 + 16 = 65 →
|
||||
// trim further if needed).
|
||||
const maxNameLen = 35
|
||||
suffix := strconv.FormatInt(time.Now().UnixNano(), 36)
|
||||
prefix := strings.ToLower(strings.NewReplacer("/", "-", " ", "-").Replace(t.Name()))
|
||||
if len(prefix) > maxNameLen {
|
||||
prefix = prefix[:maxNameLen]
|
||||
}
|
||||
return prefix + "-" + suffix
|
||||
}
|
||||
@@ -0,0 +1,128 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/healthsnapshotstore"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/operationlogstore"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/runtimerecordstore"
|
||||
"galaxy/rtmanager/internal/domain/health"
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// RuntimeRecord returns the persisted runtime record for gameID. The
|
||||
// helper opens the store on every call (cheap; the harness `*sql.DB`
|
||||
// is shared) so individual scenarios stay isolated even if a previous
|
||||
// test mutated store state.
|
||||
func RuntimeRecord(t testing.TB, env *Env, gameID string) (runtime.RuntimeRecord, error) {
|
||||
t.Helper()
|
||||
store, err := runtimerecordstore.New(runtimerecordstore.Config{
|
||||
DB: env.Postgres.Pool(),
|
||||
OperationTimeout: pgOperationTimeout,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return store.Get(context.Background(), gameID)
|
||||
}
|
||||
|
||||
// MustRuntimeRecord asserts that the record exists and returns it.
|
||||
func MustRuntimeRecord(t testing.TB, env *Env, gameID string) runtime.RuntimeRecord {
|
||||
t.Helper()
|
||||
record, err := RuntimeRecord(t, env, gameID)
|
||||
require.NoErrorf(t, err, "load runtime record for %s", gameID)
|
||||
return record
|
||||
}
|
||||
|
||||
// EventuallyRuntimeRecord polls until predicate matches the runtime
|
||||
// record for gameID, or the deadline fires. Returns the matching
|
||||
// record. Used by lifecycle assertions that depend on async state
|
||||
// transitions (start consumer → record).
|
||||
func EventuallyRuntimeRecord(t testing.TB, env *Env, gameID string, predicate func(runtime.RuntimeRecord) bool, timeout time.Duration) runtime.RuntimeRecord {
|
||||
t.Helper()
|
||||
if timeout <= 0 {
|
||||
timeout = defaultStreamTimeout
|
||||
}
|
||||
deadline := time.Now().Add(timeout)
|
||||
for {
|
||||
record, err := RuntimeRecord(t, env, gameID)
|
||||
if err == nil && predicate(record) {
|
||||
return record
|
||||
}
|
||||
if err != nil && !errors.Is(err, runtime.ErrNotFound) {
|
||||
t.Fatalf("rtmanager integration: load runtime record: %v", err)
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
if err != nil {
|
||||
t.Fatalf("rtmanager integration: runtime record predicate not met within %s; last err=%v",
|
||||
timeout, err)
|
||||
}
|
||||
t.Fatalf("rtmanager integration: runtime record predicate not met within %s; last record=%+v",
|
||||
timeout, record)
|
||||
}
|
||||
time.Sleep(defaultStreamPoll)
|
||||
}
|
||||
}
|
||||
|
||||
// OperationEntries returns up to `limit` most-recent operation_log
|
||||
// entries for gameID, ordered descending by started_at.
|
||||
func OperationEntries(t testing.TB, env *Env, gameID string, limit int) []operation.OperationEntry {
|
||||
t.Helper()
|
||||
store, err := operationlogstore.New(operationlogstore.Config{
|
||||
DB: env.Postgres.Pool(),
|
||||
OperationTimeout: pgOperationTimeout,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
entries, err := store.ListByGame(context.Background(), gameID, limit)
|
||||
require.NoErrorf(t, err, "list operation log entries for %s", gameID)
|
||||
return entries
|
||||
}
|
||||
|
||||
// EventuallyOperationKind polls operation_log until at least one entry
|
||||
// for gameID has the requested kind, or the deadline fires. Returns
|
||||
// the matching entry.
|
||||
func EventuallyOperationKind(t testing.TB, env *Env, gameID string, kind operation.OpKind, timeout time.Duration) operation.OperationEntry {
|
||||
t.Helper()
|
||||
if timeout <= 0 {
|
||||
timeout = defaultStreamTimeout
|
||||
}
|
||||
deadline := time.Now().Add(timeout)
|
||||
for {
|
||||
entries := OperationEntries(t, env, gameID, 50)
|
||||
for _, entry := range entries {
|
||||
if entry.OpKind == kind {
|
||||
return entry
|
||||
}
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("rtmanager integration: operation_log entry with op_kind=%s not seen within %s; observed=%v",
|
||||
kind, timeout, opKindSummary(entries))
|
||||
}
|
||||
time.Sleep(defaultStreamPoll)
|
||||
}
|
||||
}
|
||||
|
||||
// HealthSnapshot returns the latest persisted health snapshot for
|
||||
// gameID, or the underlying not-found sentinel when nothing has been
|
||||
// recorded yet.
|
||||
func HealthSnapshot(t testing.TB, env *Env, gameID string) (health.HealthSnapshot, error) {
|
||||
t.Helper()
|
||||
store, err := healthsnapshotstore.New(healthsnapshotstore.Config{
|
||||
DB: env.Postgres.Pool(),
|
||||
OperationTimeout: pgOperationTimeout,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return store.Get(context.Background(), gameID)
|
||||
}
|
||||
|
||||
func opKindSummary(entries []operation.OperationEntry) []string {
|
||||
out := make([]string, 0, len(entries))
|
||||
for _, entry := range entries {
|
||||
out = append(out, string(entry.OpKind)+"/"+string(entry.Outcome))
|
||||
}
|
||||
return out
|
||||
}
|
||||
@@ -0,0 +1,334 @@
|
||||
package harness
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"strconv"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// Default scenario timeouts. Stream-driven assertions sit on top of
|
||||
// the runtime's worker tickers (defaults of 200-500ms in
|
||||
// `EnvOptions`); 30s gives every reconcile / probe / event tick more
|
||||
// than enough headroom even on a slow CI runner.
|
||||
const (
|
||||
defaultStreamTimeout = 30 * time.Second
|
||||
defaultStreamPoll = 25 * time.Millisecond
|
||||
)
|
||||
|
||||
// XAddStartJob appends one start-job entry in the
|
||||
// `runtime:start_jobs` AsyncAPI shape and returns the assigned entry
|
||||
// id. Mirrors the wire shape produced by Lobby's
|
||||
// `runtimemanager.Publisher` so the consumer treats the entry exactly
|
||||
// like a real Lobby-published job.
|
||||
func XAddStartJob(t testing.TB, env *Env, gameID, imageRef string) string {
|
||||
t.Helper()
|
||||
id, err := env.RedisClient.XAdd(context.Background(), &redis.XAddArgs{
|
||||
Stream: env.Cfg.Streams.StartJobs,
|
||||
Values: map[string]any{
|
||||
"game_id": gameID,
|
||||
"image_ref": imageRef,
|
||||
"requested_at_ms": time.Now().UTC().UnixMilli(),
|
||||
},
|
||||
}).Result()
|
||||
require.NoErrorf(t, err, "xadd start_jobs for game %s", gameID)
|
||||
return id
|
||||
}
|
||||
|
||||
// XAddStopJob appends one stop-job entry classified by reason. The
|
||||
// reason enum is documented at `ports.StopReason`.
|
||||
func XAddStopJob(t testing.TB, env *Env, gameID, reason string) string {
|
||||
t.Helper()
|
||||
id, err := env.RedisClient.XAdd(context.Background(), &redis.XAddArgs{
|
||||
Stream: env.Cfg.Streams.StopJobs,
|
||||
Values: map[string]any{
|
||||
"game_id": gameID,
|
||||
"reason": reason,
|
||||
"requested_at_ms": time.Now().UTC().UnixMilli(),
|
||||
},
|
||||
}).Result()
|
||||
require.NoErrorf(t, err, "xadd stop_jobs for game %s", gameID)
|
||||
return id
|
||||
}
|
||||
|
||||
// JobResultEntry is the decoded shape of one `runtime:job_results`
|
||||
// stream entry. Mirrors `ports.JobResult` plus the entry id surfaced
|
||||
// by Redis so tests can correlate XADD ids with results.
|
||||
type JobResultEntry struct {
|
||||
StreamID string
|
||||
GameID string
|
||||
Outcome string
|
||||
ContainerID string
|
||||
EngineEndpoint string
|
||||
ErrorCode string
|
||||
ErrorMessage string
|
||||
}
|
||||
|
||||
// HealthEventEntry mirrors the `runtime:health_events` AsyncAPI shape
|
||||
// in decoded form.
|
||||
type HealthEventEntry struct {
|
||||
StreamID string
|
||||
GameID string
|
||||
ContainerID string
|
||||
EventType string
|
||||
OccurredAtMs int64
|
||||
Details map[string]any
|
||||
}
|
||||
|
||||
// NotificationIntentEntry decodes one `notification:intents` entry
|
||||
// that RTM publishes for first-touch start failures.
|
||||
type NotificationIntentEntry struct {
|
||||
StreamID string
|
||||
NotificationType string
|
||||
IdempotencyKey string
|
||||
Payload map[string]any
|
||||
}
|
||||
|
||||
// WaitForJobResult polls `runtime:job_results` until predicate
|
||||
// matches, or the timeout fires. Returns the matching entry. The
|
||||
// helper does not consume the stream — every call rescans from `0-0`
|
||||
// — because RTM's writes are append-only and the cardinality per test
|
||||
// is small.
|
||||
func WaitForJobResult(t testing.TB, env *Env, predicate func(JobResultEntry) bool, timeout time.Duration) JobResultEntry {
|
||||
t.Helper()
|
||||
if timeout <= 0 {
|
||||
timeout = defaultStreamTimeout
|
||||
}
|
||||
deadline := time.Now().Add(timeout)
|
||||
for {
|
||||
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.JobResults, "-", "+").Result()
|
||||
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.JobResults)
|
||||
for _, entry := range entries {
|
||||
decoded := decodeJobResult(entry)
|
||||
if predicate(decoded) {
|
||||
return decoded
|
||||
}
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("rtmanager integration: no job_result matched within %s; observed=%v",
|
||||
timeout, jobResultStreamSummary(entries))
|
||||
}
|
||||
time.Sleep(defaultStreamPoll)
|
||||
}
|
||||
}
|
||||
|
||||
// AllJobResults returns every entry on `runtime:job_results` in stream
|
||||
// order. Useful for assertions that depend on cardinality (replay
|
||||
// tests).
|
||||
func AllJobResults(t testing.TB, env *Env) []JobResultEntry {
|
||||
t.Helper()
|
||||
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.JobResults, "-", "+").Result()
|
||||
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.JobResults)
|
||||
out := make([]JobResultEntry, 0, len(entries))
|
||||
for _, entry := range entries {
|
||||
out = append(out, decodeJobResult(entry))
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// WaitForHealthEvent polls `runtime:health_events` until predicate
|
||||
// matches, or the timeout fires.
|
||||
func WaitForHealthEvent(t testing.TB, env *Env, predicate func(HealthEventEntry) bool, timeout time.Duration) HealthEventEntry {
|
||||
t.Helper()
|
||||
if timeout <= 0 {
|
||||
timeout = defaultStreamTimeout
|
||||
}
|
||||
deadline := time.Now().Add(timeout)
|
||||
for {
|
||||
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.HealthEvents, "-", "+").Result()
|
||||
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.HealthEvents)
|
||||
for _, entry := range entries {
|
||||
decoded := decodeHealthEvent(t, entry)
|
||||
if predicate(decoded) {
|
||||
return decoded
|
||||
}
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("rtmanager integration: no health_event matched within %s; observed=%v",
|
||||
timeout, healthEventStreamSummary(entries))
|
||||
}
|
||||
time.Sleep(defaultStreamPoll)
|
||||
}
|
||||
}
|
||||
|
||||
// WaitForNotificationIntent polls `notification:intents` until
|
||||
// predicate matches.
|
||||
func WaitForNotificationIntent(t testing.TB, env *Env, predicate func(NotificationIntentEntry) bool, timeout time.Duration) NotificationIntentEntry {
|
||||
t.Helper()
|
||||
if timeout <= 0 {
|
||||
timeout = defaultStreamTimeout
|
||||
}
|
||||
deadline := time.Now().Add(timeout)
|
||||
for {
|
||||
entries, err := env.RedisClient.XRange(context.Background(), env.Cfg.Streams.NotificationIntents, "-", "+").Result()
|
||||
require.NoErrorf(t, err, "xrange %s", env.Cfg.Streams.NotificationIntents)
|
||||
for _, entry := range entries {
|
||||
decoded := decodeNotificationIntent(t, entry)
|
||||
if predicate(decoded) {
|
||||
return decoded
|
||||
}
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("rtmanager integration: no notification_intent matched within %s; observed=%v",
|
||||
timeout, notificationStreamSummary(entries))
|
||||
}
|
||||
time.Sleep(defaultStreamPoll)
|
||||
}
|
||||
}
|
||||
|
||||
// JobOutcomeIs returns a predicate matching a job result whose game id
|
||||
// and outcome equal the inputs.
|
||||
func JobOutcomeIs(gameID, outcome string) func(JobResultEntry) bool {
|
||||
return func(entry JobResultEntry) bool {
|
||||
return entry.GameID == gameID && entry.Outcome == outcome
|
||||
}
|
||||
}
|
||||
|
||||
// JobOutcomeWithErrorCode matches a job result whose game id, outcome,
|
||||
// and error_code all equal the inputs. Used by replay-no-op
|
||||
// assertions.
|
||||
func JobOutcomeWithErrorCode(gameID, outcome, errorCode string) func(JobResultEntry) bool {
|
||||
return func(entry JobResultEntry) bool {
|
||||
return entry.GameID == gameID && entry.Outcome == outcome && entry.ErrorCode == errorCode
|
||||
}
|
||||
}
|
||||
|
||||
// HealthEventTypeIs returns a predicate matching a health event whose
|
||||
// game id and event_type equal the inputs.
|
||||
func HealthEventTypeIs(gameID, eventType string) func(HealthEventEntry) bool {
|
||||
return func(entry HealthEventEntry) bool {
|
||||
return entry.GameID == gameID && entry.EventType == eventType
|
||||
}
|
||||
}
|
||||
|
||||
func decodeJobResult(message redis.XMessage) JobResultEntry {
|
||||
return JobResultEntry{
|
||||
StreamID: message.ID,
|
||||
GameID: streamString(message.Values, "game_id"),
|
||||
Outcome: streamString(message.Values, "outcome"),
|
||||
ContainerID: streamString(message.Values, "container_id"),
|
||||
EngineEndpoint: streamString(message.Values, "engine_endpoint"),
|
||||
ErrorCode: streamString(message.Values, "error_code"),
|
||||
ErrorMessage: streamString(message.Values, "error_message"),
|
||||
}
|
||||
}
|
||||
|
||||
func decodeHealthEvent(t testing.TB, message redis.XMessage) HealthEventEntry {
|
||||
t.Helper()
|
||||
occurredAt, _ := strconv.ParseInt(streamString(message.Values, "occurred_at_ms"), 10, 64)
|
||||
entry := HealthEventEntry{
|
||||
StreamID: message.ID,
|
||||
GameID: streamString(message.Values, "game_id"),
|
||||
ContainerID: streamString(message.Values, "container_id"),
|
||||
EventType: streamString(message.Values, "event_type"),
|
||||
OccurredAtMs: occurredAt,
|
||||
}
|
||||
rawDetails := streamString(message.Values, "details")
|
||||
if rawDetails != "" {
|
||||
var parsed map[string]any
|
||||
if err := json.Unmarshal([]byte(rawDetails), &parsed); err == nil {
|
||||
entry.Details = parsed
|
||||
}
|
||||
}
|
||||
return entry
|
||||
}
|
||||
|
||||
func decodeNotificationIntent(t testing.TB, message redis.XMessage) NotificationIntentEntry {
|
||||
t.Helper()
|
||||
entry := NotificationIntentEntry{
|
||||
StreamID: message.ID,
|
||||
NotificationType: streamString(message.Values, "notification_type"),
|
||||
IdempotencyKey: streamString(message.Values, "idempotency_key"),
|
||||
}
|
||||
rawPayload := streamString(message.Values, "payload_json")
|
||||
if rawPayload == "" {
|
||||
rawPayload = streamString(message.Values, "payload")
|
||||
}
|
||||
if rawPayload != "" {
|
||||
var parsed map[string]any
|
||||
if err := json.Unmarshal([]byte(rawPayload), &parsed); err == nil {
|
||||
entry.Payload = parsed
|
||||
}
|
||||
}
|
||||
return entry
|
||||
}
|
||||
|
||||
func streamString(values map[string]any, key string) string {
|
||||
raw, ok := values[key]
|
||||
if !ok {
|
||||
return ""
|
||||
}
|
||||
switch typed := raw.(type) {
|
||||
case string:
|
||||
return typed
|
||||
case []byte:
|
||||
return string(typed)
|
||||
default:
|
||||
return fmt.Sprintf("%v", typed)
|
||||
}
|
||||
}
|
||||
|
||||
func jobResultStreamSummary(entries []redis.XMessage) []string {
|
||||
out := make([]string, 0, len(entries))
|
||||
for _, entry := range entries {
|
||||
decoded := decodeJobResult(entry)
|
||||
out = append(out, fmt.Sprintf("%s game=%s outcome=%s err=%s",
|
||||
decoded.StreamID, decoded.GameID, decoded.Outcome, decoded.ErrorCode))
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func healthEventStreamSummary(entries []redis.XMessage) []string {
|
||||
out := make([]string, 0, len(entries))
|
||||
for _, entry := range entries {
|
||||
out = append(out, fmt.Sprintf("%s %s %s",
|
||||
entry.ID, streamString(entry.Values, "game_id"), streamString(entry.Values, "event_type")))
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func notificationStreamSummary(entries []redis.XMessage) []string {
|
||||
out := make([]string, 0, len(entries))
|
||||
for _, entry := range entries {
|
||||
out = append(out, fmt.Sprintf("%s %s",
|
||||
entry.ID, streamString(entry.Values, "notification_type")))
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
// EnsureJobOutcomeConstants pins the constants from `ports` so suite
|
||||
// authors can build predicates without importing `ports` themselves.
|
||||
// Re-exported here to keep test source focused.
|
||||
var (
|
||||
JobOutcomeSuccess = ports.JobOutcomeSuccess
|
||||
JobOutcomeFailure = ports.JobOutcomeFailure
|
||||
)
|
||||
|
||||
// AssertNoJobResultBeyond fails the test if the count of entries on
|
||||
// `runtime:job_results` exceeds `expectedCount`. Used by the replay
|
||||
// tests to prove the second envelope was no-op.
|
||||
func AssertNoJobResultBeyond(t testing.TB, env *Env, expectedCount int) {
|
||||
t.Helper()
|
||||
entries, err := env.RedisClient.XLen(context.Background(), env.Cfg.Streams.JobResults).Result()
|
||||
require.NoError(t, err)
|
||||
require.LessOrEqualf(t, entries, int64(expectedCount),
|
||||
"job_results stream has more entries than expected; got=%d expected<=%d", entries, expectedCount)
|
||||
}
|
||||
|
||||
// SanitizeContainerSummaryFor returns a stable diagnostic string for a
|
||||
// container summary keyed by game id. Used in test failures.
|
||||
func SanitizeContainerSummaryFor(values map[string]string, gameID string) string {
|
||||
parts := make([]string, 0, len(values))
|
||||
for key, value := range values {
|
||||
parts = append(parts, key+"="+value)
|
||||
}
|
||||
return fmt.Sprintf("game=%s {%s}", gameID, strings.Join(parts, ", "))
|
||||
}
|
||||
@@ -0,0 +1,303 @@
|
||||
//go:build integration
|
||||
|
||||
// Package integration_test owns the service-local end-to-end scenarios
|
||||
// for Runtime Manager. The build tag keeps the suite out of the
|
||||
// default `go test ./...` run; CI invokes the suite explicitly with
|
||||
// `go test -tags=integration ./rtmanager/integration/...`.
|
||||
//
|
||||
// Design rationale for the suite — build tag, in-process harness,
|
||||
// per-test isolation, two-tag engine image — lives in
|
||||
// `rtmanager/docs/integration-tests.md`. Each test stands up its own
|
||||
// Runtime Manager process via `harness.NewEnv`, drives the same
|
||||
// streams Game Lobby uses in `integration/lobbyrtm`, and asserts the
|
||||
// resulting PostgreSQL, Redis-stream, and Docker side-effects.
|
||||
package integration_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"net/http"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/integration/harness"
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/docker/docker/api/types/container"
|
||||
"github.com/docker/docker/api/types/filters"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// TestMain centralises shared-container teardown so individual
|
||||
// failing tests do not leak the testcontainers postgres / redis pair.
|
||||
func TestMain(m *testing.M) {
|
||||
harness.RunMain(m)
|
||||
}
|
||||
|
||||
// TestLifecycle_StartInspectStopRestartPatchCleanup drives one game
|
||||
// through every supported lifecycle operation against the real engine
|
||||
// image and asserts each step's PG, Redis-stream, and Docker
|
||||
// side-effects.
|
||||
func TestLifecycle_StartInspectStopRestartPatchCleanup(t *testing.T) {
|
||||
env := harness.NewEnv(t, harness.EnvOptions{LogToStderr: true})
|
||||
rest := harness.NewREST(env)
|
||||
gameID := harness.IDFromTestName(t)
|
||||
|
||||
// Step 1 — start through the Lobby async stream contract.
|
||||
startEntryID := harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
|
||||
t.Logf("start_jobs xadd id=%s", startEntryID)
|
||||
|
||||
startResult := harness.WaitForJobResult(t, env,
|
||||
harness.JobOutcomeIs(gameID, ports.JobOutcomeSuccess),
|
||||
30*time.Second,
|
||||
)
|
||||
require.Equal(t, "", startResult.ErrorCode, "fresh start must publish empty error_code")
|
||||
require.NotEmpty(t, startResult.ContainerID, "fresh start job result must carry container_id")
|
||||
require.NotEmpty(t, startResult.EngineEndpoint, "fresh start job result must carry engine_endpoint")
|
||||
|
||||
// PG record reflects the start.
|
||||
startedRecord := harness.EventuallyRuntimeRecord(t, env, gameID,
|
||||
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusRunning },
|
||||
15*time.Second,
|
||||
)
|
||||
assert.Equal(t, env.EngineImageRef, startedRecord.CurrentImageRef)
|
||||
assert.Equal(t, env.Network, startedRecord.DockerNetwork)
|
||||
assert.Equal(t, startResult.ContainerID, startedRecord.CurrentContainerID)
|
||||
assert.Equal(t, startResult.EngineEndpoint, startedRecord.EngineEndpoint)
|
||||
|
||||
// operation_log captures the start.
|
||||
startEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindStart, 5*time.Second)
|
||||
assert.Equal(t, operation.OutcomeSuccess, startEntry.Outcome)
|
||||
assert.Equal(t, operation.OpSourceLobbyStream, startEntry.OpSource)
|
||||
|
||||
// Step 2 — inspect via the GM/Admin REST surface.
|
||||
getResp, status := rest.GetRuntime(t, gameID)
|
||||
require.Equal(t, http.StatusOK, status)
|
||||
require.Equal(t, "running", getResp.Status)
|
||||
require.NotNil(t, getResp.CurrentContainerID)
|
||||
require.Equal(t, startResult.ContainerID, *getResp.CurrentContainerID)
|
||||
require.NotNil(t, getResp.CurrentImageRef)
|
||||
require.Equal(t, env.EngineImageRef, *getResp.CurrentImageRef)
|
||||
require.NotNil(t, getResp.EngineEndpoint)
|
||||
require.Equal(t, startResult.EngineEndpoint, *getResp.EngineEndpoint)
|
||||
|
||||
// Step 3 — stop through the Lobby async stream contract.
|
||||
harness.XAddStopJob(t, env, gameID, "cancelled")
|
||||
stopResult := waitForLatestStopOrStartResult(t, env, gameID)
|
||||
require.Equal(t, ports.JobOutcomeSuccess, stopResult.Outcome)
|
||||
require.Equal(t, "", stopResult.ErrorCode, "fresh stop must publish empty error_code")
|
||||
|
||||
stoppedRecord := harness.EventuallyRuntimeRecord(t, env, gameID,
|
||||
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusStopped },
|
||||
15*time.Second,
|
||||
)
|
||||
assert.Equal(t, startResult.ContainerID, stoppedRecord.CurrentContainerID,
|
||||
"stop preserves the current container id until cleanup")
|
||||
|
||||
// Step 4 — restart via REST. Container id changes; engine endpoint
|
||||
// stays stable.
|
||||
restartResp, status := rest.RestartRuntime(t, gameID)
|
||||
require.Equal(t, http.StatusOK, status)
|
||||
require.Equal(t, "running", restartResp.Status)
|
||||
require.NotNil(t, restartResp.CurrentContainerID)
|
||||
require.NotEqual(t, startResult.ContainerID, *restartResp.CurrentContainerID,
|
||||
"restart must produce a new container id")
|
||||
require.NotNil(t, restartResp.EngineEndpoint)
|
||||
require.Equal(t, startResult.EngineEndpoint, *restartResp.EngineEndpoint,
|
||||
"restart must keep the engine endpoint stable")
|
||||
|
||||
restartContainerID := *restartResp.CurrentContainerID
|
||||
restartEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindRestart, 5*time.Second)
|
||||
assert.Equal(t, operation.OutcomeSuccess, restartEntry.Outcome)
|
||||
assert.Equal(t, operation.OpSourceAdminRest, restartEntry.OpSource)
|
||||
|
||||
// Step 5 — patch to the second semver-compatible tag. Same image
|
||||
// content, but the runtime should still record the new tag and
|
||||
// recreate the container.
|
||||
patchResp, status := rest.PatchRuntime(t, gameID, env.PatchedImageRef)
|
||||
require.Equal(t, http.StatusOK, status)
|
||||
require.Equal(t, "running", patchResp.Status)
|
||||
require.NotNil(t, patchResp.CurrentImageRef)
|
||||
assert.Equal(t, env.PatchedImageRef, *patchResp.CurrentImageRef)
|
||||
require.NotNil(t, patchResp.CurrentContainerID)
|
||||
assert.NotEqual(t, restartContainerID, *patchResp.CurrentContainerID,
|
||||
"patch must recreate the container")
|
||||
|
||||
patchEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindPatch, 5*time.Second)
|
||||
assert.Equal(t, operation.OutcomeSuccess, patchEntry.Outcome)
|
||||
|
||||
// Step 6 — quiesce via REST stop so cleanup is allowed (cleanup
|
||||
// refuses to remove a running container per
|
||||
// `rtmanager/README.md §Lifecycles → Cleanup`).
|
||||
stopResp, status := rest.StopRuntime(t, gameID, "admin_request")
|
||||
require.Equal(t, http.StatusOK, status)
|
||||
require.Equal(t, "stopped", stopResp.Status)
|
||||
|
||||
// Step 7 — cleanup the container. PG record flips to removed and
|
||||
// current_container_id becomes nil.
|
||||
cleanupResp, status := rest.CleanupRuntime(t, gameID)
|
||||
require.Equal(t, http.StatusOK, status)
|
||||
require.Equal(t, "removed", cleanupResp.Status)
|
||||
require.Nil(t, cleanupResp.CurrentContainerID)
|
||||
|
||||
cleanupEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindCleanupContainer, 5*time.Second)
|
||||
assert.Equal(t, operation.OutcomeSuccess, cleanupEntry.Outcome)
|
||||
assert.Equal(t, operation.OpSourceAdminRest, cleanupEntry.OpSource)
|
||||
}
|
||||
|
||||
// TestReplay_StartJobIsNoop publishes the same start envelope twice
|
||||
// and asserts that Runtime Manager produces a fresh job_result for
|
||||
// the first XADD and a `replay_no_op` outcome for the second, without
|
||||
// recreating the engine container.
|
||||
func TestReplay_StartJobIsNoop(t *testing.T) {
|
||||
env := harness.NewEnv(t, harness.EnvOptions{})
|
||||
gameID := harness.IDFromTestName(t)
|
||||
|
||||
// First XADD: fresh start.
|
||||
harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
|
||||
first := harness.WaitForJobResult(t, env,
|
||||
harness.JobOutcomeIs(gameID, ports.JobOutcomeSuccess),
|
||||
30*time.Second,
|
||||
)
|
||||
require.Equal(t, "", first.ErrorCode)
|
||||
|
||||
// Second XADD: same envelope; the start service must short-circuit
|
||||
// at the `runtime_records.status=running && image_ref` check.
|
||||
harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
|
||||
replay := harness.WaitForJobResult(t, env,
|
||||
harness.JobOutcomeWithErrorCode(gameID, ports.JobOutcomeSuccess, "replay_no_op"),
|
||||
15*time.Second,
|
||||
)
|
||||
assert.Equal(t, first.ContainerID, replay.ContainerID,
|
||||
"replay must surface the same container id as the original start")
|
||||
assert.Equal(t, first.EngineEndpoint, replay.EngineEndpoint)
|
||||
|
||||
// Docker view: exactly one engine container exists for this game.
|
||||
assertSingleEngineContainer(t, env, gameID)
|
||||
|
||||
// Lifecycle stream produced exactly two entries: fresh + replay.
|
||||
entries := harness.AllJobResults(t, env)
|
||||
require.Len(t, entries, 2)
|
||||
assert.Equal(t, "", entries[0].ErrorCode)
|
||||
assert.Equal(t, "replay_no_op", entries[1].ErrorCode)
|
||||
}
|
||||
|
||||
// TestReplay_StopJobIsNoop publishes a stop envelope twice after a
|
||||
// successful start and asserts the second stop surfaces as
|
||||
// `replay_no_op` without altering the runtime record's `stopped_at`.
|
||||
func TestReplay_StopJobIsNoop(t *testing.T) {
|
||||
env := harness.NewEnv(t, harness.EnvOptions{})
|
||||
gameID := harness.IDFromTestName(t)
|
||||
|
||||
// Bring the game to `running`. The start path publishes one entry
|
||||
// to `runtime:job_results`; the stops below publish two more, so
|
||||
// per-game stream order is [start, first-stop, replay-stop].
|
||||
harness.XAddStartJob(t, env, gameID, env.EngineImageRef)
|
||||
harness.WaitForJobResult(t, env,
|
||||
harness.JobOutcomeIs(gameID, ports.JobOutcomeSuccess),
|
||||
30*time.Second,
|
||||
)
|
||||
|
||||
// First stop: fresh. The expectedCount accounts for the start
|
||||
// entry that is already on the stream.
|
||||
harness.XAddStopJob(t, env, gameID, "cancelled")
|
||||
first := waitForJobResultByIndex(t, env, gameID, 2)
|
||||
require.Equal(t, ports.JobOutcomeSuccess, first.Outcome)
|
||||
require.Equal(t, "", first.ErrorCode)
|
||||
|
||||
stoppedRecord := harness.EventuallyRuntimeRecord(t, env, gameID,
|
||||
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusStopped },
|
||||
15*time.Second,
|
||||
)
|
||||
require.NotNil(t, stoppedRecord.StoppedAt, "stopped record must carry stopped_at")
|
||||
originalStoppedAt := *stoppedRecord.StoppedAt
|
||||
|
||||
// Second stop: replay (third entry on the per-game stream).
|
||||
harness.XAddStopJob(t, env, gameID, "cancelled")
|
||||
replay := waitForJobResultByIndex(t, env, gameID, 3)
|
||||
require.Equal(t, ports.JobOutcomeSuccess, replay.Outcome)
|
||||
assert.Equal(t, "replay_no_op", replay.ErrorCode)
|
||||
|
||||
// stopped_at stays anchored to the first stop.
|
||||
postReplay := harness.MustRuntimeRecord(t, env, gameID)
|
||||
require.Equal(t, runtime.StatusStopped, postReplay.Status)
|
||||
require.NotNil(t, postReplay.StoppedAt)
|
||||
assert.True(t, originalStoppedAt.Equal(*postReplay.StoppedAt),
|
||||
"stopped_at must not move on a replay stop; was %s, now %s",
|
||||
originalStoppedAt, *postReplay.StoppedAt)
|
||||
}
|
||||
|
||||
// waitForLatestStopOrStartResult finds the most recent `outcome=success`
|
||||
// entry on `runtime:job_results` for gameID. The lifecycle scenario
|
||||
// emits two consecutive successes (start then stop); the helper picks
|
||||
// the second one without re-scanning the stream every iteration.
|
||||
func waitForLatestStopOrStartResult(t *testing.T, env *harness.Env, gameID string) harness.JobResultEntry {
|
||||
t.Helper()
|
||||
deadline := time.Now().Add(30 * time.Second)
|
||||
for {
|
||||
entries := harness.AllJobResults(t, env)
|
||||
// Two entries means we've observed both the start and stop
|
||||
// outcomes for this game.
|
||||
matched := 0
|
||||
var last harness.JobResultEntry
|
||||
for _, entry := range entries {
|
||||
if entry.GameID == gameID && entry.Outcome == ports.JobOutcomeSuccess {
|
||||
matched++
|
||||
last = entry
|
||||
}
|
||||
}
|
||||
if matched >= 2 {
|
||||
return last
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("expected two job_results for %s, got %d", gameID, matched)
|
||||
}
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
}
|
||||
}
|
||||
|
||||
// waitForJobResultByIndex polls the job_results stream until it has
|
||||
// at least `expectedCount` entries for gameID and returns the
|
||||
// expectedCount-th. Used by the replay tests to deterministically
|
||||
// pick the second / nth result.
|
||||
func waitForJobResultByIndex(t *testing.T, env *harness.Env, gameID string, expectedCount int) harness.JobResultEntry {
|
||||
t.Helper()
|
||||
deadline := time.Now().Add(30 * time.Second)
|
||||
for {
|
||||
entries := harness.AllJobResults(t, env)
|
||||
matches := make([]harness.JobResultEntry, 0, len(entries))
|
||||
for _, entry := range entries {
|
||||
if entry.GameID == gameID {
|
||||
matches = append(matches, entry)
|
||||
}
|
||||
}
|
||||
if len(matches) >= expectedCount {
|
||||
return matches[expectedCount-1]
|
||||
}
|
||||
if time.Now().After(deadline) {
|
||||
t.Fatalf("expected at least %d job_results for %s, got %d",
|
||||
expectedCount, gameID, len(matches))
|
||||
}
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
}
|
||||
}
|
||||
|
||||
// assertSingleEngineContainer queries Docker by the per-game label and
|
||||
// asserts exactly one matching container exists. Catches replay
|
||||
// regressions that would let RTM start two containers for the same
|
||||
// game id.
|
||||
func assertSingleEngineContainer(t *testing.T, env *harness.Env, gameID string) {
|
||||
t.Helper()
|
||||
args := filters.NewArgs(
|
||||
filters.Arg("label", "com.galaxy.owner=rtmanager"),
|
||||
filters.Arg("label", "com.galaxy.game_id="+gameID),
|
||||
)
|
||||
containers, err := env.Docker.Client().ContainerList(
|
||||
context.Background(),
|
||||
container.ListOptions{All: true, Filters: args},
|
||||
)
|
||||
require.NoError(t, err)
|
||||
require.Lenf(t, containers, 1, "expected one engine container for game %s, got %d", gameID, len(containers))
|
||||
}
|
||||
@@ -0,0 +1,200 @@
|
||||
//go:build integration
|
||||
|
||||
package integration_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"strconv"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/notificationintent"
|
||||
"galaxy/rtmanager/integration/harness"
|
||||
"galaxy/rtmanager/internal/domain/health"
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
|
||||
dockercontainer "github.com/docker/docker/api/types/container"
|
||||
"github.com/docker/docker/api/types/network"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// TestHealth_ContainerDisappearedAndAdopt verifies the two
|
||||
// drift-detection paths. The Docker events listener emits
|
||||
// `container_disappeared` when a tracked container is destroyed
|
||||
// outside RTM, and the reconciler adopts a fresh container labelled
|
||||
// `com.galaxy.owner=rtmanager` that has no PG row.
|
||||
//
|
||||
// `runtime_records.status=removed` is terminal per
|
||||
// `runtime.AllowedTransitions`; the adoption path therefore uses a
|
||||
// **fresh** game_id rather than re-adopting the disposed one. That
|
||||
// matches the documented contract: reconciler adopts containers
|
||||
// labelled `com.galaxy.owner=rtmanager` for which no PG row exists.
|
||||
func TestHealth_ContainerDisappearedAndAdopt(t *testing.T) {
|
||||
env := harness.NewEnv(t, harness.EnvOptions{
|
||||
ReconcileInterval: 500 * time.Millisecond,
|
||||
})
|
||||
|
||||
// Step 1 — bring a game to running through the start consumer.
|
||||
disposalGameID := harness.IDFromTestName(t) + "-d"
|
||||
harness.XAddStartJob(t, env, disposalGameID, env.EngineImageRef)
|
||||
startResult := harness.WaitForJobResult(t, env,
|
||||
harness.JobOutcomeIs(disposalGameID, ports.JobOutcomeSuccess),
|
||||
30*time.Second,
|
||||
)
|
||||
originalContainerID := startResult.ContainerID
|
||||
require.NotEmpty(t, originalContainerID)
|
||||
|
||||
// Step 2 — externally remove the container; the events listener
|
||||
// should observe the destroy and publish `container_disappeared`.
|
||||
removeContainer(t, env, originalContainerID)
|
||||
disappeared := harness.WaitForHealthEvent(t, env,
|
||||
harness.HealthEventTypeIs(disposalGameID, string(health.EventTypeContainerDisappeared)),
|
||||
20*time.Second,
|
||||
)
|
||||
assert.Equal(t, originalContainerID, disappeared.ContainerID)
|
||||
|
||||
// The reconciler also marks the runtime record as removed within
|
||||
// one or two ticks (`reconcile_dispose`).
|
||||
harness.EventuallyRuntimeRecord(t, env, disposalGameID,
|
||||
func(r runtime.RuntimeRecord) bool { return r.Status == runtime.StatusRemoved },
|
||||
15*time.Second,
|
||||
)
|
||||
harness.EventuallyOperationKind(t, env, disposalGameID, operation.OpKindReconcileDispose, 5*time.Second)
|
||||
|
||||
// Step 3 — bring up an adoption candidate for an unseen game id
|
||||
// by hand. The reconciler must label-match it, find no record,
|
||||
// and insert one with status=running.
|
||||
adoptionGameID := harness.IDFromTestName(t) + "-a"
|
||||
manualContainerID := runManualEngineContainer(t, env, adoptionGameID)
|
||||
t.Logf("manual container id=%s", manualContainerID)
|
||||
|
||||
adopted := harness.EventuallyRuntimeRecord(t, env, adoptionGameID,
|
||||
func(r runtime.RuntimeRecord) bool {
|
||||
return r.Status == runtime.StatusRunning && r.CurrentContainerID == manualContainerID
|
||||
},
|
||||
20*time.Second,
|
||||
)
|
||||
assert.Equal(t, env.EngineImageRef, adopted.CurrentImageRef)
|
||||
|
||||
adoptEntry := harness.EventuallyOperationKind(t, env, adoptionGameID, operation.OpKindReconcileAdopt, 5*time.Second)
|
||||
assert.Equal(t, operation.OutcomeSuccess, adoptEntry.Outcome)
|
||||
assert.Equal(t, operation.OpSourceAutoReconcile, adoptEntry.OpSource)
|
||||
assert.Equal(t, manualContainerID, adoptEntry.ContainerID)
|
||||
}
|
||||
|
||||
// TestNotification_ImagePullFailed drives Runtime Manager with a
|
||||
// start envelope pointing at an unresolvable image reference. The
|
||||
// start service must surface the failure on `runtime:job_results` and
|
||||
// publish a `runtime.image_pull_failed` admin notification on
|
||||
// `notification:intents`.
|
||||
func TestNotification_ImagePullFailed(t *testing.T) {
|
||||
env := harness.NewEnv(t, harness.EnvOptions{})
|
||||
gameID := harness.IDFromTestName(t)
|
||||
|
||||
const missingImage = "galaxy/integration-missing:0.0.0"
|
||||
harness.XAddStartJob(t, env, gameID, missingImage)
|
||||
|
||||
// Job result publishes a failure with the stable image_pull_failed
|
||||
// code.
|
||||
jobResult := harness.WaitForJobResult(t, env,
|
||||
harness.JobOutcomeIs(gameID, ports.JobOutcomeFailure),
|
||||
60*time.Second,
|
||||
)
|
||||
assert.Equal(t, startruntime.ErrorCodeImagePullFailed, jobResult.ErrorCode)
|
||||
assert.Empty(t, jobResult.ContainerID, "failure must not surface a container id")
|
||||
assert.Empty(t, jobResult.EngineEndpoint, "failure must not surface an engine endpoint")
|
||||
assert.NotEmpty(t, jobResult.ErrorMessage, "failure must carry an operator-readable message")
|
||||
|
||||
// Notification stream carries the matching admin-only intent.
|
||||
intent := harness.WaitForNotificationIntent(t, env,
|
||||
func(entry harness.NotificationIntentEntry) bool {
|
||||
if entry.NotificationType != string(notificationintent.NotificationTypeRuntimeImagePullFailed) {
|
||||
return false
|
||||
}
|
||||
payloadGameID, _ := entry.Payload["game_id"].(string)
|
||||
return payloadGameID == gameID
|
||||
},
|
||||
30*time.Second,
|
||||
)
|
||||
require.NotNil(t, intent.Payload, "notification intent must carry a payload")
|
||||
assert.Equal(t, gameID, intent.Payload["game_id"])
|
||||
assert.Equal(t, missingImage, intent.Payload["image_ref"])
|
||||
assert.Equal(t, startruntime.ErrorCodeImagePullFailed, intent.Payload["error_code"])
|
||||
|
||||
// PG state: no running record was installed; operation_log
|
||||
// captures one failed start with the stable error code.
|
||||
_, err := harness.RuntimeRecord(t, env, gameID)
|
||||
if err == nil {
|
||||
// If an entry was upserted (rollback gap), it must not be
|
||||
// running.
|
||||
record := harness.MustRuntimeRecord(t, env, gameID)
|
||||
assert.NotEqual(t, runtime.StatusRunning, record.Status,
|
||||
"failed image pull must not leave a running record behind")
|
||||
}
|
||||
|
||||
failureEntry := harness.EventuallyOperationKind(t, env, gameID, operation.OpKindStart, 5*time.Second)
|
||||
assert.Equal(t, operation.OutcomeFailure, failureEntry.Outcome)
|
||||
assert.Equal(t, startruntime.ErrorCodeImagePullFailed, failureEntry.ErrorCode)
|
||||
}
|
||||
|
||||
// removeContainer terminates and removes the container behind RTM's
|
||||
// back. Force=true is required because the engine has not received a
|
||||
// SIGTERM and stop signal handling is engine-internal.
|
||||
func removeContainer(t *testing.T, env *harness.Env, containerID string) {
|
||||
t.Helper()
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
require.NoError(t, env.Docker.Client().ContainerRemove(ctx, containerID, dockercontainer.RemoveOptions{Force: true}))
|
||||
}
|
||||
|
||||
// runManualEngineContainer bypasses RTM and starts an engine container
|
||||
// directly through the Docker SDK. The container carries every label
|
||||
// the reconciler reads at adopt time (`com.galaxy.owner`,
|
||||
// `com.galaxy.kind`, `com.galaxy.game_id`, `com.galaxy.engine_image_ref`,
|
||||
// `com.galaxy.started_at_ms`) plus the per-game hostname so the
|
||||
// computed `engine_endpoint` matches what `rtmanager` would have
|
||||
// written.
|
||||
func runManualEngineContainer(t *testing.T, env *harness.Env, gameID string) string {
|
||||
t.Helper()
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer cancel()
|
||||
|
||||
hostname := "galaxy-game-" + gameID
|
||||
cfg := &dockercontainer.Config{
|
||||
Image: env.EngineImageRef,
|
||||
Hostname: hostname,
|
||||
Labels: map[string]string{
|
||||
"com.galaxy.owner": "rtmanager",
|
||||
"com.galaxy.kind": "game-engine",
|
||||
"com.galaxy.game_id": gameID,
|
||||
"com.galaxy.engine_image_ref": env.EngineImageRef,
|
||||
"com.galaxy.started_at_ms": strconv.FormatInt(time.Now().UnixMilli(), 10),
|
||||
},
|
||||
Env: []string{
|
||||
"GAME_STATE_PATH=/var/lib/galaxy-game",
|
||||
"STORAGE_PATH=/var/lib/galaxy-game",
|
||||
},
|
||||
}
|
||||
hostCfg := &dockercontainer.HostConfig{}
|
||||
netCfg := &network.NetworkingConfig{
|
||||
EndpointsConfig: map[string]*network.EndpointSettings{
|
||||
env.Network: {Aliases: []string{hostname}},
|
||||
},
|
||||
}
|
||||
containerName := fmt.Sprintf("galaxy-game-%s-manual", gameID)
|
||||
created, err := env.Docker.Client().ContainerCreate(ctx, cfg, hostCfg, netCfg, nil, containerName)
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() {
|
||||
removeCtx, removeCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer removeCancel()
|
||||
_ = env.Docker.Client().ContainerRemove(removeCtx, created.ID, dockercontainer.RemoveOptions{Force: true})
|
||||
})
|
||||
|
||||
require.NoError(t, env.Docker.Client().ContainerStart(ctx, created.ID, dockercontainer.StartOptions{}))
|
||||
return created.ID
|
||||
}
|
||||
@@ -0,0 +1,493 @@
|
||||
// Package docker provides the production Docker SDK adapter that
|
||||
// implements `galaxy/rtmanager/internal/ports.DockerClient`. The
|
||||
// adapter is the single component allowed to talk to the local Docker
|
||||
// daemon; every Runtime Manager service path that needs container
|
||||
// lifecycle operations goes through this surface.
|
||||
//
|
||||
// The adapter is intentionally narrow — it does not orchestrate, log,
|
||||
// or retry. Cross-cutting concerns (lease coordination, durable state,
|
||||
// notification side-effects) live in the service layer.
|
||||
package docker
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"io"
|
||||
"maps"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
cerrdefs "github.com/containerd/errdefs"
|
||||
"github.com/docker/docker/api/types/container"
|
||||
"github.com/docker/docker/api/types/events"
|
||||
"github.com/docker/docker/api/types/filters"
|
||||
"github.com/docker/docker/api/types/image"
|
||||
"github.com/docker/docker/api/types/network"
|
||||
dockerclient "github.com/docker/docker/client"
|
||||
"github.com/docker/go-units"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
// EnginePort is the in-container HTTP port the engine listens on. The
|
||||
// value is fixed by `rtmanager/README.md §Container Model` and by the
|
||||
// engine's Dockerfile (`game/Dockerfile`); RTM never publishes the port
|
||||
// to the host. Keeping the constant here lets the adapter own the URL
|
||||
// shape so the start service does not have to know it.
|
||||
const EnginePort = 8080
|
||||
|
||||
// Config groups the dependencies and per-process defaults required to
|
||||
// construct a Client. The struct is value-typed so wiring code can
|
||||
// build it inline without intermediate variables.
|
||||
type Config struct {
|
||||
// Docker stores the SDK client this adapter wraps. It must be
|
||||
// non-nil; callers typically construct it via `client.NewClientWithOpts`.
|
||||
Docker *dockerclient.Client
|
||||
|
||||
// LogDriver stores the Docker logging driver applied to every
|
||||
// container the adapter creates (e.g. `json-file`).
|
||||
LogDriver string
|
||||
|
||||
// LogOpts stores the comma-separated `key=value` driver options
|
||||
// forwarded to Docker. Empty disables driver-specific options.
|
||||
LogOpts string
|
||||
|
||||
// Clock supplies the wall-clock used for `RunResult.StartedAt`.
|
||||
// Defaults to `time.Now` when nil.
|
||||
Clock func() time.Time
|
||||
}
|
||||
|
||||
// Client is the production adapter implementing `ports.DockerClient`.
|
||||
// Construct it via NewClient; do not zero-initialise.
|
||||
type Client struct {
|
||||
docker *dockerclient.Client
|
||||
logDriver string
|
||||
logOpts string
|
||||
clock func() time.Time
|
||||
}
|
||||
|
||||
// NewClient constructs a Client from cfg. It returns an error if cfg
|
||||
// does not carry the minimum collaborator set the adapter needs to
|
||||
// function.
|
||||
func NewClient(cfg Config) (*Client, error) {
|
||||
if cfg.Docker == nil {
|
||||
return nil, errors.New("new docker adapter: nil docker client")
|
||||
}
|
||||
if strings.TrimSpace(cfg.LogDriver) == "" {
|
||||
return nil, errors.New("new docker adapter: log driver must not be empty")
|
||||
}
|
||||
clock := cfg.Clock
|
||||
if clock == nil {
|
||||
clock = time.Now
|
||||
}
|
||||
return &Client{
|
||||
docker: cfg.Docker,
|
||||
logDriver: cfg.LogDriver,
|
||||
logOpts: cfg.LogOpts,
|
||||
clock: clock,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// EnsureNetwork verifies the user-defined Docker network is present.
|
||||
// The adapter never creates networks; provisioning is the operator's
|
||||
// job per `rtmanager/README.md §Container Model`.
|
||||
func (client *Client) EnsureNetwork(ctx context.Context, name string) error {
|
||||
if _, err := client.docker.NetworkInspect(ctx, name, network.InspectOptions{}); err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return ports.ErrNetworkMissing
|
||||
}
|
||||
return fmt.Errorf("ensure network %q: %w", name, err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// PullImage pulls ref according to policy. The pull stream is drained
|
||||
// to completion because the Docker SDK only finishes the underlying
|
||||
// pull when the body is consumed.
|
||||
func (client *Client) PullImage(ctx context.Context, ref string, policy ports.PullPolicy) error {
|
||||
if !policy.IsKnown() {
|
||||
return fmt.Errorf("pull image %q: unknown pull policy %q", ref, policy)
|
||||
}
|
||||
switch policy {
|
||||
case ports.PullPolicyAlways:
|
||||
return client.runPull(ctx, ref)
|
||||
case ports.PullPolicyIfMissing:
|
||||
if present, err := client.imagePresent(ctx, ref); err != nil {
|
||||
return err
|
||||
} else if present {
|
||||
return nil
|
||||
}
|
||||
return client.runPull(ctx, ref)
|
||||
case ports.PullPolicyNever:
|
||||
present, err := client.imagePresent(ctx, ref)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if !present {
|
||||
return ports.ErrImageNotFound
|
||||
}
|
||||
return nil
|
||||
default:
|
||||
return fmt.Errorf("pull image %q: unsupported pull policy %q", ref, policy)
|
||||
}
|
||||
}
|
||||
|
||||
// InspectImage returns image metadata for ref. RTM only reads labels
|
||||
// at start time; the broader inspect struct stays accessible for
|
||||
// diagnostics.
|
||||
func (client *Client) InspectImage(ctx context.Context, ref string) (ports.ImageInspect, error) {
|
||||
inspect, err := client.docker.ImageInspect(ctx, ref)
|
||||
if err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return ports.ImageInspect{}, ports.ErrImageNotFound
|
||||
}
|
||||
return ports.ImageInspect{}, fmt.Errorf("inspect image %q: %w", ref, err)
|
||||
}
|
||||
var labels map[string]string
|
||||
if inspect.Config != nil {
|
||||
labels = copyStringMap(inspect.Config.Labels)
|
||||
}
|
||||
return ports.ImageInspect{Ref: ref, Labels: labels}, nil
|
||||
}
|
||||
|
||||
// InspectContainer returns container metadata for containerID. The
|
||||
// adapter best-effort decodes Docker timestamps; malformed values map
|
||||
// to the zero time so callers do not have to defend against nil
|
||||
// pointers in the SDK response.
|
||||
func (client *Client) InspectContainer(ctx context.Context, containerID string) (ports.ContainerInspect, error) {
|
||||
inspect, err := client.docker.ContainerInspect(ctx, containerID)
|
||||
if err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return ports.ContainerInspect{}, ports.ErrContainerNotFound
|
||||
}
|
||||
return ports.ContainerInspect{}, fmt.Errorf("inspect container %q: %w", containerID, err)
|
||||
}
|
||||
|
||||
result := ports.ContainerInspect{ID: inspect.ID}
|
||||
if inspect.ContainerJSONBase != nil {
|
||||
result.RestartCount = inspect.RestartCount
|
||||
if inspect.State != nil {
|
||||
result.Status = string(inspect.State.Status)
|
||||
result.OOMKilled = inspect.State.OOMKilled
|
||||
result.ExitCode = inspect.State.ExitCode
|
||||
result.StartedAt = parseDockerTime(inspect.State.StartedAt)
|
||||
result.FinishedAt = parseDockerTime(inspect.State.FinishedAt)
|
||||
if inspect.State.Health != nil {
|
||||
result.Health = string(inspect.State.Health.Status)
|
||||
}
|
||||
}
|
||||
}
|
||||
if inspect.Config != nil {
|
||||
result.ImageRef = inspect.Config.Image
|
||||
result.Hostname = inspect.Config.Hostname
|
||||
result.Labels = copyStringMap(inspect.Config.Labels)
|
||||
}
|
||||
return result, nil
|
||||
}
|
||||
|
||||
// Run creates and starts one container according to spec. On
|
||||
// `ContainerStart` failure the adapter best-effort removes the partial
|
||||
// container so the start service never has to clean up after a failed
|
||||
// start path.
|
||||
func (client *Client) Run(ctx context.Context, spec ports.RunSpec) (ports.RunResult, error) {
|
||||
if err := spec.Validate(); err != nil {
|
||||
return ports.RunResult{}, fmt.Errorf("run container: %w", err)
|
||||
}
|
||||
memoryBytes, err := units.RAMInBytes(spec.Memory)
|
||||
if err != nil {
|
||||
return ports.RunResult{}, fmt.Errorf("run container %q: parse memory %q: %w", spec.Name, spec.Memory, err)
|
||||
}
|
||||
pidsLimit := int64(spec.PIDsLimit)
|
||||
|
||||
containerCfg := &container.Config{
|
||||
Image: spec.Image,
|
||||
Hostname: spec.Hostname,
|
||||
Env: envMapToSlice(spec.Env),
|
||||
Labels: copyStringMap(spec.Labels),
|
||||
Cmd: append([]string(nil), spec.Cmd...),
|
||||
}
|
||||
hostCfg := &container.HostConfig{
|
||||
Binds: bindMountsToBinds(spec.BindMounts),
|
||||
LogConfig: container.LogConfig{
|
||||
Type: client.logDriver,
|
||||
Config: parseLogOpts(client.logOpts),
|
||||
},
|
||||
Resources: container.Resources{
|
||||
NanoCPUs: int64(spec.CPUQuota * 1e9),
|
||||
Memory: memoryBytes,
|
||||
PidsLimit: &pidsLimit,
|
||||
},
|
||||
}
|
||||
netCfg := &network.NetworkingConfig{
|
||||
EndpointsConfig: map[string]*network.EndpointSettings{
|
||||
spec.Network: {
|
||||
Aliases: []string{spec.Hostname},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
created, err := client.docker.ContainerCreate(ctx, containerCfg, hostCfg, netCfg, nil, spec.Name)
|
||||
if err != nil {
|
||||
return ports.RunResult{}, fmt.Errorf("create container %q: %w", spec.Name, err)
|
||||
}
|
||||
|
||||
if err := client.docker.ContainerStart(ctx, created.ID, container.StartOptions{}); err != nil {
|
||||
client.cleanupAfterFailedStart(created.ID)
|
||||
return ports.RunResult{}, fmt.Errorf("start container %q: %w", spec.Name, err)
|
||||
}
|
||||
|
||||
return ports.RunResult{
|
||||
ContainerID: created.ID,
|
||||
EngineEndpoint: fmt.Sprintf("http://%s:%d", spec.Hostname, EnginePort),
|
||||
StartedAt: client.clock(),
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Stop bounds graceful shutdown by timeout. A missing container is
|
||||
// surfaced as ErrContainerNotFound so the service layer can treat it
|
||||
// as already-stopped per `rtmanager/README.md §Lifecycles → Stop`.
|
||||
func (client *Client) Stop(ctx context.Context, containerID string, timeout time.Duration) error {
|
||||
seconds := max(int(timeout.Round(time.Second).Seconds()), 0)
|
||||
if err := client.docker.ContainerStop(ctx, containerID, container.StopOptions{Timeout: &seconds}); err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return ports.ErrContainerNotFound
|
||||
}
|
||||
return fmt.Errorf("stop container %q: %w", containerID, err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Remove removes the container without forcing kill. A missing
|
||||
// container is reported as success so callers can treat the operation
|
||||
// as idempotent.
|
||||
func (client *Client) Remove(ctx context.Context, containerID string) error {
|
||||
if err := client.docker.ContainerRemove(ctx, containerID, container.RemoveOptions{}); err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return nil
|
||||
}
|
||||
return fmt.Errorf("remove container %q: %w", containerID, err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// List returns container summaries that match filter. Empty Labels
|
||||
// match every container; the reconciler always passes
|
||||
// `com.galaxy.owner=rtmanager`.
|
||||
func (client *Client) List(ctx context.Context, filter ports.ListFilter) ([]ports.ContainerSummary, error) {
|
||||
args := filters.NewArgs()
|
||||
for key, value := range filter.Labels {
|
||||
args.Add("label", key+"="+value)
|
||||
}
|
||||
summaries, err := client.docker.ContainerList(ctx, container.ListOptions{All: true, Filters: args})
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list containers: %w", err)
|
||||
}
|
||||
out := make([]ports.ContainerSummary, 0, len(summaries))
|
||||
for _, summary := range summaries {
|
||||
hostname := ""
|
||||
if len(summary.Names) > 0 {
|
||||
hostname = strings.TrimPrefix(summary.Names[0], "/")
|
||||
}
|
||||
out = append(out, ports.ContainerSummary{
|
||||
ID: summary.ID,
|
||||
ImageRef: summary.Image,
|
||||
Hostname: hostname,
|
||||
Labels: copyStringMap(summary.Labels),
|
||||
Status: string(summary.State),
|
||||
StartedAt: time.Unix(summary.Created, 0).UTC(),
|
||||
})
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
// EventsListen subscribes to the Docker events stream and returns a
|
||||
// typed channel of decoded container events plus an asynchronous
|
||||
// error channel. The caller cancels ctx to terminate the subscription;
|
||||
// the goroutine closes both channels on termination.
|
||||
func (client *Client) EventsListen(ctx context.Context) (<-chan ports.DockerEvent, <-chan error, error) {
|
||||
msgs, sdkErrs := client.docker.Events(ctx, events.ListOptions{})
|
||||
out := make(chan ports.DockerEvent)
|
||||
outErrs := make(chan error, 1)
|
||||
|
||||
var closeOnce sync.Once
|
||||
closeAll := func() {
|
||||
closeOnce.Do(func() {
|
||||
close(out)
|
||||
close(outErrs)
|
||||
})
|
||||
}
|
||||
|
||||
go func() {
|
||||
defer closeAll()
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case msg, ok := <-msgs:
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
if msg.Type != events.ContainerEventType {
|
||||
continue
|
||||
}
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case out <- decodeEvent(msg):
|
||||
}
|
||||
case err, ok := <-sdkErrs:
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
if err == nil {
|
||||
continue
|
||||
}
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
case outErrs <- err:
|
||||
}
|
||||
return
|
||||
}
|
||||
}
|
||||
}()
|
||||
|
||||
return out, outErrs, nil
|
||||
}
|
||||
|
||||
func (client *Client) cleanupAfterFailedStart(containerID string) {
|
||||
cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
|
||||
defer cancel()
|
||||
_ = client.docker.ContainerRemove(cleanupCtx, containerID, container.RemoveOptions{Force: true})
|
||||
}
|
||||
|
||||
func (client *Client) imagePresent(ctx context.Context, ref string) (bool, error) {
|
||||
if _, err := client.docker.ImageInspect(ctx, ref); err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return false, nil
|
||||
}
|
||||
return false, fmt.Errorf("inspect image %q: %w", ref, err)
|
||||
}
|
||||
return true, nil
|
||||
}
|
||||
|
||||
func (client *Client) runPull(ctx context.Context, ref string) error {
|
||||
body, err := client.docker.ImagePull(ctx, ref, image.PullOptions{})
|
||||
if err != nil {
|
||||
if cerrdefs.IsNotFound(err) {
|
||||
return ports.ErrImageNotFound
|
||||
}
|
||||
return fmt.Errorf("pull image %q: %w", ref, err)
|
||||
}
|
||||
defer body.Close()
|
||||
if _, err := io.Copy(io.Discard, body); err != nil {
|
||||
return fmt.Errorf("drain pull stream for %q: %w", ref, err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func envMapToSlice(envMap map[string]string) []string {
|
||||
if len(envMap) == 0 {
|
||||
return nil
|
||||
}
|
||||
out := make([]string, 0, len(envMap))
|
||||
for key, value := range envMap {
|
||||
out = append(out, key+"="+value)
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func bindMountsToBinds(mounts []ports.BindMount) []string {
|
||||
if len(mounts) == 0 {
|
||||
return nil
|
||||
}
|
||||
binds := make([]string, 0, len(mounts))
|
||||
for _, mount := range mounts {
|
||||
bind := mount.HostPath + ":" + mount.MountPath
|
||||
if mount.ReadOnly {
|
||||
bind += ":ro"
|
||||
}
|
||||
binds = append(binds, bind)
|
||||
}
|
||||
return binds
|
||||
}
|
||||
|
||||
func parseLogOpts(raw string) map[string]string {
|
||||
if strings.TrimSpace(raw) == "" {
|
||||
return nil
|
||||
}
|
||||
out := make(map[string]string)
|
||||
for part := range strings.SplitSeq(raw, ",") {
|
||||
entry := strings.TrimSpace(part)
|
||||
if entry == "" {
|
||||
continue
|
||||
}
|
||||
index := strings.IndexByte(entry, '=')
|
||||
if index <= 0 {
|
||||
continue
|
||||
}
|
||||
out[entry[:index]] = entry[index+1:]
|
||||
}
|
||||
if len(out) == 0 {
|
||||
return nil
|
||||
}
|
||||
return out
|
||||
}
|
||||
|
||||
func parseDockerTime(raw string) time.Time {
|
||||
if raw == "" {
|
||||
return time.Time{}
|
||||
}
|
||||
parsed, err := time.Parse(time.RFC3339Nano, raw)
|
||||
if err != nil {
|
||||
return time.Time{}
|
||||
}
|
||||
return parsed.UTC()
|
||||
}
|
||||
|
||||
func copyStringMap(in map[string]string) map[string]string {
|
||||
if in == nil {
|
||||
return nil
|
||||
}
|
||||
out := make(map[string]string, len(in))
|
||||
maps.Copy(out, in)
|
||||
return out
|
||||
}
|
||||
|
||||
func decodeEvent(msg events.Message) ports.DockerEvent {
|
||||
occurredAt := time.Time{}
|
||||
switch {
|
||||
case msg.TimeNano != 0:
|
||||
occurredAt = time.Unix(0, msg.TimeNano).UTC()
|
||||
case msg.Time != 0:
|
||||
occurredAt = time.Unix(msg.Time, 0).UTC()
|
||||
}
|
||||
exitCode := 0
|
||||
if raw, ok := msg.Actor.Attributes["exitCode"]; ok {
|
||||
if value, err := parseExitCode(raw); err == nil {
|
||||
exitCode = value
|
||||
}
|
||||
}
|
||||
return ports.DockerEvent{
|
||||
Action: string(msg.Action),
|
||||
ContainerID: msg.Actor.ID,
|
||||
Labels: copyStringMap(msg.Actor.Attributes),
|
||||
ExitCode: exitCode,
|
||||
OccurredAt: occurredAt,
|
||||
}
|
||||
}
|
||||
|
||||
func parseExitCode(raw string) (int, error) {
|
||||
value := 0
|
||||
for _, r := range raw {
|
||||
if r < '0' || r > '9' {
|
||||
return 0, fmt.Errorf("non-numeric exit code %q", raw)
|
||||
}
|
||||
value = value*10 + int(r-'0')
|
||||
}
|
||||
return value, nil
|
||||
}
|
||||
|
||||
// Compile-time assertion: Client implements ports.DockerClient.
|
||||
var _ ports.DockerClient = (*Client)(nil)
|
||||
@@ -0,0 +1,561 @@
|
||||
package docker
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"net/url"
|
||||
"strings"
|
||||
"sync/atomic"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
dockerclient "github.com/docker/docker/client"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
// newTestClient wires an httptest.Server backed Docker SDK client to our
|
||||
// adapter. The handler is invoked for every Docker API request issued
|
||||
// during the test; tests assert on path and method to route the
|
||||
// response.
|
||||
func newTestClient(t *testing.T, handler http.HandlerFunc) *Client {
|
||||
t.Helper()
|
||||
server := httptest.NewServer(handler)
|
||||
t.Cleanup(server.Close)
|
||||
|
||||
docker, err := dockerclient.NewClientWithOpts(
|
||||
dockerclient.WithHost(server.URL),
|
||||
dockerclient.WithHTTPClient(server.Client()),
|
||||
dockerclient.WithVersion("1.45"),
|
||||
)
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() { _ = docker.Close() })
|
||||
|
||||
client, err := NewClient(Config{
|
||||
Docker: docker,
|
||||
LogDriver: "json-file",
|
||||
LogOpts: "max-size=1m,max-file=3",
|
||||
Clock: func() time.Time { return time.Date(2026, time.April, 27, 12, 0, 0, 0, time.UTC) },
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return client
|
||||
}
|
||||
|
||||
func writeJSON(t *testing.T, w http.ResponseWriter, status int, body any) {
|
||||
t.Helper()
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(status)
|
||||
require.NoError(t, json.NewEncoder(w).Encode(body))
|
||||
}
|
||||
|
||||
func writeNotFound(t *testing.T, w http.ResponseWriter, msg string) {
|
||||
t.Helper()
|
||||
writeJSON(t, w, http.StatusNotFound, map[string]string{"message": msg})
|
||||
}
|
||||
|
||||
// Docker SDK uses /v1.45 prefix when client is pinned to API 1.45.
|
||||
func dockerPath(suffix string) string {
|
||||
return "/v1.45" + suffix
|
||||
}
|
||||
|
||||
func TestNewClientValidatesConfig(t *testing.T) {
|
||||
t.Run("nil docker client", func(t *testing.T) {
|
||||
_, err := NewClient(Config{LogDriver: "json-file"})
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "nil docker client")
|
||||
})
|
||||
t.Run("empty log driver", func(t *testing.T) {
|
||||
docker, err := dockerclient.NewClientWithOpts(dockerclient.WithHost("tcp://127.0.0.1:65535"))
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() { _ = docker.Close() })
|
||||
_, err = NewClient(Config{Docker: docker, LogDriver: " "})
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "log driver")
|
||||
})
|
||||
}
|
||||
|
||||
func TestEnsureNetwork(t *testing.T) {
|
||||
t.Run("present", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
require.Equal(t, dockerPath("/networks/galaxy-net"), r.URL.Path)
|
||||
writeJSON(t, w, http.StatusOK, map[string]any{"Id": "net-1", "Name": "galaxy-net"})
|
||||
})
|
||||
require.NoError(t, client.EnsureNetwork(context.Background(), "galaxy-net"))
|
||||
})
|
||||
t.Run("missing", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
writeNotFound(t, w, "no such network")
|
||||
})
|
||||
err := client.EnsureNetwork(context.Background(), "missing")
|
||||
require.Error(t, err)
|
||||
assert.ErrorIs(t, err, ports.ErrNetworkMissing)
|
||||
})
|
||||
t.Run("transport error", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
http.Error(w, "boom", http.StatusInternalServerError)
|
||||
})
|
||||
err := client.EnsureNetwork(context.Background(), "x")
|
||||
require.Error(t, err)
|
||||
assert.NotErrorIs(t, err, ports.ErrNetworkMissing)
|
||||
})
|
||||
}
|
||||
|
||||
func TestInspectImage(t *testing.T) {
|
||||
t.Run("present", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
require.Equal(t, dockerPath("/images/galaxy/game:test/json"), r.URL.Path)
|
||||
writeJSON(t, w, http.StatusOK, map[string]any{
|
||||
"Id": "sha256:abc",
|
||||
"Config": map[string]any{
|
||||
"Labels": map[string]string{
|
||||
"com.galaxy.cpu_quota": "1.0",
|
||||
"com.galaxy.memory": "512m",
|
||||
"com.galaxy.pids_limit": "512",
|
||||
},
|
||||
},
|
||||
})
|
||||
})
|
||||
got, err := client.InspectImage(context.Background(), "galaxy/game:test")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "galaxy/game:test", got.Ref)
|
||||
assert.Equal(t, "1.0", got.Labels["com.galaxy.cpu_quota"])
|
||||
assert.Equal(t, "512m", got.Labels["com.galaxy.memory"])
|
||||
})
|
||||
t.Run("not found", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
writeNotFound(t, w, "no such image")
|
||||
})
|
||||
_, err := client.InspectImage(context.Background(), "galaxy/missing:tag")
|
||||
require.Error(t, err)
|
||||
assert.ErrorIs(t, err, ports.ErrImageNotFound)
|
||||
})
|
||||
}
|
||||
|
||||
func TestInspectContainer(t *testing.T) {
|
||||
t.Run("present", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
require.Equal(t, dockerPath("/containers/cont-1/json"), r.URL.Path)
|
||||
writeJSON(t, w, http.StatusOK, map[string]any{
|
||||
"Id": "cont-1",
|
||||
"RestartCount": 2,
|
||||
"State": map[string]any{
|
||||
"Status": "running",
|
||||
"OOMKilled": false,
|
||||
"ExitCode": 0,
|
||||
"StartedAt": "2026-04-27T11:00:00.5Z",
|
||||
"FinishedAt": "0001-01-01T00:00:00Z",
|
||||
"Health": map[string]any{"Status": "healthy"},
|
||||
},
|
||||
"Config": map[string]any{
|
||||
"Image": "galaxy/game:test",
|
||||
"Hostname": "galaxy-game-game-1",
|
||||
"Labels": map[string]string{
|
||||
"com.galaxy.owner": "rtmanager",
|
||||
"com.galaxy.game_id": "game-1",
|
||||
},
|
||||
},
|
||||
})
|
||||
})
|
||||
got, err := client.InspectContainer(context.Background(), "cont-1")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "cont-1", got.ID)
|
||||
assert.Equal(t, 2, got.RestartCount)
|
||||
assert.Equal(t, "running", got.Status)
|
||||
assert.Equal(t, "healthy", got.Health)
|
||||
assert.Equal(t, "galaxy/game:test", got.ImageRef)
|
||||
assert.Equal(t, "galaxy-game-game-1", got.Hostname)
|
||||
assert.Equal(t, "rtmanager", got.Labels["com.galaxy.owner"])
|
||||
assert.False(t, got.StartedAt.IsZero())
|
||||
})
|
||||
t.Run("not found", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
writeNotFound(t, w, "no such container")
|
||||
})
|
||||
_, err := client.InspectContainer(context.Background(), "missing")
|
||||
require.Error(t, err)
|
||||
assert.ErrorIs(t, err, ports.ErrContainerNotFound)
|
||||
})
|
||||
}
|
||||
|
||||
func TestPullImagePolicies(t *testing.T) {
|
||||
t.Run("if_missing/found skips pull", func(t *testing.T) {
|
||||
hits := struct {
|
||||
inspect atomic.Int32
|
||||
pull atomic.Int32
|
||||
}{}
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
switch {
|
||||
case strings.HasSuffix(r.URL.Path, "/json") && r.Method == http.MethodGet:
|
||||
hits.inspect.Add(1)
|
||||
writeJSON(t, w, http.StatusOK, map[string]any{"Id": "sha256:x"})
|
||||
case strings.Contains(r.URL.Path, "/images/create"):
|
||||
hits.pull.Add(1)
|
||||
w.WriteHeader(http.StatusOK)
|
||||
default:
|
||||
t.Fatalf("unexpected request %s %s", r.Method, r.URL.Path)
|
||||
}
|
||||
})
|
||||
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyIfMissing))
|
||||
assert.Equal(t, int32(1), hits.inspect.Load())
|
||||
assert.Equal(t, int32(0), hits.pull.Load())
|
||||
})
|
||||
t.Run("if_missing/absent triggers pull", func(t *testing.T) {
|
||||
hits := struct {
|
||||
inspect atomic.Int32
|
||||
pull atomic.Int32
|
||||
}{}
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
switch {
|
||||
case strings.HasSuffix(r.URL.Path, "/json") && r.Method == http.MethodGet:
|
||||
hits.inspect.Add(1)
|
||||
writeNotFound(t, w, "no such image")
|
||||
case strings.Contains(r.URL.Path, "/images/create"):
|
||||
hits.pull.Add(1)
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_, _ = io.WriteString(w, `{"status":"Pulling..."}`+"\n"+`{"status":"Done"}`+"\n")
|
||||
default:
|
||||
t.Fatalf("unexpected request %s %s", r.Method, r.URL.Path)
|
||||
}
|
||||
})
|
||||
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyIfMissing))
|
||||
assert.Equal(t, int32(1), hits.inspect.Load())
|
||||
assert.Equal(t, int32(1), hits.pull.Load())
|
||||
})
|
||||
t.Run("always pulls regardless of cache", func(t *testing.T) {
|
||||
var pullCount atomic.Int32
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Contains(t, r.URL.Path, "/images/create")
|
||||
pullCount.Add(1)
|
||||
w.WriteHeader(http.StatusOK)
|
||||
})
|
||||
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyAlways))
|
||||
assert.Equal(t, int32(1), pullCount.Load())
|
||||
})
|
||||
t.Run("never with absent image", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
writeNotFound(t, w, "no such image")
|
||||
})
|
||||
err := client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyNever)
|
||||
require.Error(t, err)
|
||||
assert.ErrorIs(t, err, ports.ErrImageNotFound)
|
||||
})
|
||||
t.Run("never with present image", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
writeJSON(t, w, http.StatusOK, map[string]any{"Id": "x"})
|
||||
})
|
||||
require.NoError(t, client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicyNever))
|
||||
})
|
||||
t.Run("unknown policy", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
t.Fatal("must not call docker on unknown policy")
|
||||
})
|
||||
err := client.PullImage(context.Background(), "alpine:3.21", ports.PullPolicy("invalid"))
|
||||
require.Error(t, err)
|
||||
})
|
||||
}
|
||||
|
||||
func TestRunHappyPath(t *testing.T) {
|
||||
calls := struct {
|
||||
create atomic.Int32
|
||||
start atomic.Int32
|
||||
remove atomic.Int32
|
||||
}{}
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
switch {
|
||||
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/containers/create"):
|
||||
calls.create.Add(1)
|
||||
require.Equal(t, "galaxy-game-game-1", r.URL.Query().Get("name"))
|
||||
writeJSON(t, w, http.StatusCreated, map[string]any{"Id": "cont-new", "Warnings": []string{}})
|
||||
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/start"):
|
||||
calls.start.Add(1)
|
||||
require.Equal(t, dockerPath("/containers/cont-new/start"), r.URL.Path)
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
case r.Method == http.MethodDelete && strings.HasPrefix(r.URL.Path, dockerPath("/containers/")):
|
||||
calls.remove.Add(1)
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
default:
|
||||
t.Fatalf("unexpected %s %s", r.Method, r.URL.Path)
|
||||
}
|
||||
})
|
||||
|
||||
result, err := client.Run(context.Background(), ports.RunSpec{
|
||||
Name: "galaxy-game-game-1",
|
||||
Image: "galaxy/game:test",
|
||||
Hostname: "galaxy-game-game-1",
|
||||
Network: "galaxy-net",
|
||||
Env: map[string]string{
|
||||
"GAME_STATE_PATH": "/var/lib/galaxy-game",
|
||||
"STORAGE_PATH": "/var/lib/galaxy-game",
|
||||
},
|
||||
Labels: map[string]string{"com.galaxy.owner": "rtmanager"},
|
||||
LogDriver: "json-file",
|
||||
BindMounts: []ports.BindMount{
|
||||
{HostPath: "/var/lib/galaxy/games/game-1", MountPath: "/var/lib/galaxy-game"},
|
||||
},
|
||||
CPUQuota: 1.0,
|
||||
Memory: "512m",
|
||||
PIDsLimit: 512,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "cont-new", result.ContainerID)
|
||||
assert.Equal(t, "http://galaxy-game-game-1:8080", result.EngineEndpoint)
|
||||
assert.False(t, result.StartedAt.IsZero())
|
||||
assert.Equal(t, int32(1), calls.create.Load())
|
||||
assert.Equal(t, int32(1), calls.start.Load())
|
||||
assert.Equal(t, int32(0), calls.remove.Load())
|
||||
}
|
||||
|
||||
func TestRunStartFailureRemovesContainer(t *testing.T) {
|
||||
calls := struct {
|
||||
create atomic.Int32
|
||||
start atomic.Int32
|
||||
remove atomic.Int32
|
||||
}{}
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
switch {
|
||||
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/containers/create"):
|
||||
calls.create.Add(1)
|
||||
writeJSON(t, w, http.StatusCreated, map[string]any{"Id": "cont-x"})
|
||||
case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/start"):
|
||||
calls.start.Add(1)
|
||||
http.Error(w, `{"message":"insufficient host resources"}`, http.StatusInternalServerError)
|
||||
case r.Method == http.MethodDelete && strings.HasPrefix(r.URL.Path, dockerPath("/containers/cont-x")):
|
||||
calls.remove.Add(1)
|
||||
require.Equal(t, "1", r.URL.Query().Get("force"))
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
default:
|
||||
t.Fatalf("unexpected %s %s", r.Method, r.URL.Path)
|
||||
}
|
||||
})
|
||||
|
||||
_, err := client.Run(context.Background(), ports.RunSpec{
|
||||
Name: "x",
|
||||
Image: "img",
|
||||
Hostname: "x",
|
||||
Network: "n",
|
||||
LogDriver: "json-file",
|
||||
CPUQuota: 1.0,
|
||||
Memory: "64m",
|
||||
PIDsLimit: 64,
|
||||
})
|
||||
require.Error(t, err)
|
||||
assert.Equal(t, int32(1), calls.create.Load())
|
||||
assert.Equal(t, int32(1), calls.start.Load())
|
||||
assert.Equal(t, int32(1), calls.remove.Load(), "adapter must roll back the partial container")
|
||||
}
|
||||
|
||||
func TestRunRejectsInvalidSpec(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
t.Fatal("must not contact docker on invalid spec")
|
||||
})
|
||||
_, err := client.Run(context.Background(), ports.RunSpec{Name: "x"})
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "image must not be empty")
|
||||
}
|
||||
|
||||
func TestStop(t *testing.T) {
|
||||
t.Run("graceful stop", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodPost, r.Method)
|
||||
require.Equal(t, dockerPath("/containers/cont-1/stop"), r.URL.Path)
|
||||
require.Equal(t, "30", r.URL.Query().Get("t"))
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
})
|
||||
require.NoError(t, client.Stop(context.Background(), "cont-1", 30*time.Second))
|
||||
})
|
||||
t.Run("missing container", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
writeNotFound(t, w, "no such container")
|
||||
})
|
||||
err := client.Stop(context.Background(), "missing", 30*time.Second)
|
||||
assert.ErrorIs(t, err, ports.ErrContainerNotFound)
|
||||
})
|
||||
t.Run("negative timeout normalised to zero", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, "0", r.URL.Query().Get("t"))
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
})
|
||||
require.NoError(t, client.Stop(context.Background(), "x", -5*time.Second))
|
||||
})
|
||||
}
|
||||
|
||||
func TestRemoveIsIdempotent(t *testing.T) {
|
||||
t.Run("present", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodDelete, r.Method)
|
||||
w.WriteHeader(http.StatusNoContent)
|
||||
})
|
||||
require.NoError(t, client.Remove(context.Background(), "cont-1"))
|
||||
})
|
||||
t.Run("missing", func(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
writeNotFound(t, w, "no such container")
|
||||
})
|
||||
require.NoError(t, client.Remove(context.Background(), "missing"))
|
||||
})
|
||||
}
|
||||
|
||||
func TestListAppliesLabelFilter(t *testing.T) {
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
require.Equal(t, dockerPath("/containers/json"), r.URL.Path)
|
||||
require.Equal(t, "1", r.URL.Query().Get("all"))
|
||||
|
||||
filtersRaw := r.URL.Query().Get("filters")
|
||||
require.NotEmpty(t, filtersRaw)
|
||||
var args map[string]map[string]bool
|
||||
require.NoError(t, json.Unmarshal([]byte(filtersRaw), &args))
|
||||
require.True(t, args["label"]["com.galaxy.owner=rtmanager"])
|
||||
|
||||
writeJSON(t, w, http.StatusOK, []map[string]any{
|
||||
{
|
||||
"Id": "cont-a",
|
||||
"Image": "galaxy/game:1.2.3",
|
||||
"Names": []string{"/galaxy-game-game-1"},
|
||||
"Labels": map[string]string{"com.galaxy.owner": "rtmanager"},
|
||||
"State": "running",
|
||||
"Created": int64(1700000000),
|
||||
},
|
||||
})
|
||||
})
|
||||
|
||||
got, err := client.List(context.Background(), ports.ListFilter{
|
||||
Labels: map[string]string{"com.galaxy.owner": "rtmanager"},
|
||||
})
|
||||
require.NoError(t, err)
|
||||
require.Len(t, got, 1)
|
||||
assert.Equal(t, "cont-a", got[0].ID)
|
||||
assert.Equal(t, "galaxy/game:1.2.3", got[0].ImageRef)
|
||||
assert.Equal(t, "galaxy-game-game-1", got[0].Hostname)
|
||||
assert.Equal(t, "running", got[0].Status)
|
||||
assert.False(t, got[0].StartedAt.IsZero())
|
||||
assert.Equal(t, "rtmanager", got[0].Labels["com.galaxy.owner"])
|
||||
}
|
||||
|
||||
func TestEventsListenDecodesContainerEvents(t *testing.T) {
|
||||
mu := make(chan struct{})
|
||||
client := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
require.Equal(t, dockerPath("/events"), r.URL.Path)
|
||||
|
||||
flusher, ok := w.(http.Flusher)
|
||||
require.True(t, ok)
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
flusher.Flush()
|
||||
|
||||
// Container start event
|
||||
writeEvent(t, w, "container", "start", "cont-1", map[string]string{
|
||||
"image": "galaxy/game:1.2.3",
|
||||
"name": "galaxy-game-game-1",
|
||||
"com.galaxy.game_id": "game-1",
|
||||
}, time.Now())
|
||||
flusher.Flush()
|
||||
|
||||
// Container die event with exit code 137
|
||||
writeEvent(t, w, "container", "die", "cont-1", map[string]string{
|
||||
"exitCode": "137",
|
||||
}, time.Now())
|
||||
flusher.Flush()
|
||||
|
||||
// Image event must be filtered out by adapter
|
||||
writeEvent(t, w, "image", "pull", "img", nil, time.Now())
|
||||
flusher.Flush()
|
||||
|
||||
<-mu
|
||||
})
|
||||
defer close(mu)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
defer cancel()
|
||||
|
||||
events, _, err := client.EventsListen(ctx)
|
||||
require.NoError(t, err)
|
||||
|
||||
got := []ports.DockerEvent{}
|
||||
deadline := time.After(2 * time.Second)
|
||||
for len(got) < 2 {
|
||||
select {
|
||||
case ev, ok := <-events:
|
||||
if !ok {
|
||||
t.Fatalf("events channel closed; got %d events", len(got))
|
||||
}
|
||||
got = append(got, ev)
|
||||
case <-deadline:
|
||||
t.Fatalf("did not receive expected events; have %d", len(got))
|
||||
}
|
||||
}
|
||||
require.Len(t, got, 2)
|
||||
assert.Equal(t, "start", got[0].Action)
|
||||
assert.Equal(t, "cont-1", got[0].ContainerID)
|
||||
assert.Equal(t, "game-1", got[0].Labels["com.galaxy.game_id"])
|
||||
assert.Equal(t, "die", got[1].Action)
|
||||
assert.Equal(t, 137, got[1].ExitCode)
|
||||
}
|
||||
|
||||
func writeEvent(t *testing.T, w io.Writer, eventType, action, id string, attributes map[string]string, when time.Time) {
|
||||
t.Helper()
|
||||
payload := map[string]any{
|
||||
"Type": eventType,
|
||||
"Action": action,
|
||||
"Actor": map[string]any{"ID": id, "Attributes": attributes},
|
||||
"time": when.Unix(),
|
||||
"timeNano": when.UnixNano(),
|
||||
}
|
||||
data, err := json.Marshal(payload)
|
||||
require.NoError(t, err)
|
||||
_, err = fmt.Fprintln(w, string(data))
|
||||
require.NoError(t, err)
|
||||
}
|
||||
|
||||
// Sanity: parsing helpers.
|
||||
func TestParseLogOpts(t *testing.T) {
|
||||
got := parseLogOpts("max-size=1m,max-file=3, ,empty=,=novalue")
|
||||
assert.Equal(t, "1m", got["max-size"])
|
||||
assert.Equal(t, "3", got["max-file"])
|
||||
assert.Equal(t, "", got["empty"])
|
||||
_, hasNovalue := got["=novalue"]
|
||||
assert.False(t, hasNovalue)
|
||||
}
|
||||
|
||||
func TestParseDockerTime(t *testing.T) {
|
||||
assert.True(t, parseDockerTime("").IsZero())
|
||||
assert.True(t, parseDockerTime("not-a-date").IsZero())
|
||||
parsed := parseDockerTime("2026-04-27T11:00:00.5Z")
|
||||
assert.False(t, parsed.IsZero())
|
||||
assert.Equal(t, time.UTC, parsed.Location())
|
||||
}
|
||||
|
||||
func TestEnvMapToSliceDeterministicLength(t *testing.T) {
|
||||
got := envMapToSlice(map[string]string{"A": "1", "B": "2"})
|
||||
assert.Len(t, got, 2)
|
||||
for _, kv := range got {
|
||||
assert.Contains(t, []string{"A=1", "B=2"}, kv)
|
||||
}
|
||||
assert.Nil(t, envMapToSlice(nil))
|
||||
}
|
||||
|
||||
// Compile-time sanity: make sure errors.Is wiring stays intact.
|
||||
func TestSentinelErrorsAreDistinct(t *testing.T) {
|
||||
require.True(t, errors.Is(ports.ErrNetworkMissing, ports.ErrNetworkMissing))
|
||||
require.False(t, errors.Is(ports.ErrNetworkMissing, ports.ErrImageNotFound))
|
||||
}
|
||||
|
||||
func TestURLPathEscapingForCharacters(t *testing.T) {
|
||||
// Ensure the SDK URL path encodes special characters; the adapter
|
||||
// passes raw inputs through and lets the SDK escape.
|
||||
encoded := url.PathEscape("game-1")
|
||||
assert.Equal(t, "game-1", encoded)
|
||||
}
|
||||
@@ -0,0 +1,175 @@
|
||||
// Code generated by MockGen. DO NOT EDIT.
|
||||
// Source: galaxy/rtmanager/internal/ports (interfaces: DockerClient)
|
||||
//
|
||||
// Generated by this command:
|
||||
//
|
||||
// mockgen -destination=../adapters/docker/mocks/mock_dockerclient.go -package=mocks galaxy/rtmanager/internal/ports DockerClient
|
||||
//
|
||||
|
||||
// Package mocks is a generated GoMock package.
|
||||
package mocks
|
||||
|
||||
import (
|
||||
context "context"
|
||||
ports "galaxy/rtmanager/internal/ports"
|
||||
reflect "reflect"
|
||||
time "time"
|
||||
|
||||
gomock "go.uber.org/mock/gomock"
|
||||
)
|
||||
|
||||
// MockDockerClient is a mock of DockerClient interface.
|
||||
type MockDockerClient struct {
|
||||
ctrl *gomock.Controller
|
||||
recorder *MockDockerClientMockRecorder
|
||||
isgomock struct{}
|
||||
}
|
||||
|
||||
// MockDockerClientMockRecorder is the mock recorder for MockDockerClient.
|
||||
type MockDockerClientMockRecorder struct {
|
||||
mock *MockDockerClient
|
||||
}
|
||||
|
||||
// NewMockDockerClient creates a new mock instance.
|
||||
func NewMockDockerClient(ctrl *gomock.Controller) *MockDockerClient {
|
||||
mock := &MockDockerClient{ctrl: ctrl}
|
||||
mock.recorder = &MockDockerClientMockRecorder{mock}
|
||||
return mock
|
||||
}
|
||||
|
||||
// EXPECT returns an object that allows the caller to indicate expected use.
|
||||
func (m *MockDockerClient) EXPECT() *MockDockerClientMockRecorder {
|
||||
return m.recorder
|
||||
}
|
||||
|
||||
// EnsureNetwork mocks base method.
|
||||
func (m *MockDockerClient) EnsureNetwork(ctx context.Context, name string) error {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "EnsureNetwork", ctx, name)
|
||||
ret0, _ := ret[0].(error)
|
||||
return ret0
|
||||
}
|
||||
|
||||
// EnsureNetwork indicates an expected call of EnsureNetwork.
|
||||
func (mr *MockDockerClientMockRecorder) EnsureNetwork(ctx, name any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "EnsureNetwork", reflect.TypeOf((*MockDockerClient)(nil).EnsureNetwork), ctx, name)
|
||||
}
|
||||
|
||||
// EventsListen mocks base method.
|
||||
func (m *MockDockerClient) EventsListen(ctx context.Context) (<-chan ports.DockerEvent, <-chan error, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "EventsListen", ctx)
|
||||
ret0, _ := ret[0].(<-chan ports.DockerEvent)
|
||||
ret1, _ := ret[1].(<-chan error)
|
||||
ret2, _ := ret[2].(error)
|
||||
return ret0, ret1, ret2
|
||||
}
|
||||
|
||||
// EventsListen indicates an expected call of EventsListen.
|
||||
func (mr *MockDockerClientMockRecorder) EventsListen(ctx any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "EventsListen", reflect.TypeOf((*MockDockerClient)(nil).EventsListen), ctx)
|
||||
}
|
||||
|
||||
// InspectContainer mocks base method.
|
||||
func (m *MockDockerClient) InspectContainer(ctx context.Context, containerID string) (ports.ContainerInspect, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "InspectContainer", ctx, containerID)
|
||||
ret0, _ := ret[0].(ports.ContainerInspect)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// InspectContainer indicates an expected call of InspectContainer.
|
||||
func (mr *MockDockerClientMockRecorder) InspectContainer(ctx, containerID any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "InspectContainer", reflect.TypeOf((*MockDockerClient)(nil).InspectContainer), ctx, containerID)
|
||||
}
|
||||
|
||||
// InspectImage mocks base method.
|
||||
func (m *MockDockerClient) InspectImage(ctx context.Context, ref string) (ports.ImageInspect, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "InspectImage", ctx, ref)
|
||||
ret0, _ := ret[0].(ports.ImageInspect)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// InspectImage indicates an expected call of InspectImage.
|
||||
func (mr *MockDockerClientMockRecorder) InspectImage(ctx, ref any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "InspectImage", reflect.TypeOf((*MockDockerClient)(nil).InspectImage), ctx, ref)
|
||||
}
|
||||
|
||||
// List mocks base method.
|
||||
func (m *MockDockerClient) List(ctx context.Context, filter ports.ListFilter) ([]ports.ContainerSummary, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "List", ctx, filter)
|
||||
ret0, _ := ret[0].([]ports.ContainerSummary)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// List indicates an expected call of List.
|
||||
func (mr *MockDockerClientMockRecorder) List(ctx, filter any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "List", reflect.TypeOf((*MockDockerClient)(nil).List), ctx, filter)
|
||||
}
|
||||
|
||||
// PullImage mocks base method.
|
||||
func (m *MockDockerClient) PullImage(ctx context.Context, ref string, policy ports.PullPolicy) error {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "PullImage", ctx, ref, policy)
|
||||
ret0, _ := ret[0].(error)
|
||||
return ret0
|
||||
}
|
||||
|
||||
// PullImage indicates an expected call of PullImage.
|
||||
func (mr *MockDockerClientMockRecorder) PullImage(ctx, ref, policy any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "PullImage", reflect.TypeOf((*MockDockerClient)(nil).PullImage), ctx, ref, policy)
|
||||
}
|
||||
|
||||
// Remove mocks base method.
|
||||
func (m *MockDockerClient) Remove(ctx context.Context, containerID string) error {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Remove", ctx, containerID)
|
||||
ret0, _ := ret[0].(error)
|
||||
return ret0
|
||||
}
|
||||
|
||||
// Remove indicates an expected call of Remove.
|
||||
func (mr *MockDockerClientMockRecorder) Remove(ctx, containerID any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Remove", reflect.TypeOf((*MockDockerClient)(nil).Remove), ctx, containerID)
|
||||
}
|
||||
|
||||
// Run mocks base method.
|
||||
func (m *MockDockerClient) Run(ctx context.Context, spec ports.RunSpec) (ports.RunResult, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Run", ctx, spec)
|
||||
ret0, _ := ret[0].(ports.RunResult)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// Run indicates an expected call of Run.
|
||||
func (mr *MockDockerClientMockRecorder) Run(ctx, spec any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Run", reflect.TypeOf((*MockDockerClient)(nil).Run), ctx, spec)
|
||||
}
|
||||
|
||||
// Stop mocks base method.
|
||||
func (m *MockDockerClient) Stop(ctx context.Context, containerID string, timeout time.Duration) error {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Stop", ctx, containerID, timeout)
|
||||
ret0, _ := ret[0].(error)
|
||||
return ret0
|
||||
}
|
||||
|
||||
// Stop indicates an expected call of Stop.
|
||||
func (mr *MockDockerClientMockRecorder) Stop(ctx, containerID, timeout any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Stop", reflect.TypeOf((*MockDockerClient)(nil).Stop), ctx, containerID, timeout)
|
||||
}
|
||||
@@ -0,0 +1,11 @@
|
||||
package mocks
|
||||
|
||||
import (
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
// Compile-time assertion that the generated mock satisfies the port
|
||||
// interface. Future signature drift between the port and the generated
|
||||
// file fails the build at this line, which is more actionable than a
|
||||
// runtime check from a service test.
|
||||
var _ ports.DockerClient = (*MockDockerClient)(nil)
|
||||
@@ -0,0 +1,202 @@
|
||||
// Package docker smoke tests exercise the production adapter against a
|
||||
// real Docker daemon. The tests skip when no Docker socket is reachable
|
||||
// (`skipUnlessDockerAvailable`), so they run in the default
|
||||
// `go test ./...` pass without a build tag.
|
||||
package docker
|
||||
|
||||
import (
|
||||
"context"
|
||||
"crypto/rand"
|
||||
"encoding/hex"
|
||||
"errors"
|
||||
"os"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/docker/docker/api/types/network"
|
||||
dockerclient "github.com/docker/docker/client"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
const (
|
||||
smokeImage = "alpine:3.21"
|
||||
smokeNetPrefix = "rtmanager-smoke-"
|
||||
)
|
||||
|
||||
func skipUnlessDockerAvailable(t *testing.T) {
|
||||
t.Helper()
|
||||
if os.Getenv("DOCKER_HOST") == "" {
|
||||
if _, err := os.Stat("/var/run/docker.sock"); err != nil {
|
||||
t.Skip("docker daemon not available; set DOCKER_HOST or expose /var/run/docker.sock")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func newSmokeAdapter(t *testing.T) (*Client, *dockerclient.Client) {
|
||||
t.Helper()
|
||||
|
||||
docker, err := dockerclient.NewClientWithOpts(dockerclient.FromEnv, dockerclient.WithAPIVersionNegotiation())
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() { _ = docker.Close() })
|
||||
|
||||
pingCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
||||
defer cancel()
|
||||
if _, err := docker.Ping(pingCtx); err != nil {
|
||||
// A reachable socket path may still be unusable in sandboxed
|
||||
// environments (e.g., macOS sandbox blocking the colima socket).
|
||||
// The smoke test can only run when the daemon answers ping, so a
|
||||
// permission-denied / connection-refused error is a runtime
|
||||
// "Docker unavailable" signal and skips the test.
|
||||
t.Skipf("docker daemon unavailable: %v", err)
|
||||
}
|
||||
|
||||
adapter, err := NewClient(Config{
|
||||
Docker: docker,
|
||||
LogDriver: "json-file",
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return adapter, docker
|
||||
}
|
||||
|
||||
func uniqueSuffix(t *testing.T) string {
|
||||
t.Helper()
|
||||
buf := make([]byte, 4)
|
||||
_, err := rand.Read(buf)
|
||||
require.NoError(t, err)
|
||||
return hex.EncodeToString(buf)
|
||||
}
|
||||
|
||||
// TestSmokeFullLifecycle runs the adapter through every method against
|
||||
// the real Docker daemon: ensure-network → pull → run → events →
|
||||
// stop → remove.
|
||||
func TestSmokeFullLifecycle(t *testing.T) {
|
||||
skipUnlessDockerAvailable(t)
|
||||
|
||||
adapter, docker := newSmokeAdapter(t)
|
||||
|
||||
suffix := uniqueSuffix(t)
|
||||
netName := smokeNetPrefix + suffix
|
||||
containerName := "rtmanager-smoke-cont-" + suffix
|
||||
|
||||
// Step 1 — provision a temporary user-defined bridge network.
|
||||
createCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
_, err := docker.NetworkCreate(createCtx, netName, network.CreateOptions{Driver: "bridge"})
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() {
|
||||
removeCtx, removeCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer removeCancel()
|
||||
_ = docker.NetworkRemove(removeCtx, netName)
|
||||
})
|
||||
|
||||
// Step 2 — EnsureNetwork present and missing paths.
|
||||
require.NoError(t, adapter.EnsureNetwork(createCtx, netName))
|
||||
missingErr := adapter.EnsureNetwork(createCtx, "rtmanager-smoke-missing-"+suffix)
|
||||
require.Error(t, missingErr)
|
||||
assert.ErrorIs(t, missingErr, ports.ErrNetworkMissing)
|
||||
|
||||
// Step 3 — pull alpine via the configured policy.
|
||||
pullCtx, pullCancel := context.WithTimeout(context.Background(), 5*time.Minute)
|
||||
defer pullCancel()
|
||||
require.NoError(t, adapter.PullImage(pullCtx, smokeImage, ports.PullPolicyIfMissing))
|
||||
|
||||
// Step 4 — subscribe to events before running the container so we
|
||||
// observe the start event.
|
||||
listenCtx, listenCancel := context.WithCancel(context.Background())
|
||||
defer listenCancel()
|
||||
events, listenErrs, err := adapter.EventsListen(listenCtx)
|
||||
require.NoError(t, err)
|
||||
|
||||
// Step 5 — run a tiny container that sleeps so we can observe it.
|
||||
stateDir := t.TempDir()
|
||||
runCtx, runCancel := context.WithTimeout(context.Background(), 60*time.Second)
|
||||
defer runCancel()
|
||||
result, err := adapter.Run(runCtx, ports.RunSpec{
|
||||
Name: containerName,
|
||||
Image: smokeImage,
|
||||
Hostname: "smoke-" + suffix,
|
||||
Network: netName,
|
||||
Env: map[string]string{
|
||||
"GAME_STATE_PATH": "/tmp/state",
|
||||
"STORAGE_PATH": "/tmp/state",
|
||||
},
|
||||
Labels: map[string]string{
|
||||
"com.galaxy.owner": "rtmanager",
|
||||
"com.galaxy.kind": "smoke",
|
||||
},
|
||||
BindMounts: []ports.BindMount{
|
||||
{HostPath: stateDir, MountPath: "/tmp/state"},
|
||||
},
|
||||
LogDriver: "json-file",
|
||||
CPUQuota: 0.5,
|
||||
Memory: "64m",
|
||||
PIDsLimit: 32,
|
||||
Cmd: []string{"/bin/sh", "-c", "sleep 60"},
|
||||
})
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() {
|
||||
removeCtx, removeCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer removeCancel()
|
||||
_ = adapter.Remove(removeCtx, result.ContainerID)
|
||||
})
|
||||
|
||||
require.NotEmpty(t, result.ContainerID)
|
||||
require.Equal(t, "http://smoke-"+suffix+":8080", result.EngineEndpoint)
|
||||
|
||||
// Step 6 — wait for a `start` event for the new container id.
|
||||
startObserved := waitForEvent(t, events, listenErrs, "start", result.ContainerID, 15*time.Second)
|
||||
require.True(t, startObserved, "did not observe start event for container %s", result.ContainerID)
|
||||
|
||||
// Step 7 — InspectContainer returns running state.
|
||||
inspectCtx, inspectCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer inspectCancel()
|
||||
inspect, err := adapter.InspectContainer(inspectCtx, result.ContainerID)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "running", inspect.Status)
|
||||
|
||||
// Step 8 — Stop, then Remove, then InspectContainer must report
|
||||
// not found.
|
||||
stopCtx, stopCancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer stopCancel()
|
||||
require.NoError(t, adapter.Stop(stopCtx, result.ContainerID, 5*time.Second))
|
||||
|
||||
require.NoError(t, adapter.Remove(stopCtx, result.ContainerID))
|
||||
|
||||
if _, err := adapter.InspectContainer(stopCtx, result.ContainerID); !errors.Is(err, ports.ErrContainerNotFound) {
|
||||
t.Fatalf("expected ErrContainerNotFound, got %v", err)
|
||||
}
|
||||
|
||||
// Step 9 — terminate the events subscription cleanly.
|
||||
listenCancel()
|
||||
select {
|
||||
case _, ok := <-events:
|
||||
_ = ok
|
||||
case <-time.After(5 * time.Second):
|
||||
t.Log("events channel did not close within timeout (best-effort)")
|
||||
}
|
||||
}
|
||||
|
||||
func waitForEvent(t *testing.T, events <-chan ports.DockerEvent, errs <-chan error, action, containerID string, timeout time.Duration) bool {
|
||||
t.Helper()
|
||||
deadline := time.After(timeout)
|
||||
for {
|
||||
select {
|
||||
case ev, ok := <-events:
|
||||
if !ok {
|
||||
return false
|
||||
}
|
||||
if ev.Action == action && ev.ContainerID == containerID {
|
||||
return true
|
||||
}
|
||||
case err := <-errs:
|
||||
if err != nil {
|
||||
t.Fatalf("events stream error: %v", err)
|
||||
}
|
||||
case <-deadline:
|
||||
return false
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,165 @@
|
||||
// Package healtheventspublisher provides the Redis-Streams-backed
|
||||
// publisher for `runtime:health_events`. Every Publish call upserts the
|
||||
// latest `health_snapshots` row before XADDing the event so consumers
|
||||
// observing the snapshot store can never lag the event stream by more
|
||||
// than the duration of one network call.
|
||||
//
|
||||
// The publisher is shared across `ports.HealthEventPublisher` callers:
|
||||
// the start service emits `container_started`; the probe, inspect, and
|
||||
// events-listener workers emit the rest. The publisher's surface is
|
||||
// stable across all of them.
|
||||
package healtheventspublisher
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strconv"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/health"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// emptyDetails is the canonical JSON payload installed when the caller
|
||||
// supplies an empty Details slice. Matches the SQL DEFAULT for
|
||||
// `health_snapshots.details`.
|
||||
const emptyDetails = "{}"
|
||||
|
||||
// Wire field names used by the Redis Streams payload. Frozen by
|
||||
// `rtmanager/api/runtime-health-asyncapi.yaml`; renaming any of them
|
||||
// breaks consumers.
|
||||
const (
|
||||
fieldGameID = "game_id"
|
||||
fieldContainerID = "container_id"
|
||||
fieldEventType = "event_type"
|
||||
fieldOccurredAtMS = "occurred_at_ms"
|
||||
fieldDetails = "details"
|
||||
)
|
||||
|
||||
// Config groups the dependencies and stream name required to construct
|
||||
// a Publisher.
|
||||
type Config struct {
|
||||
// Client appends entries to the Redis Stream. Must be non-nil.
|
||||
Client *redis.Client
|
||||
|
||||
// Snapshots upserts the latest health snapshot. Must be non-nil.
|
||||
Snapshots ports.HealthSnapshotStore
|
||||
|
||||
// Stream stores the Redis Stream key events are published to (e.g.
|
||||
// `runtime:health_events`). Must not be empty.
|
||||
Stream string
|
||||
}
|
||||
|
||||
// Publisher implements `ports.HealthEventPublisher` on top of a shared
|
||||
// Redis client and the production `health_snapshots` store.
|
||||
type Publisher struct {
|
||||
client *redis.Client
|
||||
snapshots ports.HealthSnapshotStore
|
||||
stream string
|
||||
}
|
||||
|
||||
// NewPublisher constructs one Publisher from cfg. Validation errors
|
||||
// surface the missing collaborator verbatim.
|
||||
func NewPublisher(cfg Config) (*Publisher, error) {
|
||||
if cfg.Client == nil {
|
||||
return nil, errors.New("new rtmanager health events publisher: nil redis client")
|
||||
}
|
||||
if cfg.Snapshots == nil {
|
||||
return nil, errors.New("new rtmanager health events publisher: nil snapshot store")
|
||||
}
|
||||
if cfg.Stream == "" {
|
||||
return nil, errors.New("new rtmanager health events publisher: stream must not be empty")
|
||||
}
|
||||
return &Publisher{
|
||||
client: cfg.Client,
|
||||
snapshots: cfg.Snapshots,
|
||||
stream: cfg.Stream,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Publish upserts the matching health_snapshots row and then XADDs the
|
||||
// envelope to the configured Redis Stream. Both side effects are
|
||||
// required; the snapshot upsert runs first so a successful Publish
|
||||
// always leaves the snapshot store at least as fresh as the stream.
|
||||
func (publisher *Publisher) Publish(ctx context.Context, envelope ports.HealthEventEnvelope) error {
|
||||
if publisher == nil || publisher.client == nil || publisher.snapshots == nil {
|
||||
return errors.New("publish health event: nil publisher")
|
||||
}
|
||||
if ctx == nil {
|
||||
return errors.New("publish health event: nil context")
|
||||
}
|
||||
if err := envelope.Validate(); err != nil {
|
||||
return fmt.Errorf("publish health event: %w", err)
|
||||
}
|
||||
|
||||
details := envelope.Details
|
||||
if len(details) == 0 {
|
||||
details = json.RawMessage(emptyDetails)
|
||||
}
|
||||
|
||||
status, source := snapshotMappingFor(envelope.EventType)
|
||||
snapshot := health.HealthSnapshot{
|
||||
GameID: envelope.GameID,
|
||||
ContainerID: envelope.ContainerID,
|
||||
Status: status,
|
||||
Source: source,
|
||||
Details: details,
|
||||
ObservedAt: envelope.OccurredAt.UTC(),
|
||||
}
|
||||
if err := publisher.snapshots.Upsert(ctx, snapshot); err != nil {
|
||||
return fmt.Errorf("publish health event: upsert snapshot: %w", err)
|
||||
}
|
||||
|
||||
occurredAtMS := envelope.OccurredAt.UTC().UnixMilli()
|
||||
values := map[string]any{
|
||||
fieldGameID: envelope.GameID,
|
||||
fieldContainerID: envelope.ContainerID,
|
||||
fieldEventType: string(envelope.EventType),
|
||||
fieldOccurredAtMS: strconv.FormatInt(occurredAtMS, 10),
|
||||
fieldDetails: string(details),
|
||||
}
|
||||
if err := publisher.client.XAdd(ctx, &redis.XAddArgs{
|
||||
Stream: publisher.stream,
|
||||
Values: values,
|
||||
}).Err(); err != nil {
|
||||
return fmt.Errorf("publish health event: xadd: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// snapshotMappingFor returns the SnapshotStatus and SnapshotSource that
|
||||
// match eventType per `rtmanager/README.md §Health Monitoring`.
|
||||
//
|
||||
// `container_started` is observed when the start service successfully
|
||||
// runs the container; the snapshot collapses it to `healthy`.
|
||||
// `probe_recovered` collapses to `healthy` per
|
||||
// `rtmanager/docs/domain-and-ports.md` §4: it does not have its own
|
||||
// snapshot status; the next observation overwrites the prior
|
||||
// `probe_failed` with `healthy`.
|
||||
func snapshotMappingFor(eventType health.EventType) (health.SnapshotStatus, health.SnapshotSource) {
|
||||
switch eventType {
|
||||
case health.EventTypeContainerStarted:
|
||||
return health.SnapshotStatusHealthy, health.SnapshotSourceDockerEvent
|
||||
case health.EventTypeContainerExited:
|
||||
return health.SnapshotStatusExited, health.SnapshotSourceDockerEvent
|
||||
case health.EventTypeContainerOOM:
|
||||
return health.SnapshotStatusOOM, health.SnapshotSourceDockerEvent
|
||||
case health.EventTypeContainerDisappeared:
|
||||
return health.SnapshotStatusContainerDisappeared, health.SnapshotSourceDockerEvent
|
||||
case health.EventTypeInspectUnhealthy:
|
||||
return health.SnapshotStatusInspectUnhealthy, health.SnapshotSourceInspect
|
||||
case health.EventTypeProbeFailed:
|
||||
return health.SnapshotStatusProbeFailed, health.SnapshotSourceProbe
|
||||
case health.EventTypeProbeRecovered:
|
||||
return health.SnapshotStatusHealthy, health.SnapshotSourceProbe
|
||||
default:
|
||||
return "", ""
|
||||
}
|
||||
}
|
||||
|
||||
// Compile-time assertion: Publisher implements
|
||||
// ports.HealthEventPublisher.
|
||||
var _ ports.HealthEventPublisher = (*Publisher)(nil)
|
||||
@@ -0,0 +1,197 @@
|
||||
package healtheventspublisher_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"strconv"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/healtheventspublisher"
|
||||
"galaxy/rtmanager/internal/domain/health"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/alicebob/miniredis/v2"
|
||||
"github.com/redis/go-redis/v9"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// fakeSnapshots captures Upsert invocations for assertions.
|
||||
type fakeSnapshots struct {
|
||||
mu sync.Mutex
|
||||
upserts []health.HealthSnapshot
|
||||
upsertErr error
|
||||
}
|
||||
|
||||
func (s *fakeSnapshots) Upsert(_ context.Context, snapshot health.HealthSnapshot) error {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
if s.upsertErr != nil {
|
||||
return s.upsertErr
|
||||
}
|
||||
s.upserts = append(s.upserts, snapshot)
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *fakeSnapshots) Get(_ context.Context, _ string) (health.HealthSnapshot, error) {
|
||||
return health.HealthSnapshot{}, nil
|
||||
}
|
||||
|
||||
func newPublisher(t *testing.T, snapshots ports.HealthSnapshotStore) (*healtheventspublisher.Publisher, *miniredis.Miniredis, *redis.Client) {
|
||||
t.Helper()
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
|
||||
publisher, err := healtheventspublisher.NewPublisher(healtheventspublisher.Config{
|
||||
Client: client,
|
||||
Snapshots: snapshots,
|
||||
Stream: "runtime:health_events",
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return publisher, server, client
|
||||
}
|
||||
|
||||
func TestNewPublisherRejectsMissingCollaborators(t *testing.T) {
|
||||
_, err := healtheventspublisher.NewPublisher(healtheventspublisher.Config{})
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = healtheventspublisher.NewPublisher(healtheventspublisher.Config{
|
||||
Client: redis.NewClient(&redis.Options{Addr: "127.0.0.1:0"}),
|
||||
})
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = healtheventspublisher.NewPublisher(healtheventspublisher.Config{
|
||||
Client: redis.NewClient(&redis.Options{Addr: "127.0.0.1:0"}),
|
||||
Snapshots: &fakeSnapshots{},
|
||||
})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestPublishContainerStartedUpsertsHealthyAndXAdds(t *testing.T) {
|
||||
snapshots := &fakeSnapshots{}
|
||||
publisher, _, client := newPublisher(t, snapshots)
|
||||
|
||||
occurredAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
|
||||
envelope := ports.HealthEventEnvelope{
|
||||
GameID: "game-1",
|
||||
ContainerID: "c-1",
|
||||
EventType: health.EventTypeContainerStarted,
|
||||
OccurredAt: occurredAt,
|
||||
Details: json.RawMessage(`{"image_ref":"galaxy/game:1.2.3"}`),
|
||||
}
|
||||
require.NoError(t, publisher.Publish(context.Background(), envelope))
|
||||
|
||||
require.Len(t, snapshots.upserts, 1)
|
||||
snapshot := snapshots.upserts[0]
|
||||
assert.Equal(t, "game-1", snapshot.GameID)
|
||||
assert.Equal(t, "c-1", snapshot.ContainerID)
|
||||
assert.Equal(t, health.SnapshotStatusHealthy, snapshot.Status)
|
||||
assert.Equal(t, health.SnapshotSourceDockerEvent, snapshot.Source)
|
||||
assert.JSONEq(t, `{"image_ref":"galaxy/game:1.2.3"}`, string(snapshot.Details))
|
||||
assert.Equal(t, occurredAt, snapshot.ObservedAt)
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 1)
|
||||
values := entries[0].Values
|
||||
assert.Equal(t, "game-1", values["game_id"])
|
||||
assert.Equal(t, "c-1", values["container_id"])
|
||||
assert.Equal(t, "container_started", values["event_type"])
|
||||
assert.Equal(t, strconv.FormatInt(occurredAt.UnixMilli(), 10), values["occurred_at_ms"])
|
||||
assert.JSONEq(t, `{"image_ref":"galaxy/game:1.2.3"}`, values["details"].(string))
|
||||
}
|
||||
|
||||
func TestPublishMapsEveryEventTypeToASnapshot(t *testing.T) {
|
||||
t.Parallel()
|
||||
cases := []struct {
|
||||
eventType health.EventType
|
||||
expectStatus health.SnapshotStatus
|
||||
expectSource health.SnapshotSource
|
||||
}{
|
||||
{health.EventTypeContainerStarted, health.SnapshotStatusHealthy, health.SnapshotSourceDockerEvent},
|
||||
{health.EventTypeContainerExited, health.SnapshotStatusExited, health.SnapshotSourceDockerEvent},
|
||||
{health.EventTypeContainerOOM, health.SnapshotStatusOOM, health.SnapshotSourceDockerEvent},
|
||||
{health.EventTypeContainerDisappeared, health.SnapshotStatusContainerDisappeared, health.SnapshotSourceDockerEvent},
|
||||
{health.EventTypeInspectUnhealthy, health.SnapshotStatusInspectUnhealthy, health.SnapshotSourceInspect},
|
||||
{health.EventTypeProbeFailed, health.SnapshotStatusProbeFailed, health.SnapshotSourceProbe},
|
||||
{health.EventTypeProbeRecovered, health.SnapshotStatusHealthy, health.SnapshotSourceProbe},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(string(tc.eventType), func(t *testing.T) {
|
||||
t.Parallel()
|
||||
snapshots := &fakeSnapshots{}
|
||||
publisher, _, _ := newPublisher(t, snapshots)
|
||||
require.NoError(t, publisher.Publish(context.Background(), ports.HealthEventEnvelope{
|
||||
GameID: "g",
|
||||
ContainerID: "c",
|
||||
EventType: tc.eventType,
|
||||
OccurredAt: time.Now().UTC(),
|
||||
Details: json.RawMessage(`{}`),
|
||||
}))
|
||||
require.Len(t, snapshots.upserts, 1)
|
||||
assert.Equal(t, tc.expectStatus, snapshots.upserts[0].Status)
|
||||
assert.Equal(t, tc.expectSource, snapshots.upserts[0].Source)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestPublishEmptyDetailsBecomesEmptyObject(t *testing.T) {
|
||||
snapshots := &fakeSnapshots{}
|
||||
publisher, _, client := newPublisher(t, snapshots)
|
||||
|
||||
envelope := ports.HealthEventEnvelope{
|
||||
GameID: "g",
|
||||
ContainerID: "c",
|
||||
EventType: health.EventTypeContainerDisappeared,
|
||||
OccurredAt: time.Now().UTC(),
|
||||
}
|
||||
require.NoError(t, publisher.Publish(context.Background(), envelope))
|
||||
|
||||
require.Len(t, snapshots.upserts, 1)
|
||||
assert.JSONEq(t, "{}", string(snapshots.upserts[0].Details))
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 1)
|
||||
assert.JSONEq(t, "{}", entries[0].Values["details"].(string))
|
||||
}
|
||||
|
||||
func TestPublishRejectsInvalidEnvelope(t *testing.T) {
|
||||
snapshots := &fakeSnapshots{}
|
||||
publisher, _, client := newPublisher(t, snapshots)
|
||||
|
||||
require.Error(t, publisher.Publish(context.Background(), ports.HealthEventEnvelope{}))
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
assert.Empty(t, entries)
|
||||
assert.Empty(t, snapshots.upserts)
|
||||
}
|
||||
|
||||
func TestPublishSurfacesSnapshotErrorWithoutXAdd(t *testing.T) {
|
||||
snapshots := &fakeSnapshots{upsertErr: assertSentinelErr}
|
||||
publisher, _, client := newPublisher(t, snapshots)
|
||||
|
||||
err := publisher.Publish(context.Background(), ports.HealthEventEnvelope{
|
||||
GameID: "g",
|
||||
ContainerID: "c",
|
||||
EventType: health.EventTypeContainerStarted,
|
||||
OccurredAt: time.Now().UTC(),
|
||||
Details: json.RawMessage(`{"image_ref":"x"}`),
|
||||
})
|
||||
require.Error(t, err)
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:health_events", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
assert.Empty(t, entries, "xadd must not run when snapshot upsert fails")
|
||||
}
|
||||
|
||||
// assertSentinelErr is a sentinel for snapshot-failure assertions.
|
||||
var assertSentinelErr = sentinelError("snapshot upsert failure")
|
||||
|
||||
type sentinelError string
|
||||
|
||||
func (s sentinelError) Error() string { return string(s) }
|
||||
@@ -0,0 +1,100 @@
|
||||
// Package jobresultspublisher provides the Redis-Streams-backed
|
||||
// publisher for `runtime:job_results`. The start-jobs and stop-jobs
|
||||
// consumers call this adapter so every consumed envelope produces
|
||||
// exactly one outcome entry on the result stream.
|
||||
//
|
||||
// The wire fields mirror the AsyncAPI schema frozen in
|
||||
// `rtmanager/api/runtime-jobs-asyncapi.yaml`. Every field is XADDed
|
||||
// even when empty so consumers can rely on the schema's required-field
|
||||
// set.
|
||||
package jobresultspublisher
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// Wire field names used by the Redis Streams payload. Frozen by
|
||||
// `rtmanager/api/runtime-jobs-asyncapi.yaml`; renaming any of them
|
||||
// breaks consumers.
|
||||
const (
|
||||
fieldGameID = "game_id"
|
||||
fieldOutcome = "outcome"
|
||||
fieldContainerID = "container_id"
|
||||
fieldEngineEndpoint = "engine_endpoint"
|
||||
fieldErrorCode = "error_code"
|
||||
fieldErrorMessage = "error_message"
|
||||
)
|
||||
|
||||
// Config groups the dependencies and stream name required to construct
|
||||
// a Publisher.
|
||||
type Config struct {
|
||||
// Client appends entries to the Redis Stream. Must be non-nil.
|
||||
Client *redis.Client
|
||||
|
||||
// Stream stores the Redis Stream key job results are published to
|
||||
// (e.g. `runtime:job_results`). Must not be empty.
|
||||
Stream string
|
||||
}
|
||||
|
||||
// Publisher implements `ports.JobResultPublisher` on top of a shared
|
||||
// Redis client.
|
||||
type Publisher struct {
|
||||
client *redis.Client
|
||||
stream string
|
||||
}
|
||||
|
||||
// NewPublisher constructs one Publisher from cfg. Validation errors
|
||||
// surface the missing collaborator verbatim.
|
||||
func NewPublisher(cfg Config) (*Publisher, error) {
|
||||
if cfg.Client == nil {
|
||||
return nil, errors.New("new rtmanager job results publisher: nil redis client")
|
||||
}
|
||||
if strings.TrimSpace(cfg.Stream) == "" {
|
||||
return nil, errors.New("new rtmanager job results publisher: stream must not be empty")
|
||||
}
|
||||
return &Publisher{
|
||||
client: cfg.Client,
|
||||
stream: cfg.Stream,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Publish XADDs result to the configured Redis Stream. The wire payload
|
||||
// includes every field declared as required by the AsyncAPI schema —
|
||||
// empty strings are kept so consumers always see the documented keys.
|
||||
func (publisher *Publisher) Publish(ctx context.Context, result ports.JobResult) error {
|
||||
if publisher == nil || publisher.client == nil {
|
||||
return errors.New("publish job result: nil publisher")
|
||||
}
|
||||
if ctx == nil {
|
||||
return errors.New("publish job result: nil context")
|
||||
}
|
||||
if err := result.Validate(); err != nil {
|
||||
return fmt.Errorf("publish job result: %w", err)
|
||||
}
|
||||
|
||||
values := map[string]any{
|
||||
fieldGameID: result.GameID,
|
||||
fieldOutcome: result.Outcome,
|
||||
fieldContainerID: result.ContainerID,
|
||||
fieldEngineEndpoint: result.EngineEndpoint,
|
||||
fieldErrorCode: result.ErrorCode,
|
||||
fieldErrorMessage: result.ErrorMessage,
|
||||
}
|
||||
if err := publisher.client.XAdd(ctx, &redis.XAddArgs{
|
||||
Stream: publisher.stream,
|
||||
Values: values,
|
||||
}).Err(); err != nil {
|
||||
return fmt.Errorf("publish job result: xadd: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Compile-time assertion: Publisher implements ports.JobResultPublisher.
|
||||
var _ ports.JobResultPublisher = (*Publisher)(nil)
|
||||
@@ -0,0 +1,142 @@
|
||||
package jobresultspublisher_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/jobresultspublisher"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/alicebob/miniredis/v2"
|
||||
"github.com/redis/go-redis/v9"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func newPublisher(t *testing.T) (*jobresultspublisher.Publisher, *redis.Client) {
|
||||
t.Helper()
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
|
||||
publisher, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{
|
||||
Client: client,
|
||||
Stream: "runtime:job_results",
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return publisher, client
|
||||
}
|
||||
|
||||
func TestNewPublisherRejectsMissingCollaborators(t *testing.T) {
|
||||
_, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{})
|
||||
require.Error(t, err)
|
||||
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
|
||||
_, err = jobresultspublisher.NewPublisher(jobresultspublisher.Config{Client: client})
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = jobresultspublisher.NewPublisher(jobresultspublisher.Config{Client: client, Stream: " "})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestPublishRejectsInvalidResult(t *testing.T) {
|
||||
publisher, _ := newPublisher(t)
|
||||
|
||||
require.Error(t, publisher.Publish(context.Background(), ports.JobResult{}))
|
||||
require.Error(t, publisher.Publish(context.Background(), ports.JobResult{
|
||||
GameID: "game-1",
|
||||
Outcome: "weird",
|
||||
}))
|
||||
}
|
||||
|
||||
func TestPublishStartSuccessXAddsAllRequiredFields(t *testing.T) {
|
||||
publisher, client := newPublisher(t)
|
||||
|
||||
result := ports.JobResult{
|
||||
GameID: "game-1",
|
||||
Outcome: ports.JobOutcomeSuccess,
|
||||
ContainerID: "c-1",
|
||||
EngineEndpoint: "http://galaxy-game-game-1:8080",
|
||||
ErrorCode: "",
|
||||
ErrorMessage: "",
|
||||
}
|
||||
require.NoError(t, publisher.Publish(context.Background(), result))
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:job_results", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 1)
|
||||
values := entries[0].Values
|
||||
assert.Equal(t, "game-1", values["game_id"])
|
||||
assert.Equal(t, "success", values["outcome"])
|
||||
assert.Equal(t, "c-1", values["container_id"])
|
||||
assert.Equal(t, "http://galaxy-game-game-1:8080", values["engine_endpoint"])
|
||||
assert.Equal(t, "", values["error_code"])
|
||||
assert.Equal(t, "", values["error_message"])
|
||||
}
|
||||
|
||||
func TestPublishFailureXAddsEmptyContainerAndEndpoint(t *testing.T) {
|
||||
publisher, client := newPublisher(t)
|
||||
|
||||
result := ports.JobResult{
|
||||
GameID: "game-2",
|
||||
Outcome: ports.JobOutcomeFailure,
|
||||
ErrorCode: "image_pull_failed",
|
||||
ErrorMessage: "manifest unknown",
|
||||
}
|
||||
require.NoError(t, publisher.Publish(context.Background(), result))
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:job_results", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 1)
|
||||
values := entries[0].Values
|
||||
assert.Equal(t, "game-2", values["game_id"])
|
||||
assert.Equal(t, "failure", values["outcome"])
|
||||
assert.Equal(t, "", values["container_id"], "failure must publish empty container id")
|
||||
assert.Equal(t, "", values["engine_endpoint"], "failure must publish empty engine endpoint")
|
||||
assert.Equal(t, "image_pull_failed", values["error_code"])
|
||||
assert.Equal(t, "manifest unknown", values["error_message"])
|
||||
}
|
||||
|
||||
func TestPublishReplayNoOpKeepsContainerAndEndpoint(t *testing.T) {
|
||||
publisher, client := newPublisher(t)
|
||||
|
||||
result := ports.JobResult{
|
||||
GameID: "game-3",
|
||||
Outcome: ports.JobOutcomeSuccess,
|
||||
ContainerID: "c-3",
|
||||
EngineEndpoint: "http://galaxy-game-game-3:8080",
|
||||
ErrorCode: "replay_no_op",
|
||||
}
|
||||
require.NoError(t, publisher.Publish(context.Background(), result))
|
||||
|
||||
entries, err := client.XRange(context.Background(), "runtime:job_results", "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 1)
|
||||
values := entries[0].Values
|
||||
assert.Equal(t, "game-3", values["game_id"])
|
||||
assert.Equal(t, "success", values["outcome"])
|
||||
assert.Equal(t, "c-3", values["container_id"])
|
||||
assert.Equal(t, "http://galaxy-game-game-3:8080", values["engine_endpoint"])
|
||||
assert.Equal(t, "replay_no_op", values["error_code"])
|
||||
assert.Equal(t, "", values["error_message"])
|
||||
}
|
||||
|
||||
func TestPublishFailsOnClosedClient(t *testing.T) {
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
publisher, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{
|
||||
Client: client,
|
||||
Stream: "runtime:job_results",
|
||||
})
|
||||
require.NoError(t, err)
|
||||
require.NoError(t, client.Close())
|
||||
|
||||
err = publisher.Publish(context.Background(), ports.JobResult{
|
||||
GameID: "game-4",
|
||||
Outcome: ports.JobOutcomeSuccess,
|
||||
})
|
||||
require.Error(t, err)
|
||||
}
|
||||
@@ -0,0 +1,219 @@
|
||||
// Package lobbyclient provides the trusted-internal Lobby REST client
|
||||
// Runtime Manager uses to fetch ancillary game metadata for diagnostics.
|
||||
//
|
||||
// The client is intentionally minimal: the GetGame fetch is ancillary
|
||||
// diagnostics because the start envelope already carries the only
|
||||
// required field (`image_ref`). A failed call surfaces as
|
||||
// `ports.ErrLobbyUnavailable` so callers can distinguish "not found"
|
||||
// from transport faults and continue without aborting the start
|
||||
// operation.
|
||||
package lobbyclient
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
const (
|
||||
getGamePathSuffix = "/api/v1/internal/games/%s"
|
||||
)
|
||||
|
||||
// Config configures one HTTP-backed Lobby internal client.
|
||||
type Config struct {
|
||||
// BaseURL stores the absolute base URL of the Lobby internal HTTP
|
||||
// listener (e.g. `http://lobby:8095`).
|
||||
BaseURL string
|
||||
|
||||
// RequestTimeout bounds one outbound lookup request.
|
||||
RequestTimeout time.Duration
|
||||
}
|
||||
|
||||
// Client resolves Lobby game records through the trusted internal HTTP
|
||||
// API.
|
||||
type Client struct {
|
||||
baseURL string
|
||||
requestTimeout time.Duration
|
||||
httpClient *http.Client
|
||||
closeIdleConnections func()
|
||||
}
|
||||
|
||||
type gameRecordEnvelope struct {
|
||||
GameID string `json:"game_id"`
|
||||
Status string `json:"status"`
|
||||
TargetEngineVersion string `json:"target_engine_version"`
|
||||
}
|
||||
|
||||
type errorEnvelope struct {
|
||||
Error *errorBody `json:"error"`
|
||||
}
|
||||
|
||||
type errorBody struct {
|
||||
Code string `json:"code"`
|
||||
Message string `json:"message"`
|
||||
}
|
||||
|
||||
// NewClient constructs a Lobby internal client that uses
|
||||
// repository-standard HTTP transport instrumentation through otelhttp.
|
||||
// The cloned default transport keeps the production wiring isolated
|
||||
// from caller-provided transports.
|
||||
func NewClient(cfg Config) (*Client, error) {
|
||||
transport, ok := http.DefaultTransport.(*http.Transport)
|
||||
if !ok {
|
||||
return nil, errors.New("new lobby internal client: default transport is not *http.Transport")
|
||||
}
|
||||
cloned := transport.Clone()
|
||||
return newClient(cfg, &http.Client{Transport: otelhttp.NewTransport(cloned)}, cloned.CloseIdleConnections)
|
||||
}
|
||||
|
||||
func newClient(cfg Config, httpClient *http.Client, closeIdleConnections func()) (*Client, error) {
|
||||
switch {
|
||||
case strings.TrimSpace(cfg.BaseURL) == "":
|
||||
return nil, errors.New("new lobby internal client: base URL must not be empty")
|
||||
case cfg.RequestTimeout <= 0:
|
||||
return nil, errors.New("new lobby internal client: request timeout must be positive")
|
||||
case httpClient == nil:
|
||||
return nil, errors.New("new lobby internal client: http client must not be nil")
|
||||
}
|
||||
|
||||
parsed, err := url.Parse(strings.TrimRight(strings.TrimSpace(cfg.BaseURL), "/"))
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("new lobby internal client: parse base URL: %w", err)
|
||||
}
|
||||
if parsed.Scheme == "" || parsed.Host == "" {
|
||||
return nil, errors.New("new lobby internal client: base URL must be absolute")
|
||||
}
|
||||
|
||||
return &Client{
|
||||
baseURL: parsed.String(),
|
||||
requestTimeout: cfg.RequestTimeout,
|
||||
httpClient: httpClient,
|
||||
closeIdleConnections: closeIdleConnections,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Close releases idle HTTP connections owned by the client transport.
|
||||
// Call once on shutdown.
|
||||
func (client *Client) Close() error {
|
||||
if client == nil || client.closeIdleConnections == nil {
|
||||
return nil
|
||||
}
|
||||
client.closeIdleConnections()
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetGame returns the Lobby game record for gameID. It maps Lobby's
|
||||
// `404 not_found` to `ports.ErrLobbyGameNotFound`; every other failure
|
||||
// (transport, timeout, non-2xx response) maps to
|
||||
// `ports.ErrLobbyUnavailable` wrapped with the original error so callers
|
||||
// keep the diagnostic detail.
|
||||
func (client *Client) GetGame(ctx context.Context, gameID string) (ports.LobbyGameRecord, error) {
|
||||
if client == nil || client.httpClient == nil {
|
||||
return ports.LobbyGameRecord{}, errors.New("lobby get game: nil client")
|
||||
}
|
||||
if ctx == nil {
|
||||
return ports.LobbyGameRecord{}, errors.New("lobby get game: nil context")
|
||||
}
|
||||
if err := ctx.Err(); err != nil {
|
||||
return ports.LobbyGameRecord{}, err
|
||||
}
|
||||
if strings.TrimSpace(gameID) == "" {
|
||||
return ports.LobbyGameRecord{}, errors.New("lobby get game: game id must not be empty")
|
||||
}
|
||||
|
||||
payload, statusCode, err := client.doRequest(ctx, http.MethodGet, fmt.Sprintf(getGamePathSuffix, url.PathEscape(gameID)))
|
||||
if err != nil {
|
||||
return ports.LobbyGameRecord{}, fmt.Errorf("%w: %w", ports.ErrLobbyUnavailable, err)
|
||||
}
|
||||
|
||||
switch statusCode {
|
||||
case http.StatusOK:
|
||||
var envelope gameRecordEnvelope
|
||||
if err := decodeJSONPayload(payload, &envelope); err != nil {
|
||||
return ports.LobbyGameRecord{}, fmt.Errorf("%w: decode success response: %w", ports.ErrLobbyUnavailable, err)
|
||||
}
|
||||
if strings.TrimSpace(envelope.GameID) == "" {
|
||||
return ports.LobbyGameRecord{}, fmt.Errorf("%w: success response missing game_id", ports.ErrLobbyUnavailable)
|
||||
}
|
||||
return ports.LobbyGameRecord{
|
||||
GameID: envelope.GameID,
|
||||
Status: envelope.Status,
|
||||
TargetEngineVersion: envelope.TargetEngineVersion,
|
||||
}, nil
|
||||
case http.StatusNotFound:
|
||||
return ports.LobbyGameRecord{}, ports.ErrLobbyGameNotFound
|
||||
default:
|
||||
errorCode := decodeErrorCode(payload)
|
||||
if errorCode != "" {
|
||||
return ports.LobbyGameRecord{}, fmt.Errorf("%w: unexpected status %d (error_code=%s)", ports.ErrLobbyUnavailable, statusCode, errorCode)
|
||||
}
|
||||
return ports.LobbyGameRecord{}, fmt.Errorf("%w: unexpected status %d", ports.ErrLobbyUnavailable, statusCode)
|
||||
}
|
||||
}
|
||||
|
||||
func (client *Client) doRequest(ctx context.Context, method, requestPath string) ([]byte, int, error) {
|
||||
attemptCtx, cancel := context.WithTimeout(ctx, client.requestTimeout)
|
||||
defer cancel()
|
||||
|
||||
req, err := http.NewRequestWithContext(attemptCtx, method, client.baseURL+requestPath, nil)
|
||||
if err != nil {
|
||||
return nil, 0, fmt.Errorf("build request: %w", err)
|
||||
}
|
||||
req.Header.Set("Accept", "application/json")
|
||||
|
||||
resp, err := client.httpClient.Do(req)
|
||||
if err != nil {
|
||||
return nil, 0, err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return nil, 0, fmt.Errorf("read response body: %w", err)
|
||||
}
|
||||
return body, resp.StatusCode, nil
|
||||
}
|
||||
|
||||
// decodeJSONPayload tolerantly decodes a JSON object; unknown fields
|
||||
// are ignored so additive Lobby schema changes do not break us.
|
||||
func decodeJSONPayload(payload []byte, target any) error {
|
||||
decoder := json.NewDecoder(bytes.NewReader(payload))
|
||||
if err := decoder.Decode(target); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := decoder.Decode(&struct{}{}); err != io.EOF {
|
||||
if err == nil {
|
||||
return errors.New("unexpected trailing JSON input")
|
||||
}
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func decodeErrorCode(payload []byte) string {
|
||||
if len(payload) == 0 {
|
||||
return ""
|
||||
}
|
||||
var envelope errorEnvelope
|
||||
if err := json.Unmarshal(payload, &envelope); err != nil {
|
||||
return ""
|
||||
}
|
||||
if envelope.Error == nil {
|
||||
return ""
|
||||
}
|
||||
return envelope.Error.Code
|
||||
}
|
||||
|
||||
// Compile-time assertion: Client implements ports.LobbyInternalClient.
|
||||
var _ ports.LobbyInternalClient = (*Client)(nil)
|
||||
@@ -0,0 +1,153 @@
|
||||
package lobbyclient
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
func newTestClient(t *testing.T, baseURL string, timeout time.Duration) *Client {
|
||||
t.Helper()
|
||||
client, err := NewClient(Config{BaseURL: baseURL, RequestTimeout: timeout})
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
return client
|
||||
}
|
||||
|
||||
func TestNewClientValidatesConfig(t *testing.T) {
|
||||
cases := map[string]Config{
|
||||
"empty base url": {BaseURL: "", RequestTimeout: time.Second},
|
||||
"non-absolute base url": {BaseURL: "lobby:8095", RequestTimeout: time.Second},
|
||||
"non-positive timeout": {BaseURL: "http://lobby:8095", RequestTimeout: 0},
|
||||
}
|
||||
for name, cfg := range cases {
|
||||
t.Run(name, func(t *testing.T) {
|
||||
_, err := NewClient(cfg)
|
||||
require.Error(t, err)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestGetGameSuccess(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
require.Equal(t, http.MethodGet, r.Method)
|
||||
require.Equal(t, "/api/v1/internal/games/game-1", r.URL.Path)
|
||||
require.Equal(t, "application/json", r.Header.Get("Accept"))
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
_, _ = w.Write([]byte(`{
|
||||
"game_id": "game-1",
|
||||
"game_name": "Sample",
|
||||
"status": "running",
|
||||
"target_engine_version": "1.4.2",
|
||||
"current_turn": 0,
|
||||
"runtime_status": "running"
|
||||
}`))
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, time.Second)
|
||||
got, err := client.GetGame(context.Background(), "game-1")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "game-1", got.GameID)
|
||||
assert.Equal(t, "running", got.Status)
|
||||
assert.Equal(t, "1.4.2", got.TargetEngineVersion)
|
||||
}
|
||||
|
||||
func TestGetGameNotFound(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusNotFound)
|
||||
_, _ = w.Write([]byte(`{"error":{"code":"not_found","message":"no such game"}}`))
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, time.Second)
|
||||
_, err := client.GetGame(context.Background(), "missing")
|
||||
require.Error(t, err)
|
||||
assert.True(t, errors.Is(err, ports.ErrLobbyGameNotFound))
|
||||
assert.False(t, errors.Is(err, ports.ErrLobbyUnavailable))
|
||||
}
|
||||
|
||||
func TestGetGameInternalErrorMapsToUnavailable(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusInternalServerError)
|
||||
_, _ = w.Write([]byte(`{"error":{"code":"internal_error","message":"boom"}}`))
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, time.Second)
|
||||
_, err := client.GetGame(context.Background(), "x")
|
||||
require.Error(t, err)
|
||||
assert.True(t, errors.Is(err, ports.ErrLobbyUnavailable))
|
||||
assert.Contains(t, err.Error(), "500")
|
||||
assert.Contains(t, err.Error(), "internal_error")
|
||||
}
|
||||
|
||||
func TestGetGameTimeoutMapsToUnavailable(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
time.Sleep(150 * time.Millisecond)
|
||||
_, _ = w.Write([]byte(`{}`))
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, 50*time.Millisecond)
|
||||
_, err := client.GetGame(context.Background(), "x")
|
||||
require.Error(t, err)
|
||||
assert.True(t, errors.Is(err, ports.ErrLobbyUnavailable))
|
||||
}
|
||||
|
||||
func TestGetGameSuccessMissingGameIDIsUnavailable(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
_, _ = w.Write([]byte(`{"status":"running"}`))
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, time.Second)
|
||||
_, err := client.GetGame(context.Background(), "x")
|
||||
require.Error(t, err)
|
||||
assert.True(t, errors.Is(err, ports.ErrLobbyUnavailable))
|
||||
assert.Contains(t, err.Error(), "missing game_id")
|
||||
}
|
||||
|
||||
func TestGetGameRejectsBadInput(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
t.Fatal("must not contact lobby on bad input")
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, time.Second)
|
||||
t.Run("empty game id", func(t *testing.T) {
|
||||
_, err := client.GetGame(context.Background(), " ")
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "game id")
|
||||
})
|
||||
t.Run("canceled context", func(t *testing.T) {
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
cancel()
|
||||
_, err := client.GetGame(ctx, "x")
|
||||
require.Error(t, err)
|
||||
assert.True(t, errors.Is(err, context.Canceled))
|
||||
})
|
||||
}
|
||||
|
||||
func TestCloseReleasesConnections(t *testing.T) {
|
||||
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
_, _ = w.Write([]byte(`{"game_id":"x","status":"running","target_engine_version":"1.0.0"}`))
|
||||
}))
|
||||
defer server.Close()
|
||||
|
||||
client := newTestClient(t, server.URL, time.Second)
|
||||
_, err := client.GetGame(context.Background(), "x")
|
||||
require.NoError(t, err)
|
||||
assert.NoError(t, client.Close())
|
||||
assert.NoError(t, client.Close()) // idempotent
|
||||
}
|
||||
@@ -0,0 +1,70 @@
|
||||
// Package notificationpublisher provides the Redis-Streams-backed
|
||||
// notification-intent publisher Runtime Manager uses to emit admin-only
|
||||
// failure notifications. The adapter is a thin shim over
|
||||
// `galaxy/notificationintent.Publisher` that drops the entry id at the
|
||||
// wrapper boundary; rationale lives in
|
||||
// `rtmanager/docs/domain-and-ports.md §7`.
|
||||
package notificationpublisher
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
|
||||
"galaxy/notificationintent"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
// Config groups the dependencies and stream name required to
|
||||
// construct a Publisher.
|
||||
type Config struct {
|
||||
// Client appends entries to Redis Streams. Must be non-nil.
|
||||
Client *redis.Client
|
||||
|
||||
// Stream stores the Redis Stream key intents are published to.
|
||||
// When empty, `notificationintent.DefaultIntentsStream` is used.
|
||||
Stream string
|
||||
}
|
||||
|
||||
// Publisher implements `ports.NotificationIntentPublisher` on top of
|
||||
// the shared `notificationintent.Publisher`. The wrapper is the single
|
||||
// point that drops the entry id returned by the underlying publisher.
|
||||
type Publisher struct {
|
||||
inner *notificationintent.Publisher
|
||||
}
|
||||
|
||||
// NewPublisher constructs a Publisher from cfg. It wraps the shared
|
||||
// publisher and delegates validation; transport errors and validation
|
||||
// errors propagate verbatim.
|
||||
func NewPublisher(cfg Config) (*Publisher, error) {
|
||||
if cfg.Client == nil {
|
||||
return nil, errors.New("new rtmanager notification publisher: nil redis client")
|
||||
}
|
||||
inner, err := notificationintent.NewPublisher(notificationintent.PublisherConfig{
|
||||
Client: cfg.Client,
|
||||
Stream: cfg.Stream,
|
||||
})
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("new rtmanager notification publisher: %w", err)
|
||||
}
|
||||
return &Publisher{inner: inner}, nil
|
||||
}
|
||||
|
||||
// Publish forwards intent to the underlying notificationintent
|
||||
// publisher and discards the resulting Redis Stream entry id. A failed
|
||||
// publish surfaces as the underlying error.
|
||||
func (publisher *Publisher) Publish(ctx context.Context, intent notificationintent.Intent) error {
|
||||
if publisher == nil || publisher.inner == nil {
|
||||
return errors.New("publish notification intent: nil publisher")
|
||||
}
|
||||
if _, err := publisher.inner.Publish(ctx, intent); err != nil {
|
||||
return err
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Compile-time assertion: Publisher implements
|
||||
// ports.NotificationIntentPublisher.
|
||||
var _ ports.NotificationIntentPublisher = (*Publisher)(nil)
|
||||
@@ -0,0 +1,123 @@
|
||||
package notificationpublisher
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/alicebob/miniredis/v2"
|
||||
"github.com/redis/go-redis/v9"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
|
||||
"galaxy/notificationintent"
|
||||
)
|
||||
|
||||
func newRedis(t *testing.T) (*redis.Client, *miniredis.Miniredis) {
|
||||
t.Helper()
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
return client, server
|
||||
}
|
||||
|
||||
func readStream(t *testing.T, client *redis.Client, stream string) []redis.XMessage {
|
||||
t.Helper()
|
||||
messages, err := client.XRange(context.Background(), stream, "-", "+").Result()
|
||||
require.NoError(t, err)
|
||||
return messages
|
||||
}
|
||||
|
||||
func TestNewPublisherValidation(t *testing.T) {
|
||||
t.Run("nil client", func(t *testing.T) {
|
||||
_, err := NewPublisher(Config{})
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "nil redis client")
|
||||
})
|
||||
}
|
||||
|
||||
func TestPublisherWritesIntent(t *testing.T) {
|
||||
client, _ := newRedis(t)
|
||||
|
||||
publisher, err := NewPublisher(Config{Client: client, Stream: "notification:intents"})
|
||||
require.NoError(t, err)
|
||||
|
||||
intent, err := notificationintent.NewRuntimeImagePullFailedIntent(
|
||||
notificationintent.Metadata{
|
||||
IdempotencyKey: "rtmanager:start:game-1:abc",
|
||||
OccurredAt: time.UnixMilli(1714200000000).UTC(),
|
||||
},
|
||||
notificationintent.RuntimeImagePullFailedPayload{
|
||||
GameID: "game-1",
|
||||
ImageRef: "galaxy/game:1.4.2",
|
||||
ErrorCode: "image_pull_failed",
|
||||
ErrorMessage: "registry timeout",
|
||||
AttemptedAtMs: 1714200000000,
|
||||
},
|
||||
)
|
||||
require.NoError(t, err)
|
||||
|
||||
require.NoError(t, publisher.Publish(context.Background(), intent))
|
||||
|
||||
messages := readStream(t, client, "notification:intents")
|
||||
require.Len(t, messages, 1)
|
||||
|
||||
values := messages[0].Values
|
||||
assert.Equal(t, "runtime.image_pull_failed", values["notification_type"])
|
||||
assert.Equal(t, "runtime_manager", values["producer"])
|
||||
assert.Equal(t, "admin_email", values["audience_kind"])
|
||||
assert.Equal(t, "rtmanager:start:game-1:abc", values["idempotency_key"])
|
||||
|
||||
// recipient_user_ids_json must be absent for admin_email audience.
|
||||
_, hasRecipients := values["recipient_user_ids_json"]
|
||||
assert.False(t, hasRecipients)
|
||||
|
||||
payloadRaw, ok := values["payload_json"].(string)
|
||||
require.True(t, ok)
|
||||
var payload map[string]any
|
||||
require.NoError(t, json.Unmarshal([]byte(payloadRaw), &payload))
|
||||
assert.Equal(t, "game-1", payload["game_id"])
|
||||
assert.Equal(t, "galaxy/game:1.4.2", payload["image_ref"])
|
||||
}
|
||||
|
||||
func TestPublisherForwardsValidationError(t *testing.T) {
|
||||
client, _ := newRedis(t)
|
||||
publisher, err := NewPublisher(Config{Client: client})
|
||||
require.NoError(t, err)
|
||||
|
||||
// Intent with a zero OccurredAt fails the shared validator.
|
||||
bad := notificationintent.Intent{
|
||||
NotificationType: notificationintent.NotificationTypeRuntimeImagePullFailed,
|
||||
Producer: notificationintent.ProducerRuntimeManager,
|
||||
AudienceKind: notificationintent.AudienceKindAdminEmail,
|
||||
IdempotencyKey: "k",
|
||||
PayloadJSON: `{"game_id":"g","image_ref":"r","error_code":"c","error_message":"m","attempted_at_ms":1}`,
|
||||
}
|
||||
require.Error(t, publisher.Publish(context.Background(), bad))
|
||||
}
|
||||
|
||||
func TestPublisherDefaultsStreamName(t *testing.T) {
|
||||
client, _ := newRedis(t)
|
||||
publisher, err := NewPublisher(Config{Client: client, Stream: ""})
|
||||
require.NoError(t, err)
|
||||
|
||||
intent, err := notificationintent.NewRuntimeContainerStartFailedIntent(
|
||||
notificationintent.Metadata{
|
||||
IdempotencyKey: "k",
|
||||
OccurredAt: time.UnixMilli(1714200000000).UTC(),
|
||||
},
|
||||
notificationintent.RuntimeContainerStartFailedPayload{
|
||||
GameID: "g",
|
||||
ImageRef: "r",
|
||||
ErrorCode: "container_start_failed",
|
||||
ErrorMessage: "boom",
|
||||
AttemptedAtMs: 1714200000000,
|
||||
},
|
||||
)
|
||||
require.NoError(t, err)
|
||||
require.NoError(t, publisher.Publish(context.Background(), intent))
|
||||
|
||||
messages := readStream(t, client, notificationintent.DefaultIntentsStream)
|
||||
require.Len(t, messages, 1)
|
||||
}
|
||||
@@ -0,0 +1,203 @@
|
||||
// Package healthsnapshotstore implements the PostgreSQL-backed adapter
|
||||
// for `ports.HealthSnapshotStore`.
|
||||
//
|
||||
// The package owns the on-disk shape of the `health_snapshots` table
|
||||
// defined in
|
||||
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`
|
||||
// and translates the schema-agnostic `ports.HealthSnapshotStore` interface
|
||||
// declared in `internal/ports/healthsnapshotstore.go` into concrete
|
||||
// go-jet/v2 statements driven by the pgx driver.
|
||||
//
|
||||
// The `details` jsonb column round-trips as a `json.RawMessage`. Empty
|
||||
// payloads are substituted with the SQL default `{}` on Upsert so the
|
||||
// CHECK constraints and downstream readers never observe a non-JSON
|
||||
// empty string.
|
||||
package healthsnapshotstore
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/internal/sqlx"
|
||||
pgtable "galaxy/rtmanager/internal/adapters/postgres/jet/rtmanager/table"
|
||||
"galaxy/rtmanager/internal/domain/health"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
pg "github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
// emptyDetails is the canonical jsonb payload installed when the caller
|
||||
// supplies an empty Details slice. It matches the SQL DEFAULT for the
|
||||
// column.
|
||||
const emptyDetails = "{}"
|
||||
|
||||
// Config configures one PostgreSQL-backed health-snapshot store instance.
|
||||
type Config struct {
|
||||
// DB stores the connection pool the store uses for every query.
|
||||
DB *sql.DB
|
||||
|
||||
// OperationTimeout bounds one round trip.
|
||||
OperationTimeout time.Duration
|
||||
}
|
||||
|
||||
// Store persists Runtime Manager health snapshots in PostgreSQL.
|
||||
type Store struct {
|
||||
db *sql.DB
|
||||
operationTimeout time.Duration
|
||||
}
|
||||
|
||||
// New constructs one PostgreSQL-backed health-snapshot store from cfg.
|
||||
func New(cfg Config) (*Store, error) {
|
||||
if cfg.DB == nil {
|
||||
return nil, errors.New("new postgres health snapshot store: db must not be nil")
|
||||
}
|
||||
if cfg.OperationTimeout <= 0 {
|
||||
return nil, errors.New("new postgres health snapshot store: operation timeout must be positive")
|
||||
}
|
||||
return &Store{
|
||||
db: cfg.DB,
|
||||
operationTimeout: cfg.OperationTimeout,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// healthSnapshotSelectColumns is the canonical SELECT list for the
|
||||
// health_snapshots table, matching scanSnapshot's column order.
|
||||
var healthSnapshotSelectColumns = pg.ColumnList{
|
||||
pgtable.HealthSnapshots.GameID,
|
||||
pgtable.HealthSnapshots.ContainerID,
|
||||
pgtable.HealthSnapshots.Status,
|
||||
pgtable.HealthSnapshots.Source,
|
||||
pgtable.HealthSnapshots.Details,
|
||||
pgtable.HealthSnapshots.ObservedAt,
|
||||
}
|
||||
|
||||
// Upsert installs snapshot as the latest observation for snapshot.GameID.
|
||||
// snapshot is validated through health.HealthSnapshot.Validate before the
|
||||
// SQL is issued.
|
||||
func (store *Store) Upsert(ctx context.Context, snapshot health.HealthSnapshot) error {
|
||||
if store == nil || store.db == nil {
|
||||
return errors.New("upsert health snapshot: nil store")
|
||||
}
|
||||
if err := snapshot.Validate(); err != nil {
|
||||
return fmt.Errorf("upsert health snapshot: %w", err)
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "upsert health snapshot", store.operationTimeout)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
details := emptyDetails
|
||||
if len(snapshot.Details) > 0 {
|
||||
details = string(snapshot.Details)
|
||||
}
|
||||
|
||||
stmt := pgtable.HealthSnapshots.INSERT(
|
||||
pgtable.HealthSnapshots.GameID,
|
||||
pgtable.HealthSnapshots.ContainerID,
|
||||
pgtable.HealthSnapshots.Status,
|
||||
pgtable.HealthSnapshots.Source,
|
||||
pgtable.HealthSnapshots.Details,
|
||||
pgtable.HealthSnapshots.ObservedAt,
|
||||
).VALUES(
|
||||
snapshot.GameID,
|
||||
snapshot.ContainerID,
|
||||
string(snapshot.Status),
|
||||
string(snapshot.Source),
|
||||
details,
|
||||
snapshot.ObservedAt.UTC(),
|
||||
).ON_CONFLICT(pgtable.HealthSnapshots.GameID).DO_UPDATE(
|
||||
pg.SET(
|
||||
pgtable.HealthSnapshots.ContainerID.SET(pgtable.HealthSnapshots.EXCLUDED.ContainerID),
|
||||
pgtable.HealthSnapshots.Status.SET(pgtable.HealthSnapshots.EXCLUDED.Status),
|
||||
pgtable.HealthSnapshots.Source.SET(pgtable.HealthSnapshots.EXCLUDED.Source),
|
||||
pgtable.HealthSnapshots.Details.SET(pgtable.HealthSnapshots.EXCLUDED.Details),
|
||||
pgtable.HealthSnapshots.ObservedAt.SET(pgtable.HealthSnapshots.EXCLUDED.ObservedAt),
|
||||
),
|
||||
)
|
||||
|
||||
query, args := stmt.Sql()
|
||||
if _, err := store.db.ExecContext(operationCtx, query, args...); err != nil {
|
||||
return fmt.Errorf("upsert health snapshot: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Get returns the latest snapshot for gameID. It returns
|
||||
// runtime.ErrNotFound when no snapshot has been recorded yet.
|
||||
func (store *Store) Get(ctx context.Context, gameID string) (health.HealthSnapshot, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return health.HealthSnapshot{}, errors.New("get health snapshot: nil store")
|
||||
}
|
||||
if strings.TrimSpace(gameID) == "" {
|
||||
return health.HealthSnapshot{}, fmt.Errorf("get health snapshot: game id must not be empty")
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "get health snapshot", store.operationTimeout)
|
||||
if err != nil {
|
||||
return health.HealthSnapshot{}, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pg.SELECT(healthSnapshotSelectColumns).
|
||||
FROM(pgtable.HealthSnapshots).
|
||||
WHERE(pgtable.HealthSnapshots.GameID.EQ(pg.String(gameID)))
|
||||
|
||||
query, args := stmt.Sql()
|
||||
row := store.db.QueryRowContext(operationCtx, query, args...)
|
||||
snapshot, err := scanSnapshot(row)
|
||||
if sqlx.IsNoRows(err) {
|
||||
return health.HealthSnapshot{}, runtime.ErrNotFound
|
||||
}
|
||||
if err != nil {
|
||||
return health.HealthSnapshot{}, fmt.Errorf("get health snapshot: %w", err)
|
||||
}
|
||||
return snapshot, nil
|
||||
}
|
||||
|
||||
// rowScanner abstracts *sql.Row and *sql.Rows so scanSnapshot can be
|
||||
// shared across both single-row reads and iterated reads.
|
||||
type rowScanner interface {
|
||||
Scan(dest ...any) error
|
||||
}
|
||||
|
||||
// scanSnapshot scans one health_snapshots row from rs.
|
||||
func scanSnapshot(rs rowScanner) (health.HealthSnapshot, error) {
|
||||
var (
|
||||
gameID string
|
||||
containerID string
|
||||
status string
|
||||
source string
|
||||
details []byte
|
||||
observedAt time.Time
|
||||
)
|
||||
if err := rs.Scan(
|
||||
&gameID,
|
||||
&containerID,
|
||||
&status,
|
||||
&source,
|
||||
&details,
|
||||
&observedAt,
|
||||
); err != nil {
|
||||
return health.HealthSnapshot{}, err
|
||||
}
|
||||
return health.HealthSnapshot{
|
||||
GameID: gameID,
|
||||
ContainerID: containerID,
|
||||
Status: health.SnapshotStatus(status),
|
||||
Source: health.SnapshotSource(source),
|
||||
Details: json.RawMessage(details),
|
||||
ObservedAt: observedAt.UTC(),
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Ensure Store satisfies the ports.HealthSnapshotStore interface at
|
||||
// compile time.
|
||||
var _ ports.HealthSnapshotStore = (*Store)(nil)
|
||||
@@ -0,0 +1,157 @@
|
||||
package healthsnapshotstore_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/healthsnapshotstore"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/internal/pgtest"
|
||||
"galaxy/rtmanager/internal/domain/health"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func TestMain(m *testing.M) { pgtest.RunMain(m) }
|
||||
|
||||
func newStore(t *testing.T) *healthsnapshotstore.Store {
|
||||
t.Helper()
|
||||
pgtest.TruncateAll(t)
|
||||
store, err := healthsnapshotstore.New(healthsnapshotstore.Config{
|
||||
DB: pgtest.Ensure(t).Pool(),
|
||||
OperationTimeout: pgtest.OperationTimeout,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return store
|
||||
}
|
||||
|
||||
func probeFailedSnapshot(gameID string, observedAt time.Time) health.HealthSnapshot {
|
||||
return health.HealthSnapshot{
|
||||
GameID: gameID,
|
||||
ContainerID: "container-1",
|
||||
Status: health.SnapshotStatusProbeFailed,
|
||||
Source: health.SnapshotSourceProbe,
|
||||
Details: json.RawMessage(`{"consecutive_failures":3,"last_status":503,"last_error":"timeout"}`),
|
||||
ObservedAt: observedAt,
|
||||
}
|
||||
}
|
||||
|
||||
func TestUpsertAndGetRoundTrip(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
snapshot := probeFailedSnapshot("game-001",
|
||||
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
|
||||
require.NoError(t, store.Upsert(ctx, snapshot))
|
||||
|
||||
got, err := store.Get(ctx, "game-001")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, snapshot.GameID, got.GameID)
|
||||
assert.Equal(t, snapshot.ContainerID, got.ContainerID)
|
||||
assert.Equal(t, snapshot.Status, got.Status)
|
||||
assert.Equal(t, snapshot.Source, got.Source)
|
||||
assert.JSONEq(t, string(snapshot.Details), string(got.Details))
|
||||
assert.True(t, snapshot.ObservedAt.Equal(got.ObservedAt))
|
||||
assert.Equal(t, time.UTC, got.ObservedAt.Location())
|
||||
}
|
||||
|
||||
func TestUpsertOverwritesPriorSnapshot(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
first := probeFailedSnapshot("game-001",
|
||||
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
|
||||
require.NoError(t, store.Upsert(ctx, first))
|
||||
|
||||
second := health.HealthSnapshot{
|
||||
GameID: "game-001",
|
||||
ContainerID: "container-2",
|
||||
Status: health.SnapshotStatusHealthy,
|
||||
Source: health.SnapshotSourceInspect,
|
||||
Details: json.RawMessage(`{"restart_count":0,"state":"running"}`),
|
||||
ObservedAt: first.ObservedAt.Add(time.Minute),
|
||||
}
|
||||
require.NoError(t, store.Upsert(ctx, second))
|
||||
|
||||
got, err := store.Get(ctx, "game-001")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "container-2", got.ContainerID)
|
||||
assert.Equal(t, health.SnapshotStatusHealthy, got.Status)
|
||||
assert.Equal(t, health.SnapshotSourceInspect, got.Source)
|
||||
assert.JSONEq(t, string(second.Details), string(got.Details))
|
||||
assert.True(t, second.ObservedAt.Equal(got.ObservedAt))
|
||||
}
|
||||
|
||||
func TestGetReturnsNotFound(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
_, err := store.Get(ctx, "game-missing")
|
||||
require.ErrorIs(t, err, runtime.ErrNotFound)
|
||||
}
|
||||
|
||||
func TestUpsertEmptyDetailsRoundTripsAsEmptyObject(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
snapshot := probeFailedSnapshot("game-001",
|
||||
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
|
||||
snapshot.Details = nil
|
||||
require.NoError(t, store.Upsert(ctx, snapshot))
|
||||
|
||||
got, err := store.Get(ctx, "game-001")
|
||||
require.NoError(t, err)
|
||||
assert.JSONEq(t, "{}", string(got.Details),
|
||||
"empty json.RawMessage must round-trip as the SQL default {}, got %q",
|
||||
string(got.Details))
|
||||
}
|
||||
|
||||
func TestUpsertValidatesSnapshot(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
mutate func(*health.HealthSnapshot)
|
||||
}{
|
||||
{"empty game id", func(s *health.HealthSnapshot) { s.GameID = "" }},
|
||||
{"unknown status", func(s *health.HealthSnapshot) { s.Status = "exotic" }},
|
||||
{"unknown source", func(s *health.HealthSnapshot) { s.Source = "exotic" }},
|
||||
{"zero observed at", func(s *health.HealthSnapshot) { s.ObservedAt = time.Time{} }},
|
||||
{"invalid json details", func(s *health.HealthSnapshot) {
|
||||
s.Details = json.RawMessage("not json")
|
||||
}},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
snapshot := probeFailedSnapshot("game-001",
|
||||
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC))
|
||||
tt.mutate(&snapshot)
|
||||
err := store.Upsert(ctx, snapshot)
|
||||
require.Error(t, err)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestGetRejectsEmptyGameID(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
_, err := store.Get(ctx, "")
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestNewRejectsNilDB(t *testing.T) {
|
||||
_, err := healthsnapshotstore.New(healthsnapshotstore.Config{OperationTimeout: time.Second})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestNewRejectsNonPositiveTimeout(t *testing.T) {
|
||||
_, err := healthsnapshotstore.New(healthsnapshotstore.Config{
|
||||
DB: pgtest.Ensure(t).Pool(),
|
||||
})
|
||||
require.Error(t, err)
|
||||
}
|
||||
@@ -0,0 +1,209 @@
|
||||
// Package pgtest exposes the testcontainers-backed PostgreSQL bootstrap
|
||||
// shared by every Runtime Manager PG adapter test. The package is regular
|
||||
// Go code — not a `_test.go` file — so it can be imported by the
|
||||
// `_test.go` files in the three sibling store packages
|
||||
// (`runtimerecordstore`, `operationlogstore`, `healthsnapshotstore`).
|
||||
//
|
||||
// No production code in `cmd/rtmanager` or in the runtime imports this
|
||||
// package. The testcontainers-go dependency therefore stays out of the
|
||||
// production binary's import graph.
|
||||
package pgtest
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"net/url"
|
||||
"os"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/migrations"
|
||||
|
||||
testcontainers "github.com/testcontainers/testcontainers-go"
|
||||
tcpostgres "github.com/testcontainers/testcontainers-go/modules/postgres"
|
||||
"github.com/testcontainers/testcontainers-go/wait"
|
||||
)
|
||||
|
||||
const (
|
||||
postgresImage = "postgres:16-alpine"
|
||||
superUser = "galaxy"
|
||||
superPassword = "galaxy"
|
||||
superDatabase = "galaxy_rtmanager"
|
||||
serviceRole = "rtmanagerservice"
|
||||
servicePassword = "rtmanagerservice"
|
||||
serviceSchema = "rtmanager"
|
||||
containerStartup = 90 * time.Second
|
||||
|
||||
// OperationTimeout is the per-statement timeout used by every store
|
||||
// constructed via the per-package newStore helpers. Tests may pass a
|
||||
// smaller value if they need to assert deadline behaviour explicitly.
|
||||
OperationTimeout = 10 * time.Second
|
||||
)
|
||||
|
||||
// Env holds the per-process container plus the *sql.DB pool already
|
||||
// provisioned with the rtmanager schema, role, and migrations applied.
|
||||
type Env struct {
|
||||
container *tcpostgres.PostgresContainer
|
||||
pool *sql.DB
|
||||
}
|
||||
|
||||
// Pool returns the shared pool. Tests truncate per-table state before
|
||||
// each run via TruncateAll.
|
||||
func (env *Env) Pool() *sql.DB { return env.pool }
|
||||
|
||||
var (
|
||||
once sync.Once
|
||||
cur *Env
|
||||
curEr error
|
||||
)
|
||||
|
||||
// Ensure starts the PostgreSQL container on first invocation and applies
|
||||
// the embedded goose migrations. Subsequent invocations reuse the same
|
||||
// container/pool. When Docker is unavailable Ensure calls t.Skip with the
|
||||
// underlying error so the test suite still passes on machines without
|
||||
// Docker.
|
||||
func Ensure(t testing.TB) *Env {
|
||||
t.Helper()
|
||||
once.Do(func() {
|
||||
cur, curEr = start()
|
||||
})
|
||||
if curEr != nil {
|
||||
t.Skipf("postgres container start failed (Docker unavailable?): %v", curEr)
|
||||
}
|
||||
return cur
|
||||
}
|
||||
|
||||
// TruncateAll wipes every Runtime Manager table inside the shared pool,
|
||||
// leaving the schema and indexes intact. Use it from each test that needs
|
||||
// a clean slate.
|
||||
func TruncateAll(t testing.TB) {
|
||||
t.Helper()
|
||||
env := Ensure(t)
|
||||
const stmt = `TRUNCATE TABLE runtime_records, operation_log, health_snapshots RESTART IDENTITY CASCADE`
|
||||
if _, err := env.pool.ExecContext(context.Background(), stmt); err != nil {
|
||||
t.Fatalf("truncate rtmanager tables: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// Shutdown terminates the shared container and closes the pool. It is
|
||||
// invoked from each test package's TestMain after `m.Run` returns so the
|
||||
// container is released even if individual tests panic.
|
||||
func Shutdown() {
|
||||
if cur == nil {
|
||||
return
|
||||
}
|
||||
if cur.pool != nil {
|
||||
_ = cur.pool.Close()
|
||||
}
|
||||
if cur.container != nil {
|
||||
_ = testcontainers.TerminateContainer(cur.container)
|
||||
}
|
||||
cur = nil
|
||||
}
|
||||
|
||||
// RunMain is a convenience helper for each store package's TestMain: it
|
||||
// runs the test main, captures the exit code, shuts the container down,
|
||||
// and exits. Wiring it through one helper keeps every TestMain to two
|
||||
// lines.
|
||||
func RunMain(m *testing.M) {
|
||||
code := m.Run()
|
||||
Shutdown()
|
||||
os.Exit(code)
|
||||
}
|
||||
|
||||
func start() (*Env, error) {
|
||||
ctx := context.Background()
|
||||
container, err := tcpostgres.Run(ctx, postgresImage,
|
||||
tcpostgres.WithDatabase(superDatabase),
|
||||
tcpostgres.WithUsername(superUser),
|
||||
tcpostgres.WithPassword(superPassword),
|
||||
testcontainers.WithWaitStrategy(
|
||||
wait.ForLog("database system is ready to accept connections").
|
||||
WithOccurrence(2).
|
||||
WithStartupTimeout(containerStartup),
|
||||
),
|
||||
)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
baseDSN, err := container.ConnectionString(ctx, "sslmode=disable")
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
if err := provisionRoleAndSchema(ctx, baseDSN); err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
scopedDSN, err := dsnForServiceRole(baseDSN)
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
cfg := postgres.DefaultConfig()
|
||||
cfg.PrimaryDSN = scopedDSN
|
||||
cfg.OperationTimeout = OperationTimeout
|
||||
pool, err := postgres.OpenPrimary(ctx, cfg)
|
||||
if err != nil {
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
if err := postgres.Ping(ctx, pool, OperationTimeout); err != nil {
|
||||
_ = pool.Close()
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
if err := postgres.RunMigrations(ctx, pool, migrations.FS(), "."); err != nil {
|
||||
_ = pool.Close()
|
||||
_ = testcontainers.TerminateContainer(container)
|
||||
return nil, err
|
||||
}
|
||||
return &Env{container: container, pool: pool}, nil
|
||||
}
|
||||
|
||||
func provisionRoleAndSchema(ctx context.Context, baseDSN string) error {
|
||||
cfg := postgres.DefaultConfig()
|
||||
cfg.PrimaryDSN = baseDSN
|
||||
cfg.OperationTimeout = OperationTimeout
|
||||
db, err := postgres.OpenPrimary(ctx, cfg)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer func() { _ = db.Close() }()
|
||||
|
||||
statements := []string{
|
||||
`DO $$ BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'rtmanagerservice') THEN
|
||||
CREATE ROLE rtmanagerservice LOGIN PASSWORD 'rtmanagerservice';
|
||||
END IF;
|
||||
END $$;`,
|
||||
`CREATE SCHEMA IF NOT EXISTS rtmanager AUTHORIZATION rtmanagerservice;`,
|
||||
`GRANT USAGE ON SCHEMA rtmanager TO rtmanagerservice;`,
|
||||
}
|
||||
for _, statement := range statements {
|
||||
if _, err := db.ExecContext(ctx, statement); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func dsnForServiceRole(baseDSN string) (string, error) {
|
||||
parsed, err := url.Parse(baseDSN)
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
values := url.Values{}
|
||||
values.Set("search_path", serviceSchema)
|
||||
values.Set("sslmode", "disable")
|
||||
scoped := url.URL{
|
||||
Scheme: parsed.Scheme,
|
||||
User: url.UserPassword(serviceRole, servicePassword),
|
||||
Host: parsed.Host,
|
||||
Path: parsed.Path,
|
||||
RawQuery: values.Encode(),
|
||||
}
|
||||
return scoped.String(), nil
|
||||
}
|
||||
@@ -0,0 +1,112 @@
|
||||
// Package sqlx contains the small set of helpers shared by every Runtime
|
||||
// Manager PostgreSQL adapter (runtimerecordstore, operationlogstore,
|
||||
// healthsnapshotstore). The helpers centralise the boundary translations
|
||||
// for nullable timestamps and the pgx SQLSTATE codes the adapters
|
||||
// interpret as domain conflicts.
|
||||
package sqlx
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/jackc/pgx/v5/pgconn"
|
||||
)
|
||||
|
||||
// PgUniqueViolationCode identifies the SQLSTATE returned by PostgreSQL
|
||||
// when a UNIQUE constraint is violated by INSERT or UPDATE.
|
||||
const PgUniqueViolationCode = "23505"
|
||||
|
||||
// IsUniqueViolation reports whether err is a PostgreSQL unique-violation,
|
||||
// regardless of constraint name.
|
||||
func IsUniqueViolation(err error) bool {
|
||||
var pgErr *pgconn.PgError
|
||||
if !errors.As(err, &pgErr) {
|
||||
return false
|
||||
}
|
||||
return pgErr.Code == PgUniqueViolationCode
|
||||
}
|
||||
|
||||
// IsNoRows reports whether err is sql.ErrNoRows.
|
||||
func IsNoRows(err error) bool {
|
||||
return errors.Is(err, sql.ErrNoRows)
|
||||
}
|
||||
|
||||
// NullableTime returns t.UTC() when non-zero, otherwise nil so the column
|
||||
// is bound as SQL NULL.
|
||||
func NullableTime(t time.Time) any {
|
||||
if t.IsZero() {
|
||||
return nil
|
||||
}
|
||||
return t.UTC()
|
||||
}
|
||||
|
||||
// NullableTimePtr returns t.UTC() when t is non-nil and non-zero, otherwise
|
||||
// nil. Companion of NullableTime for domain types that use *time.Time to
|
||||
// express absent timestamps.
|
||||
func NullableTimePtr(t *time.Time) any {
|
||||
if t == nil {
|
||||
return nil
|
||||
}
|
||||
return NullableTime(*t)
|
||||
}
|
||||
|
||||
// NullableString returns value when non-empty, otherwise nil so the column
|
||||
// is bound as SQL NULL. Used for Runtime Manager columns that map empty
|
||||
// domain strings to NULL (current_container_id, current_image_ref).
|
||||
func NullableString(value string) any {
|
||||
if value == "" {
|
||||
return nil
|
||||
}
|
||||
return value
|
||||
}
|
||||
|
||||
// StringFromNullable copies an optional sql.NullString into a domain
|
||||
// string. NULL becomes the empty string, matching the Runtime Manager
|
||||
// domain convention that empty == NULL for nullable text columns.
|
||||
func StringFromNullable(value sql.NullString) string {
|
||||
if !value.Valid {
|
||||
return ""
|
||||
}
|
||||
return value.String
|
||||
}
|
||||
|
||||
// TimeFromNullable copies an optional sql.NullTime into a domain
|
||||
// time.Time, applying the global UTC normalisation rule. NULL values
|
||||
// become the zero time.Time.
|
||||
func TimeFromNullable(value sql.NullTime) time.Time {
|
||||
if !value.Valid {
|
||||
return time.Time{}
|
||||
}
|
||||
return value.Time.UTC()
|
||||
}
|
||||
|
||||
// TimePtrFromNullable copies an optional sql.NullTime into a domain
|
||||
// *time.Time. NULL becomes nil; non-NULL values are wrapped after UTC
|
||||
// normalisation.
|
||||
func TimePtrFromNullable(value sql.NullTime) *time.Time {
|
||||
if !value.Valid {
|
||||
return nil
|
||||
}
|
||||
t := value.Time.UTC()
|
||||
return &t
|
||||
}
|
||||
|
||||
// WithTimeout derives a child context bounded by timeout and prefixes
|
||||
// context errors with operation. Callers must always invoke the returned
|
||||
// cancel.
|
||||
func WithTimeout(ctx context.Context, operation string, timeout time.Duration) (context.Context, context.CancelFunc, error) {
|
||||
if ctx == nil {
|
||||
return nil, nil, fmt.Errorf("%s: nil context", operation)
|
||||
}
|
||||
if err := ctx.Err(); err != nil {
|
||||
return nil, nil, fmt.Errorf("%s: %w", operation, err)
|
||||
}
|
||||
if timeout <= 0 {
|
||||
return nil, nil, fmt.Errorf("%s: operation timeout must be positive", operation)
|
||||
}
|
||||
bounded, cancel := context.WithTimeout(ctx, timeout)
|
||||
return bounded, cancel, nil
|
||||
}
|
||||
@@ -0,0 +1,19 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package model
|
||||
|
||||
import (
|
||||
"time"
|
||||
)
|
||||
|
||||
type GooseDbVersion struct {
|
||||
ID int32 `sql:"primary_key"`
|
||||
VersionID int64
|
||||
IsApplied bool
|
||||
Tstamp time.Time
|
||||
}
|
||||
@@ -0,0 +1,21 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package model
|
||||
|
||||
import (
|
||||
"time"
|
||||
)
|
||||
|
||||
type HealthSnapshots struct {
|
||||
GameID string `sql:"primary_key"`
|
||||
ContainerID string
|
||||
Status string
|
||||
Source string
|
||||
Details string
|
||||
ObservedAt time.Time
|
||||
}
|
||||
@@ -0,0 +1,27 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package model
|
||||
|
||||
import (
|
||||
"time"
|
||||
)
|
||||
|
||||
type OperationLog struct {
|
||||
ID int64 `sql:"primary_key"`
|
||||
GameID string
|
||||
OpKind string
|
||||
OpSource string
|
||||
SourceRef string
|
||||
ImageRef string
|
||||
ContainerID string
|
||||
Outcome string
|
||||
ErrorCode string
|
||||
ErrorMessage string
|
||||
StartedAt time.Time
|
||||
FinishedAt *time.Time
|
||||
}
|
||||
@@ -0,0 +1,27 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package model
|
||||
|
||||
import (
|
||||
"time"
|
||||
)
|
||||
|
||||
type RuntimeRecords struct {
|
||||
GameID string `sql:"primary_key"`
|
||||
Status string
|
||||
CurrentContainerID *string
|
||||
CurrentImageRef *string
|
||||
EngineEndpoint string
|
||||
StatePath string
|
||||
DockerNetwork string
|
||||
StartedAt *time.Time
|
||||
StoppedAt *time.Time
|
||||
RemovedAt *time.Time
|
||||
LastOpAt time.Time
|
||||
CreatedAt time.Time
|
||||
}
|
||||
@@ -0,0 +1,87 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package table
|
||||
|
||||
import (
|
||||
"github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
var GooseDbVersion = newGooseDbVersionTable("rtmanager", "goose_db_version", "")
|
||||
|
||||
type gooseDbVersionTable struct {
|
||||
postgres.Table
|
||||
|
||||
// Columns
|
||||
ID postgres.ColumnInteger
|
||||
VersionID postgres.ColumnInteger
|
||||
IsApplied postgres.ColumnBool
|
||||
Tstamp postgres.ColumnTimestamp
|
||||
|
||||
AllColumns postgres.ColumnList
|
||||
MutableColumns postgres.ColumnList
|
||||
DefaultColumns postgres.ColumnList
|
||||
}
|
||||
|
||||
type GooseDbVersionTable struct {
|
||||
gooseDbVersionTable
|
||||
|
||||
EXCLUDED gooseDbVersionTable
|
||||
}
|
||||
|
||||
// AS creates new GooseDbVersionTable with assigned alias
|
||||
func (a GooseDbVersionTable) AS(alias string) *GooseDbVersionTable {
|
||||
return newGooseDbVersionTable(a.SchemaName(), a.TableName(), alias)
|
||||
}
|
||||
|
||||
// Schema creates new GooseDbVersionTable with assigned schema name
|
||||
func (a GooseDbVersionTable) FromSchema(schemaName string) *GooseDbVersionTable {
|
||||
return newGooseDbVersionTable(schemaName, a.TableName(), a.Alias())
|
||||
}
|
||||
|
||||
// WithPrefix creates new GooseDbVersionTable with assigned table prefix
|
||||
func (a GooseDbVersionTable) WithPrefix(prefix string) *GooseDbVersionTable {
|
||||
return newGooseDbVersionTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
|
||||
}
|
||||
|
||||
// WithSuffix creates new GooseDbVersionTable with assigned table suffix
|
||||
func (a GooseDbVersionTable) WithSuffix(suffix string) *GooseDbVersionTable {
|
||||
return newGooseDbVersionTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
|
||||
}
|
||||
|
||||
func newGooseDbVersionTable(schemaName, tableName, alias string) *GooseDbVersionTable {
|
||||
return &GooseDbVersionTable{
|
||||
gooseDbVersionTable: newGooseDbVersionTableImpl(schemaName, tableName, alias),
|
||||
EXCLUDED: newGooseDbVersionTableImpl("", "excluded", ""),
|
||||
}
|
||||
}
|
||||
|
||||
func newGooseDbVersionTableImpl(schemaName, tableName, alias string) gooseDbVersionTable {
|
||||
var (
|
||||
IDColumn = postgres.IntegerColumn("id")
|
||||
VersionIDColumn = postgres.IntegerColumn("version_id")
|
||||
IsAppliedColumn = postgres.BoolColumn("is_applied")
|
||||
TstampColumn = postgres.TimestampColumn("tstamp")
|
||||
allColumns = postgres.ColumnList{IDColumn, VersionIDColumn, IsAppliedColumn, TstampColumn}
|
||||
mutableColumns = postgres.ColumnList{VersionIDColumn, IsAppliedColumn, TstampColumn}
|
||||
defaultColumns = postgres.ColumnList{TstampColumn}
|
||||
)
|
||||
|
||||
return gooseDbVersionTable{
|
||||
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
|
||||
|
||||
//Columns
|
||||
ID: IDColumn,
|
||||
VersionID: VersionIDColumn,
|
||||
IsApplied: IsAppliedColumn,
|
||||
Tstamp: TstampColumn,
|
||||
|
||||
AllColumns: allColumns,
|
||||
MutableColumns: mutableColumns,
|
||||
DefaultColumns: defaultColumns,
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,93 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package table
|
||||
|
||||
import (
|
||||
"github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
var HealthSnapshots = newHealthSnapshotsTable("rtmanager", "health_snapshots", "")
|
||||
|
||||
type healthSnapshotsTable struct {
|
||||
postgres.Table
|
||||
|
||||
// Columns
|
||||
GameID postgres.ColumnString
|
||||
ContainerID postgres.ColumnString
|
||||
Status postgres.ColumnString
|
||||
Source postgres.ColumnString
|
||||
Details postgres.ColumnString
|
||||
ObservedAt postgres.ColumnTimestampz
|
||||
|
||||
AllColumns postgres.ColumnList
|
||||
MutableColumns postgres.ColumnList
|
||||
DefaultColumns postgres.ColumnList
|
||||
}
|
||||
|
||||
type HealthSnapshotsTable struct {
|
||||
healthSnapshotsTable
|
||||
|
||||
EXCLUDED healthSnapshotsTable
|
||||
}
|
||||
|
||||
// AS creates new HealthSnapshotsTable with assigned alias
|
||||
func (a HealthSnapshotsTable) AS(alias string) *HealthSnapshotsTable {
|
||||
return newHealthSnapshotsTable(a.SchemaName(), a.TableName(), alias)
|
||||
}
|
||||
|
||||
// Schema creates new HealthSnapshotsTable with assigned schema name
|
||||
func (a HealthSnapshotsTable) FromSchema(schemaName string) *HealthSnapshotsTable {
|
||||
return newHealthSnapshotsTable(schemaName, a.TableName(), a.Alias())
|
||||
}
|
||||
|
||||
// WithPrefix creates new HealthSnapshotsTable with assigned table prefix
|
||||
func (a HealthSnapshotsTable) WithPrefix(prefix string) *HealthSnapshotsTable {
|
||||
return newHealthSnapshotsTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
|
||||
}
|
||||
|
||||
// WithSuffix creates new HealthSnapshotsTable with assigned table suffix
|
||||
func (a HealthSnapshotsTable) WithSuffix(suffix string) *HealthSnapshotsTable {
|
||||
return newHealthSnapshotsTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
|
||||
}
|
||||
|
||||
func newHealthSnapshotsTable(schemaName, tableName, alias string) *HealthSnapshotsTable {
|
||||
return &HealthSnapshotsTable{
|
||||
healthSnapshotsTable: newHealthSnapshotsTableImpl(schemaName, tableName, alias),
|
||||
EXCLUDED: newHealthSnapshotsTableImpl("", "excluded", ""),
|
||||
}
|
||||
}
|
||||
|
||||
func newHealthSnapshotsTableImpl(schemaName, tableName, alias string) healthSnapshotsTable {
|
||||
var (
|
||||
GameIDColumn = postgres.StringColumn("game_id")
|
||||
ContainerIDColumn = postgres.StringColumn("container_id")
|
||||
StatusColumn = postgres.StringColumn("status")
|
||||
SourceColumn = postgres.StringColumn("source")
|
||||
DetailsColumn = postgres.StringColumn("details")
|
||||
ObservedAtColumn = postgres.TimestampzColumn("observed_at")
|
||||
allColumns = postgres.ColumnList{GameIDColumn, ContainerIDColumn, StatusColumn, SourceColumn, DetailsColumn, ObservedAtColumn}
|
||||
mutableColumns = postgres.ColumnList{ContainerIDColumn, StatusColumn, SourceColumn, DetailsColumn, ObservedAtColumn}
|
||||
defaultColumns = postgres.ColumnList{ContainerIDColumn, DetailsColumn}
|
||||
)
|
||||
|
||||
return healthSnapshotsTable{
|
||||
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
|
||||
|
||||
//Columns
|
||||
GameID: GameIDColumn,
|
||||
ContainerID: ContainerIDColumn,
|
||||
Status: StatusColumn,
|
||||
Source: SourceColumn,
|
||||
Details: DetailsColumn,
|
||||
ObservedAt: ObservedAtColumn,
|
||||
|
||||
AllColumns: allColumns,
|
||||
MutableColumns: mutableColumns,
|
||||
DefaultColumns: defaultColumns,
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,111 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package table
|
||||
|
||||
import (
|
||||
"github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
var OperationLog = newOperationLogTable("rtmanager", "operation_log", "")
|
||||
|
||||
type operationLogTable struct {
|
||||
postgres.Table
|
||||
|
||||
// Columns
|
||||
ID postgres.ColumnInteger
|
||||
GameID postgres.ColumnString
|
||||
OpKind postgres.ColumnString
|
||||
OpSource postgres.ColumnString
|
||||
SourceRef postgres.ColumnString
|
||||
ImageRef postgres.ColumnString
|
||||
ContainerID postgres.ColumnString
|
||||
Outcome postgres.ColumnString
|
||||
ErrorCode postgres.ColumnString
|
||||
ErrorMessage postgres.ColumnString
|
||||
StartedAt postgres.ColumnTimestampz
|
||||
FinishedAt postgres.ColumnTimestampz
|
||||
|
||||
AllColumns postgres.ColumnList
|
||||
MutableColumns postgres.ColumnList
|
||||
DefaultColumns postgres.ColumnList
|
||||
}
|
||||
|
||||
type OperationLogTable struct {
|
||||
operationLogTable
|
||||
|
||||
EXCLUDED operationLogTable
|
||||
}
|
||||
|
||||
// AS creates new OperationLogTable with assigned alias
|
||||
func (a OperationLogTable) AS(alias string) *OperationLogTable {
|
||||
return newOperationLogTable(a.SchemaName(), a.TableName(), alias)
|
||||
}
|
||||
|
||||
// Schema creates new OperationLogTable with assigned schema name
|
||||
func (a OperationLogTable) FromSchema(schemaName string) *OperationLogTable {
|
||||
return newOperationLogTable(schemaName, a.TableName(), a.Alias())
|
||||
}
|
||||
|
||||
// WithPrefix creates new OperationLogTable with assigned table prefix
|
||||
func (a OperationLogTable) WithPrefix(prefix string) *OperationLogTable {
|
||||
return newOperationLogTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
|
||||
}
|
||||
|
||||
// WithSuffix creates new OperationLogTable with assigned table suffix
|
||||
func (a OperationLogTable) WithSuffix(suffix string) *OperationLogTable {
|
||||
return newOperationLogTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
|
||||
}
|
||||
|
||||
func newOperationLogTable(schemaName, tableName, alias string) *OperationLogTable {
|
||||
return &OperationLogTable{
|
||||
operationLogTable: newOperationLogTableImpl(schemaName, tableName, alias),
|
||||
EXCLUDED: newOperationLogTableImpl("", "excluded", ""),
|
||||
}
|
||||
}
|
||||
|
||||
func newOperationLogTableImpl(schemaName, tableName, alias string) operationLogTable {
|
||||
var (
|
||||
IDColumn = postgres.IntegerColumn("id")
|
||||
GameIDColumn = postgres.StringColumn("game_id")
|
||||
OpKindColumn = postgres.StringColumn("op_kind")
|
||||
OpSourceColumn = postgres.StringColumn("op_source")
|
||||
SourceRefColumn = postgres.StringColumn("source_ref")
|
||||
ImageRefColumn = postgres.StringColumn("image_ref")
|
||||
ContainerIDColumn = postgres.StringColumn("container_id")
|
||||
OutcomeColumn = postgres.StringColumn("outcome")
|
||||
ErrorCodeColumn = postgres.StringColumn("error_code")
|
||||
ErrorMessageColumn = postgres.StringColumn("error_message")
|
||||
StartedAtColumn = postgres.TimestampzColumn("started_at")
|
||||
FinishedAtColumn = postgres.TimestampzColumn("finished_at")
|
||||
allColumns = postgres.ColumnList{IDColumn, GameIDColumn, OpKindColumn, OpSourceColumn, SourceRefColumn, ImageRefColumn, ContainerIDColumn, OutcomeColumn, ErrorCodeColumn, ErrorMessageColumn, StartedAtColumn, FinishedAtColumn}
|
||||
mutableColumns = postgres.ColumnList{GameIDColumn, OpKindColumn, OpSourceColumn, SourceRefColumn, ImageRefColumn, ContainerIDColumn, OutcomeColumn, ErrorCodeColumn, ErrorMessageColumn, StartedAtColumn, FinishedAtColumn}
|
||||
defaultColumns = postgres.ColumnList{IDColumn, SourceRefColumn, ImageRefColumn, ContainerIDColumn, ErrorCodeColumn, ErrorMessageColumn}
|
||||
)
|
||||
|
||||
return operationLogTable{
|
||||
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
|
||||
|
||||
//Columns
|
||||
ID: IDColumn,
|
||||
GameID: GameIDColumn,
|
||||
OpKind: OpKindColumn,
|
||||
OpSource: OpSourceColumn,
|
||||
SourceRef: SourceRefColumn,
|
||||
ImageRef: ImageRefColumn,
|
||||
ContainerID: ContainerIDColumn,
|
||||
Outcome: OutcomeColumn,
|
||||
ErrorCode: ErrorCodeColumn,
|
||||
ErrorMessage: ErrorMessageColumn,
|
||||
StartedAt: StartedAtColumn,
|
||||
FinishedAt: FinishedAtColumn,
|
||||
|
||||
AllColumns: allColumns,
|
||||
MutableColumns: mutableColumns,
|
||||
DefaultColumns: defaultColumns,
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,111 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package table
|
||||
|
||||
import (
|
||||
"github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
var RuntimeRecords = newRuntimeRecordsTable("rtmanager", "runtime_records", "")
|
||||
|
||||
type runtimeRecordsTable struct {
|
||||
postgres.Table
|
||||
|
||||
// Columns
|
||||
GameID postgres.ColumnString
|
||||
Status postgres.ColumnString
|
||||
CurrentContainerID postgres.ColumnString
|
||||
CurrentImageRef postgres.ColumnString
|
||||
EngineEndpoint postgres.ColumnString
|
||||
StatePath postgres.ColumnString
|
||||
DockerNetwork postgres.ColumnString
|
||||
StartedAt postgres.ColumnTimestampz
|
||||
StoppedAt postgres.ColumnTimestampz
|
||||
RemovedAt postgres.ColumnTimestampz
|
||||
LastOpAt postgres.ColumnTimestampz
|
||||
CreatedAt postgres.ColumnTimestampz
|
||||
|
||||
AllColumns postgres.ColumnList
|
||||
MutableColumns postgres.ColumnList
|
||||
DefaultColumns postgres.ColumnList
|
||||
}
|
||||
|
||||
type RuntimeRecordsTable struct {
|
||||
runtimeRecordsTable
|
||||
|
||||
EXCLUDED runtimeRecordsTable
|
||||
}
|
||||
|
||||
// AS creates new RuntimeRecordsTable with assigned alias
|
||||
func (a RuntimeRecordsTable) AS(alias string) *RuntimeRecordsTable {
|
||||
return newRuntimeRecordsTable(a.SchemaName(), a.TableName(), alias)
|
||||
}
|
||||
|
||||
// Schema creates new RuntimeRecordsTable with assigned schema name
|
||||
func (a RuntimeRecordsTable) FromSchema(schemaName string) *RuntimeRecordsTable {
|
||||
return newRuntimeRecordsTable(schemaName, a.TableName(), a.Alias())
|
||||
}
|
||||
|
||||
// WithPrefix creates new RuntimeRecordsTable with assigned table prefix
|
||||
func (a RuntimeRecordsTable) WithPrefix(prefix string) *RuntimeRecordsTable {
|
||||
return newRuntimeRecordsTable(a.SchemaName(), prefix+a.TableName(), a.TableName())
|
||||
}
|
||||
|
||||
// WithSuffix creates new RuntimeRecordsTable with assigned table suffix
|
||||
func (a RuntimeRecordsTable) WithSuffix(suffix string) *RuntimeRecordsTable {
|
||||
return newRuntimeRecordsTable(a.SchemaName(), a.TableName()+suffix, a.TableName())
|
||||
}
|
||||
|
||||
func newRuntimeRecordsTable(schemaName, tableName, alias string) *RuntimeRecordsTable {
|
||||
return &RuntimeRecordsTable{
|
||||
runtimeRecordsTable: newRuntimeRecordsTableImpl(schemaName, tableName, alias),
|
||||
EXCLUDED: newRuntimeRecordsTableImpl("", "excluded", ""),
|
||||
}
|
||||
}
|
||||
|
||||
func newRuntimeRecordsTableImpl(schemaName, tableName, alias string) runtimeRecordsTable {
|
||||
var (
|
||||
GameIDColumn = postgres.StringColumn("game_id")
|
||||
StatusColumn = postgres.StringColumn("status")
|
||||
CurrentContainerIDColumn = postgres.StringColumn("current_container_id")
|
||||
CurrentImageRefColumn = postgres.StringColumn("current_image_ref")
|
||||
EngineEndpointColumn = postgres.StringColumn("engine_endpoint")
|
||||
StatePathColumn = postgres.StringColumn("state_path")
|
||||
DockerNetworkColumn = postgres.StringColumn("docker_network")
|
||||
StartedAtColumn = postgres.TimestampzColumn("started_at")
|
||||
StoppedAtColumn = postgres.TimestampzColumn("stopped_at")
|
||||
RemovedAtColumn = postgres.TimestampzColumn("removed_at")
|
||||
LastOpAtColumn = postgres.TimestampzColumn("last_op_at")
|
||||
CreatedAtColumn = postgres.TimestampzColumn("created_at")
|
||||
allColumns = postgres.ColumnList{GameIDColumn, StatusColumn, CurrentContainerIDColumn, CurrentImageRefColumn, EngineEndpointColumn, StatePathColumn, DockerNetworkColumn, StartedAtColumn, StoppedAtColumn, RemovedAtColumn, LastOpAtColumn, CreatedAtColumn}
|
||||
mutableColumns = postgres.ColumnList{StatusColumn, CurrentContainerIDColumn, CurrentImageRefColumn, EngineEndpointColumn, StatePathColumn, DockerNetworkColumn, StartedAtColumn, StoppedAtColumn, RemovedAtColumn, LastOpAtColumn, CreatedAtColumn}
|
||||
defaultColumns = postgres.ColumnList{}
|
||||
)
|
||||
|
||||
return runtimeRecordsTable{
|
||||
Table: postgres.NewTable(schemaName, tableName, alias, allColumns...),
|
||||
|
||||
//Columns
|
||||
GameID: GameIDColumn,
|
||||
Status: StatusColumn,
|
||||
CurrentContainerID: CurrentContainerIDColumn,
|
||||
CurrentImageRef: CurrentImageRefColumn,
|
||||
EngineEndpoint: EngineEndpointColumn,
|
||||
StatePath: StatePathColumn,
|
||||
DockerNetwork: DockerNetworkColumn,
|
||||
StartedAt: StartedAtColumn,
|
||||
StoppedAt: StoppedAtColumn,
|
||||
RemovedAt: RemovedAtColumn,
|
||||
LastOpAt: LastOpAtColumn,
|
||||
CreatedAt: CreatedAtColumn,
|
||||
|
||||
AllColumns: allColumns,
|
||||
MutableColumns: mutableColumns,
|
||||
DefaultColumns: defaultColumns,
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,17 @@
|
||||
//
|
||||
// Code generated by go-jet DO NOT EDIT.
|
||||
//
|
||||
// WARNING: Changes to this file may cause incorrect behavior
|
||||
// and will be lost if the code is regenerated
|
||||
//
|
||||
|
||||
package table
|
||||
|
||||
// UseSchema sets a new schema name for all generated table SQL builder types. It is recommended to invoke
|
||||
// this method only once at the beginning of the program.
|
||||
func UseSchema(schema string) {
|
||||
GooseDbVersion = GooseDbVersion.FromSchema(schema)
|
||||
HealthSnapshots = HealthSnapshots.FromSchema(schema)
|
||||
OperationLog = OperationLog.FromSchema(schema)
|
||||
RuntimeRecords = RuntimeRecords.FromSchema(schema)
|
||||
}
|
||||
@@ -0,0 +1,106 @@
|
||||
-- +goose Up
|
||||
-- Initial Runtime Manager PostgreSQL schema.
|
||||
--
|
||||
-- Three tables cover the durable surface of the service:
|
||||
-- * runtime_records — one row per game with the latest known runtime
|
||||
-- status and Docker container binding;
|
||||
-- * operation_log — append-only audit of every start/stop/restart/
|
||||
-- patch/cleanup/reconcile_* operation RTM performed;
|
||||
-- * health_snapshots — latest technical health observation per game.
|
||||
--
|
||||
-- Schema and the matching `rtmanagerservice` role are provisioned
|
||||
-- outside this script (in tests via cmd/jetgen/main.go::provisionRoleAndSchema;
|
||||
-- in production via an ops init script). This migration runs as the
|
||||
-- schema owner with `search_path=rtmanager` and only contains DDL for the
|
||||
-- service-owned tables and indexes. ARCHITECTURE.md §Database topology
|
||||
-- mandates that the per-service role's grants stay restricted to its own
|
||||
-- schema; consequently this file deliberately deviates from PLAN.md
|
||||
-- Stage 09's literal `CREATE SCHEMA IF NOT EXISTS rtmanager;` instruction.
|
||||
|
||||
-- runtime_records holds one durable record per game with the latest
|
||||
-- known runtime status and Docker container binding. The status enum
|
||||
-- (running | stopped | removed) is enforced by a CHECK so domain code
|
||||
-- can rely on it without reading every callsite. The (status, last_op_at)
|
||||
-- index drives the periodic container-cleanup worker that scans
|
||||
-- `status='stopped' AND last_op_at < now() - retention`.
|
||||
CREATE TABLE runtime_records (
|
||||
game_id text PRIMARY KEY,
|
||||
status text NOT NULL,
|
||||
current_container_id text,
|
||||
current_image_ref text,
|
||||
engine_endpoint text NOT NULL,
|
||||
state_path text NOT NULL,
|
||||
docker_network text NOT NULL,
|
||||
started_at timestamptz,
|
||||
stopped_at timestamptz,
|
||||
removed_at timestamptz,
|
||||
last_op_at timestamptz NOT NULL,
|
||||
created_at timestamptz NOT NULL,
|
||||
CONSTRAINT runtime_records_status_chk
|
||||
CHECK (status IN ('running', 'stopped', 'removed'))
|
||||
);
|
||||
|
||||
CREATE INDEX runtime_records_status_last_op_idx
|
||||
ON runtime_records (status, last_op_at);
|
||||
|
||||
-- operation_log is an append-only audit of every operation Runtime
|
||||
-- Manager performed against a game's runtime. The (game_id, started_at
|
||||
-- DESC) index drives audit reads from the GM/Admin REST surface;
|
||||
-- finished_at is nullable for in-flight rows even though Stage 13+
|
||||
-- always finalises the row in the same transaction. The op_kind /
|
||||
-- op_source / outcome enums are enforced by CHECK constraints to keep
|
||||
-- the audit schema honest without a separate Go validator.
|
||||
CREATE TABLE operation_log (
|
||||
id bigserial PRIMARY KEY,
|
||||
game_id text NOT NULL,
|
||||
op_kind text NOT NULL,
|
||||
op_source text NOT NULL,
|
||||
source_ref text NOT NULL DEFAULT '',
|
||||
image_ref text NOT NULL DEFAULT '',
|
||||
container_id text NOT NULL DEFAULT '',
|
||||
outcome text NOT NULL,
|
||||
error_code text NOT NULL DEFAULT '',
|
||||
error_message text NOT NULL DEFAULT '',
|
||||
started_at timestamptz NOT NULL,
|
||||
finished_at timestamptz,
|
||||
CONSTRAINT operation_log_op_kind_chk
|
||||
CHECK (op_kind IN (
|
||||
'start', 'stop', 'restart', 'patch',
|
||||
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
|
||||
)),
|
||||
CONSTRAINT operation_log_op_source_chk
|
||||
CHECK (op_source IN (
|
||||
'lobby_stream', 'gm_rest', 'admin_rest',
|
||||
'auto_ttl', 'auto_reconcile'
|
||||
)),
|
||||
CONSTRAINT operation_log_outcome_chk
|
||||
CHECK (outcome IN ('success', 'failure'))
|
||||
);
|
||||
|
||||
CREATE INDEX operation_log_game_started_idx
|
||||
ON operation_log (game_id, started_at DESC);
|
||||
|
||||
-- health_snapshots stores the latest technical health observation per
|
||||
-- game. One row per game; later observations overwrite. The status enum
|
||||
-- mirrors the `event_type` vocabulary on `runtime:health_events`
|
||||
-- (collapsed to a flat status column for the latest-observation view).
|
||||
CREATE TABLE health_snapshots (
|
||||
game_id text PRIMARY KEY,
|
||||
container_id text NOT NULL DEFAULT '',
|
||||
status text NOT NULL,
|
||||
source text NOT NULL,
|
||||
details jsonb NOT NULL DEFAULT '{}'::jsonb,
|
||||
observed_at timestamptz NOT NULL,
|
||||
CONSTRAINT health_snapshots_status_chk
|
||||
CHECK (status IN (
|
||||
'healthy', 'probe_failed', 'exited',
|
||||
'oom', 'inspect_unhealthy', 'container_disappeared'
|
||||
)),
|
||||
CONSTRAINT health_snapshots_source_chk
|
||||
CHECK (source IN ('docker_event', 'inspect', 'probe'))
|
||||
);
|
||||
|
||||
-- +goose Down
|
||||
DROP TABLE IF EXISTS health_snapshots;
|
||||
DROP TABLE IF EXISTS operation_log;
|
||||
DROP TABLE IF EXISTS runtime_records;
|
||||
@@ -0,0 +1,19 @@
|
||||
// Package migrations exposes the embedded goose migration files used by
|
||||
// Runtime Manager to provision its `rtmanager` schema in PostgreSQL.
|
||||
//
|
||||
// The embedded filesystem is consumed by `pkg/postgres.RunMigrations`
|
||||
// during rtmanager-service startup and by `cmd/jetgen` when regenerating
|
||||
// the `internal/adapters/postgres/jet/` code against a transient
|
||||
// PostgreSQL instance.
|
||||
package migrations
|
||||
|
||||
import "embed"
|
||||
|
||||
//go:embed *.sql
|
||||
var fs embed.FS
|
||||
|
||||
// FS returns the embedded filesystem containing every numbered goose
|
||||
// migration shipped with Runtime Manager.
|
||||
func FS() embed.FS {
|
||||
return fs
|
||||
}
|
||||
@@ -0,0 +1,235 @@
|
||||
// Package operationlogstore implements the PostgreSQL-backed adapter for
|
||||
// `ports.OperationLogStore`.
|
||||
//
|
||||
// The package owns the on-disk shape of the `operation_log` table defined
|
||||
// in
|
||||
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`
|
||||
// and translates the schema-agnostic `ports.OperationLogStore` interface
|
||||
// declared in `internal/ports/operationlogstore.go` into concrete
|
||||
// go-jet/v2 statements driven by the pgx driver.
|
||||
//
|
||||
// Append uses `INSERT ... RETURNING id` to surface the bigserial id back
|
||||
// to callers; ListByGame is index-driven by `operation_log_game_started_idx`.
|
||||
package operationlogstore
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/internal/sqlx"
|
||||
pgtable "galaxy/rtmanager/internal/adapters/postgres/jet/rtmanager/table"
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
pg "github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
// Config configures one PostgreSQL-backed operation-log store instance.
|
||||
type Config struct {
|
||||
// DB stores the connection pool the store uses for every query.
|
||||
DB *sql.DB
|
||||
|
||||
// OperationTimeout bounds one round trip.
|
||||
OperationTimeout time.Duration
|
||||
}
|
||||
|
||||
// Store persists Runtime Manager operation-log entries in PostgreSQL.
|
||||
type Store struct {
|
||||
db *sql.DB
|
||||
operationTimeout time.Duration
|
||||
}
|
||||
|
||||
// New constructs one PostgreSQL-backed operation-log store from cfg.
|
||||
func New(cfg Config) (*Store, error) {
|
||||
if cfg.DB == nil {
|
||||
return nil, errors.New("new postgres operation log store: db must not be nil")
|
||||
}
|
||||
if cfg.OperationTimeout <= 0 {
|
||||
return nil, errors.New("new postgres operation log store: operation timeout must be positive")
|
||||
}
|
||||
return &Store{
|
||||
db: cfg.DB,
|
||||
operationTimeout: cfg.OperationTimeout,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// operationLogSelectColumns is the canonical SELECT list for the
|
||||
// operation_log table, matching scanEntry's column order.
|
||||
var operationLogSelectColumns = pg.ColumnList{
|
||||
pgtable.OperationLog.ID,
|
||||
pgtable.OperationLog.GameID,
|
||||
pgtable.OperationLog.OpKind,
|
||||
pgtable.OperationLog.OpSource,
|
||||
pgtable.OperationLog.SourceRef,
|
||||
pgtable.OperationLog.ImageRef,
|
||||
pgtable.OperationLog.ContainerID,
|
||||
pgtable.OperationLog.Outcome,
|
||||
pgtable.OperationLog.ErrorCode,
|
||||
pgtable.OperationLog.ErrorMessage,
|
||||
pgtable.OperationLog.StartedAt,
|
||||
pgtable.OperationLog.FinishedAt,
|
||||
}
|
||||
|
||||
// Append inserts entry into the operation log and returns the generated
|
||||
// bigserial id. entry is validated through operation.OperationEntry.Validate
|
||||
// before the SQL is issued.
|
||||
func (store *Store) Append(ctx context.Context, entry operation.OperationEntry) (int64, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return 0, errors.New("append operation log entry: nil store")
|
||||
}
|
||||
if err := entry.Validate(); err != nil {
|
||||
return 0, fmt.Errorf("append operation log entry: %w", err)
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "append operation log entry", store.operationTimeout)
|
||||
if err != nil {
|
||||
return 0, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pgtable.OperationLog.INSERT(
|
||||
pgtable.OperationLog.GameID,
|
||||
pgtable.OperationLog.OpKind,
|
||||
pgtable.OperationLog.OpSource,
|
||||
pgtable.OperationLog.SourceRef,
|
||||
pgtable.OperationLog.ImageRef,
|
||||
pgtable.OperationLog.ContainerID,
|
||||
pgtable.OperationLog.Outcome,
|
||||
pgtable.OperationLog.ErrorCode,
|
||||
pgtable.OperationLog.ErrorMessage,
|
||||
pgtable.OperationLog.StartedAt,
|
||||
pgtable.OperationLog.FinishedAt,
|
||||
).VALUES(
|
||||
entry.GameID,
|
||||
string(entry.OpKind),
|
||||
string(entry.OpSource),
|
||||
entry.SourceRef,
|
||||
entry.ImageRef,
|
||||
entry.ContainerID,
|
||||
string(entry.Outcome),
|
||||
entry.ErrorCode,
|
||||
entry.ErrorMessage,
|
||||
entry.StartedAt.UTC(),
|
||||
sqlx.NullableTimePtr(entry.FinishedAt),
|
||||
).RETURNING(pgtable.OperationLog.ID)
|
||||
|
||||
query, args := stmt.Sql()
|
||||
row := store.db.QueryRowContext(operationCtx, query, args...)
|
||||
var id int64
|
||||
if err := row.Scan(&id); err != nil {
|
||||
return 0, fmt.Errorf("append operation log entry: %w", err)
|
||||
}
|
||||
return id, nil
|
||||
}
|
||||
|
||||
// ListByGame returns the most recent entries for gameID, ordered by
|
||||
// started_at descending and capped by limit. The (game_id,
|
||||
// started_at DESC) index drives the read.
|
||||
func (store *Store) ListByGame(ctx context.Context, gameID string, limit int) ([]operation.OperationEntry, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return nil, errors.New("list operation log entries by game: nil store")
|
||||
}
|
||||
if strings.TrimSpace(gameID) == "" {
|
||||
return nil, fmt.Errorf("list operation log entries by game: game id must not be empty")
|
||||
}
|
||||
if limit <= 0 {
|
||||
return nil, fmt.Errorf("list operation log entries by game: limit must be positive, got %d", limit)
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "list operation log entries by game", store.operationTimeout)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pg.SELECT(operationLogSelectColumns).
|
||||
FROM(pgtable.OperationLog).
|
||||
WHERE(pgtable.OperationLog.GameID.EQ(pg.String(gameID))).
|
||||
ORDER_BY(pgtable.OperationLog.StartedAt.DESC(), pgtable.OperationLog.ID.DESC()).
|
||||
LIMIT(int64(limit))
|
||||
|
||||
query, args := stmt.Sql()
|
||||
rows, err := store.db.QueryContext(operationCtx, query, args...)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list operation log entries by game: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
entries := make([]operation.OperationEntry, 0)
|
||||
for rows.Next() {
|
||||
entry, err := scanEntry(rows)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list operation log entries by game: scan: %w", err)
|
||||
}
|
||||
entries = append(entries, entry)
|
||||
}
|
||||
if err := rows.Err(); err != nil {
|
||||
return nil, fmt.Errorf("list operation log entries by game: %w", err)
|
||||
}
|
||||
if len(entries) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
return entries, nil
|
||||
}
|
||||
|
||||
// rowScanner abstracts *sql.Row and *sql.Rows so scanEntry can be shared
|
||||
// across both single-row reads and iterated reads.
|
||||
type rowScanner interface {
|
||||
Scan(dest ...any) error
|
||||
}
|
||||
|
||||
// scanEntry scans one operation_log row from rs.
|
||||
func scanEntry(rs rowScanner) (operation.OperationEntry, error) {
|
||||
var (
|
||||
id int64
|
||||
gameID string
|
||||
opKind string
|
||||
opSource string
|
||||
sourceRef string
|
||||
imageRef string
|
||||
containerID string
|
||||
outcome string
|
||||
errorCode string
|
||||
errorMessage string
|
||||
startedAt time.Time
|
||||
finishedAt sql.NullTime
|
||||
)
|
||||
if err := rs.Scan(
|
||||
&id,
|
||||
&gameID,
|
||||
&opKind,
|
||||
&opSource,
|
||||
&sourceRef,
|
||||
&imageRef,
|
||||
&containerID,
|
||||
&outcome,
|
||||
&errorCode,
|
||||
&errorMessage,
|
||||
&startedAt,
|
||||
&finishedAt,
|
||||
); err != nil {
|
||||
return operation.OperationEntry{}, err
|
||||
}
|
||||
return operation.OperationEntry{
|
||||
ID: id,
|
||||
GameID: gameID,
|
||||
OpKind: operation.OpKind(opKind),
|
||||
OpSource: operation.OpSource(opSource),
|
||||
SourceRef: sourceRef,
|
||||
ImageRef: imageRef,
|
||||
ContainerID: containerID,
|
||||
Outcome: operation.Outcome(outcome),
|
||||
ErrorCode: errorCode,
|
||||
ErrorMessage: errorMessage,
|
||||
StartedAt: startedAt.UTC(),
|
||||
FinishedAt: sqlx.TimePtrFromNullable(finishedAt),
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Ensure Store satisfies the ports.OperationLogStore interface at compile
|
||||
// time.
|
||||
var _ ports.OperationLogStore = (*Store)(nil)
|
||||
@@ -0,0 +1,207 @@
|
||||
package operationlogstore_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/internal/pgtest"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/operationlogstore"
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func TestMain(m *testing.M) { pgtest.RunMain(m) }
|
||||
|
||||
func newStore(t *testing.T) *operationlogstore.Store {
|
||||
t.Helper()
|
||||
pgtest.TruncateAll(t)
|
||||
store, err := operationlogstore.New(operationlogstore.Config{
|
||||
DB: pgtest.Ensure(t).Pool(),
|
||||
OperationTimeout: pgtest.OperationTimeout,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return store
|
||||
}
|
||||
|
||||
func successStartEntry(gameID string, startedAt time.Time, sourceRef string) operation.OperationEntry {
|
||||
finishedAt := startedAt.Add(time.Second)
|
||||
return operation.OperationEntry{
|
||||
GameID: gameID,
|
||||
OpKind: operation.OpKindStart,
|
||||
OpSource: operation.OpSourceLobbyStream,
|
||||
SourceRef: sourceRef,
|
||||
ImageRef: "galaxy/game:v1.2.3",
|
||||
ContainerID: "container-1",
|
||||
Outcome: operation.OutcomeSuccess,
|
||||
StartedAt: startedAt,
|
||||
FinishedAt: &finishedAt,
|
||||
}
|
||||
}
|
||||
|
||||
func TestAppendReturnsPositiveIDs(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
startedAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
|
||||
id1, err := store.Append(ctx, successStartEntry("game-001", startedAt, "1700000000000-0"))
|
||||
require.NoError(t, err)
|
||||
assert.Greater(t, id1, int64(0))
|
||||
|
||||
id2, err := store.Append(ctx, successStartEntry("game-001", startedAt.Add(time.Minute), "1700000000001-0"))
|
||||
require.NoError(t, err)
|
||||
assert.Greater(t, id2, id1)
|
||||
}
|
||||
|
||||
func TestAppendValidatesEntry(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
mutate func(*operation.OperationEntry)
|
||||
}{
|
||||
{"empty game id", func(e *operation.OperationEntry) { e.GameID = "" }},
|
||||
{"unknown op kind", func(e *operation.OperationEntry) { e.OpKind = "exotic" }},
|
||||
{"unknown op source", func(e *operation.OperationEntry) { e.OpSource = "exotic" }},
|
||||
{"unknown outcome", func(e *operation.OperationEntry) { e.Outcome = "exotic" }},
|
||||
{"zero started at", func(e *operation.OperationEntry) { e.StartedAt = time.Time{} }},
|
||||
{"failure without error code", func(e *operation.OperationEntry) {
|
||||
e.Outcome = operation.OutcomeFailure
|
||||
e.ErrorCode = ""
|
||||
}},
|
||||
}
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
entry := successStartEntry("game-001",
|
||||
time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC), "ref")
|
||||
tt.mutate(&entry)
|
||||
_, err := store.Append(ctx, entry)
|
||||
require.Error(t, err)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestListByGameReturnsEntriesNewestFirst(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
base := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
|
||||
for index := range 3 {
|
||||
_, err := store.Append(ctx, successStartEntry("game-001",
|
||||
base.Add(time.Duration(index)*time.Minute),
|
||||
"ref-game-001-"))
|
||||
require.NoError(t, err)
|
||||
}
|
||||
// Foreign-game entry must not appear in the list.
|
||||
_, err := store.Append(ctx, successStartEntry("game-other", base, "ref-other"))
|
||||
require.NoError(t, err)
|
||||
|
||||
entries, err := store.ListByGame(ctx, "game-001", 10)
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 3)
|
||||
for index := range 2 {
|
||||
assert.True(t,
|
||||
!entries[index].StartedAt.Before(entries[index+1].StartedAt),
|
||||
"entries must be ordered started_at DESC; got %s before %s",
|
||||
entries[index].StartedAt, entries[index+1].StartedAt,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
func TestListByGameRespectsLimit(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
base := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
|
||||
for index := range 5 {
|
||||
_, err := store.Append(ctx, successStartEntry("game-001",
|
||||
base.Add(time.Duration(index)*time.Minute), "ref"))
|
||||
require.NoError(t, err)
|
||||
}
|
||||
|
||||
entries, err := store.ListByGame(ctx, "game-001", 2)
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 2)
|
||||
}
|
||||
|
||||
func TestListByGameReturnsEmptyForUnknownGame(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
entries, err := store.ListByGame(ctx, "game-missing", 10)
|
||||
require.NoError(t, err)
|
||||
assert.Empty(t, entries)
|
||||
}
|
||||
|
||||
func TestListByGameRejectsInvalidArgs(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
_, err := store.ListByGame(ctx, "", 10)
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = store.ListByGame(ctx, "game-001", 0)
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = store.ListByGame(ctx, "game-001", -3)
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestAppendRoundTripsAllFields(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
startedAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
|
||||
finishedAt := startedAt.Add(2 * time.Second)
|
||||
original := operation.OperationEntry{
|
||||
GameID: "game-001",
|
||||
OpKind: operation.OpKindStop,
|
||||
OpSource: operation.OpSourceGMRest,
|
||||
SourceRef: "request-7",
|
||||
ImageRef: "galaxy/game:v2.0.0",
|
||||
ContainerID: "container-X",
|
||||
Outcome: operation.OutcomeFailure,
|
||||
ErrorCode: "container_start_failed",
|
||||
ErrorMessage: "stop deadline exceeded",
|
||||
StartedAt: startedAt,
|
||||
FinishedAt: &finishedAt,
|
||||
}
|
||||
id, err := store.Append(ctx, original)
|
||||
require.NoError(t, err)
|
||||
|
||||
entries, err := store.ListByGame(ctx, "game-001", 10)
|
||||
require.NoError(t, err)
|
||||
require.Len(t, entries, 1)
|
||||
|
||||
got := entries[0]
|
||||
assert.Equal(t, id, got.ID)
|
||||
assert.Equal(t, original.GameID, got.GameID)
|
||||
assert.Equal(t, original.OpKind, got.OpKind)
|
||||
assert.Equal(t, original.OpSource, got.OpSource)
|
||||
assert.Equal(t, original.SourceRef, got.SourceRef)
|
||||
assert.Equal(t, original.ImageRef, got.ImageRef)
|
||||
assert.Equal(t, original.ContainerID, got.ContainerID)
|
||||
assert.Equal(t, original.Outcome, got.Outcome)
|
||||
assert.Equal(t, original.ErrorCode, got.ErrorCode)
|
||||
assert.Equal(t, original.ErrorMessage, got.ErrorMessage)
|
||||
assert.True(t, original.StartedAt.Equal(got.StartedAt))
|
||||
require.NotNil(t, got.FinishedAt)
|
||||
assert.True(t, original.FinishedAt.Equal(*got.FinishedAt))
|
||||
assert.Equal(t, time.UTC, got.StartedAt.Location())
|
||||
assert.Equal(t, time.UTC, got.FinishedAt.Location())
|
||||
}
|
||||
|
||||
func TestNewRejectsNilDB(t *testing.T) {
|
||||
_, err := operationlogstore.New(operationlogstore.Config{OperationTimeout: time.Second})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestNewRejectsNonPositiveTimeout(t *testing.T) {
|
||||
_, err := operationlogstore.New(operationlogstore.Config{
|
||||
DB: pgtest.Ensure(t).Pool(),
|
||||
})
|
||||
require.Error(t, err)
|
||||
}
|
||||
@@ -0,0 +1,500 @@
|
||||
// Package runtimerecordstore implements the PostgreSQL-backed adapter for
|
||||
// `ports.RuntimeRecordStore`.
|
||||
//
|
||||
// The package owns the on-disk shape of the `runtime_records` table
|
||||
// defined in
|
||||
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`
|
||||
// and translates the schema-agnostic `ports.RuntimeRecordStore` interface
|
||||
// declared in `internal/ports/runtimerecordstore.go` into concrete
|
||||
// go-jet/v2 statements driven by the pgx driver.
|
||||
//
|
||||
// Lifecycle transitions (UpdateStatus) use compare-and-swap on
|
||||
// `(status, current_container_id)` rather than holding a SELECT ... FOR
|
||||
// UPDATE lock across the caller's logic, mirroring the pattern used by
|
||||
// `lobby/internal/adapters/postgres/gamestore`.
|
||||
package runtimerecordstore
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/internal/sqlx"
|
||||
pgtable "galaxy/rtmanager/internal/adapters/postgres/jet/rtmanager/table"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
pg "github.com/go-jet/jet/v2/postgres"
|
||||
)
|
||||
|
||||
// Config configures one PostgreSQL-backed runtime-record store instance.
|
||||
// The store does not own the underlying *sql.DB lifecycle: the caller
|
||||
// (typically the service runtime) opens, instruments, migrates, and
|
||||
// closes the pool.
|
||||
type Config struct {
|
||||
// DB stores the connection pool the store uses for every query.
|
||||
DB *sql.DB
|
||||
|
||||
// OperationTimeout bounds one round trip. The store creates a
|
||||
// derived context for each operation so callers cannot starve the
|
||||
// pool with an unbounded ctx.
|
||||
OperationTimeout time.Duration
|
||||
}
|
||||
|
||||
// Store persists Runtime Manager runtime records in PostgreSQL.
|
||||
type Store struct {
|
||||
db *sql.DB
|
||||
operationTimeout time.Duration
|
||||
}
|
||||
|
||||
// New constructs one PostgreSQL-backed runtime-record store from cfg.
|
||||
func New(cfg Config) (*Store, error) {
|
||||
if cfg.DB == nil {
|
||||
return nil, errors.New("new postgres runtime record store: db must not be nil")
|
||||
}
|
||||
if cfg.OperationTimeout <= 0 {
|
||||
return nil, errors.New("new postgres runtime record store: operation timeout must be positive")
|
||||
}
|
||||
return &Store{
|
||||
db: cfg.DB,
|
||||
operationTimeout: cfg.OperationTimeout,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// runtimeSelectColumns is the canonical SELECT list for the runtime_records
|
||||
// table, matching scanRecord's column order.
|
||||
var runtimeSelectColumns = pg.ColumnList{
|
||||
pgtable.RuntimeRecords.GameID,
|
||||
pgtable.RuntimeRecords.Status,
|
||||
pgtable.RuntimeRecords.CurrentContainerID,
|
||||
pgtable.RuntimeRecords.CurrentImageRef,
|
||||
pgtable.RuntimeRecords.EngineEndpoint,
|
||||
pgtable.RuntimeRecords.StatePath,
|
||||
pgtable.RuntimeRecords.DockerNetwork,
|
||||
pgtable.RuntimeRecords.StartedAt,
|
||||
pgtable.RuntimeRecords.StoppedAt,
|
||||
pgtable.RuntimeRecords.RemovedAt,
|
||||
pgtable.RuntimeRecords.LastOpAt,
|
||||
pgtable.RuntimeRecords.CreatedAt,
|
||||
}
|
||||
|
||||
// Get returns the record identified by gameID. It returns
|
||||
// runtime.ErrNotFound when no record exists.
|
||||
func (store *Store) Get(ctx context.Context, gameID string) (runtime.RuntimeRecord, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return runtime.RuntimeRecord{}, errors.New("get runtime record: nil store")
|
||||
}
|
||||
if strings.TrimSpace(gameID) == "" {
|
||||
return runtime.RuntimeRecord{}, fmt.Errorf("get runtime record: game id must not be empty")
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "get runtime record", store.operationTimeout)
|
||||
if err != nil {
|
||||
return runtime.RuntimeRecord{}, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pg.SELECT(runtimeSelectColumns).
|
||||
FROM(pgtable.RuntimeRecords).
|
||||
WHERE(pgtable.RuntimeRecords.GameID.EQ(pg.String(gameID)))
|
||||
|
||||
query, args := stmt.Sql()
|
||||
row := store.db.QueryRowContext(operationCtx, query, args...)
|
||||
record, err := scanRecord(row)
|
||||
if sqlx.IsNoRows(err) {
|
||||
return runtime.RuntimeRecord{}, runtime.ErrNotFound
|
||||
}
|
||||
if err != nil {
|
||||
return runtime.RuntimeRecord{}, fmt.Errorf("get runtime record: %w", err)
|
||||
}
|
||||
return record, nil
|
||||
}
|
||||
|
||||
// Upsert inserts record when no row exists for record.GameID and
|
||||
// otherwise overwrites every mutable column verbatim. created_at is
|
||||
// preserved across upserts so the "first time RTM saw the game"
|
||||
// timestamp stays stable.
|
||||
func (store *Store) Upsert(ctx context.Context, record runtime.RuntimeRecord) error {
|
||||
if store == nil || store.db == nil {
|
||||
return errors.New("upsert runtime record: nil store")
|
||||
}
|
||||
if err := record.Validate(); err != nil {
|
||||
return fmt.Errorf("upsert runtime record: %w", err)
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "upsert runtime record", store.operationTimeout)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pgtable.RuntimeRecords.INSERT(
|
||||
pgtable.RuntimeRecords.GameID,
|
||||
pgtable.RuntimeRecords.Status,
|
||||
pgtable.RuntimeRecords.CurrentContainerID,
|
||||
pgtable.RuntimeRecords.CurrentImageRef,
|
||||
pgtable.RuntimeRecords.EngineEndpoint,
|
||||
pgtable.RuntimeRecords.StatePath,
|
||||
pgtable.RuntimeRecords.DockerNetwork,
|
||||
pgtable.RuntimeRecords.StartedAt,
|
||||
pgtable.RuntimeRecords.StoppedAt,
|
||||
pgtable.RuntimeRecords.RemovedAt,
|
||||
pgtable.RuntimeRecords.LastOpAt,
|
||||
pgtable.RuntimeRecords.CreatedAt,
|
||||
).VALUES(
|
||||
record.GameID,
|
||||
string(record.Status),
|
||||
sqlx.NullableString(record.CurrentContainerID),
|
||||
sqlx.NullableString(record.CurrentImageRef),
|
||||
record.EngineEndpoint,
|
||||
record.StatePath,
|
||||
record.DockerNetwork,
|
||||
sqlx.NullableTimePtr(record.StartedAt),
|
||||
sqlx.NullableTimePtr(record.StoppedAt),
|
||||
sqlx.NullableTimePtr(record.RemovedAt),
|
||||
record.LastOpAt.UTC(),
|
||||
record.CreatedAt.UTC(),
|
||||
).ON_CONFLICT(pgtable.RuntimeRecords.GameID).DO_UPDATE(
|
||||
pg.SET(
|
||||
pgtable.RuntimeRecords.Status.SET(pgtable.RuntimeRecords.EXCLUDED.Status),
|
||||
pgtable.RuntimeRecords.CurrentContainerID.SET(pgtable.RuntimeRecords.EXCLUDED.CurrentContainerID),
|
||||
pgtable.RuntimeRecords.CurrentImageRef.SET(pgtable.RuntimeRecords.EXCLUDED.CurrentImageRef),
|
||||
pgtable.RuntimeRecords.EngineEndpoint.SET(pgtable.RuntimeRecords.EXCLUDED.EngineEndpoint),
|
||||
pgtable.RuntimeRecords.StatePath.SET(pgtable.RuntimeRecords.EXCLUDED.StatePath),
|
||||
pgtable.RuntimeRecords.DockerNetwork.SET(pgtable.RuntimeRecords.EXCLUDED.DockerNetwork),
|
||||
pgtable.RuntimeRecords.StartedAt.SET(pgtable.RuntimeRecords.EXCLUDED.StartedAt),
|
||||
pgtable.RuntimeRecords.StoppedAt.SET(pgtable.RuntimeRecords.EXCLUDED.StoppedAt),
|
||||
pgtable.RuntimeRecords.RemovedAt.SET(pgtable.RuntimeRecords.EXCLUDED.RemovedAt),
|
||||
pgtable.RuntimeRecords.LastOpAt.SET(pgtable.RuntimeRecords.EXCLUDED.LastOpAt),
|
||||
),
|
||||
)
|
||||
|
||||
query, args := stmt.Sql()
|
||||
if _, err := store.db.ExecContext(operationCtx, query, args...); err != nil {
|
||||
return fmt.Errorf("upsert runtime record: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// UpdateStatus applies one status transition with a compare-and-swap
|
||||
// guard on (status, current_container_id). Validate is invoked before
|
||||
// any SQL touch.
|
||||
func (store *Store) UpdateStatus(ctx context.Context, input ports.UpdateStatusInput) error {
|
||||
if store == nil || store.db == nil {
|
||||
return errors.New("update runtime status: nil store")
|
||||
}
|
||||
if err := input.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "update runtime status", store.operationTimeout)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
now := input.Now.UTC()
|
||||
stmt, err := buildUpdateStatusStatement(input, now)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
query, args := stmt.Sql()
|
||||
result, err := store.db.ExecContext(operationCtx, query, args...)
|
||||
if err != nil {
|
||||
return fmt.Errorf("update runtime status: %w", err)
|
||||
}
|
||||
affected, err := result.RowsAffected()
|
||||
if err != nil {
|
||||
return fmt.Errorf("update runtime status: rows affected: %w", err)
|
||||
}
|
||||
if affected == 0 {
|
||||
return store.classifyMissingUpdate(operationCtx, input.GameID)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// classifyMissingUpdate distinguishes ErrNotFound from ErrConflict after
|
||||
// an UPDATE that affected zero rows. A row that is absent yields
|
||||
// ErrNotFound; a row whose status or container_id does not match the
|
||||
// CAS predicate yields ErrConflict.
|
||||
func (store *Store) classifyMissingUpdate(ctx context.Context, gameID string) error {
|
||||
probe := pg.SELECT(pgtable.RuntimeRecords.Status).
|
||||
FROM(pgtable.RuntimeRecords).
|
||||
WHERE(pgtable.RuntimeRecords.GameID.EQ(pg.String(gameID)))
|
||||
probeQuery, probeArgs := probe.Sql()
|
||||
|
||||
var current string
|
||||
row := store.db.QueryRowContext(ctx, probeQuery, probeArgs...)
|
||||
if err := row.Scan(¤t); err != nil {
|
||||
if sqlx.IsNoRows(err) {
|
||||
return runtime.ErrNotFound
|
||||
}
|
||||
return fmt.Errorf("update runtime status: probe: %w", err)
|
||||
}
|
||||
return runtime.ErrConflict
|
||||
}
|
||||
|
||||
// buildUpdateStatusStatement assembles the UPDATE statement applied for
|
||||
// one runtime-status transition.
|
||||
//
|
||||
// status, last_op_at are always updated. The remaining columns are
|
||||
// driven by the destination:
|
||||
//
|
||||
// - StatusStopped: stopped_at is captured at Now.
|
||||
// - StatusRemoved: removed_at is captured at Now and current_container_id
|
||||
// is NULLed (the container is gone; the prior id remains observable
|
||||
// through operation_log).
|
||||
// - StatusRunning: only status + last_op_at change. Fresh started_at
|
||||
// and current_container_id are installed via Upsert before any
|
||||
// stopped → running transition reaches this path; the path exists
|
||||
// so runtime.AllowedTransitions stays one-to-one with the adapter
|
||||
// capability matrix even though v1 services use Upsert for this
|
||||
// case.
|
||||
func buildUpdateStatusStatement(input ports.UpdateStatusInput, now time.Time) (pg.UpdateStatement, error) {
|
||||
statusValue := pg.String(string(input.To))
|
||||
nowValue := pg.TimestampzT(now)
|
||||
|
||||
var stmt pg.UpdateStatement
|
||||
switch input.To {
|
||||
case runtime.StatusStopped:
|
||||
stmt = pgtable.RuntimeRecords.UPDATE(
|
||||
pgtable.RuntimeRecords.Status,
|
||||
pgtable.RuntimeRecords.LastOpAt,
|
||||
pgtable.RuntimeRecords.StoppedAt,
|
||||
).SET(
|
||||
statusValue,
|
||||
nowValue,
|
||||
nowValue,
|
||||
)
|
||||
case runtime.StatusRemoved:
|
||||
stmt = pgtable.RuntimeRecords.UPDATE(
|
||||
pgtable.RuntimeRecords.Status,
|
||||
pgtable.RuntimeRecords.LastOpAt,
|
||||
pgtable.RuntimeRecords.RemovedAt,
|
||||
pgtable.RuntimeRecords.CurrentContainerID,
|
||||
).SET(
|
||||
statusValue,
|
||||
nowValue,
|
||||
nowValue,
|
||||
pg.NULL,
|
||||
)
|
||||
case runtime.StatusRunning:
|
||||
stmt = pgtable.RuntimeRecords.UPDATE(
|
||||
pgtable.RuntimeRecords.Status,
|
||||
pgtable.RuntimeRecords.LastOpAt,
|
||||
).SET(
|
||||
statusValue,
|
||||
nowValue,
|
||||
)
|
||||
default:
|
||||
return nil, fmt.Errorf("update runtime status: destination status %q is unsupported", input.To)
|
||||
}
|
||||
|
||||
whereExpr := pg.AND(
|
||||
pgtable.RuntimeRecords.GameID.EQ(pg.String(input.GameID)),
|
||||
pgtable.RuntimeRecords.Status.EQ(pg.String(string(input.ExpectedFrom))),
|
||||
)
|
||||
if input.ExpectedContainerID != "" {
|
||||
whereExpr = pg.AND(
|
||||
whereExpr,
|
||||
pgtable.RuntimeRecords.CurrentContainerID.EQ(pg.String(input.ExpectedContainerID)),
|
||||
)
|
||||
}
|
||||
return stmt.WHERE(whereExpr), nil
|
||||
}
|
||||
|
||||
// ListByStatus returns every record currently indexed under status.
|
||||
// Ordering is last_op_at DESC, game_id ASC — the direction the
|
||||
// `runtime_records_status_last_op_idx` index is built in.
|
||||
func (store *Store) ListByStatus(ctx context.Context, status runtime.Status) ([]runtime.RuntimeRecord, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return nil, errors.New("list runtime records by status: nil store")
|
||||
}
|
||||
if !status.IsKnown() {
|
||||
return nil, fmt.Errorf("list runtime records by status: status %q is unsupported", status)
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "list runtime records by status", store.operationTimeout)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pg.SELECT(runtimeSelectColumns).
|
||||
FROM(pgtable.RuntimeRecords).
|
||||
WHERE(pgtable.RuntimeRecords.Status.EQ(pg.String(string(status)))).
|
||||
ORDER_BY(pgtable.RuntimeRecords.LastOpAt.DESC(), pgtable.RuntimeRecords.GameID.ASC())
|
||||
|
||||
query, args := stmt.Sql()
|
||||
rows, err := store.db.QueryContext(operationCtx, query, args...)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list runtime records by status: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
records := make([]runtime.RuntimeRecord, 0)
|
||||
for rows.Next() {
|
||||
record, err := scanRecord(rows)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list runtime records by status: scan: %w", err)
|
||||
}
|
||||
records = append(records, record)
|
||||
}
|
||||
if err := rows.Err(); err != nil {
|
||||
return nil, fmt.Errorf("list runtime records by status: %w", err)
|
||||
}
|
||||
if len(records) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
return records, nil
|
||||
}
|
||||
|
||||
// List returns every runtime record currently stored. Ordering matches
|
||||
// ListByStatus — last_op_at DESC, game_id ASC — so the REST list
|
||||
// endpoint sees the freshest activity first.
|
||||
func (store *Store) List(ctx context.Context) ([]runtime.RuntimeRecord, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return nil, errors.New("list runtime records: nil store")
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "list runtime records", store.operationTimeout)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
stmt := pg.SELECT(runtimeSelectColumns).
|
||||
FROM(pgtable.RuntimeRecords).
|
||||
ORDER_BY(pgtable.RuntimeRecords.LastOpAt.DESC(), pgtable.RuntimeRecords.GameID.ASC())
|
||||
|
||||
query, args := stmt.Sql()
|
||||
rows, err := store.db.QueryContext(operationCtx, query, args...)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list runtime records: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
records := make([]runtime.RuntimeRecord, 0)
|
||||
for rows.Next() {
|
||||
record, err := scanRecord(rows)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("list runtime records: scan: %w", err)
|
||||
}
|
||||
records = append(records, record)
|
||||
}
|
||||
if err := rows.Err(); err != nil {
|
||||
return nil, fmt.Errorf("list runtime records: %w", err)
|
||||
}
|
||||
if len(records) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
return records, nil
|
||||
}
|
||||
|
||||
// CountByStatus returns the number of records indexed under each status.
|
||||
// Statuses with zero records are present in the result with a zero
|
||||
// count so callers (e.g. the telemetry gauge) can publish a stable
|
||||
// label set on every reading.
|
||||
func (store *Store) CountByStatus(ctx context.Context) (map[runtime.Status]int, error) {
|
||||
if store == nil || store.db == nil {
|
||||
return nil, errors.New("count runtime records by status: nil store")
|
||||
}
|
||||
|
||||
operationCtx, cancel, err := sqlx.WithTimeout(ctx, "count runtime records by status", store.operationTimeout)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer cancel()
|
||||
|
||||
countAlias := pg.COUNT(pg.STAR).AS("count")
|
||||
stmt := pg.SELECT(pgtable.RuntimeRecords.Status, countAlias).
|
||||
FROM(pgtable.RuntimeRecords).
|
||||
GROUP_BY(pgtable.RuntimeRecords.Status)
|
||||
|
||||
query, args := stmt.Sql()
|
||||
rows, err := store.db.QueryContext(operationCtx, query, args...)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("count runtime records by status: %w", err)
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
counts := make(map[runtime.Status]int, len(runtime.AllStatuses()))
|
||||
for _, status := range runtime.AllStatuses() {
|
||||
counts[status] = 0
|
||||
}
|
||||
for rows.Next() {
|
||||
var status string
|
||||
var count int
|
||||
if err := rows.Scan(&status, &count); err != nil {
|
||||
return nil, fmt.Errorf("count runtime records by status: scan: %w", err)
|
||||
}
|
||||
counts[runtime.Status(status)] = count
|
||||
}
|
||||
if err := rows.Err(); err != nil {
|
||||
return nil, fmt.Errorf("count runtime records by status: %w", err)
|
||||
}
|
||||
return counts, nil
|
||||
}
|
||||
|
||||
// rowScanner abstracts *sql.Row and *sql.Rows so scanRecord can be shared
|
||||
// across both single-row reads and iterated reads.
|
||||
type rowScanner interface {
|
||||
Scan(dest ...any) error
|
||||
}
|
||||
|
||||
// scanRecord scans one runtime_records row from rs. Returns sql.ErrNoRows
|
||||
// verbatim so callers can distinguish "no row" from a hard error.
|
||||
func scanRecord(rs rowScanner) (runtime.RuntimeRecord, error) {
|
||||
var (
|
||||
gameID string
|
||||
status string
|
||||
currentContainerID sql.NullString
|
||||
currentImageRef sql.NullString
|
||||
engineEndpoint string
|
||||
statePath string
|
||||
dockerNetwork string
|
||||
startedAt sql.NullTime
|
||||
stoppedAt sql.NullTime
|
||||
removedAt sql.NullTime
|
||||
lastOpAt time.Time
|
||||
createdAt time.Time
|
||||
)
|
||||
if err := rs.Scan(
|
||||
&gameID,
|
||||
&status,
|
||||
¤tContainerID,
|
||||
¤tImageRef,
|
||||
&engineEndpoint,
|
||||
&statePath,
|
||||
&dockerNetwork,
|
||||
&startedAt,
|
||||
&stoppedAt,
|
||||
&removedAt,
|
||||
&lastOpAt,
|
||||
&createdAt,
|
||||
); err != nil {
|
||||
return runtime.RuntimeRecord{}, err
|
||||
}
|
||||
return runtime.RuntimeRecord{
|
||||
GameID: gameID,
|
||||
Status: runtime.Status(status),
|
||||
CurrentContainerID: sqlx.StringFromNullable(currentContainerID),
|
||||
CurrentImageRef: sqlx.StringFromNullable(currentImageRef),
|
||||
EngineEndpoint: engineEndpoint,
|
||||
StatePath: statePath,
|
||||
DockerNetwork: dockerNetwork,
|
||||
StartedAt: sqlx.TimePtrFromNullable(startedAt),
|
||||
StoppedAt: sqlx.TimePtrFromNullable(stoppedAt),
|
||||
RemovedAt: sqlx.TimePtrFromNullable(removedAt),
|
||||
LastOpAt: lastOpAt.UTC(),
|
||||
CreatedAt: createdAt.UTC(),
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Ensure Store satisfies the ports.RuntimeRecordStore interface at
|
||||
// compile time.
|
||||
var _ ports.RuntimeRecordStore = (*Store)(nil)
|
||||
@@ -0,0 +1,420 @@
|
||||
package runtimerecordstore_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/postgres/internal/pgtest"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/runtimerecordstore"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func TestMain(m *testing.M) { pgtest.RunMain(m) }
|
||||
|
||||
func newStore(t *testing.T) *runtimerecordstore.Store {
|
||||
t.Helper()
|
||||
pgtest.TruncateAll(t)
|
||||
store, err := runtimerecordstore.New(runtimerecordstore.Config{
|
||||
DB: pgtest.Ensure(t).Pool(),
|
||||
OperationTimeout: pgtest.OperationTimeout,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
return store
|
||||
}
|
||||
|
||||
func runningRecord(t *testing.T, gameID, containerID, imageRef string) runtime.RuntimeRecord {
|
||||
t.Helper()
|
||||
now := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
|
||||
started := now
|
||||
return runtime.RuntimeRecord{
|
||||
GameID: gameID,
|
||||
Status: runtime.StatusRunning,
|
||||
CurrentContainerID: containerID,
|
||||
CurrentImageRef: imageRef,
|
||||
EngineEndpoint: "http://galaxy-game-" + gameID + ":8080",
|
||||
StatePath: "/var/lib/galaxy/games/" + gameID,
|
||||
DockerNetwork: "galaxy-net",
|
||||
StartedAt: &started,
|
||||
LastOpAt: now,
|
||||
CreatedAt: now,
|
||||
}
|
||||
}
|
||||
|
||||
func TestUpsertAndGetRoundTrip(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
got, err := store.Get(ctx, record.GameID)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, record.GameID, got.GameID)
|
||||
assert.Equal(t, record.Status, got.Status)
|
||||
assert.Equal(t, record.CurrentContainerID, got.CurrentContainerID)
|
||||
assert.Equal(t, record.CurrentImageRef, got.CurrentImageRef)
|
||||
assert.Equal(t, record.EngineEndpoint, got.EngineEndpoint)
|
||||
assert.Equal(t, record.StatePath, got.StatePath)
|
||||
assert.Equal(t, record.DockerNetwork, got.DockerNetwork)
|
||||
require.NotNil(t, got.StartedAt)
|
||||
assert.True(t, record.StartedAt.Equal(*got.StartedAt))
|
||||
assert.Equal(t, time.UTC, got.StartedAt.Location())
|
||||
assert.Equal(t, time.UTC, got.LastOpAt.Location())
|
||||
assert.Equal(t, time.UTC, got.CreatedAt.Location())
|
||||
assert.Nil(t, got.StoppedAt)
|
||||
assert.Nil(t, got.RemovedAt)
|
||||
}
|
||||
|
||||
func TestGetReturnsNotFound(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
_, err := store.Get(ctx, "game-missing")
|
||||
require.ErrorIs(t, err, runtime.ErrNotFound)
|
||||
}
|
||||
|
||||
func TestUpsertOverwritesMutableColumnsPreservesCreatedAt(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
original := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, original))
|
||||
|
||||
updated := original
|
||||
updated.CurrentContainerID = "container-2"
|
||||
updated.CurrentImageRef = "galaxy/game:v1.2.4"
|
||||
newStarted := original.LastOpAt.Add(time.Minute)
|
||||
updated.StartedAt = &newStarted
|
||||
updated.LastOpAt = newStarted
|
||||
// Fresh CreatedAt simulates a caller passing "now"; the store must
|
||||
// preserve the original CreatedAt value on conflict.
|
||||
updated.CreatedAt = newStarted
|
||||
|
||||
require.NoError(t, store.Upsert(ctx, updated))
|
||||
|
||||
got, err := store.Get(ctx, original.GameID)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "container-2", got.CurrentContainerID)
|
||||
assert.Equal(t, "galaxy/game:v1.2.4", got.CurrentImageRef)
|
||||
assert.True(t, got.LastOpAt.Equal(newStarted))
|
||||
assert.True(t, got.CreatedAt.Equal(original.CreatedAt),
|
||||
"created_at must be preserved across upserts: got %s, want %s",
|
||||
got.CreatedAt, original.CreatedAt)
|
||||
}
|
||||
|
||||
func TestUpdateStatusRunningToStopped(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
now := record.LastOpAt.Add(2 * time.Minute)
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
ExpectedContainerID: record.CurrentContainerID,
|
||||
To: runtime.StatusStopped,
|
||||
Now: now,
|
||||
}))
|
||||
|
||||
got, err := store.Get(ctx, record.GameID)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, runtime.StatusStopped, got.Status)
|
||||
require.NotNil(t, got.StoppedAt)
|
||||
assert.True(t, now.Equal(*got.StoppedAt))
|
||||
assert.True(t, now.Equal(got.LastOpAt))
|
||||
// container id is preserved on stop; cleanup later NULLs it.
|
||||
assert.Equal(t, record.CurrentContainerID, got.CurrentContainerID)
|
||||
}
|
||||
|
||||
func TestUpdateStatusRunningToRemovedClearsContainerID(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
now := record.LastOpAt.Add(time.Minute)
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusRemoved,
|
||||
Now: now,
|
||||
}))
|
||||
|
||||
got, err := store.Get(ctx, record.GameID)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, runtime.StatusRemoved, got.Status)
|
||||
require.NotNil(t, got.RemovedAt)
|
||||
assert.True(t, now.Equal(*got.RemovedAt))
|
||||
assert.True(t, now.Equal(got.LastOpAt))
|
||||
assert.Empty(t, got.CurrentContainerID, "current_container_id must be NULL after removal")
|
||||
}
|
||||
|
||||
func TestUpdateStatusStoppedToRemoved(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
stopAt := record.LastOpAt.Add(time.Minute)
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusStopped,
|
||||
Now: stopAt,
|
||||
}))
|
||||
|
||||
removeAt := stopAt.Add(time.Hour)
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusStopped,
|
||||
To: runtime.StatusRemoved,
|
||||
Now: removeAt,
|
||||
}))
|
||||
|
||||
got, err := store.Get(ctx, record.GameID)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, runtime.StatusRemoved, got.Status)
|
||||
require.NotNil(t, got.RemovedAt)
|
||||
assert.True(t, removeAt.Equal(*got.RemovedAt))
|
||||
assert.True(t, removeAt.Equal(got.LastOpAt))
|
||||
require.NotNil(t, got.StoppedAt, "stopped_at must remain populated through removal")
|
||||
assert.True(t, stopAt.Equal(*got.StoppedAt))
|
||||
assert.Empty(t, got.CurrentContainerID)
|
||||
}
|
||||
|
||||
func TestUpdateStatusReturnsConflictOnFromMismatch(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusStopped, // wrong
|
||||
To: runtime.StatusRemoved,
|
||||
Now: record.LastOpAt.Add(time.Minute),
|
||||
})
|
||||
require.ErrorIs(t, err, runtime.ErrConflict)
|
||||
}
|
||||
|
||||
func TestUpdateStatusReturnsConflictOnContainerIDMismatch(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
ExpectedContainerID: "container-other",
|
||||
To: runtime.StatusStopped,
|
||||
Now: record.LastOpAt.Add(time.Minute),
|
||||
})
|
||||
require.ErrorIs(t, err, runtime.ErrConflict)
|
||||
}
|
||||
|
||||
func TestUpdateStatusReturnsNotFoundForMissing(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: "game-missing",
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusStopped,
|
||||
Now: time.Now().UTC(),
|
||||
})
|
||||
require.ErrorIs(t, err, runtime.ErrNotFound)
|
||||
}
|
||||
|
||||
func TestUpdateStatusValidatesInputBeforeStore(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
err := store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: "game-001",
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusStopped,
|
||||
// Now intentionally zero — validation must reject.
|
||||
})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
// TestUpdateStatusConcurrentCAS asserts the CAS guard: when two callers
|
||||
// race to apply the running → stopped transition on the same row,
|
||||
// exactly one wins (returns nil) and the other observes
|
||||
// runtime.ErrConflict.
|
||||
func TestUpdateStatusConcurrentCAS(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
record := runningRecord(t, "game-001", "container-1", "galaxy/game:v1.2.3")
|
||||
require.NoError(t, store.Upsert(ctx, record))
|
||||
|
||||
const concurrency = 8
|
||||
results := make([]error, concurrency)
|
||||
var wg sync.WaitGroup
|
||||
wg.Add(concurrency)
|
||||
for index := range concurrency {
|
||||
go func() {
|
||||
defer wg.Done()
|
||||
results[index] = store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: record.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
ExpectedContainerID: record.CurrentContainerID,
|
||||
To: runtime.StatusStopped,
|
||||
Now: record.LastOpAt.Add(time.Duration(index+1) * time.Second),
|
||||
})
|
||||
}()
|
||||
}
|
||||
wg.Wait()
|
||||
|
||||
wins, conflicts := 0, 0
|
||||
for _, err := range results {
|
||||
switch {
|
||||
case err == nil:
|
||||
wins++
|
||||
case errors.Is(err, runtime.ErrConflict):
|
||||
conflicts++
|
||||
default:
|
||||
t.Errorf("unexpected error from concurrent UpdateStatus: %v", err)
|
||||
}
|
||||
}
|
||||
assert.Equal(t, 1, wins, "exactly one caller must win the CAS race")
|
||||
assert.Equal(t, concurrency-1, conflicts, "the rest must observe runtime.ErrConflict")
|
||||
}
|
||||
|
||||
func TestListByStatusReturnsExpectedRecords(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
a := runningRecord(t, "game-aaa", "container-a", "galaxy/game:v1.2.3")
|
||||
b := runningRecord(t, "game-bbb", "container-b", "galaxy/game:v1.2.3")
|
||||
c := runningRecord(t, "game-ccc", "container-c", "galaxy/game:v1.2.3")
|
||||
for _, r := range []runtime.RuntimeRecord{a, b, c} {
|
||||
require.NoError(t, store.Upsert(ctx, r))
|
||||
}
|
||||
|
||||
stopAt := a.LastOpAt.Add(time.Minute)
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: b.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusStopped,
|
||||
Now: stopAt,
|
||||
}))
|
||||
|
||||
running, err := store.ListByStatus(ctx, runtime.StatusRunning)
|
||||
require.NoError(t, err)
|
||||
gotIDs := map[string]struct{}{}
|
||||
for _, r := range running {
|
||||
gotIDs[r.GameID] = struct{}{}
|
||||
}
|
||||
assert.Contains(t, gotIDs, a.GameID)
|
||||
assert.Contains(t, gotIDs, c.GameID)
|
||||
assert.NotContains(t, gotIDs, b.GameID)
|
||||
|
||||
stopped, err := store.ListByStatus(ctx, runtime.StatusStopped)
|
||||
require.NoError(t, err)
|
||||
require.Len(t, stopped, 1)
|
||||
assert.Equal(t, b.GameID, stopped[0].GameID)
|
||||
}
|
||||
|
||||
func TestListByStatusRejectsUnknown(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
_, err := store.ListByStatus(ctx, runtime.Status("exotic"))
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestListReturnsEveryStatus(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
a := runningRecord(t, "game-aaa", "container-a", "galaxy/game:v1.2.3")
|
||||
b := runningRecord(t, "game-bbb", "container-b", "galaxy/game:v1.2.3")
|
||||
c := runningRecord(t, "game-ccc", "container-c", "galaxy/game:v1.2.3")
|
||||
for _, r := range []runtime.RuntimeRecord{a, b, c} {
|
||||
require.NoError(t, store.Upsert(ctx, r))
|
||||
}
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: b.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusStopped,
|
||||
Now: b.LastOpAt.Add(time.Minute),
|
||||
}))
|
||||
|
||||
all, err := store.List(ctx)
|
||||
require.NoError(t, err)
|
||||
require.Len(t, all, 3)
|
||||
|
||||
gotIDs := map[string]runtime.Status{}
|
||||
for _, r := range all {
|
||||
gotIDs[r.GameID] = r.Status
|
||||
}
|
||||
assert.Equal(t, runtime.StatusRunning, gotIDs[a.GameID])
|
||||
assert.Equal(t, runtime.StatusStopped, gotIDs[b.GameID])
|
||||
assert.Equal(t, runtime.StatusRunning, gotIDs[c.GameID])
|
||||
}
|
||||
|
||||
func TestListReturnsNilWhenEmpty(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
all, err := store.List(ctx)
|
||||
require.NoError(t, err)
|
||||
assert.Nil(t, all)
|
||||
}
|
||||
|
||||
func TestCountByStatusReturnsAllBuckets(t *testing.T) {
|
||||
ctx := context.Background()
|
||||
store := newStore(t)
|
||||
|
||||
a := runningRecord(t, "game-1", "container-1", "galaxy/game:v1.2.3")
|
||||
b := runningRecord(t, "game-2", "container-2", "galaxy/game:v1.2.3")
|
||||
c := runningRecord(t, "game-3", "container-3", "galaxy/game:v1.2.3")
|
||||
for _, r := range []runtime.RuntimeRecord{a, b, c} {
|
||||
require.NoError(t, store.Upsert(ctx, r))
|
||||
}
|
||||
require.NoError(t, store.UpdateStatus(ctx, ports.UpdateStatusInput{
|
||||
GameID: b.GameID,
|
||||
ExpectedFrom: runtime.StatusRunning,
|
||||
To: runtime.StatusStopped,
|
||||
Now: b.LastOpAt.Add(time.Minute),
|
||||
}))
|
||||
|
||||
counts, err := store.CountByStatus(ctx)
|
||||
require.NoError(t, err)
|
||||
|
||||
for _, status := range runtime.AllStatuses() {
|
||||
_, ok := counts[status]
|
||||
assert.True(t, ok, "status %q must appear in counts even when zero", status)
|
||||
}
|
||||
assert.Equal(t, 2, counts[runtime.StatusRunning])
|
||||
assert.Equal(t, 1, counts[runtime.StatusStopped])
|
||||
assert.Equal(t, 0, counts[runtime.StatusRemoved])
|
||||
}
|
||||
|
||||
func TestNewRejectsNilDB(t *testing.T) {
|
||||
_, err := runtimerecordstore.New(runtimerecordstore.Config{OperationTimeout: time.Second})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestNewRejectsNonPositiveTimeout(t *testing.T) {
|
||||
_, err := runtimerecordstore.New(runtimerecordstore.Config{
|
||||
DB: pgtest.Ensure(t).Pool(),
|
||||
})
|
||||
require.Error(t, err)
|
||||
}
|
||||
@@ -0,0 +1,117 @@
|
||||
// Package gamelease implements the Redis-backed adapter for
|
||||
// `ports.GameLeaseStore`.
|
||||
//
|
||||
// The lease guards every lifecycle operation Runtime Manager runs
|
||||
// against one game (start, stop, restart, patch, cleanup, plus the
|
||||
// reconciler's drift mutations). Acquisition uses `SET NX PX <ttl>`
|
||||
// with a random caller token; release runs a Lua compare-and-delete
|
||||
// so a holder that lost the lease through TTL expiry cannot wipe
|
||||
// another caller's claim.
|
||||
package gamelease
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/redisstate"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// releaseScript removes the per-game lease only when the supplied token
|
||||
// still owns it. Compare-and-delete prevents a TTL-expired holder from
|
||||
// clearing another caller's claim.
|
||||
var releaseScript = redis.NewScript(`
|
||||
if redis.call("GET", KEYS[1]) == ARGV[1] then
|
||||
return redis.call("DEL", KEYS[1])
|
||||
end
|
||||
return 0
|
||||
`)
|
||||
|
||||
// Config configures one Redis-backed game lease store instance. The
|
||||
// store does not own the redis client lifecycle; the caller (typically
|
||||
// the service runtime) opens and closes it.
|
||||
type Config struct {
|
||||
// Client stores the Redis client the store uses for every command.
|
||||
Client *redis.Client
|
||||
}
|
||||
|
||||
// Store persists the per-game lifecycle lease in Redis.
|
||||
type Store struct {
|
||||
client *redis.Client
|
||||
keys redisstate.Keyspace
|
||||
}
|
||||
|
||||
// New constructs one Redis-backed game lease store from cfg.
|
||||
func New(cfg Config) (*Store, error) {
|
||||
if cfg.Client == nil {
|
||||
return nil, errors.New("new rtmanager game lease store: nil redis client")
|
||||
}
|
||||
return &Store{
|
||||
client: cfg.Client,
|
||||
keys: redisstate.Keyspace{},
|
||||
}, nil
|
||||
}
|
||||
|
||||
// TryAcquire attempts to acquire the per-game lease for gameID owned by
|
||||
// token for ttl. The acquired return is true on a successful claim and
|
||||
// false when another caller still owns the lease. A non-nil error
|
||||
// reports a transport failure and must not be confused with a missed
|
||||
// lease.
|
||||
func (store *Store) TryAcquire(ctx context.Context, gameID, token string, ttl time.Duration) (bool, error) {
|
||||
if store == nil || store.client == nil {
|
||||
return false, errors.New("try acquire game lease: nil store")
|
||||
}
|
||||
if ctx == nil {
|
||||
return false, errors.New("try acquire game lease: nil context")
|
||||
}
|
||||
if strings.TrimSpace(gameID) == "" {
|
||||
return false, errors.New("try acquire game lease: game id must not be empty")
|
||||
}
|
||||
if strings.TrimSpace(token) == "" {
|
||||
return false, errors.New("try acquire game lease: token must not be empty")
|
||||
}
|
||||
if ttl <= 0 {
|
||||
return false, errors.New("try acquire game lease: ttl must be positive")
|
||||
}
|
||||
|
||||
acquired, err := store.client.SetNX(ctx, store.keys.GameLease(gameID), token, ttl).Result()
|
||||
if err != nil {
|
||||
return false, fmt.Errorf("try acquire game lease: %w", err)
|
||||
}
|
||||
return acquired, nil
|
||||
}
|
||||
|
||||
// Release removes the per-game lease for gameID only when token still
|
||||
// matches the stored owner value. A token mismatch is a silent no-op.
|
||||
func (store *Store) Release(ctx context.Context, gameID, token string) error {
|
||||
if store == nil || store.client == nil {
|
||||
return errors.New("release game lease: nil store")
|
||||
}
|
||||
if ctx == nil {
|
||||
return errors.New("release game lease: nil context")
|
||||
}
|
||||
if strings.TrimSpace(gameID) == "" {
|
||||
return errors.New("release game lease: game id must not be empty")
|
||||
}
|
||||
if strings.TrimSpace(token) == "" {
|
||||
return errors.New("release game lease: token must not be empty")
|
||||
}
|
||||
|
||||
if err := releaseScript.Run(
|
||||
ctx,
|
||||
store.client,
|
||||
[]string{store.keys.GameLease(gameID)},
|
||||
token,
|
||||
).Err(); err != nil {
|
||||
return fmt.Errorf("release game lease: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Compile-time assertion: Store implements ports.GameLeaseStore.
|
||||
var _ ports.GameLeaseStore = (*Store)(nil)
|
||||
@@ -0,0 +1,133 @@
|
||||
package gamelease_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/redisstate"
|
||||
"galaxy/rtmanager/internal/adapters/redisstate/gamelease"
|
||||
|
||||
"github.com/alicebob/miniredis/v2"
|
||||
"github.com/redis/go-redis/v9"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func newLeaseStore(t *testing.T) (*gamelease.Store, *miniredis.Miniredis) {
|
||||
t.Helper()
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
|
||||
store, err := gamelease.New(gamelease.Config{Client: client})
|
||||
require.NoError(t, err)
|
||||
return store, server
|
||||
}
|
||||
|
||||
func TestNewRejectsNilClient(t *testing.T) {
|
||||
_, err := gamelease.New(gamelease.Config{})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestTryAcquireSetsKeyAndTTL(t *testing.T) {
|
||||
store, server := newLeaseStore(t)
|
||||
|
||||
acquired, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
|
||||
require.NoError(t, err)
|
||||
assert.True(t, acquired)
|
||||
|
||||
key := redisstate.Keyspace{}.GameLease("game-1")
|
||||
assert.True(t, server.Exists(key), "key %q must exist after TryAcquire", key)
|
||||
|
||||
stored, err := server.Get(key)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "token-A", stored)
|
||||
|
||||
// TTL must be positive (miniredis returns the remaining duration).
|
||||
ttl := server.TTL(key)
|
||||
assert.Greater(t, ttl, time.Duration(0))
|
||||
}
|
||||
|
||||
func TestTryAcquireReturnsFalseWhenAlreadyHeld(t *testing.T) {
|
||||
store, _ := newLeaseStore(t)
|
||||
|
||||
acquired, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
|
||||
require.NoError(t, err)
|
||||
require.True(t, acquired)
|
||||
|
||||
acquired, err = store.TryAcquire(context.Background(), "game-1", "token-B", time.Minute)
|
||||
require.NoError(t, err)
|
||||
assert.False(t, acquired)
|
||||
}
|
||||
|
||||
func TestReleaseRemovesKeyForOwnerToken(t *testing.T) {
|
||||
store, server := newLeaseStore(t)
|
||||
|
||||
_, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
|
||||
require.NoError(t, err)
|
||||
|
||||
require.NoError(t, store.Release(context.Background(), "game-1", "token-A"))
|
||||
|
||||
key := redisstate.Keyspace{}.GameLease("game-1")
|
||||
assert.False(t, server.Exists(key), "key %q must be deleted after Release", key)
|
||||
}
|
||||
|
||||
func TestReleaseIsNoOpForForeignToken(t *testing.T) {
|
||||
store, server := newLeaseStore(t)
|
||||
|
||||
_, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
|
||||
require.NoError(t, err)
|
||||
|
||||
require.NoError(t, store.Release(context.Background(), "game-1", "token-B"))
|
||||
|
||||
key := redisstate.Keyspace{}.GameLease("game-1")
|
||||
assert.True(t, server.Exists(key), "key %q must still exist when foreign token is released", key)
|
||||
|
||||
stored, err := server.Get(key)
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "token-A", stored)
|
||||
}
|
||||
|
||||
func TestTryAcquireSucceedsAfterTTLExpiry(t *testing.T) {
|
||||
store, server := newLeaseStore(t)
|
||||
|
||||
acquired, err := store.TryAcquire(context.Background(), "game-1", "token-A", time.Minute)
|
||||
require.NoError(t, err)
|
||||
require.True(t, acquired)
|
||||
|
||||
server.FastForward(2 * time.Minute)
|
||||
|
||||
acquired, err = store.TryAcquire(context.Background(), "game-1", "token-B", time.Minute)
|
||||
require.NoError(t, err)
|
||||
assert.True(t, acquired)
|
||||
}
|
||||
|
||||
func TestTryAcquireRejectsInvalidArguments(t *testing.T) {
|
||||
store, _ := newLeaseStore(t)
|
||||
|
||||
_, err := store.TryAcquire(context.Background(), "", "token", time.Minute)
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = store.TryAcquire(context.Background(), "game-1", "", time.Minute)
|
||||
require.Error(t, err)
|
||||
|
||||
_, err = store.TryAcquire(context.Background(), "game-1", "token", 0)
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestReleaseRejectsInvalidArguments(t *testing.T) {
|
||||
store, _ := newLeaseStore(t)
|
||||
|
||||
require.Error(t, store.Release(context.Background(), "", "token"))
|
||||
require.Error(t, store.Release(context.Background(), "game-1", ""))
|
||||
}
|
||||
|
||||
func TestKeyspaceGameLeaseIsPrefixedAndEncoded(t *testing.T) {
|
||||
key := redisstate.Keyspace{}.GameLease("game with spaces")
|
||||
assert.NotEmpty(t, key)
|
||||
assert.Contains(t, key, "rtmanager:game_lease:")
|
||||
suffix := key[len("rtmanager:game_lease:"):]
|
||||
// base64url-encoded suffix must not contain the original spaces.
|
||||
assert.NotContains(t, suffix, " ")
|
||||
}
|
||||
@@ -0,0 +1,44 @@
|
||||
// Package redisstate hosts the Runtime Manager Redis adapters that share
|
||||
// a single keyspace. Each sibling subpackage (e.g. `streamoffsets`)
|
||||
// implements one port and uses Keyspace to compose its keys, so the
|
||||
// Redis namespace stays under one document and one prefix.
|
||||
//
|
||||
// The package itself only declares the keyspace; concrete stores live in
|
||||
// nested packages so dependencies (testcontainers, miniredis) stay out
|
||||
// of consumer build graphs that do not need them.
|
||||
package redisstate
|
||||
|
||||
import "encoding/base64"
|
||||
|
||||
// defaultPrefix is the mandatory `rtmanager:` namespace prefix shared by
|
||||
// every Runtime Manager Redis key.
|
||||
const defaultPrefix = "rtmanager:"
|
||||
|
||||
// Keyspace builds the Runtime Manager Redis keys. The namespace covers
|
||||
// the stream consumer offsets and the per-game lifecycle lease in v1.
|
||||
//
|
||||
// Dynamic key segments are encoded with base64url so raw key structure
|
||||
// does not depend on caller-provided characters; this matches the
|
||||
// encoding chosen by `lobby/internal/adapters/redisstate.Keyspace`.
|
||||
type Keyspace struct{}
|
||||
|
||||
// StreamOffset returns the Redis key that stores the last successfully
|
||||
// processed entry id for one Redis Stream consumer. The streamLabel is
|
||||
// the short logical identifier of the consumer (e.g. `start_jobs`,
|
||||
// `stop_jobs`), not the full stream name; it stays stable when the
|
||||
// underlying stream key is renamed.
|
||||
func (Keyspace) StreamOffset(streamLabel string) string {
|
||||
return defaultPrefix + "stream_offsets:" + encodeKeyComponent(streamLabel)
|
||||
}
|
||||
|
||||
// GameLease returns the Redis key that stores the per-game lifecycle
|
||||
// lease guarding start / stop / restart / patch / cleanup operations
|
||||
// against the same game. The gameID is base64url-encoded so callers can
|
||||
// pass any opaque identifier without escaping raw key characters.
|
||||
func (Keyspace) GameLease(gameID string) string {
|
||||
return defaultPrefix + "game_lease:" + encodeKeyComponent(gameID)
|
||||
}
|
||||
|
||||
func encodeKeyComponent(value string) string {
|
||||
return base64.RawURLEncoding.EncodeToString([]byte(value))
|
||||
}
|
||||
@@ -0,0 +1,94 @@
|
||||
// Package streamoffsets implements the Redis-backed adapter for
|
||||
// `ports.StreamOffsetStore`.
|
||||
//
|
||||
// The start-jobs and stop-jobs consumers call Load on startup to
|
||||
// resume from the persisted offset and Save after every successful
|
||||
// message handling. Keys are produced by
|
||||
// `redisstate.Keyspace.StreamOffset`, mirroring the lobby pattern.
|
||||
package streamoffsets
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/redisstate"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// Config configures one Redis-backed stream-offset store instance. The
|
||||
// store does not own the redis client lifecycle; the caller (typically
|
||||
// the service runtime) opens and closes it.
|
||||
type Config struct {
|
||||
// Client stores the Redis client the store uses for every command.
|
||||
Client *redis.Client
|
||||
}
|
||||
|
||||
// Store persists Runtime Manager stream consumer offsets in Redis.
|
||||
type Store struct {
|
||||
client *redis.Client
|
||||
keys redisstate.Keyspace
|
||||
}
|
||||
|
||||
// New constructs one Redis-backed stream-offset store from cfg.
|
||||
func New(cfg Config) (*Store, error) {
|
||||
if cfg.Client == nil {
|
||||
return nil, errors.New("new rtmanager stream offset store: nil redis client")
|
||||
}
|
||||
return &Store{
|
||||
client: cfg.Client,
|
||||
keys: redisstate.Keyspace{},
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Load returns the last processed entry id for streamLabel when one is
|
||||
// stored. A missing key returns ("", false, nil).
|
||||
func (store *Store) Load(ctx context.Context, streamLabel string) (string, bool, error) {
|
||||
if store == nil || store.client == nil {
|
||||
return "", false, errors.New("load rtmanager stream offset: nil store")
|
||||
}
|
||||
if ctx == nil {
|
||||
return "", false, errors.New("load rtmanager stream offset: nil context")
|
||||
}
|
||||
if strings.TrimSpace(streamLabel) == "" {
|
||||
return "", false, errors.New("load rtmanager stream offset: stream label must not be empty")
|
||||
}
|
||||
|
||||
value, err := store.client.Get(ctx, store.keys.StreamOffset(streamLabel)).Result()
|
||||
switch {
|
||||
case errors.Is(err, redis.Nil):
|
||||
return "", false, nil
|
||||
case err != nil:
|
||||
return "", false, fmt.Errorf("load rtmanager stream offset: %w", err)
|
||||
}
|
||||
return value, true, nil
|
||||
}
|
||||
|
||||
// Save stores entryID as the new offset for streamLabel. The key has no
|
||||
// TTL — offsets are durable and only overwritten by subsequent Saves.
|
||||
func (store *Store) Save(ctx context.Context, streamLabel, entryID string) error {
|
||||
if store == nil || store.client == nil {
|
||||
return errors.New("save rtmanager stream offset: nil store")
|
||||
}
|
||||
if ctx == nil {
|
||||
return errors.New("save rtmanager stream offset: nil context")
|
||||
}
|
||||
if strings.TrimSpace(streamLabel) == "" {
|
||||
return errors.New("save rtmanager stream offset: stream label must not be empty")
|
||||
}
|
||||
if strings.TrimSpace(entryID) == "" {
|
||||
return errors.New("save rtmanager stream offset: entry id must not be empty")
|
||||
}
|
||||
|
||||
if err := store.client.Set(ctx, store.keys.StreamOffset(streamLabel), entryID, 0).Err(); err != nil {
|
||||
return fmt.Errorf("save rtmanager stream offset: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// Ensure Store satisfies the ports.StreamOffsetStore interface at
|
||||
// compile time.
|
||||
var _ ports.StreamOffsetStore = (*Store)(nil)
|
||||
@@ -0,0 +1,86 @@
|
||||
package streamoffsets_test
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/redisstate"
|
||||
"galaxy/rtmanager/internal/adapters/redisstate/streamoffsets"
|
||||
|
||||
"github.com/alicebob/miniredis/v2"
|
||||
"github.com/redis/go-redis/v9"
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func newOffsetStore(t *testing.T) (*streamoffsets.Store, *miniredis.Miniredis) {
|
||||
t.Helper()
|
||||
server := miniredis.RunT(t)
|
||||
client := redis.NewClient(&redis.Options{Addr: server.Addr()})
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
|
||||
store, err := streamoffsets.New(streamoffsets.Config{Client: client})
|
||||
require.NoError(t, err)
|
||||
return store, server
|
||||
}
|
||||
|
||||
func TestNewRejectsNilClient(t *testing.T) {
|
||||
_, err := streamoffsets.New(streamoffsets.Config{})
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestLoadMissingReturnsNotFound(t *testing.T) {
|
||||
store, _ := newOffsetStore(t)
|
||||
|
||||
id, found, err := store.Load(context.Background(), "start_jobs")
|
||||
require.NoError(t, err)
|
||||
assert.False(t, found)
|
||||
assert.Empty(t, id)
|
||||
}
|
||||
|
||||
func TestSaveLoadRoundTrip(t *testing.T) {
|
||||
store, server := newOffsetStore(t)
|
||||
|
||||
require.NoError(t, store.Save(context.Background(), "start_jobs", "1700000000000-0"))
|
||||
|
||||
id, found, err := store.Load(context.Background(), "start_jobs")
|
||||
require.NoError(t, err)
|
||||
assert.True(t, found)
|
||||
assert.Equal(t, "1700000000000-0", id)
|
||||
|
||||
// The persisted key must follow the rtmanager keyspace prefix.
|
||||
expectedKey := redisstate.Keyspace{}.StreamOffset("start_jobs")
|
||||
assert.True(t, server.Exists(expectedKey),
|
||||
"key %q must exist after Save", expectedKey)
|
||||
}
|
||||
|
||||
func TestSaveOverwritesPriorValue(t *testing.T) {
|
||||
store, _ := newOffsetStore(t)
|
||||
|
||||
require.NoError(t, store.Save(context.Background(), "start_jobs", "100-0"))
|
||||
require.NoError(t, store.Save(context.Background(), "start_jobs", "200-0"))
|
||||
|
||||
id, found, err := store.Load(context.Background(), "start_jobs")
|
||||
require.NoError(t, err)
|
||||
assert.True(t, found)
|
||||
assert.Equal(t, "200-0", id)
|
||||
}
|
||||
|
||||
func TestLoadAndSaveRejectInvalidArguments(t *testing.T) {
|
||||
store, _ := newOffsetStore(t)
|
||||
|
||||
require.Error(t, store.Save(context.Background(), "", "100-0"))
|
||||
require.Error(t, store.Save(context.Background(), "start_jobs", ""))
|
||||
|
||||
_, _, err := store.Load(context.Background(), "")
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestKeyspaceStreamOffsetIsPrefixed(t *testing.T) {
|
||||
key := redisstate.Keyspace{}.StreamOffset("start_jobs")
|
||||
assert.NotEmpty(t, key)
|
||||
assert.Contains(t, key, "rtmanager:stream_offsets:")
|
||||
// base64url-encoded label must not contain raw colons or spaces.
|
||||
suffix := key[len("rtmanager:stream_offsets:"):]
|
||||
assert.NotContains(t, suffix, ":")
|
||||
}
|
||||
@@ -0,0 +1,367 @@
|
||||
package internalhttp
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"errors"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"path/filepath"
|
||||
"runtime"
|
||||
"strings"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/api/internalhttp/handlers"
|
||||
domainruntime "galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
"galaxy/rtmanager/internal/service/cleanupcontainer"
|
||||
"galaxy/rtmanager/internal/service/patchruntime"
|
||||
"galaxy/rtmanager/internal/service/restartruntime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
"galaxy/rtmanager/internal/service/stopruntime"
|
||||
|
||||
"github.com/getkin/kin-openapi/openapi3"
|
||||
"github.com/getkin/kin-openapi/openapi3filter"
|
||||
"github.com/getkin/kin-openapi/routers"
|
||||
"github.com/getkin/kin-openapi/routers/legacy"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// TestInternalRESTConformance loads the OpenAPI specification, drives
|
||||
// every runtime operation against the live internal HTTP listener
|
||||
// backed by stub services, and validates each response body against
|
||||
// the spec via `openapi3filter.ValidateResponse`. The test catches
|
||||
// drift between the wire shape produced by the handler layer and the
|
||||
// frozen contract; failure-path response shapes are validated by the
|
||||
// per-handler tests in `handlers/<op>_test.go`.
|
||||
func TestInternalRESTConformance(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
doc := loadConformanceSpec(t)
|
||||
|
||||
router, err := legacy.NewRouter(doc)
|
||||
require.NoError(t, err)
|
||||
|
||||
deps := newConformanceDeps(t)
|
||||
server, err := NewServer(newConformanceConfig(), Dependencies{
|
||||
Logger: nil,
|
||||
Telemetry: nil,
|
||||
Readiness: nil,
|
||||
RuntimeRecords: deps.records,
|
||||
StartRuntime: deps.start,
|
||||
StopRuntime: deps.stop,
|
||||
RestartRuntime: deps.restart,
|
||||
PatchRuntime: deps.patch,
|
||||
CleanupContainer: deps.cleanup,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
|
||||
cases := []conformanceCase{
|
||||
{
|
||||
name: "internalListRuntimes",
|
||||
method: http.MethodGet,
|
||||
path: "/api/v1/internal/runtimes",
|
||||
},
|
||||
{
|
||||
name: "internalGetRuntime",
|
||||
method: http.MethodGet,
|
||||
path: "/api/v1/internal/runtimes/" + conformanceGameID,
|
||||
},
|
||||
{
|
||||
name: "internalStartRuntime",
|
||||
method: http.MethodPost,
|
||||
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/start",
|
||||
contentType: "application/json",
|
||||
body: `{"image_ref":"galaxy/game:v1.2.3"}`,
|
||||
},
|
||||
{
|
||||
name: "internalStopRuntime",
|
||||
method: http.MethodPost,
|
||||
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/stop",
|
||||
contentType: "application/json",
|
||||
body: `{"reason":"admin_request"}`,
|
||||
},
|
||||
{
|
||||
name: "internalRestartRuntime",
|
||||
method: http.MethodPost,
|
||||
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/restart",
|
||||
},
|
||||
{
|
||||
name: "internalPatchRuntime",
|
||||
method: http.MethodPost,
|
||||
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/patch",
|
||||
contentType: "application/json",
|
||||
body: `{"image_ref":"galaxy/game:v1.2.4"}`,
|
||||
},
|
||||
{
|
||||
name: "internalCleanupRuntimeContainer",
|
||||
method: http.MethodDelete,
|
||||
path: "/api/v1/internal/runtimes/" + conformanceGameID + "/container",
|
||||
},
|
||||
}
|
||||
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
runConformanceCase(t, server.handler, router, tc)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// conformanceGameID is the path variable used for every per-game
|
||||
// conformance request.
|
||||
const conformanceGameID = "game-conformance"
|
||||
|
||||
// conformanceServerURL mirrors the canonical `servers[0].url` entry in
|
||||
// `rtmanager/api/internal-openapi.yaml`. The legacy router matches
|
||||
// requests against this prefix; updating the spec's server URL
|
||||
// requires updating this constant.
|
||||
const conformanceServerURL = "http://localhost:8096"
|
||||
|
||||
// conformanceCase describes one request the conformance test drives.
|
||||
type conformanceCase struct {
|
||||
name string
|
||||
method string
|
||||
path string
|
||||
contentType string
|
||||
body string
|
||||
}
|
||||
|
||||
func runConformanceCase(t *testing.T, handler http.Handler, router routers.Router, tc conformanceCase) {
|
||||
t.Helper()
|
||||
|
||||
// Drive the handler with the path-only form so the listener's
|
||||
// http.ServeMux matches the registered routes (which use raw paths,
|
||||
// without the OpenAPI server URL prefix).
|
||||
var bodyReader io.Reader
|
||||
if tc.body != "" {
|
||||
bodyReader = strings.NewReader(tc.body)
|
||||
}
|
||||
request := httptest.NewRequest(tc.method, tc.path, bodyReader)
|
||||
if tc.contentType != "" {
|
||||
request.Header.Set("Content-Type", tc.contentType)
|
||||
}
|
||||
request.Header.Set("X-Galaxy-Caller", "admin")
|
||||
|
||||
recorder := httptest.NewRecorder()
|
||||
handler.ServeHTTP(recorder, request)
|
||||
require.Equalf(t, http.StatusOK, recorder.Code, "operation %s returned %d: %s", tc.name, recorder.Code, recorder.Body.String())
|
||||
|
||||
// kin-openapi's legacy router requires the request URL to match a
|
||||
// `servers[].url` entry; rebuild the validation request with the
|
||||
// canonical local server URL declared in the spec.
|
||||
validationURL := conformanceServerURL + tc.path
|
||||
validationRequest := httptest.NewRequest(tc.method, validationURL, bodyReaderFor(tc.body))
|
||||
if tc.contentType != "" {
|
||||
validationRequest.Header.Set("Content-Type", tc.contentType)
|
||||
}
|
||||
validationRequest.Header.Set("X-Galaxy-Caller", "admin")
|
||||
|
||||
route, pathParams, err := router.FindRoute(validationRequest)
|
||||
require.NoError(t, err)
|
||||
|
||||
requestInput := &openapi3filter.RequestValidationInput{
|
||||
Request: validationRequest,
|
||||
PathParams: pathParams,
|
||||
Route: route,
|
||||
Options: &openapi3filter.Options{
|
||||
IncludeResponseStatus: true,
|
||||
},
|
||||
}
|
||||
require.NoError(t, openapi3filter.ValidateRequest(context.Background(), requestInput))
|
||||
|
||||
responseInput := &openapi3filter.ResponseValidationInput{
|
||||
RequestValidationInput: requestInput,
|
||||
Status: recorder.Code,
|
||||
Header: recorder.Header(),
|
||||
Options: &openapi3filter.Options{
|
||||
IncludeResponseStatus: true,
|
||||
},
|
||||
}
|
||||
responseInput.SetBodyBytes(recorder.Body.Bytes())
|
||||
require.NoError(t, openapi3filter.ValidateResponse(context.Background(), responseInput))
|
||||
}
|
||||
|
||||
func loadConformanceSpec(t *testing.T) *openapi3.T {
|
||||
t.Helper()
|
||||
|
||||
_, thisFile, _, ok := runtime.Caller(0)
|
||||
require.True(t, ok)
|
||||
|
||||
specPath := filepath.Join(filepath.Dir(thisFile), "..", "..", "..", "api", "internal-openapi.yaml")
|
||||
loader := openapi3.NewLoader()
|
||||
doc, err := loader.LoadFromFile(specPath)
|
||||
require.NoError(t, err)
|
||||
require.NoError(t, doc.Validate(context.Background()))
|
||||
return doc
|
||||
}
|
||||
|
||||
func bodyReaderFor(raw string) io.Reader {
|
||||
if raw == "" {
|
||||
return http.NoBody
|
||||
}
|
||||
return bytes.NewBufferString(raw)
|
||||
}
|
||||
|
||||
// conformanceDeps groups the stub collaborators handed to the listener.
|
||||
type conformanceDeps struct {
|
||||
records *conformanceRecords
|
||||
start *conformanceStart
|
||||
stop *conformanceStop
|
||||
restart *conformanceRestart
|
||||
patch *conformancePatch
|
||||
cleanup *conformanceCleanup
|
||||
}
|
||||
|
||||
func newConformanceDeps(t *testing.T) *conformanceDeps {
|
||||
t.Helper()
|
||||
return &conformanceDeps{
|
||||
records: newConformanceRecords(),
|
||||
start: &conformanceStart{},
|
||||
stop: &conformanceStop{},
|
||||
restart: &conformanceRestart{},
|
||||
patch: &conformancePatch{},
|
||||
cleanup: &conformanceCleanup{},
|
||||
}
|
||||
}
|
||||
|
||||
func newConformanceConfig() Config {
|
||||
return Config{
|
||||
Addr: ":0",
|
||||
ReadHeaderTimeout: time.Second,
|
||||
ReadTimeout: time.Second,
|
||||
WriteTimeout: time.Second,
|
||||
IdleTimeout: time.Second,
|
||||
}
|
||||
}
|
||||
|
||||
// conformanceRecord builds a canonical running record used by every
|
||||
// stub service.
|
||||
func conformanceRecord() domainruntime.RuntimeRecord {
|
||||
started := time.Date(2026, 4, 26, 13, 0, 0, 0, time.UTC)
|
||||
return domainruntime.RuntimeRecord{
|
||||
GameID: conformanceGameID,
|
||||
Status: domainruntime.StatusRunning,
|
||||
CurrentContainerID: "container-conformance",
|
||||
CurrentImageRef: "galaxy/game:v1.2.3",
|
||||
EngineEndpoint: "http://galaxy-game-" + conformanceGameID + ":8080",
|
||||
StatePath: "/var/lib/galaxy/" + conformanceGameID,
|
||||
DockerNetwork: "galaxy-engine",
|
||||
StartedAt: &started,
|
||||
LastOpAt: started,
|
||||
CreatedAt: started,
|
||||
}
|
||||
}
|
||||
|
||||
// conformanceRecords is an in-memory record store seeded with one
|
||||
// canonical record so the get / list endpoints have something to
|
||||
// return.
|
||||
type conformanceRecords struct {
|
||||
mu sync.Mutex
|
||||
stored map[string]domainruntime.RuntimeRecord
|
||||
}
|
||||
|
||||
func newConformanceRecords() *conformanceRecords {
|
||||
return &conformanceRecords{
|
||||
stored: map[string]domainruntime.RuntimeRecord{
|
||||
conformanceGameID: conformanceRecord(),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func (s *conformanceRecords) Get(_ context.Context, gameID string) (domainruntime.RuntimeRecord, error) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
record, ok := s.stored[gameID]
|
||||
if !ok {
|
||||
return domainruntime.RuntimeRecord{}, domainruntime.ErrNotFound
|
||||
}
|
||||
return record, nil
|
||||
}
|
||||
|
||||
func (s *conformanceRecords) Upsert(_ context.Context, _ domainruntime.RuntimeRecord) error {
|
||||
return errors.New("not used in conformance test")
|
||||
}
|
||||
|
||||
func (s *conformanceRecords) UpdateStatus(_ context.Context, _ ports.UpdateStatusInput) error {
|
||||
return errors.New("not used in conformance test")
|
||||
}
|
||||
|
||||
func (s *conformanceRecords) ListByStatus(_ context.Context, _ domainruntime.Status) ([]domainruntime.RuntimeRecord, error) {
|
||||
return nil, errors.New("not used in conformance test")
|
||||
}
|
||||
|
||||
func (s *conformanceRecords) List(_ context.Context) ([]domainruntime.RuntimeRecord, error) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
out := make([]domainruntime.RuntimeRecord, 0, len(s.stored))
|
||||
for _, record := range s.stored {
|
||||
out = append(out, record)
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
// conformanceStart is the stub StartService used by the conformance
|
||||
// test. Every Handle call returns the canonical record.
|
||||
type conformanceStart struct{}
|
||||
|
||||
func (s *conformanceStart) Handle(_ context.Context, _ startruntime.Input) (startruntime.Result, error) {
|
||||
return startruntime.Result{
|
||||
Record: conformanceRecord(),
|
||||
Outcome: "success",
|
||||
}, nil
|
||||
}
|
||||
|
||||
type conformanceStop struct{}
|
||||
|
||||
func (s *conformanceStop) Handle(_ context.Context, _ stopruntime.Input) (stopruntime.Result, error) {
|
||||
rec := conformanceRecord()
|
||||
rec.Status = domainruntime.StatusStopped
|
||||
stopped := rec.LastOpAt.Add(time.Second)
|
||||
rec.StoppedAt = &stopped
|
||||
rec.LastOpAt = stopped
|
||||
return stopruntime.Result{Record: rec, Outcome: "success"}, nil
|
||||
}
|
||||
|
||||
type conformanceRestart struct{}
|
||||
|
||||
func (s *conformanceRestart) Handle(_ context.Context, _ restartruntime.Input) (restartruntime.Result, error) {
|
||||
return restartruntime.Result{Record: conformanceRecord(), Outcome: "success"}, nil
|
||||
}
|
||||
|
||||
type conformancePatch struct{}
|
||||
|
||||
func (s *conformancePatch) Handle(_ context.Context, in patchruntime.Input) (patchruntime.Result, error) {
|
||||
rec := conformanceRecord()
|
||||
if in.NewImageRef != "" {
|
||||
rec.CurrentImageRef = in.NewImageRef
|
||||
}
|
||||
return patchruntime.Result{Record: rec, Outcome: "success"}, nil
|
||||
}
|
||||
|
||||
type conformanceCleanup struct{}
|
||||
|
||||
func (s *conformanceCleanup) Handle(_ context.Context, _ cleanupcontainer.Input) (cleanupcontainer.Result, error) {
|
||||
rec := conformanceRecord()
|
||||
rec.Status = domainruntime.StatusRemoved
|
||||
rec.CurrentContainerID = ""
|
||||
removed := rec.LastOpAt.Add(time.Minute)
|
||||
rec.RemovedAt = &removed
|
||||
rec.LastOpAt = removed
|
||||
return cleanupcontainer.Result{Record: rec, Outcome: "success"}, nil
|
||||
}
|
||||
|
||||
// Compile-time guards: the stubs must satisfy the handler-level
|
||||
// service ports plus ports.RuntimeRecordStore so the listener accepts
|
||||
// them.
|
||||
var (
|
||||
_ handlers.StartService = (*conformanceStart)(nil)
|
||||
_ handlers.StopService = (*conformanceStop)(nil)
|
||||
_ handlers.RestartService = (*conformanceRestart)(nil)
|
||||
_ handlers.PatchService = (*conformancePatch)(nil)
|
||||
_ handlers.CleanupService = (*conformanceCleanup)(nil)
|
||||
_ ports.RuntimeRecordStore = (*conformanceRecords)(nil)
|
||||
)
|
||||
@@ -0,0 +1,55 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/service/cleanupcontainer"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// newCleanupHandler returns the handler for
|
||||
// `DELETE /api/v1/internal/runtimes/{game_id}/container`. The OpenAPI
|
||||
// spec declares no request body for this operation; any client-provided
|
||||
// body is ignored.
|
||||
func newCleanupHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.cleanup")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.CleanupContainer == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"cleanup container service is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
gameID, ok := extractGameID(writer, request)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
result, err := deps.CleanupContainer.Handle(request.Context(), cleanupcontainer.Input{
|
||||
GameID: gameID,
|
||||
OpSource: resolveOpSource(request),
|
||||
SourceRef: requestSourceRef(request),
|
||||
})
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "cleanup container service errored",
|
||||
"game_id", gameID,
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"cleanup container service failed",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
if result.Outcome == operation.OutcomeFailure {
|
||||
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,238 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"io"
|
||||
"log/slog"
|
||||
"net/http"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// JSONContentType is the Content-Type used by every internal REST
|
||||
// response. Exported so the listener-level tests can match it without
|
||||
// re-declaring the constant.
|
||||
const JSONContentType = "application/json; charset=utf-8"
|
||||
|
||||
// gameIDPathParam is the name of the {game_id} path variable shared by
|
||||
// every per-game runtime endpoint.
|
||||
const gameIDPathParam = "game_id"
|
||||
|
||||
// callerHeader is the HTTP header that distinguishes Game Master from
|
||||
// Admin Service in the operation log. Documented in
|
||||
// `rtmanager/api/internal-openapi.yaml` and
|
||||
// `rtmanager/docs/services.md` §18.
|
||||
const callerHeader = "X-Galaxy-Caller"
|
||||
|
||||
// errorCodeDockerUnavailable mirrors the OpenAPI error code value. The
|
||||
// lifecycle services do not currently emit it (they use
|
||||
// `service_unavailable` for Docker daemon failures); the handler layer
|
||||
// maps it to 503 anyway so future producers do not require a handler
|
||||
// change.
|
||||
const errorCodeDockerUnavailable = "docker_unavailable"
|
||||
|
||||
// errorBody mirrors the `error` element of the OpenAPI ErrorResponse
|
||||
// schema.
|
||||
type errorBody struct {
|
||||
Code string `json:"code"`
|
||||
Message string `json:"message"`
|
||||
}
|
||||
|
||||
// errorResponse mirrors the OpenAPI ErrorResponse envelope.
|
||||
type errorResponse struct {
|
||||
Error errorBody `json:"error"`
|
||||
}
|
||||
|
||||
// runtimeRecordResponse mirrors the OpenAPI RuntimeRecord schema.
|
||||
// Required fields use plain strings; nullable fields use pointers so an
|
||||
// absent value encodes as the JSON literal `null` (matches the
|
||||
// `nullable: true` declaration in the spec). Times are RFC3339 UTC.
|
||||
type runtimeRecordResponse struct {
|
||||
GameID string `json:"game_id"`
|
||||
Status string `json:"status"`
|
||||
CurrentContainerID *string `json:"current_container_id"`
|
||||
CurrentImageRef *string `json:"current_image_ref"`
|
||||
EngineEndpoint *string `json:"engine_endpoint"`
|
||||
StatePath string `json:"state_path"`
|
||||
DockerNetwork string `json:"docker_network"`
|
||||
StartedAt *string `json:"started_at"`
|
||||
StoppedAt *string `json:"stopped_at"`
|
||||
RemovedAt *string `json:"removed_at"`
|
||||
LastOpAt string `json:"last_op_at"`
|
||||
CreatedAt string `json:"created_at"`
|
||||
}
|
||||
|
||||
// runtimesListResponse mirrors the OpenAPI RuntimesList schema. Items
|
||||
// is always non-nil so the JSON form carries `[]` rather than `null`
|
||||
// for an empty result.
|
||||
type runtimesListResponse struct {
|
||||
Items []runtimeRecordResponse `json:"items"`
|
||||
}
|
||||
|
||||
// encodeRuntimeRecord turns a domain RuntimeRecord into its wire shape.
|
||||
func encodeRuntimeRecord(record runtime.RuntimeRecord) runtimeRecordResponse {
|
||||
resp := runtimeRecordResponse{
|
||||
GameID: record.GameID,
|
||||
Status: string(record.Status),
|
||||
StatePath: record.StatePath,
|
||||
DockerNetwork: record.DockerNetwork,
|
||||
LastOpAt: record.LastOpAt.UTC().Format(time.RFC3339Nano),
|
||||
CreatedAt: record.CreatedAt.UTC().Format(time.RFC3339Nano),
|
||||
}
|
||||
if record.CurrentContainerID != "" {
|
||||
v := record.CurrentContainerID
|
||||
resp.CurrentContainerID = &v
|
||||
}
|
||||
if record.CurrentImageRef != "" {
|
||||
v := record.CurrentImageRef
|
||||
resp.CurrentImageRef = &v
|
||||
}
|
||||
if record.EngineEndpoint != "" {
|
||||
v := record.EngineEndpoint
|
||||
resp.EngineEndpoint = &v
|
||||
}
|
||||
if record.StartedAt != nil {
|
||||
v := record.StartedAt.UTC().Format(time.RFC3339Nano)
|
||||
resp.StartedAt = &v
|
||||
}
|
||||
if record.StoppedAt != nil {
|
||||
v := record.StoppedAt.UTC().Format(time.RFC3339Nano)
|
||||
resp.StoppedAt = &v
|
||||
}
|
||||
if record.RemovedAt != nil {
|
||||
v := record.RemovedAt.UTC().Format(time.RFC3339Nano)
|
||||
resp.RemovedAt = &v
|
||||
}
|
||||
return resp
|
||||
}
|
||||
|
||||
// encodeRuntimesList builds the wire shape returned by the list handler.
|
||||
// records may be nil (empty store); the result still carries an empty
|
||||
// items slice so the JSON form is `{"items":[]}`.
|
||||
func encodeRuntimesList(records []runtime.RuntimeRecord) runtimesListResponse {
|
||||
resp := runtimesListResponse{
|
||||
Items: make([]runtimeRecordResponse, 0, len(records)),
|
||||
}
|
||||
for _, record := range records {
|
||||
resp.Items = append(resp.Items, encodeRuntimeRecord(record))
|
||||
}
|
||||
return resp
|
||||
}
|
||||
|
||||
// writeJSON writes payload as a JSON response with the given status code.
|
||||
func writeJSON(writer http.ResponseWriter, statusCode int, payload any) {
|
||||
writer.Header().Set("Content-Type", JSONContentType)
|
||||
writer.WriteHeader(statusCode)
|
||||
_ = json.NewEncoder(writer).Encode(payload)
|
||||
}
|
||||
|
||||
// writeError writes the canonical error envelope at statusCode.
|
||||
func writeError(writer http.ResponseWriter, statusCode int, code, message string) {
|
||||
writeJSON(writer, statusCode, errorResponse{
|
||||
Error: errorBody{Code: code, Message: message},
|
||||
})
|
||||
}
|
||||
|
||||
// writeFailure writes the canonical error envelope using the HTTP
|
||||
// status mapped from code. Used by every lifecycle handler when its
|
||||
// service returns `Outcome=failure`.
|
||||
func writeFailure(writer http.ResponseWriter, code, message string) {
|
||||
writeError(writer, mapErrorCodeToStatus(code), code, message)
|
||||
}
|
||||
|
||||
// mapErrorCodeToStatus maps a stable error code to the HTTP status
|
||||
// declared by `rtmanager/api/internal-openapi.yaml`. Unknown codes
|
||||
// degrade to 500 so a future error code that ships ahead of its
|
||||
// handler-layer mapping still produces a structurally valid response.
|
||||
func mapErrorCodeToStatus(code string) int {
|
||||
switch code {
|
||||
case startruntime.ErrorCodeInvalidRequest,
|
||||
startruntime.ErrorCodeStartConfigInvalid,
|
||||
startruntime.ErrorCodeImageRefNotSemver:
|
||||
return http.StatusBadRequest
|
||||
case startruntime.ErrorCodeNotFound:
|
||||
return http.StatusNotFound
|
||||
case startruntime.ErrorCodeConflict,
|
||||
startruntime.ErrorCodeSemverPatchOnly:
|
||||
return http.StatusConflict
|
||||
case startruntime.ErrorCodeServiceUnavailable,
|
||||
errorCodeDockerUnavailable:
|
||||
return http.StatusServiceUnavailable
|
||||
case startruntime.ErrorCodeImagePullFailed,
|
||||
startruntime.ErrorCodeContainerStartFailed,
|
||||
startruntime.ErrorCodeInternal:
|
||||
return http.StatusInternalServerError
|
||||
default:
|
||||
return http.StatusInternalServerError
|
||||
}
|
||||
}
|
||||
|
||||
// decodeStrictJSON decodes one request body into target with strict
|
||||
// JSON semantics: unknown fields are rejected and trailing content is
|
||||
// rejected. Mirrors the helper used by lobby's internal HTTP layer.
|
||||
func decodeStrictJSON(body io.Reader, target any) error {
|
||||
decoder := json.NewDecoder(body)
|
||||
decoder.DisallowUnknownFields()
|
||||
if err := decoder.Decode(target); err != nil {
|
||||
return err
|
||||
}
|
||||
if decoder.More() {
|
||||
return errors.New("unexpected trailing content after JSON body")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// extractGameID pulls the {game_id} path variable from request. An empty
|
||||
// or whitespace-only value writes a `400 invalid_request` and returns
|
||||
// ok=false so callers can short-circuit.
|
||||
func extractGameID(writer http.ResponseWriter, request *http.Request) (string, bool) {
|
||||
raw := request.PathValue(gameIDPathParam)
|
||||
if strings.TrimSpace(raw) == "" {
|
||||
writeError(writer, http.StatusBadRequest,
|
||||
startruntime.ErrorCodeInvalidRequest,
|
||||
"game id is required",
|
||||
)
|
||||
return "", false
|
||||
}
|
||||
return raw, true
|
||||
}
|
||||
|
||||
// resolveOpSource maps the X-Galaxy-Caller header to an
|
||||
// `operation.OpSource`. Missing or unknown values default to
|
||||
// `OpSourceAdminRest`, matching the contract documented in
|
||||
// `rtmanager/api/internal-openapi.yaml`.
|
||||
func resolveOpSource(request *http.Request) operation.OpSource {
|
||||
switch strings.ToLower(strings.TrimSpace(request.Header.Get(callerHeader))) {
|
||||
case "gm":
|
||||
return operation.OpSourceGMRest
|
||||
default:
|
||||
return operation.OpSourceAdminRest
|
||||
}
|
||||
}
|
||||
|
||||
// requestSourceRef returns an opaque per-request reference recorded in
|
||||
// `operation_log.source_ref`. v1 reads the `X-Request-ID` header when
|
||||
// present so callers may correlate REST requests with audit rows; the
|
||||
// listener does not currently install a request-id middleware so the
|
||||
// header path is the only source.
|
||||
func requestSourceRef(request *http.Request) string {
|
||||
if v := strings.TrimSpace(request.Header.Get("X-Request-ID")); v != "" {
|
||||
return v
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// loggerFor returns a logger annotated with the operation tag. Each
|
||||
// handler scopes its logs by op so operators filtering on
|
||||
// `op=internal_rest.start` see exactly the lifecycle they care about.
|
||||
func loggerFor(parent *slog.Logger, op string) *slog.Logger {
|
||||
if parent == nil {
|
||||
parent = slog.Default()
|
||||
}
|
||||
return parent.With("component", "internal_http.handlers", "op", op)
|
||||
}
|
||||
@@ -0,0 +1,197 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"sync"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// fixedClock is the wall-clock used to build canonical sample records
|
||||
// across the handler tests. UTC Sunday 1pm 2026-04-26 is far enough in
|
||||
// the future to be obvious in test output.
|
||||
var fixedClock = time.Date(2026, 4, 26, 13, 0, 0, 0, time.UTC)
|
||||
|
||||
// sampleRunningRecord returns a canonical running record used by every
|
||||
// happy-path test in this package.
|
||||
func sampleRunningRecord(t *testing.T) runtime.RuntimeRecord {
|
||||
t.Helper()
|
||||
started := fixedClock
|
||||
return runtime.RuntimeRecord{
|
||||
GameID: "game-test",
|
||||
Status: runtime.StatusRunning,
|
||||
CurrentContainerID: "container-test",
|
||||
CurrentImageRef: "galaxy/game:v1.2.3",
|
||||
EngineEndpoint: "http://galaxy-game-game-test:8080",
|
||||
StatePath: "/var/lib/galaxy/game-test",
|
||||
DockerNetwork: "galaxy-engine",
|
||||
StartedAt: &started,
|
||||
LastOpAt: fixedClock,
|
||||
CreatedAt: fixedClock,
|
||||
}
|
||||
}
|
||||
|
||||
// sampleStoppedRecord returns a canonical stopped record useful for
|
||||
// cleanup-handler and list-handler tests.
|
||||
func sampleStoppedRecord(t *testing.T) runtime.RuntimeRecord {
|
||||
t.Helper()
|
||||
started := fixedClock
|
||||
stopped := fixedClock.Add(time.Minute)
|
||||
return runtime.RuntimeRecord{
|
||||
GameID: "game-stopped",
|
||||
Status: runtime.StatusStopped,
|
||||
CurrentContainerID: "container-stopped",
|
||||
CurrentImageRef: "galaxy/game:v1.2.3",
|
||||
EngineEndpoint: "http://galaxy-game-game-stopped:8080",
|
||||
StatePath: "/var/lib/galaxy/game-stopped",
|
||||
DockerNetwork: "galaxy-engine",
|
||||
StartedAt: &started,
|
||||
StoppedAt: &stopped,
|
||||
LastOpAt: stopped,
|
||||
CreatedAt: fixedClock,
|
||||
}
|
||||
}
|
||||
|
||||
// drive routes one request through a full mux configured by Register.
|
||||
// It returns the captured ResponseRecorder so tests can assert on
|
||||
// status, headers, and body.
|
||||
func drive(t *testing.T, deps Dependencies, method, path string, headers http.Header, body io.Reader) *httptest.ResponseRecorder {
|
||||
t.Helper()
|
||||
|
||||
mux := http.NewServeMux()
|
||||
Register(mux, deps)
|
||||
|
||||
request := httptest.NewRequest(method, path, body)
|
||||
for key, values := range headers {
|
||||
for _, value := range values {
|
||||
request.Header.Add(key, value)
|
||||
}
|
||||
}
|
||||
|
||||
recorder := httptest.NewRecorder()
|
||||
mux.ServeHTTP(recorder, request)
|
||||
return recorder
|
||||
}
|
||||
|
||||
// decodeRecordResponse asserts that the response carried a 200 with
|
||||
// the canonical content type and decodes the record body.
|
||||
func decodeRecordResponse(t *testing.T, rec *httptest.ResponseRecorder) runtimeRecordResponse {
|
||||
t.Helper()
|
||||
require.Equalf(t, http.StatusOK, rec.Code, "expected 200, got body: %s", rec.Body.String())
|
||||
require.Equal(t, JSONContentType, rec.Header().Get("Content-Type"))
|
||||
|
||||
var resp runtimeRecordResponse
|
||||
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
|
||||
return resp
|
||||
}
|
||||
|
||||
// decodeErrorBody asserts the canonical error envelope and decodes it.
|
||||
func decodeErrorBody(t *testing.T, rec *httptest.ResponseRecorder, wantStatus int) errorBody {
|
||||
t.Helper()
|
||||
require.Equalf(t, wantStatus, rec.Code, "expected %d, got body: %s", wantStatus, rec.Body.String())
|
||||
require.Equal(t, JSONContentType, rec.Header().Get("Content-Type"))
|
||||
|
||||
var resp errorResponse
|
||||
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
|
||||
return resp.Error
|
||||
}
|
||||
|
||||
// fakeRuntimeRecords is an in-memory ports.RuntimeRecordStore used by
|
||||
// list / get tests. It is intentionally minimal — services use their
|
||||
// own fakes in `internal/service/<op>/service_test.go` and do not
|
||||
// share this helper.
|
||||
type fakeRuntimeRecords struct {
|
||||
mu sync.Mutex
|
||||
stored map[string]runtime.RuntimeRecord
|
||||
listErr error
|
||||
getErr error
|
||||
}
|
||||
|
||||
func newFakeRuntimeRecords() *fakeRuntimeRecords {
|
||||
return &fakeRuntimeRecords{stored: map[string]runtime.RuntimeRecord{}}
|
||||
}
|
||||
|
||||
func (s *fakeRuntimeRecords) put(record runtime.RuntimeRecord) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
s.stored[record.GameID] = record
|
||||
}
|
||||
|
||||
func (s *fakeRuntimeRecords) Get(_ context.Context, gameID string) (runtime.RuntimeRecord, error) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
if s.getErr != nil {
|
||||
return runtime.RuntimeRecord{}, s.getErr
|
||||
}
|
||||
record, ok := s.stored[gameID]
|
||||
if !ok {
|
||||
return runtime.RuntimeRecord{}, runtime.ErrNotFound
|
||||
}
|
||||
return record, nil
|
||||
}
|
||||
|
||||
func (s *fakeRuntimeRecords) Upsert(_ context.Context, _ runtime.RuntimeRecord) error {
|
||||
return errors.New("not used in handler tests")
|
||||
}
|
||||
|
||||
func (s *fakeRuntimeRecords) UpdateStatus(_ context.Context, _ ports.UpdateStatusInput) error {
|
||||
return errors.New("not used in handler tests")
|
||||
}
|
||||
|
||||
func (s *fakeRuntimeRecords) ListByStatus(_ context.Context, _ runtime.Status) ([]runtime.RuntimeRecord, error) {
|
||||
return nil, errors.New("not used in handler tests")
|
||||
}
|
||||
|
||||
func (s *fakeRuntimeRecords) List(_ context.Context) ([]runtime.RuntimeRecord, error) {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
if s.listErr != nil {
|
||||
return nil, s.listErr
|
||||
}
|
||||
if len(s.stored) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
records := make([]runtime.RuntimeRecord, 0, len(s.stored))
|
||||
for _, record := range s.stored {
|
||||
records = append(records, record)
|
||||
}
|
||||
return records, nil
|
||||
}
|
||||
|
||||
// jsonHeaders returns the default headers used by tests that send a
|
||||
// JSON body.
|
||||
func jsonHeaders() http.Header {
|
||||
h := http.Header{}
|
||||
h.Set("Content-Type", "application/json")
|
||||
return h
|
||||
}
|
||||
|
||||
// withCaller adds the X-Galaxy-Caller header to h and returns h. The
|
||||
// helper exists to keep test cases readable when the header is the
|
||||
// only difference between two table rows.
|
||||
func withCaller(h http.Header, value string) http.Header {
|
||||
if h == nil {
|
||||
h = http.Header{}
|
||||
}
|
||||
h.Set(callerHeader, value)
|
||||
return h
|
||||
}
|
||||
|
||||
// strReader builds an io.Reader from raw JSON.
|
||||
func strReader(raw string) io.Reader {
|
||||
return strings.NewReader(raw)
|
||||
}
|
||||
|
||||
// Compile-time assertions that the in-memory fake satisfies the port.
|
||||
var _ ports.RuntimeRecordStore = (*fakeRuntimeRecords)(nil)
|
||||
@@ -0,0 +1,55 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// newGetHandler returns the handler for
|
||||
// `GET /api/v1/internal/runtimes/{game_id}`. The handler reads
|
||||
// directly from the runtime record store and translates
|
||||
// `runtime.ErrNotFound` to `404 not_found`. Like list, it does not
|
||||
// run through the service layer and does not produce an operation_log
|
||||
// row.
|
||||
func newGetHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.get")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.RuntimeRecords == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"runtime records store is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
gameID, ok := extractGameID(writer, request)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
record, err := deps.RuntimeRecords.Get(request.Context(), gameID)
|
||||
if errors.Is(err, runtime.ErrNotFound) {
|
||||
writeError(writer, http.StatusNotFound,
|
||||
startruntime.ErrorCodeNotFound,
|
||||
"runtime record not found",
|
||||
)
|
||||
return
|
||||
}
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "get runtime record",
|
||||
"game_id", gameID,
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"failed to read runtime record",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(record))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,69 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"log/slog"
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
)
|
||||
|
||||
// Route paths registered by Register. The values match the operation
|
||||
// IDs frozen by `rtmanager/api/internal-openapi.yaml` and
|
||||
// `rtmanager/contract_openapi_test.go`.
|
||||
const (
|
||||
listRuntimesPath = "/api/v1/internal/runtimes"
|
||||
getRuntimePath = "/api/v1/internal/runtimes/{game_id}"
|
||||
startRuntimePath = "/api/v1/internal/runtimes/{game_id}/start"
|
||||
stopRuntimePath = "/api/v1/internal/runtimes/{game_id}/stop"
|
||||
restartRuntimePath = "/api/v1/internal/runtimes/{game_id}/restart"
|
||||
patchRuntimePath = "/api/v1/internal/runtimes/{game_id}/patch"
|
||||
cleanupRuntimePath = "/api/v1/internal/runtimes/{game_id}/container"
|
||||
)
|
||||
|
||||
// Dependencies bundles the collaborators required to serve the GM/Admin
|
||||
// REST surface. Any service may be nil for tests that exercise a
|
||||
// subset of the surface; in that case the unwired routes return
|
||||
// `500 internal_error` (mirrors lobby's "service is not wired"
|
||||
// pattern).
|
||||
type Dependencies struct {
|
||||
// Logger receives structured logs scoped per handler. nil falls back
|
||||
// to slog.Default.
|
||||
Logger *slog.Logger
|
||||
|
||||
// RuntimeRecords backs the read-only list and get handlers. They do
|
||||
// not produce operation_log rows because they do not mutate state.
|
||||
RuntimeRecords ports.RuntimeRecordStore
|
||||
|
||||
// StartRuntime executes the start lifecycle operation. Production
|
||||
// wiring passes `*startruntime.Service` (the concrete service
|
||||
// satisfies StartService).
|
||||
StartRuntime StartService
|
||||
|
||||
// StopRuntime executes the stop lifecycle operation.
|
||||
StopRuntime StopService
|
||||
|
||||
// RestartRuntime executes the restart lifecycle operation.
|
||||
RestartRuntime RestartService
|
||||
|
||||
// PatchRuntime executes the patch lifecycle operation.
|
||||
PatchRuntime PatchService
|
||||
|
||||
// CleanupContainer executes the cleanup_container lifecycle
|
||||
// operation.
|
||||
CleanupContainer CleanupService
|
||||
}
|
||||
|
||||
// Register attaches every internal REST route to mux using deps. Each
|
||||
// route reads its dependency lazily so a partially-wired Dependencies
|
||||
// (e.g., a probe-only listener test) does not crash; missing
|
||||
// dependencies surface as `500 internal_error`. Routes use Go 1.22
|
||||
// method-aware mux patterns.
|
||||
func Register(mux *http.ServeMux, deps Dependencies) {
|
||||
mux.HandleFunc("GET "+listRuntimesPath, newListHandler(deps))
|
||||
mux.HandleFunc("GET "+getRuntimePath, newGetHandler(deps))
|
||||
mux.HandleFunc("POST "+startRuntimePath, newStartHandler(deps))
|
||||
mux.HandleFunc("POST "+stopRuntimePath, newStopHandler(deps))
|
||||
mux.HandleFunc("POST "+restartRuntimePath, newRestartHandler(deps))
|
||||
mux.HandleFunc("POST "+patchRuntimePath, newPatchHandler(deps))
|
||||
mux.HandleFunc("DELETE "+cleanupRuntimePath, newCleanupHandler(deps))
|
||||
}
|
||||
@@ -0,0 +1,610 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"net/http"
|
||||
"testing"
|
||||
|
||||
"galaxy/rtmanager/internal/api/internalhttp/handlers/mocks"
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/domain/runtime"
|
||||
"galaxy/rtmanager/internal/service/cleanupcontainer"
|
||||
"galaxy/rtmanager/internal/service/patchruntime"
|
||||
"galaxy/rtmanager/internal/service/restartruntime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
"galaxy/rtmanager/internal/service/stopruntime"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
"go.uber.org/mock/gomock"
|
||||
)
|
||||
|
||||
// Tests for the mutating handlers (start, stop, restart, patch,
|
||||
// cleanup). Each handler delegates to one lifecycle service through a
|
||||
// narrow `mockgen`-backed interface; the handler layer is responsible
|
||||
// for input parsing, the `X-Galaxy-Caller` → `op_source` mapping, and
|
||||
// the canonical `ErrorCode` → HTTP status table documented in
|
||||
// `rtmanager/docs/services.md` §18.
|
||||
|
||||
// --- start ---
|
||||
|
||||
func TestStartHandlerReturnsRecordOnSuccess(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
record := sampleRunningRecord(t)
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(startruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in startruntime.Input) (startruntime.Result, error) {
|
||||
assert.Equal(t, "game-test", in.GameID)
|
||||
assert.Equal(t, "galaxy/game:v1.2.3", in.ImageRef)
|
||||
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
|
||||
return startruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
deps := Dependencies{StartRuntime: mock}
|
||||
rec := drive(t, deps, http.MethodPost, "/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "game-test", resp.GameID)
|
||||
assert.Equal(t, "running", resp.Status)
|
||||
}
|
||||
|
||||
func TestStartHandlerReturnsRecordOnReplayNoOp(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
record := sampleRunningRecord(t)
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.Any()).
|
||||
Return(startruntime.Result{
|
||||
Record: record,
|
||||
Outcome: operation.OutcomeSuccess,
|
||||
ErrorCode: startruntime.ErrorCodeReplayNoOp,
|
||||
}, nil)
|
||||
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "game-test", resp.GameID)
|
||||
}
|
||||
|
||||
func TestStartHandlerMapsServiceFailures(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
cases := []struct {
|
||||
name string
|
||||
errorCode string
|
||||
wantStatus int
|
||||
}{
|
||||
{"start_config_invalid", startruntime.ErrorCodeStartConfigInvalid, http.StatusBadRequest},
|
||||
{"image_pull_failed", startruntime.ErrorCodeImagePullFailed, http.StatusInternalServerError},
|
||||
{"container_start_failed", startruntime.ErrorCodeContainerStartFailed, http.StatusInternalServerError},
|
||||
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
|
||||
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
|
||||
{"internal_error", startruntime.ErrorCodeInternal, http.StatusInternalServerError},
|
||||
}
|
||||
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.Any()).
|
||||
Return(startruntime.Result{
|
||||
Outcome: operation.OutcomeFailure,
|
||||
ErrorCode: tc.errorCode,
|
||||
ErrorMessage: "synthetic " + tc.name,
|
||||
}, nil)
|
||||
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
|
||||
body := decodeErrorBody(t, rec, tc.wantStatus)
|
||||
assert.Equal(t, tc.errorCode, body.Code)
|
||||
assert.Equal(t, "synthetic "+tc.name, body.Message)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestStartHandlerRejectsUnknownJSONFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"x","extra":"y"}`),
|
||||
)
|
||||
|
||||
body := decodeErrorBody(t, rec, http.StatusBadRequest)
|
||||
assert.Equal(t, "invalid_request", body.Code)
|
||||
}
|
||||
|
||||
func TestStartHandlerRejectsMalformedJSON(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":`),
|
||||
)
|
||||
|
||||
body := decodeErrorBody(t, rec, http.StatusBadRequest)
|
||||
assert.Equal(t, "invalid_request", body.Code)
|
||||
}
|
||||
|
||||
func TestStartHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
cases := []struct {
|
||||
header string
|
||||
want operation.OpSource
|
||||
hdrLabel string
|
||||
}{
|
||||
{"gm", operation.OpSourceGMRest, "gm"},
|
||||
{"GM", operation.OpSourceGMRest, "uppercase gm"},
|
||||
{"admin", operation.OpSourceAdminRest, "admin"},
|
||||
{"unknown", operation.OpSourceAdminRest, "unknown value"},
|
||||
{"", operation.OpSourceAdminRest, "missing header"},
|
||||
}
|
||||
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.hdrLabel, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
record := sampleRunningRecord(t)
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(startruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in startruntime.Input) (startruntime.Result, error) {
|
||||
assert.Equal(t, tc.want, in.OpSource)
|
||||
return startruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
headers := jsonHeaders()
|
||||
if tc.header != "" {
|
||||
headers = withCaller(headers, tc.header)
|
||||
}
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
headers,
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestStartHandlerForwardsXRequestIDAsSourceRef(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(startruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in startruntime.Input) (startruntime.Result, error) {
|
||||
assert.Equal(t, "req-42", in.SourceRef)
|
||||
return startruntime.Result{Record: sampleRunningRecord(t), Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
headers := jsonHeaders()
|
||||
headers.Set("X-Request-ID", "req-42")
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
headers,
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
}
|
||||
|
||||
func TestStartHandlerReturnsInternalErrorWhenServiceErrors(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStartService(ctrl)
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.Any()).
|
||||
Return(startruntime.Result{}, assert.AnError)
|
||||
|
||||
rec := drive(t, Dependencies{StartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
func TestStartHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/start",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.3"}`),
|
||||
)
|
||||
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
// --- stop ---
|
||||
|
||||
func TestStopHandlerReturnsRecordOnSuccess(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStopService(ctrl)
|
||||
|
||||
record := sampleStoppedRecord(t)
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(stopruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in stopruntime.Input) (stopruntime.Result, error) {
|
||||
assert.Equal(t, "game-test", in.GameID)
|
||||
assert.Equal(t, stopruntime.StopReasonAdminRequest, in.Reason)
|
||||
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
|
||||
return stopruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/stop",
|
||||
jsonHeaders(),
|
||||
strReader(`{"reason":"admin_request"}`),
|
||||
)
|
||||
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "stopped", resp.Status)
|
||||
}
|
||||
|
||||
func TestStopHandlerMapsServiceFailures(t *testing.T) {
|
||||
t.Parallel()
|
||||
cases := []struct {
|
||||
name string
|
||||
errorCode string
|
||||
wantStatus int
|
||||
}{
|
||||
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
|
||||
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
|
||||
{"invalid_request", startruntime.ErrorCodeInvalidRequest, http.StatusBadRequest},
|
||||
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
|
||||
{"internal_error", startruntime.ErrorCodeInternal, http.StatusInternalServerError},
|
||||
}
|
||||
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStopService(ctrl)
|
||||
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(stopruntime.Result{
|
||||
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
|
||||
}, nil)
|
||||
|
||||
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/stop",
|
||||
jsonHeaders(),
|
||||
strReader(`{"reason":"admin_request"}`),
|
||||
)
|
||||
body := decodeErrorBody(t, rec, tc.wantStatus)
|
||||
assert.Equal(t, tc.errorCode, body.Code)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestStopHandlerRejectsUnknownJSONFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStopService(ctrl)
|
||||
|
||||
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/stop",
|
||||
jsonHeaders(),
|
||||
strReader(`{"reason":"admin_request","extra":1}`),
|
||||
)
|
||||
body := decodeErrorBody(t, rec, http.StatusBadRequest)
|
||||
assert.Equal(t, "invalid_request", body.Code)
|
||||
}
|
||||
|
||||
func TestStopHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockStopService(ctrl)
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(stopruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in stopruntime.Input) (stopruntime.Result, error) {
|
||||
assert.Equal(t, operation.OpSourceGMRest, in.OpSource)
|
||||
return stopruntime.Result{Record: sampleStoppedRecord(t), Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{StopRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/stop",
|
||||
withCaller(jsonHeaders(), "gm"),
|
||||
strReader(`{"reason":"cancelled"}`),
|
||||
)
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
}
|
||||
|
||||
func TestStopHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/stop",
|
||||
jsonHeaders(),
|
||||
strReader(`{"reason":"admin_request"}`),
|
||||
)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
// --- restart ---
|
||||
|
||||
func TestRestartHandlerReturnsRecordOnSuccess(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockRestartService(ctrl)
|
||||
|
||||
record := sampleRunningRecord(t)
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(restartruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in restartruntime.Input) (restartruntime.Result, error) {
|
||||
assert.Equal(t, "game-test", in.GameID)
|
||||
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
|
||||
return restartruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{RestartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/restart", nil, nil,
|
||||
)
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "running", resp.Status)
|
||||
}
|
||||
|
||||
func TestRestartHandlerMapsServiceFailures(t *testing.T) {
|
||||
t.Parallel()
|
||||
cases := []struct {
|
||||
name string
|
||||
errorCode string
|
||||
wantStatus int
|
||||
}{
|
||||
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
|
||||
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
|
||||
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
|
||||
{"internal_error", startruntime.ErrorCodeInternal, http.StatusInternalServerError},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockRestartService(ctrl)
|
||||
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(restartruntime.Result{
|
||||
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
|
||||
}, nil)
|
||||
|
||||
rec := drive(t, Dependencies{RestartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/restart", nil, nil,
|
||||
)
|
||||
body := decodeErrorBody(t, rec, tc.wantStatus)
|
||||
assert.Equal(t, tc.errorCode, body.Code)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestRestartHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockRestartService(ctrl)
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(restartruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in restartruntime.Input) (restartruntime.Result, error) {
|
||||
assert.Equal(t, operation.OpSourceGMRest, in.OpSource)
|
||||
return restartruntime.Result{Record: sampleRunningRecord(t), Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{RestartRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/restart",
|
||||
withCaller(http.Header{}, "gm"), nil,
|
||||
)
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
}
|
||||
|
||||
func TestRestartHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/restart", nil, nil,
|
||||
)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
// --- patch ---
|
||||
|
||||
func TestPatchHandlerReturnsRecordOnSuccess(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockPatchService(ctrl)
|
||||
|
||||
record := sampleRunningRecord(t)
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(patchruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in patchruntime.Input) (patchruntime.Result, error) {
|
||||
assert.Equal(t, "game-test", in.GameID)
|
||||
assert.Equal(t, "galaxy/game:v1.2.4", in.NewImageRef)
|
||||
return patchruntime.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/patch",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
|
||||
)
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "running", resp.Status)
|
||||
}
|
||||
|
||||
func TestPatchHandlerMapsServiceFailures(t *testing.T) {
|
||||
t.Parallel()
|
||||
cases := []struct {
|
||||
name string
|
||||
errorCode string
|
||||
wantStatus int
|
||||
}{
|
||||
{"image_ref_not_semver", startruntime.ErrorCodeImageRefNotSemver, http.StatusBadRequest},
|
||||
{"semver_patch_only", startruntime.ErrorCodeSemverPatchOnly, http.StatusConflict},
|
||||
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
|
||||
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
|
||||
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockPatchService(ctrl)
|
||||
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(patchruntime.Result{
|
||||
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
|
||||
}, nil)
|
||||
|
||||
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/patch",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
|
||||
)
|
||||
body := decodeErrorBody(t, rec, tc.wantStatus)
|
||||
assert.Equal(t, tc.errorCode, body.Code)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestPatchHandlerRejectsUnknownJSONFields(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockPatchService(ctrl)
|
||||
|
||||
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/patch",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"x","unexpected":true}`),
|
||||
)
|
||||
body := decodeErrorBody(t, rec, http.StatusBadRequest)
|
||||
assert.Equal(t, "invalid_request", body.Code)
|
||||
}
|
||||
|
||||
func TestPatchHandlerHonoursXGalaxyCallerHeader(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockPatchService(ctrl)
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(patchruntime.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in patchruntime.Input) (patchruntime.Result, error) {
|
||||
assert.Equal(t, operation.OpSourceGMRest, in.OpSource)
|
||||
return patchruntime.Result{Record: sampleRunningRecord(t), Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{PatchRuntime: mock}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/patch",
|
||||
withCaller(jsonHeaders(), "gm"),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
|
||||
)
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
}
|
||||
|
||||
func TestPatchHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodPost,
|
||||
"/api/v1/internal/runtimes/game-test/patch",
|
||||
jsonHeaders(),
|
||||
strReader(`{"image_ref":"galaxy/game:v1.2.4"}`),
|
||||
)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
// --- cleanup ---
|
||||
|
||||
func TestCleanupHandlerReturnsRecordOnSuccess(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockCleanupService(ctrl)
|
||||
|
||||
record := sampleStoppedRecord(t)
|
||||
record.Status = runtime.StatusRemoved
|
||||
record.CurrentContainerID = ""
|
||||
removed := record.LastOpAt
|
||||
record.RemovedAt = &removed
|
||||
|
||||
mock.EXPECT().
|
||||
Handle(gomock.Any(), gomock.AssignableToTypeOf(cleanupcontainer.Input{})).
|
||||
DoAndReturn(func(_ context.Context, in cleanupcontainer.Input) (cleanupcontainer.Result, error) {
|
||||
assert.Equal(t, "game-stopped", in.GameID)
|
||||
assert.Equal(t, operation.OpSourceAdminRest, in.OpSource)
|
||||
return cleanupcontainer.Result{Record: record, Outcome: operation.OutcomeSuccess}, nil
|
||||
})
|
||||
|
||||
rec := drive(t, Dependencies{CleanupContainer: mock}, http.MethodDelete,
|
||||
"/api/v1/internal/runtimes/game-stopped/container", nil, nil,
|
||||
)
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "removed", resp.Status)
|
||||
assert.Nil(t, resp.CurrentContainerID, "container id must be null after cleanup")
|
||||
}
|
||||
|
||||
func TestCleanupHandlerMapsServiceFailures(t *testing.T) {
|
||||
t.Parallel()
|
||||
cases := []struct {
|
||||
name string
|
||||
errorCode string
|
||||
wantStatus int
|
||||
}{
|
||||
{"not_found", startruntime.ErrorCodeNotFound, http.StatusNotFound},
|
||||
{"conflict", startruntime.ErrorCodeConflict, http.StatusConflict},
|
||||
{"service_unavailable", startruntime.ErrorCodeServiceUnavailable, http.StatusServiceUnavailable},
|
||||
}
|
||||
for _, tc := range cases {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
t.Parallel()
|
||||
ctrl := gomock.NewController(t)
|
||||
mock := mocks.NewMockCleanupService(ctrl)
|
||||
mock.EXPECT().Handle(gomock.Any(), gomock.Any()).Return(cleanupcontainer.Result{
|
||||
Outcome: operation.OutcomeFailure, ErrorCode: tc.errorCode, ErrorMessage: tc.name,
|
||||
}, nil)
|
||||
|
||||
rec := drive(t, Dependencies{CleanupContainer: mock}, http.MethodDelete,
|
||||
"/api/v1/internal/runtimes/game-test/container", nil, nil,
|
||||
)
|
||||
body := decodeErrorBody(t, rec, tc.wantStatus)
|
||||
assert.Equal(t, tc.errorCode, body.Code)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestCleanupHandlerReturnsInternalErrorWhenServiceNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodDelete,
|
||||
"/api/v1/internal/runtimes/game-test/container", nil, nil,
|
||||
)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
@@ -0,0 +1,115 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"net/http"
|
||||
"testing"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
// Tests for the read-only handlers (`internalListRuntimes`,
|
||||
// `internalGetRuntime`). These bypass the service layer and read
|
||||
// directly from `ports.RuntimeRecordStore` — see
|
||||
// `rtmanager/docs/services.md` §18.
|
||||
|
||||
func TestListHandlerReturnsEmptyItemsForEmptyStore(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
deps := Dependencies{RuntimeRecords: newFakeRuntimeRecords()}
|
||||
rec := drive(t, deps, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
|
||||
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
require.Equal(t, JSONContentType, rec.Header().Get("Content-Type"))
|
||||
|
||||
var resp runtimesListResponse
|
||||
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
|
||||
require.NotNil(t, resp.Items, "items must never be nil")
|
||||
assert.Empty(t, resp.Items)
|
||||
}
|
||||
|
||||
func TestListHandlerReturnsEveryStoredRecord(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store := newFakeRuntimeRecords()
|
||||
store.put(sampleRunningRecord(t))
|
||||
store.put(sampleStoppedRecord(t))
|
||||
|
||||
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
|
||||
var resp runtimesListResponse
|
||||
require.NoError(t, json.NewDecoder(rec.Body).Decode(&resp))
|
||||
require.Len(t, resp.Items, 2)
|
||||
|
||||
gotIDs := map[string]string{}
|
||||
for _, item := range resp.Items {
|
||||
gotIDs[item.GameID] = item.Status
|
||||
}
|
||||
assert.Equal(t, "running", gotIDs["game-test"])
|
||||
assert.Equal(t, "stopped", gotIDs["game-stopped"])
|
||||
}
|
||||
|
||||
func TestListHandlerReturnsInternalErrorWhenStoreFails(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store := newFakeRuntimeRecords()
|
||||
store.listErr = errors.New("postgres exploded")
|
||||
|
||||
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
func TestListHandlerReturnsInternalErrorWhenStoreNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodGet, "/api/v1/internal/runtimes", nil, nil)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
func TestGetHandlerReturnsTheRecord(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store := newFakeRuntimeRecords()
|
||||
record := sampleRunningRecord(t)
|
||||
store.put(record)
|
||||
|
||||
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes/game-test", nil, nil)
|
||||
resp := decodeRecordResponse(t, rec)
|
||||
assert.Equal(t, "game-test", resp.GameID)
|
||||
assert.Equal(t, "running", resp.Status)
|
||||
if assert.NotNil(t, resp.CurrentImageRef) {
|
||||
assert.Equal(t, "galaxy/game:v1.2.3", *resp.CurrentImageRef)
|
||||
}
|
||||
}
|
||||
|
||||
func TestGetHandlerReturnsNotFoundForMissingRecord(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{RuntimeRecords: newFakeRuntimeRecords()}, http.MethodGet, "/api/v1/internal/runtimes/game-missing", nil, nil)
|
||||
body := decodeErrorBody(t, rec, http.StatusNotFound)
|
||||
assert.Equal(t, "not_found", body.Code)
|
||||
}
|
||||
|
||||
func TestGetHandlerReturnsInternalErrorWhenStoreFails(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
store := newFakeRuntimeRecords()
|
||||
store.getErr = errors.New("transport blew up")
|
||||
|
||||
rec := drive(t, Dependencies{RuntimeRecords: store}, http.MethodGet, "/api/v1/internal/runtimes/game-test", nil, nil)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
|
||||
func TestGetHandlerReturnsInternalErrorWhenStoreNotWired(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
rec := drive(t, Dependencies{}, http.MethodGet, "/api/v1/internal/runtimes/game-test", nil, nil)
|
||||
body := decodeErrorBody(t, rec, http.StatusInternalServerError)
|
||||
assert.Equal(t, "internal_error", body.Code)
|
||||
}
|
||||
@@ -0,0 +1,38 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// newListHandler returns the handler for `GET /api/v1/internal/runtimes`.
|
||||
// The handler reads directly from `ports.RuntimeRecordStore.List` —
|
||||
// this surface is read-only and does not produce operation_log rows
|
||||
// (rationale: see `rtmanager/docs/services.md` §18).
|
||||
func newListHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.list")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.RuntimeRecords == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"runtime records store is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
records, err := deps.RuntimeRecords.List(request.Context())
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "list runtime records",
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"failed to list runtime records",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimesList(records))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,217 @@
|
||||
// Code generated by MockGen. DO NOT EDIT.
|
||||
// Source: galaxy/rtmanager/internal/api/internalhttp/handlers (interfaces: StartService,StopService,RestartService,PatchService,CleanupService)
|
||||
//
|
||||
// Generated by this command:
|
||||
//
|
||||
// mockgen -destination=mocks/mock_services.go -package=mocks galaxy/rtmanager/internal/api/internalhttp/handlers StartService,StopService,RestartService,PatchService,CleanupService
|
||||
//
|
||||
|
||||
// Package mocks is a generated GoMock package.
|
||||
package mocks
|
||||
|
||||
import (
|
||||
context "context"
|
||||
cleanupcontainer "galaxy/rtmanager/internal/service/cleanupcontainer"
|
||||
patchruntime "galaxy/rtmanager/internal/service/patchruntime"
|
||||
restartruntime "galaxy/rtmanager/internal/service/restartruntime"
|
||||
startruntime "galaxy/rtmanager/internal/service/startruntime"
|
||||
stopruntime "galaxy/rtmanager/internal/service/stopruntime"
|
||||
reflect "reflect"
|
||||
|
||||
gomock "go.uber.org/mock/gomock"
|
||||
)
|
||||
|
||||
// MockStartService is a mock of StartService interface.
|
||||
type MockStartService struct {
|
||||
ctrl *gomock.Controller
|
||||
recorder *MockStartServiceMockRecorder
|
||||
isgomock struct{}
|
||||
}
|
||||
|
||||
// MockStartServiceMockRecorder is the mock recorder for MockStartService.
|
||||
type MockStartServiceMockRecorder struct {
|
||||
mock *MockStartService
|
||||
}
|
||||
|
||||
// NewMockStartService creates a new mock instance.
|
||||
func NewMockStartService(ctrl *gomock.Controller) *MockStartService {
|
||||
mock := &MockStartService{ctrl: ctrl}
|
||||
mock.recorder = &MockStartServiceMockRecorder{mock}
|
||||
return mock
|
||||
}
|
||||
|
||||
// EXPECT returns an object that allows the caller to indicate expected use.
|
||||
func (m *MockStartService) EXPECT() *MockStartServiceMockRecorder {
|
||||
return m.recorder
|
||||
}
|
||||
|
||||
// Handle mocks base method.
|
||||
func (m *MockStartService) Handle(ctx context.Context, in startruntime.Input) (startruntime.Result, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Handle", ctx, in)
|
||||
ret0, _ := ret[0].(startruntime.Result)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// Handle indicates an expected call of Handle.
|
||||
func (mr *MockStartServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockStartService)(nil).Handle), ctx, in)
|
||||
}
|
||||
|
||||
// MockStopService is a mock of StopService interface.
|
||||
type MockStopService struct {
|
||||
ctrl *gomock.Controller
|
||||
recorder *MockStopServiceMockRecorder
|
||||
isgomock struct{}
|
||||
}
|
||||
|
||||
// MockStopServiceMockRecorder is the mock recorder for MockStopService.
|
||||
type MockStopServiceMockRecorder struct {
|
||||
mock *MockStopService
|
||||
}
|
||||
|
||||
// NewMockStopService creates a new mock instance.
|
||||
func NewMockStopService(ctrl *gomock.Controller) *MockStopService {
|
||||
mock := &MockStopService{ctrl: ctrl}
|
||||
mock.recorder = &MockStopServiceMockRecorder{mock}
|
||||
return mock
|
||||
}
|
||||
|
||||
// EXPECT returns an object that allows the caller to indicate expected use.
|
||||
func (m *MockStopService) EXPECT() *MockStopServiceMockRecorder {
|
||||
return m.recorder
|
||||
}
|
||||
|
||||
// Handle mocks base method.
|
||||
func (m *MockStopService) Handle(ctx context.Context, in stopruntime.Input) (stopruntime.Result, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Handle", ctx, in)
|
||||
ret0, _ := ret[0].(stopruntime.Result)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// Handle indicates an expected call of Handle.
|
||||
func (mr *MockStopServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockStopService)(nil).Handle), ctx, in)
|
||||
}
|
||||
|
||||
// MockRestartService is a mock of RestartService interface.
|
||||
type MockRestartService struct {
|
||||
ctrl *gomock.Controller
|
||||
recorder *MockRestartServiceMockRecorder
|
||||
isgomock struct{}
|
||||
}
|
||||
|
||||
// MockRestartServiceMockRecorder is the mock recorder for MockRestartService.
|
||||
type MockRestartServiceMockRecorder struct {
|
||||
mock *MockRestartService
|
||||
}
|
||||
|
||||
// NewMockRestartService creates a new mock instance.
|
||||
func NewMockRestartService(ctrl *gomock.Controller) *MockRestartService {
|
||||
mock := &MockRestartService{ctrl: ctrl}
|
||||
mock.recorder = &MockRestartServiceMockRecorder{mock}
|
||||
return mock
|
||||
}
|
||||
|
||||
// EXPECT returns an object that allows the caller to indicate expected use.
|
||||
func (m *MockRestartService) EXPECT() *MockRestartServiceMockRecorder {
|
||||
return m.recorder
|
||||
}
|
||||
|
||||
// Handle mocks base method.
|
||||
func (m *MockRestartService) Handle(ctx context.Context, in restartruntime.Input) (restartruntime.Result, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Handle", ctx, in)
|
||||
ret0, _ := ret[0].(restartruntime.Result)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// Handle indicates an expected call of Handle.
|
||||
func (mr *MockRestartServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockRestartService)(nil).Handle), ctx, in)
|
||||
}
|
||||
|
||||
// MockPatchService is a mock of PatchService interface.
|
||||
type MockPatchService struct {
|
||||
ctrl *gomock.Controller
|
||||
recorder *MockPatchServiceMockRecorder
|
||||
isgomock struct{}
|
||||
}
|
||||
|
||||
// MockPatchServiceMockRecorder is the mock recorder for MockPatchService.
|
||||
type MockPatchServiceMockRecorder struct {
|
||||
mock *MockPatchService
|
||||
}
|
||||
|
||||
// NewMockPatchService creates a new mock instance.
|
||||
func NewMockPatchService(ctrl *gomock.Controller) *MockPatchService {
|
||||
mock := &MockPatchService{ctrl: ctrl}
|
||||
mock.recorder = &MockPatchServiceMockRecorder{mock}
|
||||
return mock
|
||||
}
|
||||
|
||||
// EXPECT returns an object that allows the caller to indicate expected use.
|
||||
func (m *MockPatchService) EXPECT() *MockPatchServiceMockRecorder {
|
||||
return m.recorder
|
||||
}
|
||||
|
||||
// Handle mocks base method.
|
||||
func (m *MockPatchService) Handle(ctx context.Context, in patchruntime.Input) (patchruntime.Result, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Handle", ctx, in)
|
||||
ret0, _ := ret[0].(patchruntime.Result)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// Handle indicates an expected call of Handle.
|
||||
func (mr *MockPatchServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockPatchService)(nil).Handle), ctx, in)
|
||||
}
|
||||
|
||||
// MockCleanupService is a mock of CleanupService interface.
|
||||
type MockCleanupService struct {
|
||||
ctrl *gomock.Controller
|
||||
recorder *MockCleanupServiceMockRecorder
|
||||
isgomock struct{}
|
||||
}
|
||||
|
||||
// MockCleanupServiceMockRecorder is the mock recorder for MockCleanupService.
|
||||
type MockCleanupServiceMockRecorder struct {
|
||||
mock *MockCleanupService
|
||||
}
|
||||
|
||||
// NewMockCleanupService creates a new mock instance.
|
||||
func NewMockCleanupService(ctrl *gomock.Controller) *MockCleanupService {
|
||||
mock := &MockCleanupService{ctrl: ctrl}
|
||||
mock.recorder = &MockCleanupServiceMockRecorder{mock}
|
||||
return mock
|
||||
}
|
||||
|
||||
// EXPECT returns an object that allows the caller to indicate expected use.
|
||||
func (m *MockCleanupService) EXPECT() *MockCleanupServiceMockRecorder {
|
||||
return m.recorder
|
||||
}
|
||||
|
||||
// Handle mocks base method.
|
||||
func (m *MockCleanupService) Handle(ctx context.Context, in cleanupcontainer.Input) (cleanupcontainer.Result, error) {
|
||||
m.ctrl.T.Helper()
|
||||
ret := m.ctrl.Call(m, "Handle", ctx, in)
|
||||
ret0, _ := ret[0].(cleanupcontainer.Result)
|
||||
ret1, _ := ret[1].(error)
|
||||
return ret0, ret1
|
||||
}
|
||||
|
||||
// Handle indicates an expected call of Handle.
|
||||
func (mr *MockCleanupServiceMockRecorder) Handle(ctx, in any) *gomock.Call {
|
||||
mr.mock.ctrl.T.Helper()
|
||||
return mr.mock.ctrl.RecordCallWithMethodType(mr.mock, "Handle", reflect.TypeOf((*MockCleanupService)(nil).Handle), ctx, in)
|
||||
}
|
||||
@@ -0,0 +1,71 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/service/patchruntime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// patchRequestBody mirrors the OpenAPI PatchRequest schema. The
|
||||
// service layer validates `image_ref` shape (semver, distribution
|
||||
// reference) and surfaces `image_ref_not_semver` /
|
||||
// `semver_patch_only` as needed.
|
||||
type patchRequestBody struct {
|
||||
ImageRef string `json:"image_ref"`
|
||||
}
|
||||
|
||||
// newPatchHandler returns the handler for
|
||||
// `POST /api/v1/internal/runtimes/{game_id}/patch`.
|
||||
func newPatchHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.patch")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.PatchRuntime == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"patch runtime service is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
gameID, ok := extractGameID(writer, request)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
var body patchRequestBody
|
||||
if err := decodeStrictJSON(request.Body, &body); err != nil {
|
||||
writeError(writer, http.StatusBadRequest,
|
||||
startruntime.ErrorCodeInvalidRequest,
|
||||
err.Error(),
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
result, err := deps.PatchRuntime.Handle(request.Context(), patchruntime.Input{
|
||||
GameID: gameID,
|
||||
NewImageRef: body.ImageRef,
|
||||
OpSource: resolveOpSource(request),
|
||||
SourceRef: requestSourceRef(request),
|
||||
})
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "patch runtime service errored",
|
||||
"game_id", gameID,
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"patch runtime service failed",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
if result.Outcome == operation.OutcomeFailure {
|
||||
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,55 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/service/restartruntime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// newRestartHandler returns the handler for
|
||||
// `POST /api/v1/internal/runtimes/{game_id}/restart`. The OpenAPI spec
|
||||
// declares no request body for this operation; any client-provided
|
||||
// body is ignored.
|
||||
func newRestartHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.restart")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.RestartRuntime == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"restart runtime service is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
gameID, ok := extractGameID(writer, request)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
result, err := deps.RestartRuntime.Handle(request.Context(), restartruntime.Input{
|
||||
GameID: gameID,
|
||||
OpSource: resolveOpSource(request),
|
||||
SourceRef: requestSourceRef(request),
|
||||
})
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "restart runtime service errored",
|
||||
"game_id", gameID,
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"restart runtime service failed",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
if result.Outcome == operation.OutcomeFailure {
|
||||
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,54 @@
|
||||
// Package handlers ships the GM/Admin-facing internal REST surface of
|
||||
// Runtime Manager. The package is consumed by
|
||||
// `galaxy/rtmanager/internal/api/internalhttp`; each handler delegates
|
||||
// to one of the lifecycle services in `internal/service/`
|
||||
// (`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
|
||||
// `cleanupcontainer`) or reads directly from `ports.RuntimeRecordStore`
|
||||
// (list / get).
|
||||
//
|
||||
// The interfaces declared in this file mirror the single `Handle`
|
||||
// method exposed by every concrete lifecycle service. Production wiring
|
||||
// passes the concrete service pointers; tests pass `mockgen`-generated
|
||||
// mocks. The narrow shape keeps the handler layer free of service
|
||||
// internals (lease tokens, telemetry, durable side effects) and matches
|
||||
// the repo-wide `mockgen` convention for wide / recorder ports.
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
|
||||
"galaxy/rtmanager/internal/service/cleanupcontainer"
|
||||
"galaxy/rtmanager/internal/service/patchruntime"
|
||||
"galaxy/rtmanager/internal/service/restartruntime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
"galaxy/rtmanager/internal/service/stopruntime"
|
||||
)
|
||||
|
||||
//go:generate go run go.uber.org/mock/mockgen -destination=mocks/mock_services.go -package=mocks galaxy/rtmanager/internal/api/internalhttp/handlers StartService,StopService,RestartService,PatchService,CleanupService
|
||||
|
||||
// StartService is the narrow port the start handler depends on. It
|
||||
// matches the public Handle method of `startruntime.Service`; the
|
||||
// concrete service satisfies the interface implicitly.
|
||||
type StartService interface {
|
||||
Handle(ctx context.Context, in startruntime.Input) (startruntime.Result, error)
|
||||
}
|
||||
|
||||
// StopService is the narrow port the stop handler depends on.
|
||||
type StopService interface {
|
||||
Handle(ctx context.Context, in stopruntime.Input) (stopruntime.Result, error)
|
||||
}
|
||||
|
||||
// RestartService is the narrow port the restart handler depends on.
|
||||
type RestartService interface {
|
||||
Handle(ctx context.Context, in restartruntime.Input) (restartruntime.Result, error)
|
||||
}
|
||||
|
||||
// PatchService is the narrow port the patch handler depends on.
|
||||
type PatchService interface {
|
||||
Handle(ctx context.Context, in patchruntime.Input) (patchruntime.Result, error)
|
||||
}
|
||||
|
||||
// CleanupService is the narrow port the cleanup handler depends on.
|
||||
type CleanupService interface {
|
||||
Handle(ctx context.Context, in cleanupcontainer.Input) (cleanupcontainer.Result, error)
|
||||
}
|
||||
@@ -0,0 +1,71 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
)
|
||||
|
||||
// startRequestBody mirrors the OpenAPI StartRequest schema. Only
|
||||
// `image_ref` is accepted; unknown fields are rejected by
|
||||
// decodeStrictJSON.
|
||||
type startRequestBody struct {
|
||||
ImageRef string `json:"image_ref"`
|
||||
}
|
||||
|
||||
// newStartHandler returns the handler for
|
||||
// `POST /api/v1/internal/runtimes/{game_id}/start`. The handler
|
||||
// delegates the entire lifecycle to `startruntime.Service`; failure
|
||||
// codes are mapped to HTTP statuses via mapErrorCodeToStatus.
|
||||
func newStartHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.start")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.StartRuntime == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"start runtime service is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
gameID, ok := extractGameID(writer, request)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
var body startRequestBody
|
||||
if err := decodeStrictJSON(request.Body, &body); err != nil {
|
||||
writeError(writer, http.StatusBadRequest,
|
||||
startruntime.ErrorCodeInvalidRequest,
|
||||
err.Error(),
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
result, err := deps.StartRuntime.Handle(request.Context(), startruntime.Input{
|
||||
GameID: gameID,
|
||||
ImageRef: body.ImageRef,
|
||||
OpSource: resolveOpSource(request),
|
||||
SourceRef: requestSourceRef(request),
|
||||
})
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "start runtime service errored",
|
||||
"game_id", gameID,
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"start runtime service failed",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
if result.Outcome == operation.OutcomeFailure {
|
||||
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,70 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"galaxy/rtmanager/internal/domain/operation"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
"galaxy/rtmanager/internal/service/stopruntime"
|
||||
)
|
||||
|
||||
// stopRequestBody mirrors the OpenAPI StopRequest schema. The reason
|
||||
// enum is validated at the service layer (`stopruntime.Input.Validate`);
|
||||
// unknown values surface as `invalid_request`.
|
||||
type stopRequestBody struct {
|
||||
Reason string `json:"reason"`
|
||||
}
|
||||
|
||||
// newStopHandler returns the handler for
|
||||
// `POST /api/v1/internal/runtimes/{game_id}/stop`.
|
||||
func newStopHandler(deps Dependencies) http.HandlerFunc {
|
||||
logger := loggerFor(deps.Logger, "internal_rest.stop")
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if deps.StopRuntime == nil {
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"stop runtime service is not wired",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
gameID, ok := extractGameID(writer, request)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
var body stopRequestBody
|
||||
if err := decodeStrictJSON(request.Body, &body); err != nil {
|
||||
writeError(writer, http.StatusBadRequest,
|
||||
startruntime.ErrorCodeInvalidRequest,
|
||||
err.Error(),
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
result, err := deps.StopRuntime.Handle(request.Context(), stopruntime.Input{
|
||||
GameID: gameID,
|
||||
Reason: stopruntime.StopReason(body.Reason),
|
||||
OpSource: resolveOpSource(request),
|
||||
SourceRef: requestSourceRef(request),
|
||||
})
|
||||
if err != nil {
|
||||
logger.ErrorContext(request.Context(), "stop runtime service errored",
|
||||
"game_id", gameID,
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeError(writer, http.StatusInternalServerError,
|
||||
startruntime.ErrorCodeInternal,
|
||||
"stop runtime service failed",
|
||||
)
|
||||
return
|
||||
}
|
||||
|
||||
if result.Outcome == operation.OutcomeFailure {
|
||||
writeFailure(writer, result.ErrorCode, result.ErrorMessage)
|
||||
return
|
||||
}
|
||||
|
||||
writeJSON(writer, http.StatusOK, encodeRuntimeRecord(result.Record))
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,363 @@
|
||||
// Package internalhttp provides the trusted internal HTTP listener used
|
||||
// by the runnable Runtime Manager process. It exposes `/healthz` and
|
||||
// `/readyz` plus the GM/Admin REST surface backed by the lifecycle
|
||||
// services in `internal/service/`.
|
||||
package internalhttp
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"net"
|
||||
"net/http"
|
||||
"strconv"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/api/internalhttp/handlers"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
"galaxy/rtmanager/internal/telemetry"
|
||||
|
||||
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
|
||||
"go.opentelemetry.io/otel/attribute"
|
||||
)
|
||||
|
||||
const jsonContentType = "application/json; charset=utf-8"
|
||||
|
||||
// errorCodeServiceUnavailable mirrors the stable error code declared in
|
||||
// `rtmanager/api/internal-openapi.yaml` `§Error Model`.
|
||||
const errorCodeServiceUnavailable = "service_unavailable"
|
||||
|
||||
// HealthzPath and ReadyzPath are the internal probe routes documented in
|
||||
// `rtmanager/api/internal-openapi.yaml`.
|
||||
const (
|
||||
HealthzPath = "/healthz"
|
||||
ReadyzPath = "/readyz"
|
||||
)
|
||||
|
||||
// ReadinessProbe reports whether the dependencies the listener guards
|
||||
// (PostgreSQL, Redis, Docker) are reachable. A non-nil error is reported
|
||||
// to the caller as `503 service_unavailable` with the wrapped message.
|
||||
type ReadinessProbe interface {
|
||||
Check(ctx context.Context) error
|
||||
}
|
||||
|
||||
// Config describes the trusted internal HTTP listener owned by Runtime
|
||||
// Manager.
|
||||
type Config struct {
|
||||
// Addr is the TCP listen address used by the internal HTTP server.
|
||||
Addr string
|
||||
|
||||
// ReadHeaderTimeout bounds how long the listener may spend reading
|
||||
// request headers before the server rejects the connection.
|
||||
ReadHeaderTimeout time.Duration
|
||||
|
||||
// ReadTimeout bounds how long the listener may spend reading one
|
||||
// request.
|
||||
ReadTimeout time.Duration
|
||||
|
||||
// WriteTimeout bounds how long the listener may spend writing one
|
||||
// response.
|
||||
WriteTimeout time.Duration
|
||||
|
||||
// IdleTimeout bounds how long the listener keeps an idle keep-alive
|
||||
// connection open.
|
||||
IdleTimeout time.Duration
|
||||
}
|
||||
|
||||
// Validate reports whether cfg contains a usable internal HTTP listener
|
||||
// configuration.
|
||||
func (cfg Config) Validate() error {
|
||||
switch {
|
||||
case cfg.Addr == "":
|
||||
return errors.New("internal HTTP addr must not be empty")
|
||||
case cfg.ReadHeaderTimeout <= 0:
|
||||
return errors.New("internal HTTP read header timeout must be positive")
|
||||
case cfg.ReadTimeout <= 0:
|
||||
return errors.New("internal HTTP read timeout must be positive")
|
||||
case cfg.WriteTimeout <= 0:
|
||||
return errors.New("internal HTTP write timeout must be positive")
|
||||
case cfg.IdleTimeout <= 0:
|
||||
return errors.New("internal HTTP idle timeout must be positive")
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// Dependencies describes the collaborators used by the internal HTTP
|
||||
// transport layer. The listener still works when the lifecycle service
|
||||
// fields are zero — handlers register but each returns
|
||||
// `500 internal_error` until the runtime wires the real services.
|
||||
type Dependencies struct {
|
||||
// Logger writes structured listener lifecycle logs. When nil,
|
||||
// slog.Default is used.
|
||||
Logger *slog.Logger
|
||||
|
||||
// Telemetry records low-cardinality probe metrics and lifecycle
|
||||
// events.
|
||||
Telemetry *telemetry.Runtime
|
||||
|
||||
// Readiness reports whether PG / Redis / Docker are reachable. A
|
||||
// nil readiness probe makes `/readyz` always answer `200`; the
|
||||
// runtime always supplies a real probe in production wiring.
|
||||
Readiness ReadinessProbe
|
||||
|
||||
// RuntimeRecords backs the read-only list/get handlers. When nil
|
||||
// those routes return `500 internal_error`.
|
||||
RuntimeRecords ports.RuntimeRecordStore
|
||||
|
||||
// StartRuntime, StopRuntime, RestartRuntime, PatchRuntime, and
|
||||
// CleanupContainer back the lifecycle handlers. Each accepts a
|
||||
// narrow interface so tests can pass `mockgen`-generated mocks;
|
||||
// production wiring passes the concrete `*<lifecycle>.Service`
|
||||
// pointer.
|
||||
StartRuntime handlers.StartService
|
||||
StopRuntime handlers.StopService
|
||||
RestartRuntime handlers.RestartService
|
||||
PatchRuntime handlers.PatchService
|
||||
CleanupContainer handlers.CleanupService
|
||||
}
|
||||
|
||||
// Server owns the trusted internal HTTP listener exposed by Runtime
|
||||
// Manager.
|
||||
type Server struct {
|
||||
cfg Config
|
||||
|
||||
handler http.Handler
|
||||
logger *slog.Logger
|
||||
metrics *telemetry.Runtime
|
||||
|
||||
stateMu sync.RWMutex
|
||||
server *http.Server
|
||||
listener net.Listener
|
||||
}
|
||||
|
||||
// NewServer constructs one trusted internal HTTP server for cfg and deps.
|
||||
func NewServer(cfg Config, deps Dependencies) (*Server, error) {
|
||||
if err := cfg.Validate(); err != nil {
|
||||
return nil, fmt.Errorf("new internal HTTP server: %w", err)
|
||||
}
|
||||
|
||||
logger := deps.Logger
|
||||
if logger == nil {
|
||||
logger = slog.Default()
|
||||
}
|
||||
|
||||
return &Server{
|
||||
cfg: cfg,
|
||||
handler: newHandler(deps, logger),
|
||||
logger: logger.With("component", "internal_http"),
|
||||
metrics: deps.Telemetry,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Addr returns the currently bound listener address after Run is called.
|
||||
// It returns an empty string if the server has not yet bound a listener.
|
||||
func (server *Server) Addr() string {
|
||||
server.stateMu.RLock()
|
||||
defer server.stateMu.RUnlock()
|
||||
if server.listener == nil {
|
||||
return ""
|
||||
}
|
||||
|
||||
return server.listener.Addr().String()
|
||||
}
|
||||
|
||||
// Run binds the configured listener and serves the internal HTTP surface
|
||||
// until Shutdown closes the server.
|
||||
func (server *Server) Run(ctx context.Context) error {
|
||||
if ctx == nil {
|
||||
return errors.New("run internal HTTP server: nil context")
|
||||
}
|
||||
if err := ctx.Err(); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
listener, err := net.Listen("tcp", server.cfg.Addr)
|
||||
if err != nil {
|
||||
return fmt.Errorf("run internal HTTP server: listen on %q: %w", server.cfg.Addr, err)
|
||||
}
|
||||
|
||||
httpServer := &http.Server{
|
||||
Handler: server.handler,
|
||||
ReadHeaderTimeout: server.cfg.ReadHeaderTimeout,
|
||||
ReadTimeout: server.cfg.ReadTimeout,
|
||||
WriteTimeout: server.cfg.WriteTimeout,
|
||||
IdleTimeout: server.cfg.IdleTimeout,
|
||||
}
|
||||
|
||||
server.stateMu.Lock()
|
||||
server.server = httpServer
|
||||
server.listener = listener
|
||||
server.stateMu.Unlock()
|
||||
|
||||
server.logger.Info("rtmanager internal HTTP server started", "addr", listener.Addr().String())
|
||||
|
||||
defer func() {
|
||||
server.stateMu.Lock()
|
||||
server.server = nil
|
||||
server.listener = nil
|
||||
server.stateMu.Unlock()
|
||||
}()
|
||||
|
||||
err = httpServer.Serve(listener)
|
||||
switch {
|
||||
case err == nil:
|
||||
return nil
|
||||
case errors.Is(err, http.ErrServerClosed):
|
||||
server.logger.Info("rtmanager internal HTTP server stopped")
|
||||
return nil
|
||||
default:
|
||||
return fmt.Errorf("run internal HTTP server: serve on %q: %w", server.cfg.Addr, err)
|
||||
}
|
||||
}
|
||||
|
||||
// Shutdown gracefully stops the internal HTTP server within ctx.
|
||||
func (server *Server) Shutdown(ctx context.Context) error {
|
||||
if ctx == nil {
|
||||
return errors.New("shutdown internal HTTP server: nil context")
|
||||
}
|
||||
|
||||
server.stateMu.RLock()
|
||||
httpServer := server.server
|
||||
server.stateMu.RUnlock()
|
||||
|
||||
if httpServer == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
if err := httpServer.Shutdown(ctx); err != nil && !errors.Is(err, http.ErrServerClosed) {
|
||||
return fmt.Errorf("shutdown internal HTTP server: %w", err)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func newHandler(deps Dependencies, logger *slog.Logger) http.Handler {
|
||||
mux := http.NewServeMux()
|
||||
mux.HandleFunc("GET "+HealthzPath, handleHealthz)
|
||||
mux.HandleFunc("GET "+ReadyzPath, handleReadyz(deps.Readiness, logger))
|
||||
|
||||
handlers.Register(mux, handlers.Dependencies{
|
||||
Logger: logger,
|
||||
RuntimeRecords: deps.RuntimeRecords,
|
||||
StartRuntime: deps.StartRuntime,
|
||||
StopRuntime: deps.StopRuntime,
|
||||
RestartRuntime: deps.RestartRuntime,
|
||||
PatchRuntime: deps.PatchRuntime,
|
||||
CleanupContainer: deps.CleanupContainer,
|
||||
})
|
||||
|
||||
metrics := deps.Telemetry
|
||||
options := []otelhttp.Option{}
|
||||
if metrics != nil {
|
||||
options = append(options,
|
||||
otelhttp.WithTracerProvider(metrics.TracerProvider()),
|
||||
otelhttp.WithMeterProvider(metrics.MeterProvider()),
|
||||
)
|
||||
}
|
||||
|
||||
return otelhttp.NewHandler(withObservability(mux, metrics), "rtmanager.internal_http", options...)
|
||||
}
|
||||
|
||||
func withObservability(next http.Handler, metrics *telemetry.Runtime) http.Handler {
|
||||
return http.HandlerFunc(func(writer http.ResponseWriter, request *http.Request) {
|
||||
startedAt := time.Now()
|
||||
recorder := &statusRecorder{
|
||||
ResponseWriter: writer,
|
||||
statusCode: http.StatusOK,
|
||||
}
|
||||
|
||||
next.ServeHTTP(recorder, request)
|
||||
|
||||
route := request.Pattern
|
||||
switch recorder.statusCode {
|
||||
case http.StatusMethodNotAllowed:
|
||||
route = "method_not_allowed"
|
||||
case http.StatusNotFound:
|
||||
route = "not_found"
|
||||
case 0:
|
||||
route = "unmatched"
|
||||
}
|
||||
if route == "" {
|
||||
route = "unmatched"
|
||||
}
|
||||
|
||||
if metrics != nil {
|
||||
metrics.RecordInternalHTTPRequest(
|
||||
request.Context(),
|
||||
[]attribute.KeyValue{
|
||||
attribute.String("route", route),
|
||||
attribute.String("method", request.Method),
|
||||
attribute.String("status_code", strconv.Itoa(recorder.statusCode)),
|
||||
},
|
||||
time.Since(startedAt),
|
||||
)
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
func handleHealthz(writer http.ResponseWriter, _ *http.Request) {
|
||||
writeStatusResponse(writer, http.StatusOK, "ok")
|
||||
}
|
||||
|
||||
func handleReadyz(probe ReadinessProbe, logger *slog.Logger) http.HandlerFunc {
|
||||
return func(writer http.ResponseWriter, request *http.Request) {
|
||||
if probe == nil {
|
||||
writeStatusResponse(writer, http.StatusOK, "ready")
|
||||
return
|
||||
}
|
||||
|
||||
if err := probe.Check(request.Context()); err != nil {
|
||||
logger.WarnContext(request.Context(), "rtmanager readiness probe failed",
|
||||
"err", err.Error(),
|
||||
)
|
||||
writeServiceUnavailable(writer, err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
writeStatusResponse(writer, http.StatusOK, "ready")
|
||||
}
|
||||
}
|
||||
|
||||
func writeStatusResponse(writer http.ResponseWriter, statusCode int, status string) {
|
||||
writer.Header().Set("Content-Type", jsonContentType)
|
||||
writer.WriteHeader(statusCode)
|
||||
_ = json.NewEncoder(writer).Encode(statusResponse{Status: status})
|
||||
}
|
||||
|
||||
func writeServiceUnavailable(writer http.ResponseWriter, message string) {
|
||||
writer.Header().Set("Content-Type", jsonContentType)
|
||||
writer.WriteHeader(http.StatusServiceUnavailable)
|
||||
_ = json.NewEncoder(writer).Encode(errorResponse{
|
||||
Error: errorBody{
|
||||
Code: errorCodeServiceUnavailable,
|
||||
Message: message,
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
type statusResponse struct {
|
||||
Status string `json:"status"`
|
||||
}
|
||||
|
||||
type errorBody struct {
|
||||
Code string `json:"code"`
|
||||
Message string `json:"message"`
|
||||
}
|
||||
|
||||
type errorResponse struct {
|
||||
Error errorBody `json:"error"`
|
||||
}
|
||||
|
||||
type statusRecorder struct {
|
||||
http.ResponseWriter
|
||||
statusCode int
|
||||
}
|
||||
|
||||
func (recorder *statusRecorder) WriteHeader(statusCode int) {
|
||||
recorder.statusCode = statusCode
|
||||
recorder.ResponseWriter.WriteHeader(statusCode)
|
||||
}
|
||||
@@ -0,0 +1,115 @@
|
||||
package internalhttp
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func newTestConfig() Config {
|
||||
return Config{
|
||||
Addr: ":0",
|
||||
ReadHeaderTimeout: time.Second,
|
||||
ReadTimeout: time.Second,
|
||||
WriteTimeout: time.Second,
|
||||
IdleTimeout: time.Second,
|
||||
}
|
||||
}
|
||||
|
||||
type stubReadiness struct {
|
||||
err error
|
||||
}
|
||||
|
||||
func (probe stubReadiness) Check(_ context.Context) error {
|
||||
return probe.err
|
||||
}
|
||||
|
||||
func newTestServer(t *testing.T, deps Dependencies) http.Handler {
|
||||
t.Helper()
|
||||
server, err := NewServer(newTestConfig(), deps)
|
||||
require.NoError(t, err)
|
||||
return server.handler
|
||||
}
|
||||
|
||||
func TestHealthzReturnsOK(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
handler := newTestServer(t, Dependencies{})
|
||||
|
||||
rec := httptest.NewRecorder()
|
||||
req := httptest.NewRequest(http.MethodGet, HealthzPath, nil)
|
||||
handler.ServeHTTP(rec, req)
|
||||
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
require.Equal(t, jsonContentType, rec.Header().Get("Content-Type"))
|
||||
|
||||
var body statusResponse
|
||||
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
|
||||
require.Equal(t, "ok", body.Status)
|
||||
}
|
||||
|
||||
func TestReadyzReturnsReadyWhenProbeIsNil(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
handler := newTestServer(t, Dependencies{})
|
||||
|
||||
rec := httptest.NewRecorder()
|
||||
req := httptest.NewRequest(http.MethodGet, ReadyzPath, nil)
|
||||
handler.ServeHTTP(rec, req)
|
||||
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
|
||||
var body statusResponse
|
||||
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
|
||||
require.Equal(t, "ready", body.Status)
|
||||
}
|
||||
|
||||
func TestReadyzReturnsReadyWhenProbeSucceeds(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
handler := newTestServer(t, Dependencies{Readiness: stubReadiness{}})
|
||||
|
||||
rec := httptest.NewRecorder()
|
||||
req := httptest.NewRequest(http.MethodGet, ReadyzPath, nil)
|
||||
handler.ServeHTTP(rec, req)
|
||||
|
||||
require.Equal(t, http.StatusOK, rec.Code)
|
||||
|
||||
var body statusResponse
|
||||
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
|
||||
require.Equal(t, "ready", body.Status)
|
||||
}
|
||||
|
||||
func TestReadyzReturnsServiceUnavailableWhenProbeFails(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
handler := newTestServer(t, Dependencies{
|
||||
Readiness: stubReadiness{err: errors.New("postgres ping: connection refused")},
|
||||
})
|
||||
|
||||
rec := httptest.NewRecorder()
|
||||
req := httptest.NewRequest(http.MethodGet, ReadyzPath, nil)
|
||||
handler.ServeHTTP(rec, req)
|
||||
|
||||
require.Equal(t, http.StatusServiceUnavailable, rec.Code)
|
||||
require.Equal(t, jsonContentType, rec.Header().Get("Content-Type"))
|
||||
|
||||
var body errorResponse
|
||||
require.NoError(t, json.Unmarshal(rec.Body.Bytes(), &body))
|
||||
require.Equal(t, errorCodeServiceUnavailable, body.Error.Code)
|
||||
require.True(t, strings.Contains(body.Error.Message, "postgres"))
|
||||
}
|
||||
|
||||
func TestNewServerRejectsInvalidConfig(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
_, err := NewServer(Config{}, Dependencies{})
|
||||
require.Error(t, err)
|
||||
}
|
||||
@@ -0,0 +1,170 @@
|
||||
// Package app wires the Runtime Manager process lifecycle and
|
||||
// coordinates component startup and graceful shutdown.
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"sync"
|
||||
|
||||
"galaxy/rtmanager/internal/config"
|
||||
)
|
||||
|
||||
// Component is a long-lived Runtime Manager subsystem that participates
|
||||
// in coordinated startup and graceful shutdown.
|
||||
type Component interface {
|
||||
// Run starts the component and blocks until it stops.
|
||||
Run(context.Context) error
|
||||
|
||||
// Shutdown stops the component within the provided timeout-bounded
|
||||
// context.
|
||||
Shutdown(context.Context) error
|
||||
}
|
||||
|
||||
// App owns the process-level lifecycle of Runtime Manager and its
|
||||
// registered components.
|
||||
type App struct {
|
||||
cfg config.Config
|
||||
components []Component
|
||||
}
|
||||
|
||||
// New constructs App with a defensive copy of the supplied components.
|
||||
func New(cfg config.Config, components ...Component) *App {
|
||||
clonedComponents := append([]Component(nil), components...)
|
||||
|
||||
return &App{
|
||||
cfg: cfg,
|
||||
components: clonedComponents,
|
||||
}
|
||||
}
|
||||
|
||||
// Run starts all configured components, waits for cancellation or the
|
||||
// first component failure, and then executes best-effort graceful
|
||||
// shutdown.
|
||||
func (app *App) Run(ctx context.Context) error {
|
||||
if ctx == nil {
|
||||
return errors.New("run rtmanager app: nil context")
|
||||
}
|
||||
if err := app.validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if len(app.components) == 0 {
|
||||
<-ctx.Done()
|
||||
return nil
|
||||
}
|
||||
|
||||
runCtx, cancel := context.WithCancel(ctx)
|
||||
defer cancel()
|
||||
|
||||
results := make(chan componentResult, len(app.components))
|
||||
var runWaitGroup sync.WaitGroup
|
||||
|
||||
for index, component := range app.components {
|
||||
runWaitGroup.Add(1)
|
||||
|
||||
go func(componentIndex int, component Component) {
|
||||
defer runWaitGroup.Done()
|
||||
results <- componentResult{
|
||||
index: componentIndex,
|
||||
err: component.Run(runCtx),
|
||||
}
|
||||
}(index, component)
|
||||
}
|
||||
|
||||
var runErr error
|
||||
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
case result := <-results:
|
||||
runErr = classifyComponentResult(ctx, result)
|
||||
}
|
||||
|
||||
cancel()
|
||||
|
||||
shutdownErr := app.shutdownComponents()
|
||||
waitErr := app.waitForComponents(&runWaitGroup)
|
||||
|
||||
return errors.Join(runErr, shutdownErr, waitErr)
|
||||
}
|
||||
|
||||
type componentResult struct {
|
||||
index int
|
||||
err error
|
||||
}
|
||||
|
||||
func (app *App) validate() error {
|
||||
if app.cfg.ShutdownTimeout <= 0 {
|
||||
return fmt.Errorf("run rtmanager app: shutdown timeout must be positive, got %s", app.cfg.ShutdownTimeout)
|
||||
}
|
||||
|
||||
for index, component := range app.components {
|
||||
if component == nil {
|
||||
return fmt.Errorf("run rtmanager app: component %d is nil", index)
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func classifyComponentResult(parentCtx context.Context, result componentResult) error {
|
||||
switch {
|
||||
case result.err == nil:
|
||||
if parentCtx.Err() != nil {
|
||||
return nil
|
||||
}
|
||||
return fmt.Errorf("run rtmanager app: component %d exited without error before shutdown", result.index)
|
||||
case errors.Is(result.err, context.Canceled) && parentCtx.Err() != nil:
|
||||
return nil
|
||||
default:
|
||||
return fmt.Errorf("run rtmanager app: component %d: %w", result.index, result.err)
|
||||
}
|
||||
}
|
||||
|
||||
func (app *App) shutdownComponents() error {
|
||||
var shutdownWaitGroup sync.WaitGroup
|
||||
errs := make(chan error, len(app.components))
|
||||
|
||||
for index, component := range app.components {
|
||||
shutdownWaitGroup.Add(1)
|
||||
|
||||
go func(componentIndex int, component Component) {
|
||||
defer shutdownWaitGroup.Done()
|
||||
|
||||
shutdownCtx, cancel := context.WithTimeout(context.Background(), app.cfg.ShutdownTimeout)
|
||||
defer cancel()
|
||||
|
||||
if err := component.Shutdown(shutdownCtx); err != nil {
|
||||
errs <- fmt.Errorf("shutdown rtmanager component %d: %w", componentIndex, err)
|
||||
}
|
||||
}(index, component)
|
||||
}
|
||||
|
||||
shutdownWaitGroup.Wait()
|
||||
close(errs)
|
||||
|
||||
var joined error
|
||||
for err := range errs {
|
||||
joined = errors.Join(joined, err)
|
||||
}
|
||||
|
||||
return joined
|
||||
}
|
||||
|
||||
func (app *App) waitForComponents(runWaitGroup *sync.WaitGroup) error {
|
||||
done := make(chan struct{})
|
||||
go func() {
|
||||
runWaitGroup.Wait()
|
||||
close(done)
|
||||
}()
|
||||
|
||||
waitCtx, cancel := context.WithTimeout(context.Background(), app.cfg.ShutdownTimeout)
|
||||
defer cancel()
|
||||
|
||||
select {
|
||||
case <-done:
|
||||
return nil
|
||||
case <-waitCtx.Done():
|
||||
return fmt.Errorf("wait for rtmanager components: %w", waitCtx.Err())
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,137 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"sync/atomic"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/config"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
type fakeComponent struct {
|
||||
runErr error
|
||||
shutdownErr error
|
||||
runHook func(context.Context) error
|
||||
shutdownHook func(context.Context) error
|
||||
runCount atomic.Int32
|
||||
downCount atomic.Int32
|
||||
blockForCtx bool
|
||||
}
|
||||
|
||||
func (component *fakeComponent) Run(ctx context.Context) error {
|
||||
component.runCount.Add(1)
|
||||
if component.runHook != nil {
|
||||
return component.runHook(ctx)
|
||||
}
|
||||
if component.blockForCtx {
|
||||
<-ctx.Done()
|
||||
return ctx.Err()
|
||||
}
|
||||
|
||||
return component.runErr
|
||||
}
|
||||
|
||||
func (component *fakeComponent) Shutdown(ctx context.Context) error {
|
||||
component.downCount.Add(1)
|
||||
if component.shutdownHook != nil {
|
||||
return component.shutdownHook(ctx)
|
||||
}
|
||||
|
||||
return component.shutdownErr
|
||||
}
|
||||
|
||||
func newCfg() config.Config {
|
||||
return config.Config{ShutdownTimeout: time.Second}
|
||||
}
|
||||
|
||||
func TestAppRunWithoutComponentsBlocksUntilContextDone(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
app := New(newCfg())
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
cancel()
|
||||
|
||||
require.NoError(t, app.Run(ctx))
|
||||
}
|
||||
|
||||
func TestAppRunReturnsOnContextCancel(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
component := &fakeComponent{blockForCtx: true}
|
||||
app := New(newCfg(), component)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
go func() {
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
cancel()
|
||||
}()
|
||||
|
||||
require.NoError(t, app.Run(ctx))
|
||||
assert.EqualValues(t, 1, component.runCount.Load())
|
||||
assert.EqualValues(t, 1, component.downCount.Load())
|
||||
}
|
||||
|
||||
func TestAppRunPropagatesComponentFailure(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
failure := errors.New("boom")
|
||||
component := &fakeComponent{runErr: failure}
|
||||
app := New(newCfg(), component)
|
||||
|
||||
err := app.Run(context.Background())
|
||||
require.Error(t, err)
|
||||
require.ErrorIs(t, err, failure)
|
||||
assert.EqualValues(t, 1, component.downCount.Load())
|
||||
}
|
||||
|
||||
func TestAppRunFailsOnNilContext(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
app := New(newCfg())
|
||||
var ctx context.Context
|
||||
require.Error(t, app.Run(ctx))
|
||||
}
|
||||
|
||||
func TestAppRunFailsOnNonPositiveShutdownTimeout(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
app := New(config.Config{}, &fakeComponent{})
|
||||
require.Error(t, app.Run(context.Background()))
|
||||
}
|
||||
|
||||
func TestAppRunFailsOnNilComponent(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
app := New(newCfg(), nil)
|
||||
require.Error(t, app.Run(context.Background()))
|
||||
}
|
||||
|
||||
func TestAppRunFlagsCleanExitBeforeShutdown(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
component := &fakeComponent{}
|
||||
app := New(newCfg(), component)
|
||||
|
||||
err := app.Run(context.Background())
|
||||
require.Error(t, err)
|
||||
require.True(t, contains(err.Error(), "exited without error"))
|
||||
}
|
||||
|
||||
func contains(haystack, needle string) bool {
|
||||
return len(needle) == 0 || (len(haystack) >= len(needle) && (haystack == needle || index(haystack, needle) >= 0))
|
||||
}
|
||||
|
||||
func index(haystack, needle string) int {
|
||||
for i := 0; i+len(needle) <= len(haystack); i++ {
|
||||
if haystack[i:i+len(needle)] == needle {
|
||||
return i
|
||||
}
|
||||
}
|
||||
return -1
|
||||
}
|
||||
@@ -0,0 +1,85 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"galaxy/redisconn"
|
||||
"galaxy/rtmanager/internal/config"
|
||||
"galaxy/rtmanager/internal/telemetry"
|
||||
|
||||
"github.com/docker/docker/client"
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// newRedisClient builds the master Redis client from cfg via the shared
|
||||
// `pkg/redisconn` helper. Replica clients are not opened in this iteration
|
||||
// per ARCHITECTURE.md §Persistence Backends; they will be wired when read
|
||||
// routing is introduced.
|
||||
func newRedisClient(cfg config.RedisConfig) *redis.Client {
|
||||
return redisconn.NewMasterClient(cfg.Conn)
|
||||
}
|
||||
|
||||
// instrumentRedisClient attaches the OpenTelemetry tracing and metrics
|
||||
// instrumentation to client when telemetryRuntime is available. The
|
||||
// actual instrumentation lives in `pkg/redisconn` so every Galaxy service
|
||||
// shares one surface.
|
||||
func instrumentRedisClient(redisClient *redis.Client, telemetryRuntime *telemetry.Runtime) error {
|
||||
if redisClient == nil {
|
||||
return errors.New("instrument redis client: nil client")
|
||||
}
|
||||
if telemetryRuntime == nil {
|
||||
return nil
|
||||
}
|
||||
return redisconn.Instrument(redisClient,
|
||||
redisconn.WithTracerProvider(telemetryRuntime.TracerProvider()),
|
||||
redisconn.WithMeterProvider(telemetryRuntime.MeterProvider()),
|
||||
)
|
||||
}
|
||||
|
||||
// pingRedis performs a single Redis PING bounded by
|
||||
// cfg.Conn.OperationTimeout to confirm that the configured Redis endpoint
|
||||
// is reachable at startup.
|
||||
func pingRedis(ctx context.Context, cfg config.RedisConfig, redisClient *redis.Client) error {
|
||||
return redisconn.Ping(ctx, redisClient, cfg.Conn.OperationTimeout)
|
||||
}
|
||||
|
||||
// newDockerClient constructs a Docker SDK client for cfg.Host with an
|
||||
// optional API version override. The bootstrap layer opens and pings
|
||||
// the client; the production Docker adapter wraps it for the service
|
||||
// layer.
|
||||
func newDockerClient(cfg config.DockerConfig) (*client.Client, error) {
|
||||
options := []client.Opt{client.WithHost(cfg.Host)}
|
||||
if cfg.APIVersion == "" {
|
||||
options = append(options, client.WithAPIVersionNegotiation())
|
||||
} else {
|
||||
options = append(options, client.WithVersion(cfg.APIVersion))
|
||||
}
|
||||
|
||||
docker, err := client.NewClientWithOpts(options...)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("new docker client: %w", err)
|
||||
}
|
||||
return docker, nil
|
||||
}
|
||||
|
||||
// pingDocker bounds one Docker daemon ping under timeout and returns a
|
||||
// wrapped error so startup failures are easy to spot in service logs.
|
||||
func pingDocker(ctx context.Context, dockerClient *client.Client, timeout time.Duration) error {
|
||||
if dockerClient == nil {
|
||||
return errors.New("ping docker: nil client")
|
||||
}
|
||||
if timeout <= 0 {
|
||||
return errors.New("ping docker: timeout must be positive")
|
||||
}
|
||||
|
||||
pingCtx, cancel := context.WithTimeout(ctx, timeout)
|
||||
defer cancel()
|
||||
|
||||
if _, err := dockerClient.Ping(pingCtx); err != nil {
|
||||
return fmt.Errorf("ping docker: %w", err)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
@@ -0,0 +1,82 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"galaxy/redisconn"
|
||||
"galaxy/rtmanager/internal/config"
|
||||
|
||||
"github.com/alicebob/miniredis/v2"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func newTestRedisCfg(addr string) config.RedisConfig {
|
||||
return config.RedisConfig{
|
||||
Conn: redisconn.Config{
|
||||
MasterAddr: addr,
|
||||
Password: "test",
|
||||
OperationTimeout: time.Second,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func TestPingRedisSucceedsAgainstMiniredis(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
server := miniredis.RunT(t)
|
||||
|
||||
redisCfg := newTestRedisCfg(server.Addr())
|
||||
client := newRedisClient(redisCfg)
|
||||
t.Cleanup(func() { _ = client.Close() })
|
||||
|
||||
require.NoError(t, pingRedis(context.Background(), redisCfg, client))
|
||||
}
|
||||
|
||||
func TestPingRedisReturnsErrorWhenClosed(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
server := miniredis.RunT(t)
|
||||
|
||||
redisCfg := newTestRedisCfg(server.Addr())
|
||||
client := newRedisClient(redisCfg)
|
||||
require.NoError(t, client.Close())
|
||||
|
||||
require.Error(t, pingRedis(context.Background(), redisCfg, client))
|
||||
}
|
||||
|
||||
func TestNewDockerClientHonoursHostOverride(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
docker, err := newDockerClient(config.DockerConfig{
|
||||
Host: "unix:///var/run/docker.sock",
|
||||
APIVersion: "1.43",
|
||||
Network: "galaxy-net",
|
||||
LogDriver: "json-file",
|
||||
PullPolicy: config.ImagePullPolicyIfMissing,
|
||||
})
|
||||
require.NoError(t, err)
|
||||
require.NotNil(t, docker)
|
||||
require.NoError(t, docker.Close())
|
||||
}
|
||||
|
||||
func TestPingDockerRejectsNilClient(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
require.Error(t, pingDocker(context.Background(), nil, time.Second))
|
||||
}
|
||||
|
||||
func TestPingDockerRejectsNonPositiveTimeout(t *testing.T) {
|
||||
t.Parallel()
|
||||
|
||||
docker, err := newDockerClient(config.DockerConfig{
|
||||
Host: "unix:///var/run/docker.sock",
|
||||
Network: "galaxy-net",
|
||||
LogDriver: "json-file",
|
||||
})
|
||||
require.NoError(t, err)
|
||||
t.Cleanup(func() { _ = docker.Close() })
|
||||
|
||||
require.Error(t, pingDocker(context.Background(), docker, 0))
|
||||
}
|
||||
@@ -0,0 +1,262 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/redisconn"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/migrations"
|
||||
"galaxy/rtmanager/internal/api/internalhttp"
|
||||
"galaxy/rtmanager/internal/config"
|
||||
"galaxy/rtmanager/internal/telemetry"
|
||||
|
||||
dockerclient "github.com/docker/docker/client"
|
||||
"github.com/redis/go-redis/v9"
|
||||
)
|
||||
|
||||
// Runtime owns the runnable Runtime Manager process plus the cleanup
|
||||
// functions that release runtime resources after shutdown.
|
||||
type Runtime struct {
|
||||
cfg config.Config
|
||||
|
||||
app *App
|
||||
|
||||
wiring *wiring
|
||||
|
||||
internalServer *internalhttp.Server
|
||||
|
||||
cleanupFns []func() error
|
||||
}
|
||||
|
||||
// NewRuntime constructs the runnable Runtime Manager process from cfg.
|
||||
//
|
||||
// PostgreSQL migrations apply strictly before the internal HTTP listener
|
||||
// becomes ready. The runtime opens one shared `*redis.Client`, one
|
||||
// `*sql.DB`, one Docker SDK client, and one OpenTelemetry runtime; all
|
||||
// are released in reverse construction order on shutdown.
|
||||
func NewRuntime(ctx context.Context, cfg config.Config, logger *slog.Logger) (*Runtime, error) {
|
||||
if ctx == nil {
|
||||
return nil, errors.New("new rtmanager runtime: nil context")
|
||||
}
|
||||
if err := cfg.Validate(); err != nil {
|
||||
return nil, fmt.Errorf("new rtmanager runtime: %w", err)
|
||||
}
|
||||
if logger == nil {
|
||||
logger = slog.Default()
|
||||
}
|
||||
|
||||
runtime := &Runtime{
|
||||
cfg: cfg,
|
||||
}
|
||||
|
||||
cleanupOnError := func(err error) (*Runtime, error) {
|
||||
if cleanupErr := runtime.Close(); cleanupErr != nil {
|
||||
return nil, fmt.Errorf("%w; cleanup: %w", err, cleanupErr)
|
||||
}
|
||||
|
||||
return nil, err
|
||||
}
|
||||
|
||||
telemetryRuntime, err := telemetry.NewProcess(ctx, telemetry.ProcessConfig{
|
||||
ServiceName: cfg.Telemetry.ServiceName,
|
||||
TracesExporter: cfg.Telemetry.TracesExporter,
|
||||
MetricsExporter: cfg.Telemetry.MetricsExporter,
|
||||
TracesProtocol: cfg.Telemetry.TracesProtocol,
|
||||
MetricsProtocol: cfg.Telemetry.MetricsProtocol,
|
||||
StdoutTracesEnabled: cfg.Telemetry.StdoutTracesEnabled,
|
||||
StdoutMetricsEnabled: cfg.Telemetry.StdoutMetricsEnabled,
|
||||
}, logger)
|
||||
if err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: telemetry: %w", err))
|
||||
}
|
||||
runtime.cleanupFns = append(runtime.cleanupFns, func() error {
|
||||
shutdownCtx, cancel := context.WithTimeout(context.Background(), cfg.ShutdownTimeout)
|
||||
defer cancel()
|
||||
return telemetryRuntime.Shutdown(shutdownCtx)
|
||||
})
|
||||
|
||||
redisClient := newRedisClient(cfg.Redis)
|
||||
if err := instrumentRedisClient(redisClient, telemetryRuntime); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
|
||||
}
|
||||
runtime.cleanupFns = append(runtime.cleanupFns, func() error {
|
||||
err := redisClient.Close()
|
||||
if errors.Is(err, redis.ErrClosed) {
|
||||
return nil
|
||||
}
|
||||
return err
|
||||
})
|
||||
if err := pingRedis(ctx, cfg.Redis, redisClient); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
|
||||
}
|
||||
|
||||
pgPool, err := postgres.OpenPrimary(ctx, cfg.Postgres.Conn,
|
||||
postgres.WithTracerProvider(telemetryRuntime.TracerProvider()),
|
||||
postgres.WithMeterProvider(telemetryRuntime.MeterProvider()),
|
||||
)
|
||||
if err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: open postgres: %w", err))
|
||||
}
|
||||
runtime.cleanupFns = append(runtime.cleanupFns, pgPool.Close)
|
||||
unregisterPGStats, err := postgres.InstrumentDBStats(pgPool,
|
||||
postgres.WithMeterProvider(telemetryRuntime.MeterProvider()),
|
||||
)
|
||||
if err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: instrument postgres: %w", err))
|
||||
}
|
||||
runtime.cleanupFns = append(runtime.cleanupFns, func() error {
|
||||
return unregisterPGStats()
|
||||
})
|
||||
if err := postgres.Ping(ctx, pgPool, cfg.Postgres.Conn.OperationTimeout); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: ping postgres: %w", err))
|
||||
}
|
||||
if err := postgres.RunMigrations(ctx, pgPool, migrations.FS(), "."); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: run postgres migrations: %w", err))
|
||||
}
|
||||
|
||||
dockerClient, err := newDockerClient(cfg.Docker)
|
||||
if err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
|
||||
}
|
||||
runtime.cleanupFns = append(runtime.cleanupFns, dockerClient.Close)
|
||||
if err := pingDocker(ctx, dockerClient, cfg.Postgres.Conn.OperationTimeout); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: %w", err))
|
||||
}
|
||||
|
||||
wiring, err := newWiring(cfg, redisClient, pgPool, dockerClient, time.Now, logger, telemetryRuntime)
|
||||
if err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: wiring: %w", err))
|
||||
}
|
||||
runtime.wiring = wiring
|
||||
runtime.cleanupFns = append(runtime.cleanupFns, wiring.close)
|
||||
if err := wiring.registerTelemetryGauges(); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: register telemetry gauges: %w", err))
|
||||
}
|
||||
|
||||
if err := wiring.reconciler.ReconcileNow(ctx); err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: initial reconcile: %w", err))
|
||||
}
|
||||
|
||||
probe := newReadinessProbe(pgPool, redisClient, dockerClient, cfg)
|
||||
|
||||
internalServer, err := internalhttp.NewServer(internalhttp.Config{
|
||||
Addr: cfg.InternalHTTP.Addr,
|
||||
ReadHeaderTimeout: cfg.InternalHTTP.ReadHeaderTimeout,
|
||||
ReadTimeout: cfg.InternalHTTP.ReadTimeout,
|
||||
WriteTimeout: cfg.InternalHTTP.WriteTimeout,
|
||||
IdleTimeout: cfg.InternalHTTP.IdleTimeout,
|
||||
}, internalhttp.Dependencies{
|
||||
Logger: logger,
|
||||
Telemetry: telemetryRuntime,
|
||||
Readiness: probe,
|
||||
RuntimeRecords: wiring.runtimeRecordStore,
|
||||
StartRuntime: wiring.startRuntimeService,
|
||||
StopRuntime: wiring.stopRuntimeService,
|
||||
RestartRuntime: wiring.restartRuntimeService,
|
||||
PatchRuntime: wiring.patchRuntimeService,
|
||||
CleanupContainer: wiring.cleanupContainerService,
|
||||
})
|
||||
if err != nil {
|
||||
return cleanupOnError(fmt.Errorf("new rtmanager runtime: internal HTTP server: %w", err))
|
||||
}
|
||||
runtime.internalServer = internalServer
|
||||
|
||||
runtime.app = New(cfg,
|
||||
internalServer,
|
||||
wiring.startJobsConsumer,
|
||||
wiring.stopJobsConsumer,
|
||||
wiring.dockerEventsListener,
|
||||
wiring.healthProbeWorker,
|
||||
wiring.dockerInspectWorker,
|
||||
wiring.reconciler,
|
||||
wiring.containerCleanupWorker,
|
||||
)
|
||||
|
||||
return runtime, nil
|
||||
}
|
||||
|
||||
// InternalServer returns the internal HTTP server owned by runtime. It is
|
||||
// primarily exposed for tests; production code should not depend on it.
|
||||
func (runtime *Runtime) InternalServer() *internalhttp.Server {
|
||||
if runtime == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
return runtime.internalServer
|
||||
}
|
||||
|
||||
// Run serves the internal HTTP listener until ctx is canceled or one
|
||||
// component fails.
|
||||
func (runtime *Runtime) Run(ctx context.Context) error {
|
||||
if ctx == nil {
|
||||
return errors.New("run rtmanager runtime: nil context")
|
||||
}
|
||||
if runtime == nil {
|
||||
return errors.New("run rtmanager runtime: nil runtime")
|
||||
}
|
||||
if runtime.app == nil {
|
||||
return errors.New("run rtmanager runtime: nil app")
|
||||
}
|
||||
|
||||
return runtime.app.Run(ctx)
|
||||
}
|
||||
|
||||
// Close releases every runtime dependency in reverse construction order.
|
||||
// Close is safe to call multiple times.
|
||||
func (runtime *Runtime) Close() error {
|
||||
if runtime == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
var joined error
|
||||
for index := len(runtime.cleanupFns) - 1; index >= 0; index-- {
|
||||
if err := runtime.cleanupFns[index](); err != nil {
|
||||
joined = errors.Join(joined, err)
|
||||
}
|
||||
}
|
||||
runtime.cleanupFns = nil
|
||||
|
||||
return joined
|
||||
}
|
||||
|
||||
// readinessProbe pings every steady-state dependency the listener
|
||||
// guards: PostgreSQL primary, Redis master, the Docker daemon, plus
|
||||
// the configured Docker network's existence.
|
||||
type readinessProbe struct {
|
||||
pgPool *sql.DB
|
||||
redisClient *redis.Client
|
||||
dockerClient *dockerclient.Client
|
||||
|
||||
postgresTimeout time.Duration
|
||||
redisTimeout time.Duration
|
||||
dockerTimeout time.Duration
|
||||
}
|
||||
|
||||
func newReadinessProbe(pgPool *sql.DB, redisClient *redis.Client, dockerClient *dockerclient.Client, cfg config.Config) *readinessProbe {
|
||||
return &readinessProbe{
|
||||
pgPool: pgPool,
|
||||
redisClient: redisClient,
|
||||
dockerClient: dockerClient,
|
||||
postgresTimeout: cfg.Postgres.Conn.OperationTimeout,
|
||||
redisTimeout: cfg.Redis.Conn.OperationTimeout,
|
||||
dockerTimeout: cfg.Postgres.Conn.OperationTimeout,
|
||||
}
|
||||
}
|
||||
|
||||
// Check pings PostgreSQL, Redis, and Docker. The first failing
|
||||
// dependency aborts the check so callers see a single, actionable
|
||||
// error.
|
||||
func (probe *readinessProbe) Check(ctx context.Context) error {
|
||||
if err := postgres.Ping(ctx, probe.pgPool, probe.postgresTimeout); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := redisconn.Ping(ctx, probe.redisClient, probe.redisTimeout); err != nil {
|
||||
return err
|
||||
}
|
||||
return pingDocker(ctx, probe.dockerClient, probe.dockerTimeout)
|
||||
}
|
||||
@@ -0,0 +1,541 @@
|
||||
package app
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"errors"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"net/http"
|
||||
"time"
|
||||
|
||||
"galaxy/rtmanager/internal/adapters/docker"
|
||||
"galaxy/rtmanager/internal/adapters/healtheventspublisher"
|
||||
"galaxy/rtmanager/internal/adapters/jobresultspublisher"
|
||||
"galaxy/rtmanager/internal/adapters/lobbyclient"
|
||||
"galaxy/rtmanager/internal/adapters/notificationpublisher"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/healthsnapshotstore"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/operationlogstore"
|
||||
"galaxy/rtmanager/internal/adapters/postgres/runtimerecordstore"
|
||||
"galaxy/rtmanager/internal/adapters/redisstate/gamelease"
|
||||
"galaxy/rtmanager/internal/adapters/redisstate/streamoffsets"
|
||||
"galaxy/rtmanager/internal/config"
|
||||
"galaxy/rtmanager/internal/ports"
|
||||
"galaxy/rtmanager/internal/service/cleanupcontainer"
|
||||
"galaxy/rtmanager/internal/service/patchruntime"
|
||||
"galaxy/rtmanager/internal/service/restartruntime"
|
||||
"galaxy/rtmanager/internal/service/startruntime"
|
||||
"galaxy/rtmanager/internal/service/stopruntime"
|
||||
"galaxy/rtmanager/internal/telemetry"
|
||||
"galaxy/rtmanager/internal/worker/containercleanup"
|
||||
"galaxy/rtmanager/internal/worker/dockerevents"
|
||||
"galaxy/rtmanager/internal/worker/dockerinspect"
|
||||
"galaxy/rtmanager/internal/worker/healthprobe"
|
||||
"galaxy/rtmanager/internal/worker/reconcile"
|
||||
"galaxy/rtmanager/internal/worker/startjobsconsumer"
|
||||
"galaxy/rtmanager/internal/worker/stopjobsconsumer"
|
||||
|
||||
dockerclient "github.com/docker/docker/client"
|
||||
"github.com/redis/go-redis/v9"
|
||||
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
|
||||
)
|
||||
|
||||
// wiring owns the process-level singletons constructed once during
|
||||
// `NewRuntime` and consumed by every worker and HTTP handler.
|
||||
//
|
||||
// The struct exposes typed accessors so callers can grab the store /
|
||||
// adapter / service singletons without depending on internal fields.
|
||||
type wiring struct {
|
||||
cfg config.Config
|
||||
|
||||
redisClient *redis.Client
|
||||
pgPool *sql.DB
|
||||
dockerClient *dockerclient.Client
|
||||
|
||||
clock func() time.Time
|
||||
|
||||
logger *slog.Logger
|
||||
telemetry *telemetry.Runtime
|
||||
|
||||
// Persistence stores.
|
||||
runtimeRecordStore *runtimerecordstore.Store
|
||||
operationLogStore *operationlogstore.Store
|
||||
healthSnapshotStore *healthsnapshotstore.Store
|
||||
streamOffsetStore *streamoffsets.Store
|
||||
gameLeaseStore *gamelease.Store
|
||||
|
||||
// External adapters.
|
||||
dockerAdapter *docker.Client
|
||||
lobbyClient *lobbyclient.Client
|
||||
notificationPublisher *notificationpublisher.Publisher
|
||||
healthEventsPublisher *healtheventspublisher.Publisher
|
||||
jobResultsPublisher *jobresultspublisher.Publisher
|
||||
|
||||
// Service layer.
|
||||
startRuntimeService *startruntime.Service
|
||||
stopRuntimeService *stopruntime.Service
|
||||
restartRuntimeService *restartruntime.Service
|
||||
patchRuntimeService *patchruntime.Service
|
||||
cleanupContainerService *cleanupcontainer.Service
|
||||
|
||||
// Worker layer.
|
||||
startJobsConsumer *startjobsconsumer.Consumer
|
||||
stopJobsConsumer *stopjobsconsumer.Consumer
|
||||
dockerEventsListener *dockerevents.Listener
|
||||
healthProbeWorker *healthprobe.Worker
|
||||
dockerInspectWorker *dockerinspect.Worker
|
||||
reconciler *reconcile.Reconciler
|
||||
containerCleanupWorker *containercleanup.Worker
|
||||
|
||||
// closers releases adapter-level resources at runtime shutdown.
|
||||
closers []func() error
|
||||
}
|
||||
|
||||
// newWiring constructs the process-level dependency set, the persistence
|
||||
// stores, the external adapters, and the service layer. It validates
|
||||
// every required collaborator so callers can rely on them being non-nil.
|
||||
func newWiring(
|
||||
cfg config.Config,
|
||||
redisClient *redis.Client,
|
||||
pgPool *sql.DB,
|
||||
dockerClient *dockerclient.Client,
|
||||
clock func() time.Time,
|
||||
logger *slog.Logger,
|
||||
telemetryRuntime *telemetry.Runtime,
|
||||
) (*wiring, error) {
|
||||
if redisClient == nil {
|
||||
return nil, errors.New("new rtmanager wiring: nil redis client")
|
||||
}
|
||||
if pgPool == nil {
|
||||
return nil, errors.New("new rtmanager wiring: nil postgres pool")
|
||||
}
|
||||
if dockerClient == nil {
|
||||
return nil, errors.New("new rtmanager wiring: nil docker client")
|
||||
}
|
||||
if clock == nil {
|
||||
clock = time.Now
|
||||
}
|
||||
if logger == nil {
|
||||
logger = slog.Default()
|
||||
}
|
||||
if telemetryRuntime == nil {
|
||||
return nil, fmt.Errorf("new rtmanager wiring: nil telemetry runtime")
|
||||
}
|
||||
|
||||
w := &wiring{
|
||||
cfg: cfg,
|
||||
redisClient: redisClient,
|
||||
pgPool: pgPool,
|
||||
dockerClient: dockerClient,
|
||||
clock: clock,
|
||||
logger: logger,
|
||||
telemetry: telemetryRuntime,
|
||||
}
|
||||
|
||||
if err := w.buildPersistence(); err != nil {
|
||||
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
|
||||
}
|
||||
if err := w.buildAdapters(); err != nil {
|
||||
_ = w.close()
|
||||
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
|
||||
}
|
||||
if err := w.buildServices(); err != nil {
|
||||
_ = w.close()
|
||||
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
|
||||
}
|
||||
if err := w.buildWorkers(); err != nil {
|
||||
_ = w.close()
|
||||
return nil, fmt.Errorf("new rtmanager wiring: %w", err)
|
||||
}
|
||||
return w, nil
|
||||
}
|
||||
|
||||
func (w *wiring) buildPersistence() error {
|
||||
runtimeStore, err := runtimerecordstore.New(runtimerecordstore.Config{
|
||||
DB: w.pgPool,
|
||||
OperationTimeout: w.cfg.Postgres.Conn.OperationTimeout,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("runtime record store: %w", err)
|
||||
}
|
||||
w.runtimeRecordStore = runtimeStore
|
||||
|
||||
operationStore, err := operationlogstore.New(operationlogstore.Config{
|
||||
DB: w.pgPool,
|
||||
OperationTimeout: w.cfg.Postgres.Conn.OperationTimeout,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("operation log store: %w", err)
|
||||
}
|
||||
w.operationLogStore = operationStore
|
||||
|
||||
snapshotStore, err := healthsnapshotstore.New(healthsnapshotstore.Config{
|
||||
DB: w.pgPool,
|
||||
OperationTimeout: w.cfg.Postgres.Conn.OperationTimeout,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("health snapshot store: %w", err)
|
||||
}
|
||||
w.healthSnapshotStore = snapshotStore
|
||||
|
||||
offsetStore, err := streamoffsets.New(streamoffsets.Config{Client: w.redisClient})
|
||||
if err != nil {
|
||||
return fmt.Errorf("stream offset store: %w", err)
|
||||
}
|
||||
w.streamOffsetStore = offsetStore
|
||||
|
||||
leaseStore, err := gamelease.New(gamelease.Config{Client: w.redisClient})
|
||||
if err != nil {
|
||||
return fmt.Errorf("game lease store: %w", err)
|
||||
}
|
||||
w.gameLeaseStore = leaseStore
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (w *wiring) buildAdapters() error {
|
||||
dockerAdapter, err := docker.NewClient(docker.Config{
|
||||
Docker: w.dockerClient,
|
||||
LogDriver: w.cfg.Docker.LogDriver,
|
||||
LogOpts: w.cfg.Docker.LogOpts,
|
||||
Clock: w.clock,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("docker adapter: %w", err)
|
||||
}
|
||||
w.dockerAdapter = dockerAdapter
|
||||
|
||||
lobby, err := lobbyclient.NewClient(lobbyclient.Config{
|
||||
BaseURL: w.cfg.Lobby.BaseURL,
|
||||
RequestTimeout: w.cfg.Lobby.Timeout,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("lobby client: %w", err)
|
||||
}
|
||||
w.lobbyClient = lobby
|
||||
w.closers = append(w.closers, lobby.Close)
|
||||
|
||||
notificationPub, err := notificationpublisher.NewPublisher(notificationpublisher.Config{
|
||||
Client: w.redisClient,
|
||||
Stream: w.cfg.Streams.NotificationIntents,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("notification publisher: %w", err)
|
||||
}
|
||||
w.notificationPublisher = notificationPub
|
||||
|
||||
healthPub, err := healtheventspublisher.NewPublisher(healtheventspublisher.Config{
|
||||
Client: w.redisClient,
|
||||
Snapshots: w.healthSnapshotStore,
|
||||
Stream: w.cfg.Streams.HealthEvents,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("health events publisher: %w", err)
|
||||
}
|
||||
w.healthEventsPublisher = healthPub
|
||||
|
||||
jobResultsPub, err := jobresultspublisher.NewPublisher(jobresultspublisher.Config{
|
||||
Client: w.redisClient,
|
||||
Stream: w.cfg.Streams.JobResults,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("job results publisher: %w", err)
|
||||
}
|
||||
w.jobResultsPublisher = jobResultsPub
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (w *wiring) buildServices() error {
|
||||
startService, err := startruntime.NewService(startruntime.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
OperationLogs: w.operationLogStore,
|
||||
Docker: w.dockerAdapter,
|
||||
Leases: w.gameLeaseStore,
|
||||
HealthEvents: w.healthEventsPublisher,
|
||||
Notifications: w.notificationPublisher,
|
||||
Lobby: w.lobbyClient,
|
||||
Container: w.cfg.Container,
|
||||
DockerCfg: w.cfg.Docker,
|
||||
Coordination: w.cfg.Coordination,
|
||||
Telemetry: w.telemetry,
|
||||
Logger: w.logger,
|
||||
Clock: w.clock,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("start runtime service: %w", err)
|
||||
}
|
||||
w.startRuntimeService = startService
|
||||
|
||||
stopService, err := stopruntime.NewService(stopruntime.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
OperationLogs: w.operationLogStore,
|
||||
Docker: w.dockerAdapter,
|
||||
Leases: w.gameLeaseStore,
|
||||
HealthEvents: w.healthEventsPublisher,
|
||||
Container: w.cfg.Container,
|
||||
Coordination: w.cfg.Coordination,
|
||||
Telemetry: w.telemetry,
|
||||
Logger: w.logger,
|
||||
Clock: w.clock,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("stop runtime service: %w", err)
|
||||
}
|
||||
w.stopRuntimeService = stopService
|
||||
|
||||
restartService, err := restartruntime.NewService(restartruntime.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
OperationLogs: w.operationLogStore,
|
||||
Docker: w.dockerAdapter,
|
||||
Leases: w.gameLeaseStore,
|
||||
StopService: stopService,
|
||||
StartService: startService,
|
||||
Coordination: w.cfg.Coordination,
|
||||
Telemetry: w.telemetry,
|
||||
Logger: w.logger,
|
||||
Clock: w.clock,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("restart runtime service: %w", err)
|
||||
}
|
||||
w.restartRuntimeService = restartService
|
||||
|
||||
patchService, err := patchruntime.NewService(patchruntime.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
OperationLogs: w.operationLogStore,
|
||||
Docker: w.dockerAdapter,
|
||||
Leases: w.gameLeaseStore,
|
||||
StopService: stopService,
|
||||
StartService: startService,
|
||||
Coordination: w.cfg.Coordination,
|
||||
Telemetry: w.telemetry,
|
||||
Logger: w.logger,
|
||||
Clock: w.clock,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("patch runtime service: %w", err)
|
||||
}
|
||||
w.patchRuntimeService = patchService
|
||||
|
||||
cleanupService, err := cleanupcontainer.NewService(cleanupcontainer.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
OperationLogs: w.operationLogStore,
|
||||
Docker: w.dockerAdapter,
|
||||
Leases: w.gameLeaseStore,
|
||||
Coordination: w.cfg.Coordination,
|
||||
Telemetry: w.telemetry,
|
||||
Logger: w.logger,
|
||||
Clock: w.clock,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("cleanup container service: %w", err)
|
||||
}
|
||||
w.cleanupContainerService = cleanupService
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// buildWorkers constructs the asynchronous Lobby ↔ RTM stream
|
||||
// consumers. Both consumers participate in the process lifecycle as
|
||||
// `app.Component`s; `internal/app/runtime.go` passes them into
|
||||
// `app.New` alongside the internal HTTP server.
|
||||
func (w *wiring) buildWorkers() error {
|
||||
startConsumer, err := startjobsconsumer.NewConsumer(startjobsconsumer.Config{
|
||||
Client: w.redisClient,
|
||||
Stream: w.cfg.Streams.StartJobs,
|
||||
BlockTimeout: w.cfg.Streams.BlockTimeout,
|
||||
StartService: w.startRuntimeService,
|
||||
JobResults: w.jobResultsPublisher,
|
||||
OffsetStore: w.streamOffsetStore,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("start jobs consumer: %w", err)
|
||||
}
|
||||
w.startJobsConsumer = startConsumer
|
||||
|
||||
stopConsumer, err := stopjobsconsumer.NewConsumer(stopjobsconsumer.Config{
|
||||
Client: w.redisClient,
|
||||
Stream: w.cfg.Streams.StopJobs,
|
||||
BlockTimeout: w.cfg.Streams.BlockTimeout,
|
||||
StopService: w.stopRuntimeService,
|
||||
JobResults: w.jobResultsPublisher,
|
||||
OffsetStore: w.streamOffsetStore,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("stop jobs consumer: %w", err)
|
||||
}
|
||||
w.stopJobsConsumer = stopConsumer
|
||||
|
||||
eventsListener, err := dockerevents.NewListener(dockerevents.Dependencies{
|
||||
Docker: w.dockerAdapter,
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
HealthEvents: w.healthEventsPublisher,
|
||||
Telemetry: w.telemetry,
|
||||
Clock: w.clock,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("docker events listener: %w", err)
|
||||
}
|
||||
w.dockerEventsListener = eventsListener
|
||||
|
||||
probeHTTPClient, err := newProbeHTTPClient(w.telemetry)
|
||||
if err != nil {
|
||||
return fmt.Errorf("health probe http client: %w", err)
|
||||
}
|
||||
probeWorker, err := healthprobe.NewWorker(healthprobe.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
HealthEvents: w.healthEventsPublisher,
|
||||
HTTPClient: probeHTTPClient,
|
||||
Telemetry: w.telemetry,
|
||||
Interval: w.cfg.Health.ProbeInterval,
|
||||
ProbeTimeout: w.cfg.Health.ProbeTimeout,
|
||||
FailuresThreshold: w.cfg.Health.ProbeFailuresThreshold,
|
||||
Clock: w.clock,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("health probe worker: %w", err)
|
||||
}
|
||||
w.healthProbeWorker = probeWorker
|
||||
|
||||
inspectWorker, err := dockerinspect.NewWorker(dockerinspect.Dependencies{
|
||||
Docker: w.dockerAdapter,
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
HealthEvents: w.healthEventsPublisher,
|
||||
Telemetry: w.telemetry,
|
||||
Interval: w.cfg.Health.InspectInterval,
|
||||
Clock: w.clock,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("docker inspect worker: %w", err)
|
||||
}
|
||||
w.dockerInspectWorker = inspectWorker
|
||||
|
||||
reconciler, err := reconcile.NewReconciler(reconcile.Dependencies{
|
||||
Docker: w.dockerAdapter,
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
OperationLogs: w.operationLogStore,
|
||||
HealthEvents: w.healthEventsPublisher,
|
||||
Leases: w.gameLeaseStore,
|
||||
Telemetry: w.telemetry,
|
||||
DockerCfg: w.cfg.Docker,
|
||||
ContainerCfg: w.cfg.Container,
|
||||
Coordination: w.cfg.Coordination,
|
||||
Interval: w.cfg.Cleanup.ReconcileInterval,
|
||||
Clock: w.clock,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("reconciler: %w", err)
|
||||
}
|
||||
w.reconciler = reconciler
|
||||
|
||||
cleanupWorker, err := containercleanup.NewWorker(containercleanup.Dependencies{
|
||||
RuntimeRecords: w.runtimeRecordStore,
|
||||
Cleanup: w.cleanupContainerService,
|
||||
Retention: w.cfg.Container.Retention,
|
||||
Interval: w.cfg.Cleanup.CleanupInterval,
|
||||
Clock: w.clock,
|
||||
Logger: w.logger,
|
||||
})
|
||||
if err != nil {
|
||||
return fmt.Errorf("container cleanup worker: %w", err)
|
||||
}
|
||||
w.containerCleanupWorker = cleanupWorker
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// newProbeHTTPClient constructs the otelhttp-instrumented HTTP client
|
||||
// the active health probe uses to call engine `/healthz`. It clones
|
||||
// the default transport so caller-provided transports stay isolated
|
||||
// from production wiring (mirrors the lobby internal client).
|
||||
func newProbeHTTPClient(telemetryRuntime *telemetry.Runtime) (*http.Client, error) {
|
||||
transport, ok := http.DefaultTransport.(*http.Transport)
|
||||
if !ok {
|
||||
return nil, errors.New("default http transport is not *http.Transport")
|
||||
}
|
||||
cloned := transport.Clone()
|
||||
instrumented := otelhttp.NewTransport(cloned,
|
||||
otelhttp.WithTracerProvider(telemetryRuntime.TracerProvider()),
|
||||
otelhttp.WithMeterProvider(telemetryRuntime.MeterProvider()),
|
||||
)
|
||||
return &http.Client{Transport: instrumented}, nil
|
||||
}
|
||||
|
||||
// registerTelemetryGauges installs the runtime-records-by-status gauge
|
||||
// callback so the telemetry runtime can observe the persistent store
|
||||
// without holding a strong reference to the wiring.
|
||||
func (w *wiring) registerTelemetryGauges() error {
|
||||
probe := newRuntimeRecordsProbe(w.runtimeRecordStore)
|
||||
return w.telemetry.RegisterGauges(telemetry.GaugeDependencies{
|
||||
RuntimeRecordsByStatus: probe,
|
||||
Logger: w.logger,
|
||||
})
|
||||
}
|
||||
|
||||
// close releases adapter-level resources owned by the wiring layer.
|
||||
// Returns the joined error of every closer; the caller is expected to
|
||||
// invoke this once during process shutdown.
|
||||
func (w *wiring) close() error {
|
||||
var joined error
|
||||
for index := len(w.closers) - 1; index >= 0; index-- {
|
||||
if err := w.closers[index](); err != nil {
|
||||
joined = errors.Join(joined, err)
|
||||
}
|
||||
}
|
||||
w.closers = nil
|
||||
return joined
|
||||
}
|
||||
|
||||
// runtimeRecordsProbe adapts runtimerecordstore.Store to
|
||||
// telemetry.RuntimeRecordsByStatusProbe by translating the typed status
|
||||
// keys into the string keys the gauge expects.
|
||||
type runtimeRecordsProbe struct {
|
||||
store *runtimerecordstore.Store
|
||||
}
|
||||
|
||||
func newRuntimeRecordsProbe(store *runtimerecordstore.Store) *runtimeRecordsProbe {
|
||||
return &runtimeRecordsProbe{store: store}
|
||||
}
|
||||
|
||||
func (p *runtimeRecordsProbe) CountByStatus(ctx context.Context) (map[string]int, error) {
|
||||
if p == nil || p.store == nil {
|
||||
return nil, errors.New("runtime records probe: nil store")
|
||||
}
|
||||
counts, err := p.store.CountByStatus(ctx)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
out := make(map[string]int, len(counts))
|
||||
for status, count := range counts {
|
||||
out[string(status)] = count
|
||||
}
|
||||
return out, nil
|
||||
}
|
||||
|
||||
// Compile-time assertions that the constructed adapters satisfy the
|
||||
// expected port surfaces; these prevent silent regressions when a
|
||||
// port shape changes.
|
||||
var (
|
||||
_ ports.RuntimeRecordStore = (*runtimerecordstore.Store)(nil)
|
||||
_ ports.OperationLogStore = (*operationlogstore.Store)(nil)
|
||||
_ ports.HealthSnapshotStore = (*healthsnapshotstore.Store)(nil)
|
||||
_ ports.StreamOffsetStore = (*streamoffsets.Store)(nil)
|
||||
_ ports.GameLeaseStore = (*gamelease.Store)(nil)
|
||||
_ ports.DockerClient = (*docker.Client)(nil)
|
||||
_ ports.LobbyInternalClient = (*lobbyclient.Client)(nil)
|
||||
_ ports.NotificationIntentPublisher = (*notificationpublisher.Publisher)(nil)
|
||||
_ ports.HealthEventPublisher = (*healtheventspublisher.Publisher)(nil)
|
||||
_ ports.JobResultPublisher = (*jobresultspublisher.Publisher)(nil)
|
||||
|
||||
_ Component = (*reconcile.Reconciler)(nil)
|
||||
_ Component = (*containercleanup.Worker)(nil)
|
||||
_ containercleanup.Cleaner = (*cleanupcontainer.Service)(nil)
|
||||
)
|
||||
|
||||
@@ -0,0 +1,632 @@
|
||||
// Package config loads the Runtime Manager process configuration from
|
||||
// environment variables.
|
||||
package config
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/redisconn"
|
||||
"galaxy/rtmanager/internal/telemetry"
|
||||
)
|
||||
|
||||
const (
|
||||
envPrefix = "RTMANAGER"
|
||||
|
||||
shutdownTimeoutEnvVar = "RTMANAGER_SHUTDOWN_TIMEOUT"
|
||||
logLevelEnvVar = "RTMANAGER_LOG_LEVEL"
|
||||
|
||||
internalHTTPAddrEnvVar = "RTMANAGER_INTERNAL_HTTP_ADDR"
|
||||
internalHTTPReadHeaderTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_READ_HEADER_TIMEOUT"
|
||||
internalHTTPReadTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_READ_TIMEOUT"
|
||||
internalHTTPWriteTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_WRITE_TIMEOUT"
|
||||
internalHTTPIdleTimeoutEnvVar = "RTMANAGER_INTERNAL_HTTP_IDLE_TIMEOUT"
|
||||
|
||||
dockerHostEnvVar = "RTMANAGER_DOCKER_HOST"
|
||||
dockerAPIVersionEnvVar = "RTMANAGER_DOCKER_API_VERSION"
|
||||
dockerNetworkEnvVar = "RTMANAGER_DOCKER_NETWORK"
|
||||
dockerLogDriverEnvVar = "RTMANAGER_DOCKER_LOG_DRIVER"
|
||||
dockerLogOptsEnvVar = "RTMANAGER_DOCKER_LOG_OPTS"
|
||||
imagePullPolicyEnvVar = "RTMANAGER_IMAGE_PULL_POLICY"
|
||||
|
||||
defaultCPUQuotaEnvVar = "RTMANAGER_DEFAULT_CPU_QUOTA"
|
||||
defaultMemoryEnvVar = "RTMANAGER_DEFAULT_MEMORY"
|
||||
defaultPIDsLimitEnvVar = "RTMANAGER_DEFAULT_PIDS_LIMIT"
|
||||
containerStopTimeoutSecondsEnvVar = "RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS"
|
||||
containerRetentionDaysEnvVar = "RTMANAGER_CONTAINER_RETENTION_DAYS"
|
||||
engineStateMountPathEnvVar = "RTMANAGER_ENGINE_STATE_MOUNT_PATH"
|
||||
engineStateEnvNameEnvVar = "RTMANAGER_ENGINE_STATE_ENV_NAME"
|
||||
gameStateDirModeEnvVar = "RTMANAGER_GAME_STATE_DIR_MODE"
|
||||
gameStateOwnerUIDEnvVar = "RTMANAGER_GAME_STATE_OWNER_UID"
|
||||
gameStateOwnerGIDEnvVar = "RTMANAGER_GAME_STATE_OWNER_GID"
|
||||
gameStateRootEnvVar = "RTMANAGER_GAME_STATE_ROOT"
|
||||
|
||||
startJobsStreamEnvVar = "RTMANAGER_REDIS_START_JOBS_STREAM"
|
||||
stopJobsStreamEnvVar = "RTMANAGER_REDIS_STOP_JOBS_STREAM"
|
||||
jobResultsStreamEnvVar = "RTMANAGER_REDIS_JOB_RESULTS_STREAM"
|
||||
healthEventsStreamEnvVar = "RTMANAGER_REDIS_HEALTH_EVENTS_STREAM"
|
||||
notificationIntentsStreamEnv = "RTMANAGER_NOTIFICATION_INTENTS_STREAM"
|
||||
streamBlockTimeoutEnvVar = "RTMANAGER_STREAM_BLOCK_TIMEOUT"
|
||||
|
||||
inspectIntervalEnvVar = "RTMANAGER_INSPECT_INTERVAL"
|
||||
probeIntervalEnvVar = "RTMANAGER_PROBE_INTERVAL"
|
||||
probeTimeoutEnvVar = "RTMANAGER_PROBE_TIMEOUT"
|
||||
probeFailuresThresholdEnvVar = "RTMANAGER_PROBE_FAILURES_THRESHOLD"
|
||||
|
||||
reconcileIntervalEnvVar = "RTMANAGER_RECONCILE_INTERVAL"
|
||||
cleanupIntervalEnvVar = "RTMANAGER_CLEANUP_INTERVAL"
|
||||
|
||||
gameLeaseTTLSecondsEnvVar = "RTMANAGER_GAME_LEASE_TTL_SECONDS"
|
||||
|
||||
lobbyInternalBaseURLEnvVar = "RTMANAGER_LOBBY_INTERNAL_BASE_URL"
|
||||
lobbyInternalTimeoutEnvVar = "RTMANAGER_LOBBY_INTERNAL_TIMEOUT"
|
||||
|
||||
otelServiceNameEnvVar = "OTEL_SERVICE_NAME"
|
||||
otelTracesExporterEnvVar = "OTEL_TRACES_EXPORTER"
|
||||
otelMetricsExporterEnvVar = "OTEL_METRICS_EXPORTER"
|
||||
otelExporterOTLPProtocolEnvVar = "OTEL_EXPORTER_OTLP_PROTOCOL"
|
||||
otelExporterOTLPTracesProtocolEnvVar = "OTEL_EXPORTER_OTLP_TRACES_PROTOCOL"
|
||||
otelExporterOTLPMetricsProtocolEnvVar = "OTEL_EXPORTER_OTLP_METRICS_PROTOCOL"
|
||||
otelStdoutTracesEnabledEnvVar = "RTMANAGER_OTEL_STDOUT_TRACES_ENABLED"
|
||||
otelStdoutMetricsEnabledEnvVar = "RTMANAGER_OTEL_STDOUT_METRICS_ENABLED"
|
||||
|
||||
defaultShutdownTimeout = 30 * time.Second
|
||||
defaultLogLevel = "info"
|
||||
defaultInternalHTTPAddr = ":8096"
|
||||
defaultReadHeaderTimeout = 2 * time.Second
|
||||
defaultReadTimeout = 5 * time.Second
|
||||
defaultWriteTimeout = 15 * time.Second
|
||||
defaultIdleTimeout = 60 * time.Second
|
||||
|
||||
defaultDockerHost = "unix:///var/run/docker.sock"
|
||||
defaultDockerNetwork = "galaxy-net"
|
||||
defaultDockerLogDriver = "json-file"
|
||||
defaultImagePullPolicy = ImagePullPolicyIfMissing
|
||||
|
||||
defaultCPUQuota = 1.0
|
||||
defaultMemory = "512m"
|
||||
defaultPIDsLimit = 512
|
||||
defaultContainerStopTimeout = 30 * time.Second
|
||||
defaultContainerRetention = 30 * 24 * time.Hour
|
||||
defaultEngineStateMountPath = "/var/lib/galaxy-game"
|
||||
defaultEngineStateEnvName = "GAME_STATE_PATH"
|
||||
defaultGameStateDirMode = 0o750
|
||||
|
||||
defaultStartJobsStream = "runtime:start_jobs"
|
||||
defaultStopJobsStream = "runtime:stop_jobs"
|
||||
defaultJobResultsStream = "runtime:job_results"
|
||||
defaultHealthEventsStream = "runtime:health_events"
|
||||
defaultNotificationIntentsKey = "notification:intents"
|
||||
defaultStreamBlockTimeout = 5 * time.Second
|
||||
|
||||
defaultInspectInterval = 30 * time.Second
|
||||
defaultProbeInterval = 15 * time.Second
|
||||
defaultProbeTimeout = 2 * time.Second
|
||||
defaultProbeFailuresThreshold = 3
|
||||
|
||||
defaultReconcileInterval = 5 * time.Minute
|
||||
defaultCleanupInterval = time.Hour
|
||||
|
||||
defaultGameLeaseTTL = 60 * time.Second
|
||||
|
||||
defaultLobbyInternalTimeout = 2 * time.Second
|
||||
|
||||
defaultOTelServiceName = "galaxy-rtmanager"
|
||||
)
|
||||
|
||||
// ImagePullPolicy enumerates the supported image pull policies. The start
|
||||
// service validates a producer-supplied `image_ref` against this policy at
|
||||
// start time.
|
||||
type ImagePullPolicy string
|
||||
|
||||
// Supported pull policies, frozen by `rtmanager/README.md` §Configuration.
|
||||
const (
|
||||
ImagePullPolicyIfMissing ImagePullPolicy = "if_missing"
|
||||
ImagePullPolicyAlways ImagePullPolicy = "always"
|
||||
ImagePullPolicyNever ImagePullPolicy = "never"
|
||||
)
|
||||
|
||||
// Validate reports whether p is one of the frozen pull policies.
|
||||
func (p ImagePullPolicy) Validate() error {
|
||||
switch p {
|
||||
case ImagePullPolicyIfMissing, ImagePullPolicyAlways, ImagePullPolicyNever:
|
||||
return nil
|
||||
default:
|
||||
return fmt.Errorf("image pull policy %q must be one of %q, %q, %q",
|
||||
p, ImagePullPolicyIfMissing, ImagePullPolicyAlways, ImagePullPolicyNever)
|
||||
}
|
||||
}
|
||||
|
||||
// Config stores the full Runtime Manager process configuration.
|
||||
type Config struct {
|
||||
// ShutdownTimeout bounds graceful shutdown of every long-lived
|
||||
// component.
|
||||
ShutdownTimeout time.Duration
|
||||
|
||||
// Logging configures the process-wide structured logger.
|
||||
Logging LoggingConfig
|
||||
|
||||
// InternalHTTP configures the trusted internal HTTP listener that
|
||||
// serves probes and the GM/Admin REST surface.
|
||||
InternalHTTP InternalHTTPConfig
|
||||
|
||||
// Docker configures the Docker SDK client RTM uses to drive the local
|
||||
// Docker daemon.
|
||||
Docker DockerConfig
|
||||
|
||||
// Postgres configures the PostgreSQL-backed durable store consumed via
|
||||
// `pkg/postgres`.
|
||||
Postgres PostgresConfig
|
||||
|
||||
// Redis configures the shared Redis connection topology consumed via
|
||||
// `pkg/redisconn`.
|
||||
Redis RedisConfig
|
||||
|
||||
// Streams stores the stable Redis Stream names RTM reads from and
|
||||
// writes to.
|
||||
Streams StreamsConfig
|
||||
|
||||
// Container stores the per-container defaults applied at start time
|
||||
// when the resolved image does not declare its own labels.
|
||||
Container ContainerConfig
|
||||
|
||||
// Health configures the periodic health-monitoring workers (events
|
||||
// listener, inspect, active probe).
|
||||
Health HealthConfig
|
||||
|
||||
// Cleanup configures the reconciler and container-cleanup workers.
|
||||
Cleanup CleanupConfig
|
||||
|
||||
// Coordination configures the per-game Redis lease used to serialise
|
||||
// operations across all entry points.
|
||||
Coordination CoordinationConfig
|
||||
|
||||
// Lobby configures the synchronous Lobby internal REST client used by
|
||||
// the start service for ancillary lookups.
|
||||
Lobby LobbyConfig
|
||||
|
||||
// Telemetry configures the process-wide OpenTelemetry runtime.
|
||||
Telemetry TelemetryConfig
|
||||
}
|
||||
|
||||
// LoggingConfig configures the process-wide structured logger.
|
||||
type LoggingConfig struct {
|
||||
// Level stores the process log level accepted by log/slog.
|
||||
Level string
|
||||
}
|
||||
|
||||
// InternalHTTPConfig configures the trusted internal HTTP listener.
|
||||
type InternalHTTPConfig struct {
|
||||
// Addr stores the TCP listen address.
|
||||
Addr string
|
||||
|
||||
// ReadHeaderTimeout bounds request-header reading.
|
||||
ReadHeaderTimeout time.Duration
|
||||
|
||||
// ReadTimeout bounds reading one request.
|
||||
ReadTimeout time.Duration
|
||||
|
||||
// WriteTimeout bounds writing one response.
|
||||
WriteTimeout time.Duration
|
||||
|
||||
// IdleTimeout bounds how long keep-alive connections stay open.
|
||||
IdleTimeout time.Duration
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores a usable internal HTTP listener
|
||||
// configuration.
|
||||
func (cfg InternalHTTPConfig) Validate() error {
|
||||
switch {
|
||||
case strings.TrimSpace(cfg.Addr) == "":
|
||||
return fmt.Errorf("internal HTTP addr must not be empty")
|
||||
case !isTCPAddr(cfg.Addr):
|
||||
return fmt.Errorf("internal HTTP addr %q must use host:port form", cfg.Addr)
|
||||
case cfg.ReadHeaderTimeout <= 0:
|
||||
return fmt.Errorf("internal HTTP read header timeout must be positive")
|
||||
case cfg.ReadTimeout <= 0:
|
||||
return fmt.Errorf("internal HTTP read timeout must be positive")
|
||||
case cfg.WriteTimeout <= 0:
|
||||
return fmt.Errorf("internal HTTP write timeout must be positive")
|
||||
case cfg.IdleTimeout <= 0:
|
||||
return fmt.Errorf("internal HTTP idle timeout must be positive")
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// DockerConfig configures the Docker SDK client.
|
||||
type DockerConfig struct {
|
||||
// Host stores the Docker daemon endpoint (e.g.
|
||||
// `unix:///var/run/docker.sock`).
|
||||
Host string
|
||||
|
||||
// APIVersion overrides the Docker API version. Empty lets the SDK
|
||||
// negotiate.
|
||||
APIVersion string
|
||||
|
||||
// Network stores the user-defined Docker bridge network containers
|
||||
// attach to. Provisioned outside RTM; missing network is a fail-fast
|
||||
// condition at startup.
|
||||
Network string
|
||||
|
||||
// LogDriver stores the Docker logging driver applied to engine
|
||||
// containers.
|
||||
LogDriver string
|
||||
|
||||
// LogOpts stores the comma-separated `key=value` driver options.
|
||||
LogOpts string
|
||||
|
||||
// PullPolicy stores the configured image pull policy.
|
||||
PullPolicy ImagePullPolicy
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores a usable Docker configuration.
|
||||
func (cfg DockerConfig) Validate() error {
|
||||
switch {
|
||||
case strings.TrimSpace(cfg.Host) == "":
|
||||
return fmt.Errorf("docker host must not be empty")
|
||||
case strings.TrimSpace(cfg.Network) == "":
|
||||
return fmt.Errorf("docker network must not be empty")
|
||||
case strings.TrimSpace(cfg.LogDriver) == "":
|
||||
return fmt.Errorf("docker log driver must not be empty")
|
||||
}
|
||||
return cfg.PullPolicy.Validate()
|
||||
}
|
||||
|
||||
// PostgresConfig configures the PostgreSQL-backed durable store consumed
|
||||
// via `pkg/postgres`.
|
||||
type PostgresConfig struct {
|
||||
// Conn carries the primary plus replica DSN topology and pool tuning.
|
||||
Conn postgres.Config
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores a usable PostgreSQL configuration.
|
||||
func (cfg PostgresConfig) Validate() error {
|
||||
return cfg.Conn.Validate()
|
||||
}
|
||||
|
||||
// RedisConfig configures the Runtime Manager Redis connection topology.
|
||||
type RedisConfig struct {
|
||||
// Conn carries the connection topology (master, replicas, password,
|
||||
// db, per-call timeout).
|
||||
Conn redisconn.Config
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores a usable Redis configuration.
|
||||
func (cfg RedisConfig) Validate() error {
|
||||
return cfg.Conn.Validate()
|
||||
}
|
||||
|
||||
// StreamsConfig stores the stable Redis Stream names used by Runtime
|
||||
// Manager.
|
||||
type StreamsConfig struct {
|
||||
// StartJobs stores the Redis Streams key Lobby writes start jobs to.
|
||||
StartJobs string
|
||||
|
||||
// StopJobs stores the Redis Streams key Lobby writes stop jobs to.
|
||||
StopJobs string
|
||||
|
||||
// JobResults stores the Redis Streams key RTM writes job outcomes
|
||||
// to.
|
||||
JobResults string
|
||||
|
||||
// HealthEvents stores the Redis Streams key RTM publishes
|
||||
// technical health events to.
|
||||
HealthEvents string
|
||||
|
||||
// NotificationIntents stores the Redis Streams key RTM publishes
|
||||
// admin-only notification intents to.
|
||||
NotificationIntents string
|
||||
|
||||
// BlockTimeout bounds the maximum blocking read window for stream
|
||||
// consumers.
|
||||
BlockTimeout time.Duration
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores usable stream names.
|
||||
func (cfg StreamsConfig) Validate() error {
|
||||
switch {
|
||||
case strings.TrimSpace(cfg.StartJobs) == "":
|
||||
return fmt.Errorf("redis start jobs stream must not be empty")
|
||||
case strings.TrimSpace(cfg.StopJobs) == "":
|
||||
return fmt.Errorf("redis stop jobs stream must not be empty")
|
||||
case strings.TrimSpace(cfg.JobResults) == "":
|
||||
return fmt.Errorf("redis job results stream must not be empty")
|
||||
case strings.TrimSpace(cfg.HealthEvents) == "":
|
||||
return fmt.Errorf("redis health events stream must not be empty")
|
||||
case strings.TrimSpace(cfg.NotificationIntents) == "":
|
||||
return fmt.Errorf("redis notification intents stream must not be empty")
|
||||
case cfg.BlockTimeout <= 0:
|
||||
return fmt.Errorf("redis stream block timeout must be positive")
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// ContainerConfig stores the per-container defaults applied at start
|
||||
// time. Resource defaults apply when the resolved engine image does not
|
||||
// expose `com.galaxy.cpu_quota` / `com.galaxy.memory` /
|
||||
// `com.galaxy.pids_limit` labels.
|
||||
type ContainerConfig struct {
|
||||
// DefaultCPUQuota is the fallback `--cpus` value applied when the
|
||||
// image does not declare `com.galaxy.cpu_quota`.
|
||||
DefaultCPUQuota float64
|
||||
|
||||
// DefaultMemory is the fallback `--memory` value applied when the
|
||||
// image does not declare `com.galaxy.memory`.
|
||||
DefaultMemory string
|
||||
|
||||
// DefaultPIDsLimit is the fallback `--pids-limit` value applied
|
||||
// when the image does not declare `com.galaxy.pids_limit`.
|
||||
DefaultPIDsLimit int
|
||||
|
||||
// StopTimeout bounds graceful container stop before Docker fires
|
||||
// SIGKILL.
|
||||
StopTimeout time.Duration
|
||||
|
||||
// Retention stores the TTL after which `status=stopped` containers
|
||||
// are removed by the cleanup worker.
|
||||
Retention time.Duration
|
||||
|
||||
// EngineStateMountPath is the in-container path the per-game state
|
||||
// directory is bind-mounted to.
|
||||
EngineStateMountPath string
|
||||
|
||||
// EngineStateEnvName is the env-var name forwarded to the engine
|
||||
// pointing at EngineStateMountPath.
|
||||
EngineStateEnvName string
|
||||
|
||||
// GameStateDirMode stores the unix permissions applied to the
|
||||
// per-game state directory on creation.
|
||||
GameStateDirMode uint32
|
||||
|
||||
// GameStateOwnerUID stores the unix uid applied to the per-game
|
||||
// state directory on creation.
|
||||
GameStateOwnerUID int
|
||||
|
||||
// GameStateOwnerGID stores the unix gid applied to the per-game
|
||||
// state directory on creation.
|
||||
GameStateOwnerGID int
|
||||
|
||||
// GameStateRoot is the host path under which per-game state
|
||||
// directories are created.
|
||||
GameStateRoot string
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores usable container defaults.
|
||||
func (cfg ContainerConfig) Validate() error {
|
||||
switch {
|
||||
case cfg.DefaultCPUQuota <= 0:
|
||||
return fmt.Errorf("default cpu quota must be positive")
|
||||
case strings.TrimSpace(cfg.DefaultMemory) == "":
|
||||
return fmt.Errorf("default memory must not be empty")
|
||||
case cfg.DefaultPIDsLimit <= 0:
|
||||
return fmt.Errorf("default pids limit must be positive")
|
||||
case cfg.StopTimeout <= 0:
|
||||
return fmt.Errorf("container stop timeout must be positive")
|
||||
case cfg.Retention <= 0:
|
||||
return fmt.Errorf("container retention must be positive")
|
||||
case strings.TrimSpace(cfg.EngineStateMountPath) == "":
|
||||
return fmt.Errorf("engine state mount path must not be empty")
|
||||
case strings.TrimSpace(cfg.EngineStateEnvName) == "":
|
||||
return fmt.Errorf("engine state env name must not be empty")
|
||||
case cfg.GameStateDirMode == 0:
|
||||
return fmt.Errorf("game state dir mode must be non-zero")
|
||||
case strings.TrimSpace(cfg.GameStateRoot) == "":
|
||||
return fmt.Errorf("game state root must not be empty")
|
||||
case !strings.HasPrefix(strings.TrimSpace(cfg.GameStateRoot), "/"):
|
||||
return fmt.Errorf("game state root %q must be an absolute path", cfg.GameStateRoot)
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// HealthConfig configures the periodic health-monitoring workers
|
||||
// (Docker events listener, periodic inspect, active probe).
|
||||
type HealthConfig struct {
|
||||
// InspectInterval is the period between two periodic Docker inspect
|
||||
// passes.
|
||||
InspectInterval time.Duration
|
||||
|
||||
// ProbeInterval is the period between two engine `/healthz` probe
|
||||
// rounds.
|
||||
ProbeInterval time.Duration
|
||||
|
||||
// ProbeTimeout bounds one engine `/healthz` request.
|
||||
ProbeTimeout time.Duration
|
||||
|
||||
// ProbeFailuresThreshold is the consecutive-failure count that
|
||||
// triggers a `probe_failed` event.
|
||||
ProbeFailuresThreshold int
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores usable health-monitoring settings.
|
||||
func (cfg HealthConfig) Validate() error {
|
||||
switch {
|
||||
case cfg.InspectInterval <= 0:
|
||||
return fmt.Errorf("inspect interval must be positive")
|
||||
case cfg.ProbeInterval <= 0:
|
||||
return fmt.Errorf("probe interval must be positive")
|
||||
case cfg.ProbeTimeout <= 0:
|
||||
return fmt.Errorf("probe timeout must be positive")
|
||||
case cfg.ProbeFailuresThreshold <= 0:
|
||||
return fmt.Errorf("probe failures threshold must be positive")
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// CleanupConfig configures the reconciler and container-cleanup workers.
|
||||
type CleanupConfig struct {
|
||||
// ReconcileInterval is the period between two reconciler passes.
|
||||
ReconcileInterval time.Duration
|
||||
|
||||
// CleanupInterval is the period between two container-cleanup
|
||||
// passes.
|
||||
CleanupInterval time.Duration
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores usable cleanup settings.
|
||||
func (cfg CleanupConfig) Validate() error {
|
||||
switch {
|
||||
case cfg.ReconcileInterval <= 0:
|
||||
return fmt.Errorf("reconcile interval must be positive")
|
||||
case cfg.CleanupInterval <= 0:
|
||||
return fmt.Errorf("cleanup interval must be positive")
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// CoordinationConfig configures the per-game Redis lease.
|
||||
type CoordinationConfig struct {
|
||||
// GameLeaseTTL bounds the per-game lease lifetime renewed every
|
||||
// half-TTL while an operation runs.
|
||||
GameLeaseTTL time.Duration
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores a usable lease configuration.
|
||||
func (cfg CoordinationConfig) Validate() error {
|
||||
if cfg.GameLeaseTTL <= 0 {
|
||||
return fmt.Errorf("game lease ttl must be positive")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// LobbyConfig configures the synchronous Lobby internal REST client.
|
||||
type LobbyConfig struct {
|
||||
// BaseURL stores the trusted Lobby internal listener base URL.
|
||||
BaseURL string
|
||||
|
||||
// Timeout bounds one Lobby internal request.
|
||||
Timeout time.Duration
|
||||
}
|
||||
|
||||
// Validate reports whether cfg stores a usable Lobby client
|
||||
// configuration.
|
||||
func (cfg LobbyConfig) Validate() error {
|
||||
switch {
|
||||
case strings.TrimSpace(cfg.BaseURL) == "":
|
||||
return fmt.Errorf("lobby internal base url must not be empty")
|
||||
case !isHTTPURL(cfg.BaseURL):
|
||||
return fmt.Errorf("lobby internal base url %q must be an absolute http(s) URL", cfg.BaseURL)
|
||||
case cfg.Timeout <= 0:
|
||||
return fmt.Errorf("lobby internal timeout must be positive")
|
||||
default:
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
// TelemetryConfig configures the Runtime Manager OpenTelemetry runtime.
|
||||
type TelemetryConfig struct {
|
||||
// ServiceName overrides the default OpenTelemetry service name.
|
||||
ServiceName string
|
||||
|
||||
// TracesExporter selects the external traces exporter. Supported
|
||||
// values are `none` and `otlp`.
|
||||
TracesExporter string
|
||||
|
||||
// MetricsExporter selects the external metrics exporter. Supported
|
||||
// values are `none` and `otlp`.
|
||||
MetricsExporter string
|
||||
|
||||
// TracesProtocol selects the OTLP traces protocol when
|
||||
// TracesExporter is `otlp`.
|
||||
TracesProtocol string
|
||||
|
||||
// MetricsProtocol selects the OTLP metrics protocol when
|
||||
// MetricsExporter is `otlp`.
|
||||
MetricsProtocol string
|
||||
|
||||
// StdoutTracesEnabled enables the additional stdout trace exporter
|
||||
// used for local development and debugging.
|
||||
StdoutTracesEnabled bool
|
||||
|
||||
// StdoutMetricsEnabled enables the additional stdout metric
|
||||
// exporter used for local development and debugging.
|
||||
StdoutMetricsEnabled bool
|
||||
}
|
||||
|
||||
// Validate reports whether cfg contains a supported OpenTelemetry
|
||||
// configuration.
|
||||
func (cfg TelemetryConfig) Validate() error {
|
||||
return telemetry.ProcessConfig{
|
||||
ServiceName: cfg.ServiceName,
|
||||
TracesExporter: cfg.TracesExporter,
|
||||
MetricsExporter: cfg.MetricsExporter,
|
||||
TracesProtocol: cfg.TracesProtocol,
|
||||
MetricsProtocol: cfg.MetricsProtocol,
|
||||
StdoutTracesEnabled: cfg.StdoutTracesEnabled,
|
||||
StdoutMetricsEnabled: cfg.StdoutMetricsEnabled,
|
||||
}.Validate()
|
||||
}
|
||||
|
||||
// DefaultConfig returns the default Runtime Manager process configuration.
|
||||
func DefaultConfig() Config {
|
||||
return Config{
|
||||
ShutdownTimeout: defaultShutdownTimeout,
|
||||
Logging: LoggingConfig{
|
||||
Level: defaultLogLevel,
|
||||
},
|
||||
InternalHTTP: InternalHTTPConfig{
|
||||
Addr: defaultInternalHTTPAddr,
|
||||
ReadHeaderTimeout: defaultReadHeaderTimeout,
|
||||
ReadTimeout: defaultReadTimeout,
|
||||
WriteTimeout: defaultWriteTimeout,
|
||||
IdleTimeout: defaultIdleTimeout,
|
||||
},
|
||||
Docker: DockerConfig{
|
||||
Host: defaultDockerHost,
|
||||
Network: defaultDockerNetwork,
|
||||
LogDriver: defaultDockerLogDriver,
|
||||
PullPolicy: defaultImagePullPolicy,
|
||||
},
|
||||
Postgres: PostgresConfig{
|
||||
Conn: postgres.DefaultConfig(),
|
||||
},
|
||||
Redis: RedisConfig{
|
||||
Conn: redisconn.DefaultConfig(),
|
||||
},
|
||||
Streams: StreamsConfig{
|
||||
StartJobs: defaultStartJobsStream,
|
||||
StopJobs: defaultStopJobsStream,
|
||||
JobResults: defaultJobResultsStream,
|
||||
HealthEvents: defaultHealthEventsStream,
|
||||
NotificationIntents: defaultNotificationIntentsKey,
|
||||
BlockTimeout: defaultStreamBlockTimeout,
|
||||
},
|
||||
Container: ContainerConfig{
|
||||
DefaultCPUQuota: defaultCPUQuota,
|
||||
DefaultMemory: defaultMemory,
|
||||
DefaultPIDsLimit: defaultPIDsLimit,
|
||||
StopTimeout: defaultContainerStopTimeout,
|
||||
Retention: defaultContainerRetention,
|
||||
EngineStateMountPath: defaultEngineStateMountPath,
|
||||
EngineStateEnvName: defaultEngineStateEnvName,
|
||||
GameStateDirMode: defaultGameStateDirMode,
|
||||
},
|
||||
Health: HealthConfig{
|
||||
InspectInterval: defaultInspectInterval,
|
||||
ProbeInterval: defaultProbeInterval,
|
||||
ProbeTimeout: defaultProbeTimeout,
|
||||
ProbeFailuresThreshold: defaultProbeFailuresThreshold,
|
||||
},
|
||||
Cleanup: CleanupConfig{
|
||||
ReconcileInterval: defaultReconcileInterval,
|
||||
CleanupInterval: defaultCleanupInterval,
|
||||
},
|
||||
Coordination: CoordinationConfig{
|
||||
GameLeaseTTL: defaultGameLeaseTTL,
|
||||
},
|
||||
Lobby: LobbyConfig{
|
||||
Timeout: defaultLobbyInternalTimeout,
|
||||
},
|
||||
Telemetry: TelemetryConfig{
|
||||
ServiceName: defaultOTelServiceName,
|
||||
TracesExporter: "none",
|
||||
MetricsExporter: "none",
|
||||
},
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,142 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"strings"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func validEnv(t *testing.T) {
|
||||
t.Helper()
|
||||
|
||||
t.Setenv("RTMANAGER_POSTGRES_PRIMARY_DSN", "postgres://rtm:secret@localhost:5432/galaxy?search_path=rtmanager&sslmode=disable")
|
||||
t.Setenv("RTMANAGER_REDIS_MASTER_ADDR", "localhost:6379")
|
||||
t.Setenv("RTMANAGER_REDIS_PASSWORD", "secret")
|
||||
t.Setenv("RTMANAGER_GAME_STATE_ROOT", "/var/lib/galaxy/games")
|
||||
t.Setenv("RTMANAGER_LOBBY_INTERNAL_BASE_URL", "http://lobby:8095")
|
||||
}
|
||||
|
||||
func TestLoadFromEnvAcceptsDefaults(t *testing.T) {
|
||||
validEnv(t)
|
||||
|
||||
cfg, err := LoadFromEnv()
|
||||
require.NoError(t, err)
|
||||
|
||||
require.Equal(t, ":8096", cfg.InternalHTTP.Addr)
|
||||
require.Equal(t, "unix:///var/run/docker.sock", cfg.Docker.Host)
|
||||
require.Equal(t, "galaxy-net", cfg.Docker.Network)
|
||||
require.Equal(t, "json-file", cfg.Docker.LogDriver)
|
||||
require.Equal(t, ImagePullPolicyIfMissing, cfg.Docker.PullPolicy)
|
||||
require.Equal(t, "runtime:start_jobs", cfg.Streams.StartJobs)
|
||||
require.Equal(t, "runtime:stop_jobs", cfg.Streams.StopJobs)
|
||||
require.Equal(t, "runtime:job_results", cfg.Streams.JobResults)
|
||||
require.Equal(t, "runtime:health_events", cfg.Streams.HealthEvents)
|
||||
require.Equal(t, "notification:intents", cfg.Streams.NotificationIntents)
|
||||
require.Equal(t, 30*time.Second, cfg.Container.StopTimeout)
|
||||
require.Equal(t, 30*24*time.Hour, cfg.Container.Retention)
|
||||
require.Equal(t, "/var/lib/galaxy-game", cfg.Container.EngineStateMountPath)
|
||||
require.Equal(t, "GAME_STATE_PATH", cfg.Container.EngineStateEnvName)
|
||||
require.EqualValues(t, 0o750, cfg.Container.GameStateDirMode)
|
||||
require.Equal(t, 60*time.Second, cfg.Coordination.GameLeaseTTL)
|
||||
require.Equal(t, "http://lobby:8095", cfg.Lobby.BaseURL)
|
||||
require.Equal(t, 2*time.Second, cfg.Lobby.Timeout)
|
||||
require.Equal(t, "galaxy-rtmanager", cfg.Telemetry.ServiceName)
|
||||
}
|
||||
|
||||
func TestLoadFromEnvHonoursOverrides(t *testing.T) {
|
||||
validEnv(t)
|
||||
t.Setenv("RTMANAGER_INTERNAL_HTTP_ADDR", ":9000")
|
||||
t.Setenv("RTMANAGER_DOCKER_NETWORK", "custom-net")
|
||||
t.Setenv("RTMANAGER_IMAGE_PULL_POLICY", "always")
|
||||
t.Setenv("RTMANAGER_REDIS_START_JOBS_STREAM", "custom:start_jobs")
|
||||
t.Setenv("RTMANAGER_GAME_LEASE_TTL_SECONDS", "120")
|
||||
t.Setenv("RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS", "45")
|
||||
t.Setenv("RTMANAGER_CONTAINER_RETENTION_DAYS", "7")
|
||||
t.Setenv("RTMANAGER_GAME_STATE_DIR_MODE", "0700")
|
||||
|
||||
cfg, err := LoadFromEnv()
|
||||
require.NoError(t, err)
|
||||
|
||||
require.Equal(t, ":9000", cfg.InternalHTTP.Addr)
|
||||
require.Equal(t, "custom-net", cfg.Docker.Network)
|
||||
require.Equal(t, ImagePullPolicyAlways, cfg.Docker.PullPolicy)
|
||||
require.Equal(t, "custom:start_jobs", cfg.Streams.StartJobs)
|
||||
require.Equal(t, 120*time.Second, cfg.Coordination.GameLeaseTTL)
|
||||
require.Equal(t, 45*time.Second, cfg.Container.StopTimeout)
|
||||
require.Equal(t, 7*24*time.Hour, cfg.Container.Retention)
|
||||
require.EqualValues(t, 0o700, cfg.Container.GameStateDirMode)
|
||||
}
|
||||
|
||||
func TestLoadFromEnvRejectsUnknownPullPolicy(t *testing.T) {
|
||||
validEnv(t)
|
||||
t.Setenv("RTMANAGER_IMAGE_PULL_POLICY", "weekly")
|
||||
|
||||
_, err := LoadFromEnv()
|
||||
require.Error(t, err)
|
||||
require.Contains(t, err.Error(), "image pull policy")
|
||||
}
|
||||
|
||||
func TestLoadFromEnvRequiresGameStateRoot(t *testing.T) {
|
||||
t.Setenv("RTMANAGER_POSTGRES_PRIMARY_DSN", "postgres://rtm:secret@localhost:5432/galaxy")
|
||||
t.Setenv("RTMANAGER_REDIS_MASTER_ADDR", "localhost:6379")
|
||||
t.Setenv("RTMANAGER_REDIS_PASSWORD", "secret")
|
||||
t.Setenv("RTMANAGER_LOBBY_INTERNAL_BASE_URL", "http://lobby:8095")
|
||||
|
||||
_, err := LoadFromEnv()
|
||||
require.Error(t, err)
|
||||
require.Contains(t, err.Error(), "RTMANAGER_GAME_STATE_ROOT")
|
||||
}
|
||||
|
||||
func TestLoadFromEnvRequiresLobbyBaseURL(t *testing.T) {
|
||||
t.Setenv("RTMANAGER_POSTGRES_PRIMARY_DSN", "postgres://rtm:secret@localhost:5432/galaxy")
|
||||
t.Setenv("RTMANAGER_REDIS_MASTER_ADDR", "localhost:6379")
|
||||
t.Setenv("RTMANAGER_REDIS_PASSWORD", "secret")
|
||||
t.Setenv("RTMANAGER_GAME_STATE_ROOT", "/var/lib/galaxy/games")
|
||||
|
||||
_, err := LoadFromEnv()
|
||||
require.Error(t, err)
|
||||
require.Contains(t, err.Error(), "RTMANAGER_LOBBY_INTERNAL_BASE_URL")
|
||||
}
|
||||
|
||||
func TestLoadFromEnvRejectsRelativeStateRoot(t *testing.T) {
|
||||
validEnv(t)
|
||||
t.Setenv("RTMANAGER_GAME_STATE_ROOT", "relative/path")
|
||||
|
||||
_, err := LoadFromEnv()
|
||||
require.Error(t, err)
|
||||
require.Contains(t, err.Error(), "absolute path")
|
||||
}
|
||||
|
||||
func TestLoadFromEnvRejectsBadLogLevel(t *testing.T) {
|
||||
validEnv(t)
|
||||
t.Setenv("RTMANAGER_LOG_LEVEL", "verbose")
|
||||
|
||||
_, err := LoadFromEnv()
|
||||
require.Error(t, err)
|
||||
require.Contains(t, err.Error(), "RTMANAGER_LOG_LEVEL")
|
||||
}
|
||||
|
||||
func TestImagePullPolicyValidate(t *testing.T) {
|
||||
require.NoError(t, ImagePullPolicyIfMissing.Validate())
|
||||
require.NoError(t, ImagePullPolicyAlways.Validate())
|
||||
require.NoError(t, ImagePullPolicyNever.Validate())
|
||||
require.Error(t, ImagePullPolicy("monthly").Validate())
|
||||
}
|
||||
|
||||
func TestInternalHTTPValidateRejectsBadAddr(t *testing.T) {
|
||||
cfg := DefaultConfig().InternalHTTP
|
||||
cfg.Addr = "not-an-addr"
|
||||
err := cfg.Validate()
|
||||
require.Error(t, err)
|
||||
require.Contains(t, err.Error(), "host:port")
|
||||
}
|
||||
|
||||
func TestStreamsValidateRequiresAllNames(t *testing.T) {
|
||||
cfg := DefaultConfig().Streams
|
||||
cfg.StartJobs = " "
|
||||
err := cfg.Validate()
|
||||
require.Error(t, err)
|
||||
require.True(t, strings.Contains(err.Error(), "start jobs"))
|
||||
}
|
||||
@@ -0,0 +1,319 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"galaxy/postgres"
|
||||
"galaxy/redisconn"
|
||||
)
|
||||
|
||||
// LoadFromEnv builds Config from environment variables and validates the
|
||||
// resulting configuration.
|
||||
func LoadFromEnv() (Config, error) {
|
||||
cfg := DefaultConfig()
|
||||
|
||||
var err error
|
||||
|
||||
cfg.ShutdownTimeout, err = durationEnv(shutdownTimeoutEnvVar, cfg.ShutdownTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
cfg.Logging.Level = stringEnv(logLevelEnvVar, cfg.Logging.Level)
|
||||
|
||||
cfg.InternalHTTP.Addr = stringEnv(internalHTTPAddrEnvVar, cfg.InternalHTTP.Addr)
|
||||
cfg.InternalHTTP.ReadHeaderTimeout, err = durationEnv(internalHTTPReadHeaderTimeoutEnvVar, cfg.InternalHTTP.ReadHeaderTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.InternalHTTP.ReadTimeout, err = durationEnv(internalHTTPReadTimeoutEnvVar, cfg.InternalHTTP.ReadTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.InternalHTTP.WriteTimeout, err = durationEnv(internalHTTPWriteTimeoutEnvVar, cfg.InternalHTTP.WriteTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.InternalHTTP.IdleTimeout, err = durationEnv(internalHTTPIdleTimeoutEnvVar, cfg.InternalHTTP.IdleTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
cfg.Docker.Host = stringEnv(dockerHostEnvVar, cfg.Docker.Host)
|
||||
cfg.Docker.APIVersion = stringEnv(dockerAPIVersionEnvVar, cfg.Docker.APIVersion)
|
||||
cfg.Docker.Network = stringEnv(dockerNetworkEnvVar, cfg.Docker.Network)
|
||||
cfg.Docker.LogDriver = stringEnv(dockerLogDriverEnvVar, cfg.Docker.LogDriver)
|
||||
cfg.Docker.LogOpts = stringEnv(dockerLogOptsEnvVar, cfg.Docker.LogOpts)
|
||||
if raw, ok := os.LookupEnv(imagePullPolicyEnvVar); ok {
|
||||
cfg.Docker.PullPolicy = ImagePullPolicy(strings.TrimSpace(raw))
|
||||
}
|
||||
|
||||
pgConn, err := postgres.LoadFromEnv(envPrefix)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Postgres.Conn = pgConn
|
||||
|
||||
redisConn, err := redisconn.LoadFromEnv(envPrefix)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Redis.Conn = redisConn
|
||||
|
||||
cfg.Streams.StartJobs = stringEnv(startJobsStreamEnvVar, cfg.Streams.StartJobs)
|
||||
cfg.Streams.StopJobs = stringEnv(stopJobsStreamEnvVar, cfg.Streams.StopJobs)
|
||||
cfg.Streams.JobResults = stringEnv(jobResultsStreamEnvVar, cfg.Streams.JobResults)
|
||||
cfg.Streams.HealthEvents = stringEnv(healthEventsStreamEnvVar, cfg.Streams.HealthEvents)
|
||||
cfg.Streams.NotificationIntents = stringEnv(notificationIntentsStreamEnv, cfg.Streams.NotificationIntents)
|
||||
cfg.Streams.BlockTimeout, err = durationEnv(streamBlockTimeoutEnvVar, cfg.Streams.BlockTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
cfg.Container.DefaultCPUQuota, err = floatEnv(defaultCPUQuotaEnvVar, cfg.Container.DefaultCPUQuota)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Container.DefaultMemory = stringEnv(defaultMemoryEnvVar, cfg.Container.DefaultMemory)
|
||||
cfg.Container.DefaultPIDsLimit, err = intEnv(defaultPIDsLimitEnvVar, cfg.Container.DefaultPIDsLimit)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Container.StopTimeout, err = secondsEnv(containerStopTimeoutSecondsEnvVar, cfg.Container.StopTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Container.Retention, err = daysEnv(containerRetentionDaysEnvVar, cfg.Container.Retention)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Container.EngineStateMountPath = stringEnv(engineStateMountPathEnvVar, cfg.Container.EngineStateMountPath)
|
||||
cfg.Container.EngineStateEnvName = stringEnv(engineStateEnvNameEnvVar, cfg.Container.EngineStateEnvName)
|
||||
cfg.Container.GameStateDirMode, err = octalUint32Env(gameStateDirModeEnvVar, cfg.Container.GameStateDirMode)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Container.GameStateOwnerUID, err = intEnv(gameStateOwnerUIDEnvVar, cfg.Container.GameStateOwnerUID)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Container.GameStateOwnerGID, err = intEnv(gameStateOwnerGIDEnvVar, cfg.Container.GameStateOwnerGID)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
root, ok := os.LookupEnv(gameStateRootEnvVar)
|
||||
if !ok || strings.TrimSpace(root) == "" {
|
||||
return Config{}, fmt.Errorf("%s must be set", gameStateRootEnvVar)
|
||||
}
|
||||
cfg.Container.GameStateRoot = strings.TrimSpace(root)
|
||||
|
||||
cfg.Health.InspectInterval, err = durationEnv(inspectIntervalEnvVar, cfg.Health.InspectInterval)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Health.ProbeInterval, err = durationEnv(probeIntervalEnvVar, cfg.Health.ProbeInterval)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Health.ProbeTimeout, err = durationEnv(probeTimeoutEnvVar, cfg.Health.ProbeTimeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Health.ProbeFailuresThreshold, err = intEnv(probeFailuresThresholdEnvVar, cfg.Health.ProbeFailuresThreshold)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
cfg.Cleanup.ReconcileInterval, err = durationEnv(reconcileIntervalEnvVar, cfg.Cleanup.ReconcileInterval)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Cleanup.CleanupInterval, err = durationEnv(cleanupIntervalEnvVar, cfg.Cleanup.CleanupInterval)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
cfg.Coordination.GameLeaseTTL, err = secondsEnv(gameLeaseTTLSecondsEnvVar, cfg.Coordination.GameLeaseTTL)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
lobbyURL, ok := os.LookupEnv(lobbyInternalBaseURLEnvVar)
|
||||
if !ok || strings.TrimSpace(lobbyURL) == "" {
|
||||
return Config{}, fmt.Errorf("%s must be set", lobbyInternalBaseURLEnvVar)
|
||||
}
|
||||
cfg.Lobby.BaseURL = strings.TrimSpace(lobbyURL)
|
||||
cfg.Lobby.Timeout, err = durationEnv(lobbyInternalTimeoutEnvVar, cfg.Lobby.Timeout)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
cfg.Telemetry.ServiceName = stringEnv(otelServiceNameEnvVar, cfg.Telemetry.ServiceName)
|
||||
cfg.Telemetry.TracesExporter = normalizeExporterValue(stringEnv(otelTracesExporterEnvVar, cfg.Telemetry.TracesExporter))
|
||||
cfg.Telemetry.MetricsExporter = normalizeExporterValue(stringEnv(otelMetricsExporterEnvVar, cfg.Telemetry.MetricsExporter))
|
||||
cfg.Telemetry.TracesProtocol = normalizeProtocolValue(
|
||||
os.Getenv(otelExporterOTLPTracesProtocolEnvVar),
|
||||
os.Getenv(otelExporterOTLPProtocolEnvVar),
|
||||
cfg.Telemetry.TracesProtocol,
|
||||
)
|
||||
cfg.Telemetry.MetricsProtocol = normalizeProtocolValue(
|
||||
os.Getenv(otelExporterOTLPMetricsProtocolEnvVar),
|
||||
os.Getenv(otelExporterOTLPProtocolEnvVar),
|
||||
cfg.Telemetry.MetricsProtocol,
|
||||
)
|
||||
cfg.Telemetry.StdoutTracesEnabled, err = boolEnv(otelStdoutTracesEnabledEnvVar, cfg.Telemetry.StdoutTracesEnabled)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
cfg.Telemetry.StdoutMetricsEnabled, err = boolEnv(otelStdoutMetricsEnabledEnvVar, cfg.Telemetry.StdoutMetricsEnabled)
|
||||
if err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
if err := cfg.Validate(); err != nil {
|
||||
return Config{}, err
|
||||
}
|
||||
|
||||
return cfg, nil
|
||||
}
|
||||
|
||||
func stringEnv(name string, fallback string) string {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback
|
||||
}
|
||||
|
||||
return strings.TrimSpace(value)
|
||||
}
|
||||
|
||||
func durationEnv(name string, fallback time.Duration) (time.Duration, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := time.ParseDuration(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("%s: parse duration: %w", name, err)
|
||||
}
|
||||
|
||||
return parsed, nil
|
||||
}
|
||||
|
||||
func secondsEnv(name string, fallback time.Duration) (time.Duration, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := strconv.Atoi(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("%s: parse seconds: %w", name, err)
|
||||
}
|
||||
if parsed <= 0 {
|
||||
return 0, fmt.Errorf("%s: must be positive", name)
|
||||
}
|
||||
|
||||
return time.Duration(parsed) * time.Second, nil
|
||||
}
|
||||
|
||||
func daysEnv(name string, fallback time.Duration) (time.Duration, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := strconv.Atoi(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("%s: parse days: %w", name, err)
|
||||
}
|
||||
if parsed <= 0 {
|
||||
return 0, fmt.Errorf("%s: must be positive", name)
|
||||
}
|
||||
|
||||
return time.Duration(parsed) * 24 * time.Hour, nil
|
||||
}
|
||||
|
||||
func intEnv(name string, fallback int) (int, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := strconv.Atoi(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("%s: parse int: %w", name, err)
|
||||
}
|
||||
|
||||
return parsed, nil
|
||||
}
|
||||
|
||||
func floatEnv(name string, fallback float64) (float64, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := strconv.ParseFloat(strings.TrimSpace(value), 64)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("%s: parse float: %w", name, err)
|
||||
}
|
||||
|
||||
return parsed, nil
|
||||
}
|
||||
|
||||
func boolEnv(name string, fallback bool) (bool, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := strconv.ParseBool(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return false, fmt.Errorf("%s: parse bool: %w", name, err)
|
||||
}
|
||||
|
||||
return parsed, nil
|
||||
}
|
||||
|
||||
func octalUint32Env(name string, fallback uint32) (uint32, error) {
|
||||
value, ok := os.LookupEnv(name)
|
||||
if !ok {
|
||||
return fallback, nil
|
||||
}
|
||||
|
||||
parsed, err := strconv.ParseUint(strings.TrimSpace(value), 8, 32)
|
||||
if err != nil {
|
||||
return 0, fmt.Errorf("%s: parse octal: %w", name, err)
|
||||
}
|
||||
|
||||
return uint32(parsed), nil
|
||||
}
|
||||
|
||||
func normalizeExporterValue(value string) string {
|
||||
trimmed := strings.TrimSpace(value)
|
||||
switch trimmed {
|
||||
case "", "none":
|
||||
return "none"
|
||||
default:
|
||||
return trimmed
|
||||
}
|
||||
}
|
||||
|
||||
func normalizeProtocolValue(primary string, fallback string, defaultValue string) string {
|
||||
primary = strings.TrimSpace(primary)
|
||||
if primary != "" {
|
||||
return primary
|
||||
}
|
||||
|
||||
fallback = strings.TrimSpace(fallback)
|
||||
if fallback != "" {
|
||||
return fallback
|
||||
}
|
||||
|
||||
return strings.TrimSpace(defaultValue)
|
||||
}
|
||||
@@ -0,0 +1,93 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"net"
|
||||
"net/url"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// Validate reports whether cfg stores a usable Runtime Manager process
|
||||
// configuration.
|
||||
func (cfg Config) Validate() error {
|
||||
if cfg.ShutdownTimeout <= 0 {
|
||||
return fmt.Errorf("%s must be positive", shutdownTimeoutEnvVar)
|
||||
}
|
||||
if err := validateSlogLevel(cfg.Logging.Level); err != nil {
|
||||
return fmt.Errorf("%s: %w", logLevelEnvVar, err)
|
||||
}
|
||||
if err := cfg.InternalHTTP.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Docker.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Postgres.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Redis.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Streams.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Container.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Health.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Cleanup.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Coordination.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Lobby.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
if err := cfg.Telemetry.Validate(); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func validateSlogLevel(level string) error {
|
||||
var slogLevel slog.Level
|
||||
if err := slogLevel.UnmarshalText([]byte(strings.TrimSpace(level))); err != nil {
|
||||
return fmt.Errorf("invalid slog level %q: %w", level, err)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func isTCPAddr(value string) bool {
|
||||
host, port, err := net.SplitHostPort(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
if port == "" {
|
||||
return false
|
||||
}
|
||||
if host == "" {
|
||||
return true
|
||||
}
|
||||
|
||||
return !strings.Contains(host, " ")
|
||||
}
|
||||
|
||||
func isHTTPURL(value string) bool {
|
||||
parsed, err := url.Parse(strings.TrimSpace(value))
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
|
||||
if parsed.Scheme != "http" && parsed.Scheme != "https" {
|
||||
return false
|
||||
}
|
||||
|
||||
return parsed.Host != ""
|
||||
}
|
||||
@@ -0,0 +1,231 @@
|
||||
// Package health defines the technical-health domain types owned by
|
||||
// Runtime Manager.
|
||||
//
|
||||
// EventType matches the `event_type` enum frozen in
|
||||
// `galaxy/rtmanager/api/runtime-health-asyncapi.yaml`. SnapshotStatus
|
||||
// matches the SQL CHECK on `health_snapshots.status` and is intentionally
|
||||
// narrower than EventType (the snapshot table collapses
|
||||
// `container_started → healthy` and drops `probe_recovered` per
|
||||
// `galaxy/rtmanager/README.md §Health Monitoring`).
|
||||
package health
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// EventType identifies one entry on the `runtime:health_events` Redis
|
||||
// Stream. Used by the health-event publishers and consumers.
|
||||
type EventType string
|
||||
|
||||
const (
|
||||
// EventTypeContainerStarted reports a successful container start.
|
||||
EventTypeContainerStarted EventType = "container_started"
|
||||
|
||||
// EventTypeContainerExited reports a non-zero Docker `die` event.
|
||||
EventTypeContainerExited EventType = "container_exited"
|
||||
|
||||
// EventTypeContainerOOM reports a Docker `oom` event.
|
||||
EventTypeContainerOOM EventType = "container_oom"
|
||||
|
||||
// EventTypeContainerDisappeared reports that the listener observed
|
||||
// a `destroy` event for a record Runtime Manager did not initiate.
|
||||
EventTypeContainerDisappeared EventType = "container_disappeared"
|
||||
|
||||
// EventTypeInspectUnhealthy reports an unexpected outcome of the
|
||||
// periodic Docker inspect (RestartCount growth, unexpected status,
|
||||
// declared HEALTHCHECK reporting unhealthy).
|
||||
EventTypeInspectUnhealthy EventType = "inspect_unhealthy"
|
||||
|
||||
// EventTypeProbeFailed reports that the active HTTP probe crossed
|
||||
// the configured failure threshold.
|
||||
EventTypeProbeFailed EventType = "probe_failed"
|
||||
|
||||
// EventTypeProbeRecovered reports the first probe success after a
|
||||
// `probe_failed` event was published.
|
||||
EventTypeProbeRecovered EventType = "probe_recovered"
|
||||
)
|
||||
|
||||
// IsKnown reports whether eventType belongs to the frozen event-type
|
||||
// vocabulary.
|
||||
func (eventType EventType) IsKnown() bool {
|
||||
switch eventType {
|
||||
case EventTypeContainerStarted,
|
||||
EventTypeContainerExited,
|
||||
EventTypeContainerOOM,
|
||||
EventTypeContainerDisappeared,
|
||||
EventTypeInspectUnhealthy,
|
||||
EventTypeProbeFailed,
|
||||
EventTypeProbeRecovered:
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
// AllEventTypes returns the frozen list of every event-type value.
|
||||
func AllEventTypes() []EventType {
|
||||
return []EventType{
|
||||
EventTypeContainerStarted,
|
||||
EventTypeContainerExited,
|
||||
EventTypeContainerOOM,
|
||||
EventTypeContainerDisappeared,
|
||||
EventTypeInspectUnhealthy,
|
||||
EventTypeProbeFailed,
|
||||
EventTypeProbeRecovered,
|
||||
}
|
||||
}
|
||||
|
||||
// SnapshotStatus identifies one latest-observation status value stored
|
||||
// in the `health_snapshots.status` column. Distinct from EventType: the
|
||||
// table collapses `container_started → healthy` and never persists
|
||||
// `probe_recovered` (it is conveyed only as a `runtime:health_events`
|
||||
// entry with status=healthy in the next observation).
|
||||
type SnapshotStatus string
|
||||
|
||||
const (
|
||||
// SnapshotStatusHealthy reports that the most recent observation
|
||||
// found the container live and the engine probe responsive.
|
||||
SnapshotStatusHealthy SnapshotStatus = "healthy"
|
||||
|
||||
// SnapshotStatusProbeFailed reports that the active probe crossed
|
||||
// the failure threshold.
|
||||
SnapshotStatusProbeFailed SnapshotStatus = "probe_failed"
|
||||
|
||||
// SnapshotStatusExited reports that the container exited.
|
||||
SnapshotStatusExited SnapshotStatus = "exited"
|
||||
|
||||
// SnapshotStatusOOM reports that the container was killed by the
|
||||
// OOM killer.
|
||||
SnapshotStatusOOM SnapshotStatus = "oom"
|
||||
|
||||
// SnapshotStatusInspectUnhealthy reports that the periodic inspect
|
||||
// observed an unexpected state.
|
||||
SnapshotStatusInspectUnhealthy SnapshotStatus = "inspect_unhealthy"
|
||||
|
||||
// SnapshotStatusContainerDisappeared reports that Docker no longer
|
||||
// reports the container.
|
||||
SnapshotStatusContainerDisappeared SnapshotStatus = "container_disappeared"
|
||||
)
|
||||
|
||||
// IsKnown reports whether status belongs to the frozen snapshot-status
|
||||
// vocabulary.
|
||||
func (status SnapshotStatus) IsKnown() bool {
|
||||
switch status {
|
||||
case SnapshotStatusHealthy,
|
||||
SnapshotStatusProbeFailed,
|
||||
SnapshotStatusExited,
|
||||
SnapshotStatusOOM,
|
||||
SnapshotStatusInspectUnhealthy,
|
||||
SnapshotStatusContainerDisappeared:
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
// AllSnapshotStatuses returns the frozen list of every snapshot-status
|
||||
// value.
|
||||
func AllSnapshotStatuses() []SnapshotStatus {
|
||||
return []SnapshotStatus{
|
||||
SnapshotStatusHealthy,
|
||||
SnapshotStatusProbeFailed,
|
||||
SnapshotStatusExited,
|
||||
SnapshotStatusOOM,
|
||||
SnapshotStatusInspectUnhealthy,
|
||||
SnapshotStatusContainerDisappeared,
|
||||
}
|
||||
}
|
||||
|
||||
// SnapshotSource identifies the observation source that produced one
|
||||
// snapshot. Matches the SQL CHECK on `health_snapshots.source`.
|
||||
type SnapshotSource string
|
||||
|
||||
const (
|
||||
// SnapshotSourceDockerEvent reports that the latest observation
|
||||
// arrived through the Docker events listener.
|
||||
SnapshotSourceDockerEvent SnapshotSource = "docker_event"
|
||||
|
||||
// SnapshotSourceInspect reports that the latest observation arrived
|
||||
// through the periodic Docker inspect worker.
|
||||
SnapshotSourceInspect SnapshotSource = "inspect"
|
||||
|
||||
// SnapshotSourceProbe reports that the latest observation arrived
|
||||
// through the active HTTP probe.
|
||||
SnapshotSourceProbe SnapshotSource = "probe"
|
||||
)
|
||||
|
||||
// IsKnown reports whether source belongs to the frozen snapshot-source
|
||||
// vocabulary.
|
||||
func (source SnapshotSource) IsKnown() bool {
|
||||
switch source {
|
||||
case SnapshotSourceDockerEvent,
|
||||
SnapshotSourceInspect,
|
||||
SnapshotSourceProbe:
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
// AllSnapshotSources returns the frozen list of every snapshot-source
|
||||
// value.
|
||||
func AllSnapshotSources() []SnapshotSource {
|
||||
return []SnapshotSource{
|
||||
SnapshotSourceDockerEvent,
|
||||
SnapshotSourceInspect,
|
||||
SnapshotSourceProbe,
|
||||
}
|
||||
}
|
||||
|
||||
// HealthSnapshot stores the latest technical-health observation for one
|
||||
// game. One row per game_id; later observations overwrite.
|
||||
type HealthSnapshot struct {
|
||||
// GameID identifies the platform game.
|
||||
GameID string
|
||||
|
||||
// ContainerID stores the Docker container id observed by the
|
||||
// snapshot source. Empty when the source could not associate a
|
||||
// container (e.g., reconciler dispose for a record whose container
|
||||
// is already gone).
|
||||
ContainerID string
|
||||
|
||||
// Status stores the latest observed snapshot status.
|
||||
Status SnapshotStatus
|
||||
|
||||
// Source stores the observation source that produced this entry.
|
||||
Source SnapshotSource
|
||||
|
||||
// Details stores the source-specific JSON detail payload. Adapters
|
||||
// store and retrieve it verbatim. Empty / nil values are persisted
|
||||
// as the SQL default `{}`.
|
||||
Details json.RawMessage
|
||||
|
||||
// ObservedAt stores the wall-clock at which the source captured the
|
||||
// observation.
|
||||
ObservedAt time.Time
|
||||
}
|
||||
|
||||
// Validate reports whether snapshot satisfies the snapshot invariants
|
||||
// implied by the SQL CHECK constraints.
|
||||
func (snapshot HealthSnapshot) Validate() error {
|
||||
if strings.TrimSpace(snapshot.GameID) == "" {
|
||||
return fmt.Errorf("game id must not be empty")
|
||||
}
|
||||
if !snapshot.Status.IsKnown() {
|
||||
return fmt.Errorf("status %q is unsupported", snapshot.Status)
|
||||
}
|
||||
if !snapshot.Source.IsKnown() {
|
||||
return fmt.Errorf("source %q is unsupported", snapshot.Source)
|
||||
}
|
||||
if snapshot.ObservedAt.IsZero() {
|
||||
return fmt.Errorf("observed at must not be zero")
|
||||
}
|
||||
if len(snapshot.Details) > 0 && !json.Valid(snapshot.Details) {
|
||||
return fmt.Errorf("details must be valid JSON when non-empty")
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
@@ -0,0 +1,133 @@
|
||||
package health
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func TestEventTypeIsKnown(t *testing.T) {
|
||||
for _, eventType := range AllEventTypes() {
|
||||
assert.Truef(t, eventType.IsKnown(), "expected %q known", eventType)
|
||||
}
|
||||
|
||||
assert.False(t, EventType("").IsKnown())
|
||||
assert.False(t, EventType("paused").IsKnown())
|
||||
}
|
||||
|
||||
func TestAllEventTypesCoverFrozenSet(t *testing.T) {
|
||||
assert.ElementsMatch(t,
|
||||
[]EventType{
|
||||
EventTypeContainerStarted,
|
||||
EventTypeContainerExited,
|
||||
EventTypeContainerOOM,
|
||||
EventTypeContainerDisappeared,
|
||||
EventTypeInspectUnhealthy,
|
||||
EventTypeProbeFailed,
|
||||
EventTypeProbeRecovered,
|
||||
},
|
||||
AllEventTypes(),
|
||||
)
|
||||
}
|
||||
|
||||
func TestSnapshotStatusIsKnown(t *testing.T) {
|
||||
for _, status := range AllSnapshotStatuses() {
|
||||
assert.Truef(t, status.IsKnown(), "expected %q known", status)
|
||||
}
|
||||
|
||||
assert.False(t, SnapshotStatus("").IsKnown())
|
||||
assert.False(t, SnapshotStatus("starting").IsKnown())
|
||||
assert.False(t, SnapshotStatus("probe_recovered").IsKnown(),
|
||||
"snapshot status must not include event-only values")
|
||||
assert.False(t, SnapshotStatus("container_started").IsKnown(),
|
||||
"snapshot status must not include event-only values")
|
||||
}
|
||||
|
||||
func TestAllSnapshotStatusesCoverFrozenSet(t *testing.T) {
|
||||
assert.ElementsMatch(t,
|
||||
[]SnapshotStatus{
|
||||
SnapshotStatusHealthy,
|
||||
SnapshotStatusProbeFailed,
|
||||
SnapshotStatusExited,
|
||||
SnapshotStatusOOM,
|
||||
SnapshotStatusInspectUnhealthy,
|
||||
SnapshotStatusContainerDisappeared,
|
||||
},
|
||||
AllSnapshotStatuses(),
|
||||
)
|
||||
}
|
||||
|
||||
func TestSnapshotSourceIsKnown(t *testing.T) {
|
||||
for _, source := range AllSnapshotSources() {
|
||||
assert.Truef(t, source.IsKnown(), "expected %q known", source)
|
||||
}
|
||||
|
||||
assert.False(t, SnapshotSource("").IsKnown())
|
||||
assert.False(t, SnapshotSource("manual").IsKnown())
|
||||
}
|
||||
|
||||
func TestAllSnapshotSourcesCoverFrozenSet(t *testing.T) {
|
||||
assert.ElementsMatch(t,
|
||||
[]SnapshotSource{
|
||||
SnapshotSourceDockerEvent,
|
||||
SnapshotSourceInspect,
|
||||
SnapshotSourceProbe,
|
||||
},
|
||||
AllSnapshotSources(),
|
||||
)
|
||||
}
|
||||
|
||||
func sampleSnapshot() HealthSnapshot {
|
||||
return HealthSnapshot{
|
||||
GameID: "game-test",
|
||||
ContainerID: "container-1",
|
||||
Status: SnapshotStatusHealthy,
|
||||
Source: SnapshotSourceProbe,
|
||||
Details: json.RawMessage(`{"prior_failure_count":0}`),
|
||||
ObservedAt: time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC),
|
||||
}
|
||||
}
|
||||
|
||||
func TestHealthSnapshotValidateHappy(t *testing.T) {
|
||||
require.NoError(t, sampleSnapshot().Validate())
|
||||
}
|
||||
|
||||
func TestHealthSnapshotValidateAcceptsEmptyDetails(t *testing.T) {
|
||||
snapshot := sampleSnapshot()
|
||||
snapshot.Details = nil
|
||||
|
||||
assert.NoError(t, snapshot.Validate())
|
||||
}
|
||||
|
||||
func TestHealthSnapshotValidateAcceptsEmptyContainerID(t *testing.T) {
|
||||
snapshot := sampleSnapshot()
|
||||
snapshot.ContainerID = ""
|
||||
|
||||
assert.NoError(t, snapshot.Validate())
|
||||
}
|
||||
|
||||
func TestHealthSnapshotValidateRejects(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
mutate func(*HealthSnapshot)
|
||||
}{
|
||||
{"empty game id", func(s *HealthSnapshot) { s.GameID = "" }},
|
||||
{"unknown status", func(s *HealthSnapshot) { s.Status = "exotic" }},
|
||||
{"unknown source", func(s *HealthSnapshot) { s.Source = "exotic" }},
|
||||
{"zero observed at", func(s *HealthSnapshot) { s.ObservedAt = time.Time{} }},
|
||||
{"invalid details json", func(s *HealthSnapshot) {
|
||||
s.Details = json.RawMessage("not-json")
|
||||
}},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
snapshot := sampleSnapshot()
|
||||
tt.mutate(&snapshot)
|
||||
assert.Error(t, snapshot.Validate())
|
||||
})
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,245 @@
|
||||
// Package operation defines the runtime-operation audit-log domain types
|
||||
// owned by Runtime Manager.
|
||||
//
|
||||
// One OperationEntry maps to one row of the `operation_log` PostgreSQL
|
||||
// table (see
|
||||
// `galaxy/rtmanager/internal/adapters/postgres/migrations/00001_init.sql`).
|
||||
// The OpKind / OpSource / Outcome enums match the SQL CHECK constraints
|
||||
// verbatim and feed the telemetry counters declared in
|
||||
// `galaxy/rtmanager/README.md §Observability`.
|
||||
package operation
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
// OpKind identifies the kind of operation Runtime Manager performed.
|
||||
type OpKind string
|
||||
|
||||
const (
|
||||
// OpKindStart records a start lifecycle operation.
|
||||
OpKindStart OpKind = "start"
|
||||
|
||||
// OpKindStop records a stop lifecycle operation.
|
||||
OpKindStop OpKind = "stop"
|
||||
|
||||
// OpKindRestart records a restart lifecycle operation
|
||||
// (recreate with the same image_ref).
|
||||
OpKindRestart OpKind = "restart"
|
||||
|
||||
// OpKindPatch records a semver-patch lifecycle operation
|
||||
// (recreate with a new image_ref).
|
||||
OpKindPatch OpKind = "patch"
|
||||
|
||||
// OpKindCleanupContainer records a container removal performed by
|
||||
// the cleanup TTL worker or the admin DELETE endpoint.
|
||||
OpKindCleanupContainer OpKind = "cleanup_container"
|
||||
|
||||
// OpKindReconcileAdopt records that the reconciler discovered an
|
||||
// unrecorded container labelled `com.galaxy.owner=rtmanager` and
|
||||
// inserted a runtime record for it.
|
||||
OpKindReconcileAdopt OpKind = "reconcile_adopt"
|
||||
|
||||
// OpKindReconcileDispose records that the reconciler observed a
|
||||
// running record whose container is missing in Docker and marked it
|
||||
// as removed.
|
||||
OpKindReconcileDispose OpKind = "reconcile_dispose"
|
||||
)
|
||||
|
||||
// IsKnown reports whether kind belongs to the frozen op-kind vocabulary.
|
||||
func (kind OpKind) IsKnown() bool {
|
||||
switch kind {
|
||||
case OpKindStart,
|
||||
OpKindStop,
|
||||
OpKindRestart,
|
||||
OpKindPatch,
|
||||
OpKindCleanupContainer,
|
||||
OpKindReconcileAdopt,
|
||||
OpKindReconcileDispose:
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
// AllOpKinds returns the frozen list of every op-kind value. The slice
|
||||
// order is stable across calls.
|
||||
func AllOpKinds() []OpKind {
|
||||
return []OpKind{
|
||||
OpKindStart,
|
||||
OpKindStop,
|
||||
OpKindRestart,
|
||||
OpKindPatch,
|
||||
OpKindCleanupContainer,
|
||||
OpKindReconcileAdopt,
|
||||
OpKindReconcileDispose,
|
||||
}
|
||||
}
|
||||
|
||||
// OpSource identifies where one operation entered Runtime Manager.
|
||||
type OpSource string
|
||||
|
||||
const (
|
||||
// OpSourceLobbyStream identifies entries triggered by the
|
||||
// `runtime:start_jobs` or `runtime:stop_jobs` Redis Stream consumer.
|
||||
OpSourceLobbyStream OpSource = "lobby_stream"
|
||||
|
||||
// OpSourceGMRest identifies entries triggered by Game Master through
|
||||
// the internal REST surface.
|
||||
OpSourceGMRest OpSource = "gm_rest"
|
||||
|
||||
// OpSourceAdminRest identifies entries triggered by Admin Service
|
||||
// through the internal REST surface.
|
||||
OpSourceAdminRest OpSource = "admin_rest"
|
||||
|
||||
// OpSourceAutoTTL identifies entries triggered by the periodic
|
||||
// container-cleanup worker.
|
||||
OpSourceAutoTTL OpSource = "auto_ttl"
|
||||
|
||||
// OpSourceAutoReconcile identifies entries triggered by the
|
||||
// reconciler at startup or on its periodic interval.
|
||||
OpSourceAutoReconcile OpSource = "auto_reconcile"
|
||||
)
|
||||
|
||||
// IsKnown reports whether source belongs to the frozen op-source
|
||||
// vocabulary.
|
||||
func (source OpSource) IsKnown() bool {
|
||||
switch source {
|
||||
case OpSourceLobbyStream,
|
||||
OpSourceGMRest,
|
||||
OpSourceAdminRest,
|
||||
OpSourceAutoTTL,
|
||||
OpSourceAutoReconcile:
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
// AllOpSources returns the frozen list of every op-source value. The
|
||||
// slice order is stable across calls.
|
||||
func AllOpSources() []OpSource {
|
||||
return []OpSource{
|
||||
OpSourceLobbyStream,
|
||||
OpSourceGMRest,
|
||||
OpSourceAdminRest,
|
||||
OpSourceAutoTTL,
|
||||
OpSourceAutoReconcile,
|
||||
}
|
||||
}
|
||||
|
||||
// Outcome reports the high-level outcome of one operation.
|
||||
type Outcome string
|
||||
|
||||
const (
|
||||
// OutcomeSuccess reports that the operation completed without
|
||||
// surfacing an error.
|
||||
OutcomeSuccess Outcome = "success"
|
||||
|
||||
// OutcomeFailure reports that the operation surfaced a stable error
|
||||
// code recorded in OperationEntry.ErrorCode.
|
||||
OutcomeFailure Outcome = "failure"
|
||||
)
|
||||
|
||||
// IsKnown reports whether outcome belongs to the frozen outcome
|
||||
// vocabulary.
|
||||
func (outcome Outcome) IsKnown() bool {
|
||||
switch outcome {
|
||||
case OutcomeSuccess, OutcomeFailure:
|
||||
return true
|
||||
default:
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
// AllOutcomes returns the frozen list of every outcome value.
|
||||
func AllOutcomes() []Outcome {
|
||||
return []Outcome{OutcomeSuccess, OutcomeFailure}
|
||||
}
|
||||
|
||||
// OperationEntry stores one append-only audit row of the `operation_log`
|
||||
// table. ID is zero on records that have not been persisted yet; the
|
||||
// store assigns it from the table's bigserial column. FinishedAt is a
|
||||
// pointer because the column is nullable for in-flight rows even though
|
||||
// the lifecycle services finalise the row in the same transaction.
|
||||
type OperationEntry struct {
|
||||
// ID identifies the persisted row. Zero before persistence.
|
||||
ID int64
|
||||
|
||||
// GameID identifies the platform game this operation acted on.
|
||||
GameID string
|
||||
|
||||
// OpKind classifies what the operation did.
|
||||
OpKind OpKind
|
||||
|
||||
// OpSource classifies how the operation entered Runtime Manager.
|
||||
OpSource OpSource
|
||||
|
||||
// SourceRef stores an opaque per-source reference such as a Redis
|
||||
// Stream entry id, a REST request id, or an admin user id. Empty
|
||||
// when the source does not provide one.
|
||||
SourceRef string
|
||||
|
||||
// ImageRef stores the engine image reference associated with the
|
||||
// operation, when applicable. Empty for operations that do not
|
||||
// touch an image (e.g., cleanup_container).
|
||||
ImageRef string
|
||||
|
||||
// ContainerID stores the Docker container id observed at the time
|
||||
// of the operation, when applicable.
|
||||
ContainerID string
|
||||
|
||||
// Outcome reports whether the operation succeeded or failed.
|
||||
Outcome Outcome
|
||||
|
||||
// ErrorCode stores the stable error code on failure. Empty on
|
||||
// success.
|
||||
ErrorCode string
|
||||
|
||||
// ErrorMessage stores the operator-readable detail on failure.
|
||||
// Empty on success.
|
||||
ErrorMessage string
|
||||
|
||||
// StartedAt stores the wall-clock at which the operation began.
|
||||
StartedAt time.Time
|
||||
|
||||
// FinishedAt stores the wall-clock at which the operation
|
||||
// finalised. Nil for in-flight rows.
|
||||
FinishedAt *time.Time
|
||||
}
|
||||
|
||||
// Validate reports whether entry satisfies the operation-log invariants
|
||||
// implied by the SQL CHECK constraints and the README §Persistence
|
||||
// Layout.
|
||||
func (entry OperationEntry) Validate() error {
|
||||
if strings.TrimSpace(entry.GameID) == "" {
|
||||
return fmt.Errorf("game id must not be empty")
|
||||
}
|
||||
if !entry.OpKind.IsKnown() {
|
||||
return fmt.Errorf("op kind %q is unsupported", entry.OpKind)
|
||||
}
|
||||
if !entry.OpSource.IsKnown() {
|
||||
return fmt.Errorf("op source %q is unsupported", entry.OpSource)
|
||||
}
|
||||
if !entry.Outcome.IsKnown() {
|
||||
return fmt.Errorf("outcome %q is unsupported", entry.Outcome)
|
||||
}
|
||||
if entry.StartedAt.IsZero() {
|
||||
return fmt.Errorf("started at must not be zero")
|
||||
}
|
||||
if entry.FinishedAt != nil {
|
||||
if entry.FinishedAt.IsZero() {
|
||||
return fmt.Errorf("finished at must not be zero when present")
|
||||
}
|
||||
if entry.FinishedAt.Before(entry.StartedAt) {
|
||||
return fmt.Errorf("finished at must not be before started at")
|
||||
}
|
||||
}
|
||||
if entry.Outcome == OutcomeFailure && strings.TrimSpace(entry.ErrorCode) == "" {
|
||||
return fmt.Errorf("error code must not be empty for failure entries")
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user