feat: runtime manager
This commit is contained in:
@@ -0,0 +1,163 @@
|
||||
# Service-Local Integration Suite
|
||||
|
||||
This document explains the design of the service-local integration
|
||||
suite under [`../integration/`](../integration). The current-state
|
||||
behaviour (harness layout, env knobs, scenario coverage) lives next
|
||||
to the files themselves; this document records the rationale.
|
||||
|
||||
The cross-service Lobby↔RTM suite at
|
||||
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
|
||||
different rules (it lives in the top-level `galaxy/integration`
|
||||
module) and is documented inside that package.
|
||||
|
||||
## 1. Build tag `integration`
|
||||
|
||||
The scenarios under [`../integration/*_test.go`](../integration) are
|
||||
guarded by `//go:build integration`. The default `go test ./...`
|
||||
invocation skips them, while `go test -tags=integration
|
||||
./integration/...` (and the `make integration` target) runs the full
|
||||
set:
|
||||
|
||||
```sh
|
||||
make -C rtmanager integration
|
||||
```
|
||||
|
||||
The harness package itself ([`../integration/harness`](../integration/harness))
|
||||
has no build tag. It compiles on every run because each helper guards
|
||||
its Docker-dependent paths with `t.Skip` when the daemon is
|
||||
unavailable. This keeps the harness loadable from a tagless `go vet`
|
||||
or IDE workflow without dragging Docker into the default `go test`
|
||||
critical path.
|
||||
|
||||
## 2. Smoke test runs in the default `go test` pass
|
||||
|
||||
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
|
||||
runs in the regular `go test ./...` pass and falls back on
|
||||
`skipUnlessDockerAvailable` when no Docker socket is present. The
|
||||
smoke test is intentionally kept separate from the new `integration/`
|
||||
suite because it exercises the production adapter shape (one
|
||||
container at a time against `alpine:3.21`), not the full runtime;
|
||||
both surfaces are useful.
|
||||
|
||||
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
|
||||
|
||||
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
|
||||
logger)` directly rather than spawning the binary from
|
||||
`cmd/rtmanager/main.go`:
|
||||
|
||||
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
|
||||
the runtime context and call `runtime.Close()`; the goroutine
|
||||
driving `runtime.Run` returns with `context.Canceled` and the
|
||||
helper waits on it via the `runDone` channel. With a subprocess the
|
||||
equivalent dance requires SIGTERM, output capture, and graceful
|
||||
shutdown timing tied to the child's signal handler.
|
||||
- **Goroutine and store visibility.** Tests read the durable PG state
|
||||
directly through the harness-owned pool and read every Redis stream
|
||||
through the harness-owned client. Both observe the exact wire shape
|
||||
Lobby will see in the cross-service suite.
|
||||
- **Logger isolation.** The harness defaults to `slog.Discard` so the
|
||||
default test output stays focused on assertions; flipping
|
||||
`EnvOptions.LogToStderr` lights up the runtime's structured logs
|
||||
for local debugging without requiring any subprocess plumbing.
|
||||
|
||||
The cross-service inter-process suite at `integration/lobbyrtm/`
|
||||
re-uses the existing `integration/internal/harness` binary-spawn
|
||||
helpers; the in-process choice here is specific to the service-local
|
||||
scope.
|
||||
|
||||
## 4. `httptest.Server` stub for the Lobby internal client
|
||||
|
||||
Runtime Manager configuration requires a non-empty
|
||||
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
|
||||
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
|
||||
as a no-op (the start envelope already carries the only required
|
||||
field, `image_ref`; rationale in [`services.md`](services.md) §7).
|
||||
The harness therefore stands up a tiny `httptest.Server` per test
|
||||
that returns a stable `200 OK` response. The stub is intentionally
|
||||
unconfigurable: every integration scenario produces the same
|
||||
ancillary fetch, and adding routing/error injection would invite
|
||||
test code to depend on a contract the start service deliberately
|
||||
ignores.
|
||||
|
||||
## 5. One built engine image, two semver-compatible tags
|
||||
|
||||
The patch lifecycle expects the new and current image refs to share
|
||||
the same major / minor version (`semver_patch_only` failure
|
||||
otherwise). Building two distinct images would multiply the per-run
|
||||
build cost without changing what the test verifies — the patch path
|
||||
exercises `image_ref_not_semver` and `semver_patch_only` validation
|
||||
plus the recreate-with-new-tag flow, none of which depend on
|
||||
distinct image *content*. The harness builds the engine once and
|
||||
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
|
||||
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
|
||||
|
||||
The integration tags use the `*-rtm-it` suffix (rather than plain
|
||||
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
|
||||
accidentally consume a hand-built dev image, and so a `docker image
|
||||
rm` of integration leftovers does not nuke a production-shaped tag.
|
||||
|
||||
## 6. Per-test Docker network and per-test state root
|
||||
|
||||
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
|
||||
and registers cleanup; `t.ArtifactDir()` provides the per-game state
|
||||
root. Both ensure that two scenarios running back-to-back cannot
|
||||
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
|
||||
filesystem state. Game ids are themselves unique per test
|
||||
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
|
||||
the per-test network and state root, the suite is safe to run with
|
||||
`-count` greater than one.
|
||||
|
||||
`t.ArtifactDir()` keeps the engine state directory around when a
|
||||
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
|
||||
failure and inspect what the engine wrote. On success the directory
|
||||
is automatically cleaned up.
|
||||
|
||||
## 7. PostgreSQL and Redis containers shared per-package
|
||||
|
||||
Both fixtures use `sync.Once` to start one testcontainer per test
|
||||
package, mirroring the
|
||||
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
|
||||
pattern. `TruncatePostgres` and `FlushRedis` reset state between
|
||||
tests so each scenario starts on an empty stack. The trade-off versus
|
||||
per-test containers is the standard one: container startup dominates
|
||||
the per-package latency, so amortising it across the suite keeps the
|
||||
loop tight while the truncate/flush ensures isolation. The ~1–2 s
|
||||
difference matters in CI.
|
||||
|
||||
## 8. Engine image cache is intentionally retained between runs
|
||||
|
||||
`buildAndTagEngineImage` runs once per package via `sync.Once` and
|
||||
leaves both image tags in the local Docker cache after the suite
|
||||
exits. The cache is a substantial speed-up on a developer laptop
|
||||
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
|
||||
hot), and a stale image is unlikely because the tags carry the
|
||||
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
|
||||
with multiple test runs. Operators who suspect a stale image can
|
||||
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
|
||||
the next run rebuilds.
|
||||
|
||||
## 9. Scenario coverage
|
||||
|
||||
The suite covers the four end-to-end flows operators care about:
|
||||
|
||||
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
|
||||
restart → patch → stop → cleanup. The intermediate `stop` between
|
||||
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
|
||||
to remove a running container per
|
||||
[`../README.md` §Cleanup](../README.md#cleanup).
|
||||
- **replay** (`replay_test.go`) — duplicate start / stop entries
|
||||
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
|
||||
- **health** (`health_test.go`) — external `docker rm` produces
|
||||
`container_disappeared`; manual `docker run` is adopted by the
|
||||
reconciler.
|
||||
- **notification** (`notification_test.go`) — unresolvable `image_ref`
|
||||
produces `runtime.image_pull_failed` plus a `failure` job_result.
|
||||
|
||||
## 10. Service-local scope only
|
||||
|
||||
This suite runs Runtime Manager against a real Docker daemon plus
|
||||
testcontainers PG / Redis but **does not** include any other Galaxy
|
||||
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
|
||||
in the top-level `galaxy/integration/` module, where the harness
|
||||
spawns multiple service binaries and uses real (not stubbed) cross-
|
||||
service streams.
|
||||
Reference in New Issue
Block a user