Files
galaxy-game/rtmanager/docs/integration-tests.md
T
2026-04-28 20:39:18 +02:00

164 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Service-Local Integration Suite
This document explains the design of the service-local integration
suite under [`../integration/`](../integration). The current-state
behaviour (harness layout, env knobs, scenario coverage) lives next
to the files themselves; this document records the rationale.
The cross-service Lobby↔RTM suite at
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
different rules (it lives in the top-level `galaxy/integration`
module) and is documented inside that package.
## 1. Build tag `integration`
The scenarios under [`../integration/*_test.go`](../integration) are
guarded by `//go:build integration`. The default `go test ./...`
invocation skips them, while `go test -tags=integration
./integration/...` (and the `make integration` target) runs the full
set:
```sh
make -C rtmanager integration
```
The harness package itself ([`../integration/harness`](../integration/harness))
has no build tag. It compiles on every run because each helper guards
its Docker-dependent paths with `t.Skip` when the daemon is
unavailable. This keeps the harness loadable from a tagless `go vet`
or IDE workflow without dragging Docker into the default `go test`
critical path.
## 2. Smoke test runs in the default `go test` pass
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
runs in the regular `go test ./...` pass and falls back on
`skipUnlessDockerAvailable` when no Docker socket is present. The
smoke test is intentionally kept separate from the new `integration/`
suite because it exercises the production adapter shape (one
container at a time against `alpine:3.21`), not the full runtime;
both surfaces are useful.
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
logger)` directly rather than spawning the binary from
`cmd/rtmanager/main.go`:
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
the runtime context and call `runtime.Close()`; the goroutine
driving `runtime.Run` returns with `context.Canceled` and the
helper waits on it via the `runDone` channel. With a subprocess the
equivalent dance requires SIGTERM, output capture, and graceful
shutdown timing tied to the child's signal handler.
- **Goroutine and store visibility.** Tests read the durable PG state
directly through the harness-owned pool and read every Redis stream
through the harness-owned client. Both observe the exact wire shape
Lobby will see in the cross-service suite.
- **Logger isolation.** The harness defaults to `slog.Discard` so the
default test output stays focused on assertions; flipping
`EnvOptions.LogToStderr` lights up the runtime's structured logs
for local debugging without requiring any subprocess plumbing.
The cross-service inter-process suite at `integration/lobbyrtm/`
re-uses the existing `integration/internal/harness` binary-spawn
helpers; the in-process choice here is specific to the service-local
scope.
## 4. `httptest.Server` stub for the Lobby internal client
Runtime Manager configuration requires a non-empty
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
as a no-op (the start envelope already carries the only required
field, `image_ref`; rationale in [`services.md`](services.md) §7).
The harness therefore stands up a tiny `httptest.Server` per test
that returns a stable `200 OK` response. The stub is intentionally
unconfigurable: every integration scenario produces the same
ancillary fetch, and adding routing/error injection would invite
test code to depend on a contract the start service deliberately
ignores.
## 5. One built engine image, two semver-compatible tags
The patch lifecycle expects the new and current image refs to share
the same major / minor version (`semver_patch_only` failure
otherwise). Building two distinct images would multiply the per-run
build cost without changing what the test verifies — the patch path
exercises `image_ref_not_semver` and `semver_patch_only` validation
plus the recreate-with-new-tag flow, none of which depend on
distinct image *content*. The harness builds the engine once and
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
The integration tags use the `*-rtm-it` suffix (rather than plain
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
accidentally consume a hand-built dev image, and so a `docker image
rm` of integration leftovers does not nuke a production-shaped tag.
## 6. Per-test Docker network and per-test state root
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
and registers cleanup; `t.ArtifactDir()` provides the per-game state
root. Both ensure that two scenarios running back-to-back cannot
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
filesystem state. Game ids are themselves unique per test
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
the per-test network and state root, the suite is safe to run with
`-count` greater than one.
`t.ArtifactDir()` keeps the engine state directory around when a
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
failure and inspect what the engine wrote. On success the directory
is automatically cleaned up.
## 7. PostgreSQL and Redis containers shared per-package
Both fixtures use `sync.Once` to start one testcontainer per test
package, mirroring the
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
pattern. `TruncatePostgres` and `FlushRedis` reset state between
tests so each scenario starts on an empty stack. The trade-off versus
per-test containers is the standard one: container startup dominates
the per-package latency, so amortising it across the suite keeps the
loop tight while the truncate/flush ensures isolation. The ~12 s
difference matters in CI.
## 8. Engine image cache is intentionally retained between runs
`buildAndTagEngineImage` runs once per package via `sync.Once` and
leaves both image tags in the local Docker cache after the suite
exits. The cache is a substantial speed-up on a developer laptop
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
hot), and a stale image is unlikely because the tags carry the
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
with multiple test runs. Operators who suspect a stale image can
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
the next run rebuilds.
## 9. Scenario coverage
The suite covers the four end-to-end flows operators care about:
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
restart → patch → stop → cleanup. The intermediate `stop` between
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
to remove a running container per
[`../README.md` §Cleanup](../README.md#cleanup).
- **replay** (`replay_test.go`) — duplicate start / stop entries
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
- **health** (`health_test.go`) — external `docker rm` produces
`container_disappeared`; manual `docker run` is adopted by the
reconciler.
- **notification** (`notification_test.go`) — unresolvable `image_ref`
produces `runtime.image_pull_failed` plus a `failure` job_result.
## 10. Service-local scope only
This suite runs Runtime Manager against a real Docker daemon plus
testcontainers PG / Redis but **does not** include any other Galaxy
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
in the top-level `galaxy/integration/` module, where the harness
spawns multiple service binaries and uses real (not stubbed) cross-
service streams.