164 lines
7.7 KiB
Markdown
164 lines
7.7 KiB
Markdown
# Service-Local Integration Suite
|
||
|
||
This document explains the design of the service-local integration
|
||
suite under [`../integration/`](../integration). The current-state
|
||
behaviour (harness layout, env knobs, scenario coverage) lives next
|
||
to the files themselves; this document records the rationale.
|
||
|
||
The cross-service Lobby↔RTM suite at
|
||
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
|
||
different rules (it lives in the top-level `galaxy/integration`
|
||
module) and is documented inside that package.
|
||
|
||
## 1. Build tag `integration`
|
||
|
||
The scenarios under [`../integration/*_test.go`](../integration) are
|
||
guarded by `//go:build integration`. The default `go test ./...`
|
||
invocation skips them, while `go test -tags=integration
|
||
./integration/...` (and the `make integration` target) runs the full
|
||
set:
|
||
|
||
```sh
|
||
make -C rtmanager integration
|
||
```
|
||
|
||
The harness package itself ([`../integration/harness`](../integration/harness))
|
||
has no build tag. It compiles on every run because each helper guards
|
||
its Docker-dependent paths with `t.Skip` when the daemon is
|
||
unavailable. This keeps the harness loadable from a tagless `go vet`
|
||
or IDE workflow without dragging Docker into the default `go test`
|
||
critical path.
|
||
|
||
## 2. Smoke test runs in the default `go test` pass
|
||
|
||
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
|
||
runs in the regular `go test ./...` pass and falls back on
|
||
`skipUnlessDockerAvailable` when no Docker socket is present. The
|
||
smoke test is intentionally kept separate from the new `integration/`
|
||
suite because it exercises the production adapter shape (one
|
||
container at a time against `alpine:3.21`), not the full runtime;
|
||
both surfaces are useful.
|
||
|
||
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
|
||
|
||
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
|
||
logger)` directly rather than spawning the binary from
|
||
`cmd/rtmanager/main.go`:
|
||
|
||
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
|
||
the runtime context and call `runtime.Close()`; the goroutine
|
||
driving `runtime.Run` returns with `context.Canceled` and the
|
||
helper waits on it via the `runDone` channel. With a subprocess the
|
||
equivalent dance requires SIGTERM, output capture, and graceful
|
||
shutdown timing tied to the child's signal handler.
|
||
- **Goroutine and store visibility.** Tests read the durable PG state
|
||
directly through the harness-owned pool and read every Redis stream
|
||
through the harness-owned client. Both observe the exact wire shape
|
||
Lobby will see in the cross-service suite.
|
||
- **Logger isolation.** The harness defaults to `slog.Discard` so the
|
||
default test output stays focused on assertions; flipping
|
||
`EnvOptions.LogToStderr` lights up the runtime's structured logs
|
||
for local debugging without requiring any subprocess plumbing.
|
||
|
||
The cross-service inter-process suite at `integration/lobbyrtm/`
|
||
re-uses the existing `integration/internal/harness` binary-spawn
|
||
helpers; the in-process choice here is specific to the service-local
|
||
scope.
|
||
|
||
## 4. `httptest.Server` stub for the Lobby internal client
|
||
|
||
Runtime Manager configuration requires a non-empty
|
||
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
|
||
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
|
||
as a no-op (the start envelope already carries the only required
|
||
field, `image_ref`; rationale in [`services.md`](services.md) §7).
|
||
The harness therefore stands up a tiny `httptest.Server` per test
|
||
that returns a stable `200 OK` response. The stub is intentionally
|
||
unconfigurable: every integration scenario produces the same
|
||
ancillary fetch, and adding routing/error injection would invite
|
||
test code to depend on a contract the start service deliberately
|
||
ignores.
|
||
|
||
## 5. One built engine image, two semver-compatible tags
|
||
|
||
The patch lifecycle expects the new and current image refs to share
|
||
the same major / minor version (`semver_patch_only` failure
|
||
otherwise). Building two distinct images would multiply the per-run
|
||
build cost without changing what the test verifies — the patch path
|
||
exercises `image_ref_not_semver` and `semver_patch_only` validation
|
||
plus the recreate-with-new-tag flow, none of which depend on
|
||
distinct image *content*. The harness builds the engine once and
|
||
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
|
||
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
|
||
|
||
The integration tags use the `*-rtm-it` suffix (rather than plain
|
||
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
|
||
accidentally consume a hand-built dev image, and so a `docker image
|
||
rm` of integration leftovers does not nuke a production-shaped tag.
|
||
|
||
## 6. Per-test Docker network and per-test state root
|
||
|
||
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
|
||
and registers cleanup; `t.ArtifactDir()` provides the per-game state
|
||
root. Both ensure that two scenarios running back-to-back cannot
|
||
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
|
||
filesystem state. Game ids are themselves unique per test
|
||
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
|
||
the per-test network and state root, the suite is safe to run with
|
||
`-count` greater than one.
|
||
|
||
`t.ArtifactDir()` keeps the engine state directory around when a
|
||
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
|
||
failure and inspect what the engine wrote. On success the directory
|
||
is automatically cleaned up.
|
||
|
||
## 7. PostgreSQL and Redis containers shared per-package
|
||
|
||
Both fixtures use `sync.Once` to start one testcontainer per test
|
||
package, mirroring the
|
||
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
|
||
pattern. `TruncatePostgres` and `FlushRedis` reset state between
|
||
tests so each scenario starts on an empty stack. The trade-off versus
|
||
per-test containers is the standard one: container startup dominates
|
||
the per-package latency, so amortising it across the suite keeps the
|
||
loop tight while the truncate/flush ensures isolation. The ~1–2 s
|
||
difference matters in CI.
|
||
|
||
## 8. Engine image cache is intentionally retained between runs
|
||
|
||
`buildAndTagEngineImage` runs once per package via `sync.Once` and
|
||
leaves both image tags in the local Docker cache after the suite
|
||
exits. The cache is a substantial speed-up on a developer laptop
|
||
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
|
||
hot), and a stale image is unlikely because the tags carry the
|
||
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
|
||
with multiple test runs. Operators who suspect a stale image can
|
||
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
|
||
the next run rebuilds.
|
||
|
||
## 9. Scenario coverage
|
||
|
||
The suite covers the four end-to-end flows operators care about:
|
||
|
||
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
|
||
restart → patch → stop → cleanup. The intermediate `stop` between
|
||
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
|
||
to remove a running container per
|
||
[`../README.md` §Cleanup](../README.md#cleanup).
|
||
- **replay** (`replay_test.go`) — duplicate start / stop entries
|
||
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
|
||
- **health** (`health_test.go`) — external `docker rm` produces
|
||
`container_disappeared`; manual `docker run` is adopted by the
|
||
reconciler.
|
||
- **notification** (`notification_test.go`) — unresolvable `image_ref`
|
||
produces `runtime.image_pull_failed` plus a `failure` job_result.
|
||
|
||
## 10. Service-local scope only
|
||
|
||
This suite runs Runtime Manager against a real Docker daemon plus
|
||
testcontainers PG / Redis but **does not** include any other Galaxy
|
||
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
|
||
in the top-level `galaxy/integration/` module, where the harness
|
||
spawns multiple service binaries and uses real (not stubbed) cross-
|
||
service streams.
|