feat: runtime manager

This commit is contained in:
Ilia Denisov
2026-04-28 20:39:18 +02:00
committed by GitHub
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
+163
View File
@@ -0,0 +1,163 @@
# Service-Local Integration Suite
This document explains the design of the service-local integration
suite under [`../integration/`](../integration). The current-state
behaviour (harness layout, env knobs, scenario coverage) lives next
to the files themselves; this document records the rationale.
The cross-service Lobby↔RTM suite at
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
different rules (it lives in the top-level `galaxy/integration`
module) and is documented inside that package.
## 1. Build tag `integration`
The scenarios under [`../integration/*_test.go`](../integration) are
guarded by `//go:build integration`. The default `go test ./...`
invocation skips them, while `go test -tags=integration
./integration/...` (and the `make integration` target) runs the full
set:
```sh
make -C rtmanager integration
```
The harness package itself ([`../integration/harness`](../integration/harness))
has no build tag. It compiles on every run because each helper guards
its Docker-dependent paths with `t.Skip` when the daemon is
unavailable. This keeps the harness loadable from a tagless `go vet`
or IDE workflow without dragging Docker into the default `go test`
critical path.
## 2. Smoke test runs in the default `go test` pass
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
runs in the regular `go test ./...` pass and falls back on
`skipUnlessDockerAvailable` when no Docker socket is present. The
smoke test is intentionally kept separate from the new `integration/`
suite because it exercises the production adapter shape (one
container at a time against `alpine:3.21`), not the full runtime;
both surfaces are useful.
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
logger)` directly rather than spawning the binary from
`cmd/rtmanager/main.go`:
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
the runtime context and call `runtime.Close()`; the goroutine
driving `runtime.Run` returns with `context.Canceled` and the
helper waits on it via the `runDone` channel. With a subprocess the
equivalent dance requires SIGTERM, output capture, and graceful
shutdown timing tied to the child's signal handler.
- **Goroutine and store visibility.** Tests read the durable PG state
directly through the harness-owned pool and read every Redis stream
through the harness-owned client. Both observe the exact wire shape
Lobby will see in the cross-service suite.
- **Logger isolation.** The harness defaults to `slog.Discard` so the
default test output stays focused on assertions; flipping
`EnvOptions.LogToStderr` lights up the runtime's structured logs
for local debugging without requiring any subprocess plumbing.
The cross-service inter-process suite at `integration/lobbyrtm/`
re-uses the existing `integration/internal/harness` binary-spawn
helpers; the in-process choice here is specific to the service-local
scope.
## 4. `httptest.Server` stub for the Lobby internal client
Runtime Manager configuration requires a non-empty
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
as a no-op (the start envelope already carries the only required
field, `image_ref`; rationale in [`services.md`](services.md) §7).
The harness therefore stands up a tiny `httptest.Server` per test
that returns a stable `200 OK` response. The stub is intentionally
unconfigurable: every integration scenario produces the same
ancillary fetch, and adding routing/error injection would invite
test code to depend on a contract the start service deliberately
ignores.
## 5. One built engine image, two semver-compatible tags
The patch lifecycle expects the new and current image refs to share
the same major / minor version (`semver_patch_only` failure
otherwise). Building two distinct images would multiply the per-run
build cost without changing what the test verifies — the patch path
exercises `image_ref_not_semver` and `semver_patch_only` validation
plus the recreate-with-new-tag flow, none of which depend on
distinct image *content*. The harness builds the engine once and
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
The integration tags use the `*-rtm-it` suffix (rather than plain
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
accidentally consume a hand-built dev image, and so a `docker image
rm` of integration leftovers does not nuke a production-shaped tag.
## 6. Per-test Docker network and per-test state root
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
and registers cleanup; `t.ArtifactDir()` provides the per-game state
root. Both ensure that two scenarios running back-to-back cannot
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
filesystem state. Game ids are themselves unique per test
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
the per-test network and state root, the suite is safe to run with
`-count` greater than one.
`t.ArtifactDir()` keeps the engine state directory around when a
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
failure and inspect what the engine wrote. On success the directory
is automatically cleaned up.
## 7. PostgreSQL and Redis containers shared per-package
Both fixtures use `sync.Once` to start one testcontainer per test
package, mirroring the
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
pattern. `TruncatePostgres` and `FlushRedis` reset state between
tests so each scenario starts on an empty stack. The trade-off versus
per-test containers is the standard one: container startup dominates
the per-package latency, so amortising it across the suite keeps the
loop tight while the truncate/flush ensures isolation. The ~12 s
difference matters in CI.
## 8. Engine image cache is intentionally retained between runs
`buildAndTagEngineImage` runs once per package via `sync.Once` and
leaves both image tags in the local Docker cache after the suite
exits. The cache is a substantial speed-up on a developer laptop
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
hot), and a stale image is unlikely because the tags carry the
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
with multiple test runs. Operators who suspect a stale image can
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
the next run rebuilds.
## 9. Scenario coverage
The suite covers the four end-to-end flows operators care about:
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
restart → patch → stop → cleanup. The intermediate `stop` between
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
to remove a running container per
[`../README.md` §Cleanup](../README.md#cleanup).
- **replay** (`replay_test.go`) — duplicate start / stop entries
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
- **health** (`health_test.go`) — external `docker rm` produces
`container_disappeared`; manual `docker run` is adopted by the
reconciler.
- **notification** (`notification_test.go`) — unresolvable `image_ref`
produces `runtime.image_pull_failed` plus a `failure` job_result.
## 10. Service-local scope only
This suite runs Runtime Manager against a real Docker daemon plus
testcontainers PG / Redis but **does not** include any other Galaxy
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
in the top-level `galaxy/integration/` module, where the harness
spawns multiple service binaries and uses real (not stubbed) cross-
service streams.