Files
galaxy-game/integration
Ilia Denisov a338ebf058
Tests · Integration / integration (pull_request) Successful in 1m37s
fix(integration): scope preclean to galaxy.stack=integration
Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.

Fix:

- `integration/testenv/backend.go` now sets
  `BACKEND_STACK_LABEL=integration` on every backend-under-test, so
  the engines spawned by integration carry
  `galaxy.stack=integration` in addition to `galaxy.backend=1`. The
  backend support for this env was added in the previous CI tidy-up
  PR (#13).

- `integration/scripts/preclean.sh` gains a multi-label AND filter
  helper and uses it to scope engine cleanup to the combination
  `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
  local-dev engines carry different `galaxy.stack` values, so the
  AND match leaves them alone.

- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
  the AND-scoping rule and the new integration backend stamp.

- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
  gets an "Update" section recording the root cause and the fix; the
  status is downgraded to "partially fixed" because the solo
  `workflow_dispatch` reproduction (which does NOT trigger
  integration) remains unexplained.

- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
  `docker restart galaxy-dev-backend` failure caused by the
  runner-workspace bind-mount that surfaced while diagnosing this
  issue. Workaround: `make -C tools/dev-deploy up` from the
  persistent checkout. Real fix is a follow-up (bake fixture into
  image or copy to named volume).

Verification:

- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
  `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
  returns nothing while the dev-deploy engine
  `galaxy-game-80f3ce86-...` keeps running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:37:55 +02:00
..
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-06 10:14:55 +03:00
2026-05-07 00:58:53 +03:00
2026-05-07 00:58:53 +03:00

integration

End-to-end test suite for the Galaxy platform. The suite drives gateway from outside and verifies behaviour at the public boundary while backend and galaxy/game run as Docker containers managed by the test process via testcontainers-go.

For cross-cutting testing principles (unit vs integration boundaries, why testcontainers tests pin no-op observability providers, why infrastructure failures in this suite fail loudly instead of skipping) see docs/TESTING.md. This README focuses on the integration-specific runbook: prerequisites, entry points, labels, and per-test fixtures.

Prerequisites

  • A reachable Docker daemon (DOCKER_HOST or the local socket).
  • Go toolchain matching the workspace go.work directive.
  • Network access for the first run (postgres:16-alpine, axllent/mailpit, redis:7-alpine images are pulled). Subsequent runs reuse the local image cache.

Run

The recommended entry points are the Makefile targets:

make -C integration preclean          # idempotent leftover cleanup
make -C integration integration       # preclean + serial test run
make -C integration integration-step  # preclean + one-test-at-a-time

preclean removes stale containers and locally-built images from earlier runs; it never touches testcontainers-pulled service images (postgres:16-alpine, axllent/mailpit, redis:7-alpine, testcontainers/ryuk), so the cache stays warm. The cleanup keys off labels:

  • org.testcontainers=true — every container/network created by testcontainers-go (our backend/gateway/game and the postgres / redis / mailpit / ryuk service containers).
  • galaxy.backend=1 — engine instances spawned by backend's runtime adapter directly on the host Docker daemon (see backend/internal/dockerclient/types.go).
  • galaxy.test.kind=integration-image — local builds of galaxy/{backend,gateway,game}:integration produced by testenv/images.go.

integration runs every test in the module sequentially (-p=1 -parallel=1) — recommended default on a slow / shared Docker. integration-step runs them one at a time with a fresh preclean before each test and stops on the first failure; useful to isolate a flake or build up to a full pass without losing context to subsequent tests.

Direct go test ./integration/... still works but does not pre-clean or serialise the suite; use it only on a hand-cleaned Docker.

The suite builds three Docker images on demand from the workspace sources:

  • galaxy/backend:integration (backend/Dockerfile),
  • galaxy/gateway:integration (gateway/Dockerfile),
  • galaxy/game:integration (game/Dockerfile).

Each image is built once per go test invocation, guarded by a sync.Once inside testenv, and stamped with the galaxy.test.kind=integration-image label so preclean can wipe it on the next run. The first cold run is slow (~23 min on a developer machine); subsequent runs reuse the layer cache.

Skipping

Tests skip with a clear message when the Docker daemon is unreachable. Subsuites that require a live engine container (lobby_flow_test.go) also skip when the galaxy/game image cannot be built.

Layout

  • testenv/ — fixtures: Postgres, Redis, mailpit, GeoLite2 mmdb, image builders, backend/gateway runners, signed gRPC client (built on top of the public galaxy/gateway/authn package, no duplicated canonical-bytes code), mailpit HTTP client, EnrollPilots helper for runtime-driven scenarios that need ≥10 members, platform bootstrap.
  • *_test.go — one file per cross-service scenario.

The runtime-driven tests (runtime_lifecycle_test.go, engine_command_proxy_test.go) honour the engine's production contract len(races) >= 10: each registers ten extra pilots with synthetic Player01..Player10 race names and matching emails, has the owner invite each one, and has each pilot redeem the invite before admin force-start. Cold runs add ~30 s for the ten extra mailpit round-trips on top of the engine image build.

Determinism

  • Each test calls Bootstrap(t) to spin up a dedicated Postgres, Redis, mailpit, backend and gateway. Cross-test contamination is not possible.
  • Tests do not call t.Parallel(). Docker resource pressure makes parallel suites flaky on commodity hardware.
  • Gateway anti-abuse and body-size limits are loosened for the bulk of scenarios (so legitimate flows are not rate-limited mid-test) and intentionally tightened in gateway_edge_test.go so each protective mechanism can be observed firing.