fix(integration): scope preclean to galaxy.stack=integration
Tests · Integration / integration (pull_request) Successful in 1m37s

Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.

Fix:

- `integration/testenv/backend.go` now sets
  `BACKEND_STACK_LABEL=integration` on every backend-under-test, so
  the engines spawned by integration carry
  `galaxy.stack=integration` in addition to `galaxy.backend=1`. The
  backend support for this env was added in the previous CI tidy-up
  PR (#13).

- `integration/scripts/preclean.sh` gains a multi-label AND filter
  helper and uses it to scope engine cleanup to the combination
  `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
  local-dev engines carry different `galaxy.stack` values, so the
  AND match leaves them alone.

- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
  the AND-scoping rule and the new integration backend stamp.

- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
  gets an "Update" section recording the root cause and the fix; the
  status is downgraded to "partially fixed" because the solo
  `workflow_dispatch` reproduction (which does NOT trigger
  integration) remains unexplained.

- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
  `docker restart galaxy-dev-backend` failure caused by the
  runner-workspace bind-mount that surfaced while diagnosing this
  issue. Workaround: `make -C tools/dev-deploy up` from the
  persistent checkout. Real fix is a follow-up (bake fixture into
  image or copy to named volume).

Verification:

- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
  `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
  returns nothing while the dev-deploy engine
  `galaxy-game-80f3ce86-...` keeps running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-19 01:37:55 +02:00
parent f91cf6eb41
commit a338ebf058
4 changed files with 109 additions and 17 deletions
+20 -9
View File
@@ -7,11 +7,15 @@
# 1. Containers labelled `org.testcontainers=true` — every container
# brought up by testcontainers-go (our backend/gateway/game plus
# postgres/redis/mailpit/ryuk service containers).
# 2. Containers labelled `galaxy.backend=1` — engine instances spawned
# by backend's runtime adapter on the host Docker daemon (see
# `backend/internal/dockerclient/types.go`). These do not carry
# the testcontainers label because backend, not testcontainers,
# creates them.
# 2. Containers labelled `galaxy.backend=1` AND
# `galaxy.stack=integration` — engine instances spawned by the
# backend-under-test on the host Docker daemon (see
# `backend/internal/dockerclient/types.go` and the
# `BACKEND_STACK_LABEL=integration` env in
# `integration/testenv/backend.go`). The stack-label filter is
# what keeps dev-deploy / local-dev engines on the same host
# safe — they carry `galaxy.backend=1` too but a different
# `galaxy.stack` value, so the AND match leaves them alone.
# 3. Networks labelled `org.testcontainers=true` — networks created
# by testcontainers-go for cross-container wiring.
# 4. Images labelled `galaxy.test.kind=integration-image` — local
@@ -22,14 +26,21 @@
# What we never touch:
# - Containers / images without one of the labels above.
# - User-managed images and volumes.
# - dev-deploy / local-dev engines (they share the `galaxy.backend=1`
# label, but their `galaxy.stack` value differs from `integration`).
set -euo pipefail
remove_containers_with_label() {
local label="$1"
local description="$2"
local description="${!#}"
local labels=("${@:1:$#-1}")
local filter_args=()
local label
for label in "${labels[@]}"; do
filter_args+=("--filter" "label=$label")
done
local ids
ids=$(docker ps -aq --filter "label=$label" 2>/dev/null || true)
ids=$(docker ps -aq "${filter_args[@]}" 2>/dev/null || true)
if [ -z "$ids" ]; then
return
fi
@@ -81,7 +92,7 @@ if ! docker info >/dev/null 2>&1; then
fi
remove_containers_with_label "org.testcontainers=true" "testcontainers-managed containers"
remove_containers_with_label "galaxy.backend=1" "backend-managed engine containers"
remove_containers_with_label "galaxy.backend=1" "galaxy.stack=integration" "integration-owned engine containers"
remove_networks_with_label "org.testcontainers=true" "testcontainers-managed networks"
remove_images_with_label "galaxy.test.kind=integration-image" "integration-built images"
+7
View File
@@ -85,6 +85,13 @@ func StartBackend(t *testing.T, opts BackendOptions) *BackendContainer {
"BACKEND_AUTH_CHALLENGE_THROTTLE_MAX": "100",
"BACKEND_MAIL_WORKER_INTERVAL": "500ms",
"BACKEND_NOTIFICATION_WORKER_INTERVAL": "500ms",
// Stamp galaxy.stack=integration on every engine container the
// backend-under-test spawns so the post-run preclean.sh can
// scope its cleanup to integration-owned engines and leave
// dev-deploy / local-dev stacks running on the same daemon
// untouched. See `integration/scripts/preclean.sh` and the
// "Container labels" section in `docs/ARCHITECTURE.md`.
"BACKEND_STACK_LABEL": "integration",
}
for k, v := range opts.Extra {
env[k] = v