fix(integration): scope preclean to galaxy.stack=integration #14

Merged
developer merged 1 commits from feature/preclean-stack-scope into development 2026-05-18 23:48:47 +00:00
Owner

Summary

Root cause for the long-standing "Dev Sandbox flips to cancelled after dev-deploy" symptom in push-triggered cycles: integration/scripts/preclean.sh was unscoped — it docker rm -f'd every container with galaxy.backend=1, including the dev-deploy engines living on the same Docker daemon. Each post-merge auto-deploy had integration.yaml and dev-deploy.yaml running in parallel, and the integration preclean wiped the dev sandbox engine. The new backend's reconciler then observed container disappeared and caskaded the sandbox to cancelled.

Reproduction was captured live during the previous PR's merge:

23:10:40  dev-deploy backend pre-bootstrap reconciler tick: engine ALIVE
23:10:56  integration #190 preclean: removing 1 backend-managed engine containers  ← wipe
23:11:40  dev-deploy backend reconciler tick: container disappeared → game cancelled

Changes

  • integration/testenv/backend.go — pass BACKEND_STACK_LABEL=integration to every backend-under-test, so its engines carry galaxy.stack=integration in addition to galaxy.backend=1.
  • integration/scripts/preclean.shremove_containers_with_label now accepts multiple labels (AND-combined). The engine cleanup is scoped to galaxy.backend=1 AND galaxy.stack=integration. dev-deploy and local-dev engines stamped with different galaxy.stack values are no longer collateral.
  • docs/ARCHITECTURE.md — Container labels section refreshed to call out the AND-scoping rule.
  • tools/dev-deploy/KNOWN-ISSUES.md — sandbox-cancel entry downgraded from parked to partially fixed (push-triggered cycles); separate new entry for the docker restart galaxy-dev-backend failure caused by the runner-workspace bind-mount that surfaced while diagnosing this issue.

Test plan

  • go build ./backend/... ./integration/... — clean.
  • bash -n integration/scripts/preclean.sh — syntax OK.
  • Live filter check on dev host: AND filter returns nothing while the dev-deploy engine galaxy-game-80f3ce86-... keeps running.
  • Gitea CI: go-unit + integration green.
  • After merge: post-merge integration run does NOT wipe the dev-deploy engine; sandbox stays running.
## Summary Root cause for the long-standing **"Dev Sandbox flips to `cancelled` after dev-deploy"** symptom in push-triggered cycles: `integration/scripts/preclean.sh` was unscoped — it `docker rm -f`'d **every** container with `galaxy.backend=1`, including the dev-deploy engines living on the same Docker daemon. Each post-merge auto-deploy had `integration.yaml` and `dev-deploy.yaml` running in parallel, and the integration preclean wiped the dev sandbox engine. The new backend's reconciler then observed `container disappeared` and caskaded the sandbox to `cancelled`. Reproduction was captured live during the previous PR's merge: ``` 23:10:40 dev-deploy backend pre-bootstrap reconciler tick: engine ALIVE 23:10:56 integration #190 preclean: removing 1 backend-managed engine containers ← wipe 23:11:40 dev-deploy backend reconciler tick: container disappeared → game cancelled ``` ## Changes - **`integration/testenv/backend.go`** — pass `BACKEND_STACK_LABEL=integration` to every backend-under-test, so its engines carry `galaxy.stack=integration` in addition to `galaxy.backend=1`. - **`integration/scripts/preclean.sh`** — `remove_containers_with_label` now accepts multiple labels (AND-combined). The engine cleanup is scoped to `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and local-dev engines stamped with different `galaxy.stack` values are no longer collateral. - **`docs/ARCHITECTURE.md`** — Container labels section refreshed to call out the AND-scoping rule. - **`tools/dev-deploy/KNOWN-ISSUES.md`** — sandbox-cancel entry downgraded from `parked` to `partially fixed` (push-triggered cycles); separate new entry for the `docker restart galaxy-dev-backend` failure caused by the runner-workspace bind-mount that surfaced while diagnosing this issue. ## Test plan - [x] `go build ./backend/... ./integration/...` — clean. - [x] `bash -n integration/scripts/preclean.sh` — syntax OK. - [x] Live filter check on dev host: AND filter returns nothing while the dev-deploy engine `galaxy-game-80f3ce86-...` keeps running. - [ ] Gitea CI: `go-unit` + `integration` green. - [ ] After merge: post-merge integration run does NOT wipe the dev-deploy engine; sandbox stays `running`.
developer added 1 commit 2026-05-18 23:38:15 +00:00
fix(integration): scope preclean to galaxy.stack=integration
Tests · Integration / integration (pull_request) Successful in 1m37s
a338ebf058
Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.

Fix:

- `integration/testenv/backend.go` now sets
  `BACKEND_STACK_LABEL=integration` on every backend-under-test, so
  the engines spawned by integration carry
  `galaxy.stack=integration` in addition to `galaxy.backend=1`. The
  backend support for this env was added in the previous CI tidy-up
  PR (#13).

- `integration/scripts/preclean.sh` gains a multi-label AND filter
  helper and uses it to scope engine cleanup to the combination
  `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
  local-dev engines carry different `galaxy.stack` values, so the
  AND match leaves them alone.

- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
  the AND-scoping rule and the new integration backend stamp.

- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
  gets an "Update" section recording the root cause and the fix; the
  status is downgraded to "partially fixed" because the solo
  `workflow_dispatch` reproduction (which does NOT trigger
  integration) remains unexplained.

- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
  `docker restart galaxy-dev-backend` failure caused by the
  runner-workspace bind-mount that surfaced while diagnosing this
  issue. Workaround: `make -C tools/dev-deploy up` from the
  persistent checkout. Real fix is a follow-up (bake fixture into
  image or copy to named volume).

Verification:

- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
  `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
  returns nothing while the dev-deploy engine
  `galaxy-game-80f3ce86-...` keeps running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
developer merged commit d19aa3aac5 into development 2026-05-18 23:48:47 +00:00
developer deleted branch feature/preclean-stack-scope 2026-05-18 23:48:47 +00:00
Sign in to join this conversation.