fix(integration): scope preclean to galaxy.stack=integration

Root cause for the long-standing "Dev Sandbox flips to cancelled after dev-deploy" symptom in push-triggered cycles: when `integration.yaml` runs in parallel with `dev-deploy.yaml`, its `integration/scripts/preclean.sh` issues a `docker rm -f` over every container labelled `galaxy.backend=1`. That label is stamped by the backend's runtime adapter on every engine it spawns — including the engines living in the long-lived dev-deploy environment on the same Docker daemon. Each post-merge auto-deploy therefore had the integration preclean wipe the dev-sandbox engine, and the new backend's reconciler tick observed `container disappeared` and cascaded the sandbox into `cancelled`. Fix: - `integration/testenv/backend.go` now sets `BACKEND_STACK_LABEL=integration` on every backend-under-test, so the engines spawned by integration carry `galaxy.stack=integration` in addition to `galaxy.backend=1`. The backend support for this env was added in the previous CI tidy-up PR (#13). - `integration/scripts/preclean.sh` gains a multi-label AND filter helper and uses it to scope engine cleanup to the combination `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and local-dev engines carry different `galaxy.stack` values, so the AND match leaves them alone. - `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out the AND-scoping rule and the new integration backend stamp. - `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry gets an "Update" section recording the root cause and the fix; the status is downgraded to "partially fixed" because the solo `workflow_dispatch` reproduction (which does NOT trigger integration) remains unexplained. - `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the `docker restart galaxy-dev-backend` failure caused by the runner-workspace bind-mount that surfaced while diagnosing this issue. Workaround: `make -C tools/dev-deploy up` from the persistent checkout. Real fix is a follow-up (bake fixture into image or copy to named volume). Verification: - `go build ./backend/... ./integration/...` — clean. - `bash -n integration/scripts/preclean.sh` — syntax OK. - Live AND-filter check on the dev host: `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration` returns nothing while the dev-deploy engine `galaxy-game-80f3ce86-...` keeps running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:37:55 +02:00
parent f91cf6eb41
commit a338ebf058
4 changed files with 109 additions and 17 deletions
@@ -103,14 +103,42 @@ deliberate next reproduction with `docker events --since 0` armed
 end-to-end on the dev host, not just the chunk after backend
 recreate — would pin which step emits the `destroy` on the engine.

+### Update 2026-05-19: integration preclean identified as one cause
+
+A live reproduction during the post-merge auto-deploy cycle (Gitea
+run #188 dev-deploy plus parallel run #190 integration) pinned one
+clobbering source: `integration/scripts/preclean.sh` was unscoped
+and removed *every* container labelled `galaxy.backend=1`, including
+the dev-deploy engine. Timeline from the dev host:
+
+```text
+23:10:40  backend pre-bootstrap reconciler tick: engine alive
+23:10:40  dev_sandbox bootstrap: status=running
+23:10:56  preclean: removing 1 backend-managed engine containers  ← integration run #190
+23:11:40  reconciler: container disappeared → game cancelled
+```
+
+Fix landed: `BACKEND_STACK_LABEL=integration` is now passed to
+every integration backend (see
+`integration/testenv/backend.go`) and `preclean.sh` AND-combines
+`galaxy.backend=1` with `galaxy.stack=integration`, so dev-deploy /
+local-dev engines stamped with different stack values are no longer
+collateral.
+
+This covers **push**-triggered cycles where `dev-deploy.yaml` and
+`integration.yaml` run on the same Gitea host. The original
+hypothesis (a `workflow_dispatch dev-deploy` solo run also losing
+the engine) is *not* explained by the integration fix — manual
+dispatches do not trigger `integration.yaml`. Keep this entry open
+until a solo-dispatch reproduction confirms whether the symptom
+still occurs.
+
 ### Status

-Parked. The bug is mildly disruptive (one redispatch + a manual
-`make seed-ui`-style follow-up brings the sandbox back) and the
-remaining hypotheses are speculative. If the symptom recurs, attach
-the next bad-window `docker events` capture to this entry and
-reopen. A `tools/dev-deploy/` rewrite may obviate the issue
-entirely; that is on the project owner's medium-term list.
+Partially fixed (push-triggered cycles). Solo `workflow_dispatch`
+reproductions still open. If the symptom recurs after the
+integration fix lands, capture `docker events --since 0` for the
+full dispatch window and attach here.

 ### Workaround in use today

@@ -131,3 +159,49 @@ row, `findOrCreateSandboxGame` creates a fresh one, and
 Unassigned. File an issue once we have the runtime / reconciler
 analysis above; reference this section in the issue body so future
 redeploys can short-circuit the diagnostic loop.
+
+## `docker restart galaxy-dev-backend` fails after the CI runner cleans up
+
+### Symptom
+
+`docker restart galaxy-dev-backend` from the host fails with:
+
+```text
+Error response from daemon: ... error mounting
+"/home/runner/.cache/act/<workspace>/hostexecutor/pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb"
+to rootfs at "/var/lib/galaxy/geoip.mmdb": ... not a directory
+```
+
+The container ends up `Exited (127)` and never comes back.
+
+### Cause
+
+`tools/dev-deploy/docker-compose.yml` mounts the geoip database via
+a path relative to the compose file
+(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`). When
+the `dev-deploy.yaml` Gitea runner invokes `docker compose up` it
+resolves that relative path against the runner's ephemeral workspace
+under `/home/runner/.cache/act/<hash>/hostexecutor/tools/dev-deploy/`,
+so the bind-mount source baked into the running container points at
+that ephemeral path. The runner deletes the workspace once the
+workflow ends, the source disappears, and the next `docker restart`
+fails to remount it.
+
+### Workaround
+
+Bring the stack back up from a stable workspace, which re-binds the
+mount source to the persistent checkout:
+
+```sh
+make -C tools/dev-deploy up
+```
+
+This restarts every service (including the broken `galaxy-dev-backend`)
+with a stable source path.
+
+### Status
+
+Open. The clean fix is either to bake the geoip test fixture into
+the backend image (no host bind-mount) or to copy it onto a named
+volume during `dev-deploy.yaml` and bind that instead. Either change
+removes the runner-workspace dependency entirely.