Files
galaxy-game/tools/dev-deploy/KNOWN-ISSUES.md
T
Ilia Denisov a338ebf058
Tests · Integration / integration (pull_request) Successful in 1m37s
fix(integration): scope preclean to galaxy.stack=integration
Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.

Fix:

- `integration/testenv/backend.go` now sets
  `BACKEND_STACK_LABEL=integration` on every backend-under-test, so
  the engines spawned by integration carry
  `galaxy.stack=integration` in addition to `galaxy.backend=1`. The
  backend support for this env was added in the previous CI tidy-up
  PR (#13).

- `integration/scripts/preclean.sh` gains a multi-label AND filter
  helper and uses it to scope engine cleanup to the combination
  `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
  local-dev engines carry different `galaxy.stack` values, so the
  AND match leaves them alone.

- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
  the AND-scoping rule and the new integration backend stamp.

- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
  gets an "Update" section recording the root cause and the fix; the
  status is downgraded to "partially fixed" because the solo
  `workflow_dispatch` reproduction (which does NOT trigger
  integration) remains unexplained.

- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
  `docker restart galaxy-dev-backend` failure caused by the
  runner-workspace bind-mount that surfaced while diagnosing this
  issue. Workaround: `make -C tools/dev-deploy up` from the
  persistent checkout. Real fix is a follow-up (bake fixture into
  image or copy to named volume).

Verification:

- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
  `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
  returns nothing while the dev-deploy engine
  `galaxy-game-80f3ce86-...` keeps running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:37:55 +02:00

8.3 KiB

tools/dev-deploy/ — known issues

Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands.

Dev Sandbox game flips to cancelled after a dev-deploy redispatch

Symptom

A previously running "Dev Sandbox" game (created by backend/internal/devsandbox) transitions to cancelled ~15 minutes after a dev-deploy.yaml workflow_dispatch run finishes. The user's browser session survives (the same device_session_id keeps working), but the lobby shows no game because the only game it had is now terminal. purgeTerminalSandboxGames does pick it up on the next boot and creates a fresh sandbox — but the first redispatch leaves the user with an empty lobby until backend restarts again.

Diagnostic evidence

Backend logs from the broken cycle (timestamps abbreviated):

20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
         op=reconcile status=removed message="container disappeared"

Between 20:24:40 (status=starting) and 20:39:40 (reconciler cancel) the backend logs are silent on the runtime / engine paths — no engine spawned, no engine container started, no runtime transition lines. The reconciler then fires and reports the engine container as missing.

docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine' returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot.

What has been ruled out

A live docker inspect on a healthy engine container shows:

Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
        galaxy.game_id=<uuid>,
        org.opencontainers.image.title=galaxy-game-engine,
        com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove:    false
RestartPolicy: on-failure
NetworkMode:   galaxy-dev-internal

There are no com.docker.compose.* labels and AutoRemove=false, so --remove-orphans cannot reap the engine and a --rm-style self-destruct is not in play. Two redispatches captured under docker events --filter event=create,start,die,destroy,kill,stop also confirmed it: across both runs the only die / destroy events were for galaxy-dev-{backend,api,caddy}. The live engine container survived both redispatches, and the reconciler that fires 60 seconds after the new backend boots correctly matched it through byGameID / byContainerID.

backend/internal/runtime/service.go only removes engine containers from the explicit runStop / runRestart / runPatch paths. There is no runtime.Service.Shutdown that proactively kills containers on backend exit, so a graceful SIGTERM to galaxy-dev-backend will not touch its child engine containers.

Host-side hypotheses considered and rejected by the owner

The natural follow-up suspects after compose was cleared — host-side docker prune cron jobs, a manual docker rm, an out-of-band dockerd restart, and an idle-state engine crash — were all rejected by the project owner: the dev host runs none of those periodic cleanups, no one manually removed the container, dockerd was not restarted in the window, and the engine binary does not crash while idling on API calls.

Best remaining suspicion

Something the dev-deploy.yaml CI run does between successful image builds and the final docker compose up -d --wait --remove-orphans clobbers the previously-spawned engine container. The chain at runtime contains:

  1. docker build -t galaxy-engine:dev -f game/Dockerfile .
  2. docker compose build galaxy-backend galaxy-api
  3. docker run --rm alpine for the UI volume seed
  4. docker compose up -d --wait --remove-orphans

None of these should touch an unmanaged engine container, but the reproduction window points squarely inside this sequence. A deliberate next reproduction with docker events --since 0 armed before the deploy starts and live for the entire job — captured end-to-end on the dev host, not just the chunk after backend recreate — would pin which step emits the destroy on the engine.

Update 2026-05-19: integration preclean identified as one cause

A live reproduction during the post-merge auto-deploy cycle (Gitea run #188 dev-deploy plus parallel run #190 integration) pinned one clobbering source: integration/scripts/preclean.sh was unscoped and removed every container labelled galaxy.backend=1, including the dev-deploy engine. Timeline from the dev host:

23:10:40  backend pre-bootstrap reconciler tick: engine alive
23:10:40  dev_sandbox bootstrap: status=running
23:10:56  preclean: removing 1 backend-managed engine containers  ← integration run #190
23:11:40  reconciler: container disappeared → game cancelled

Fix landed: BACKEND_STACK_LABEL=integration is now passed to every integration backend (see integration/testenv/backend.go) and preclean.sh AND-combines galaxy.backend=1 with galaxy.stack=integration, so dev-deploy / local-dev engines stamped with different stack values are no longer collateral.

This covers push-triggered cycles where dev-deploy.yaml and integration.yaml run on the same Gitea host. The original hypothesis (a workflow_dispatch dev-deploy solo run also losing the engine) is not explained by the integration fix — manual dispatches do not trigger integration.yaml. Keep this entry open until a solo-dispatch reproduction confirms whether the symptom still occurs.

Status

Partially fixed (push-triggered cycles). Solo workflow_dispatch reproductions still open. If the symptom recurs after the integration fix lands, capture docker events --since 0 for the full dispatch window and attach here.

Workaround in use today

When the sandbox game flips to cancelled, redispatch dev-deploy:

curl -X POST -n -H 'Content-Type: application/json' \
  -d '{"ref":"<branch>"}' \
  https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches

The next boot's purgeTerminalSandboxGames removes the cancelled row, findOrCreateSandboxGame creates a fresh one, and ensureMembershipsAndDrive puts the new game back to running.

Owner

Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.

docker restart galaxy-dev-backend fails after the CI runner cleans up

Symptom

docker restart galaxy-dev-backend from the host fails with:

Error response from daemon: ... error mounting
"/home/runner/.cache/act/<workspace>/hostexecutor/pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb"
to rootfs at "/var/lib/galaxy/geoip.mmdb": ... not a directory

The container ends up Exited (127) and never comes back.

Cause

tools/dev-deploy/docker-compose.yml mounts the geoip database via a path relative to the compose file (../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb). When the dev-deploy.yaml Gitea runner invokes docker compose up it resolves that relative path against the runner's ephemeral workspace under /home/runner/.cache/act/<hash>/hostexecutor/tools/dev-deploy/, so the bind-mount source baked into the running container points at that ephemeral path. The runner deletes the workspace once the workflow ends, the source disappears, and the next docker restart fails to remount it.

Workaround

Bring the stack back up from a stable workspace, which re-binds the mount source to the persistent checkout:

make -C tools/dev-deploy up

This restarts every service (including the broken galaxy-dev-backend) with a stable source path.

Status

Open. The clean fix is either to bake the geoip test fixture into the backend image (no host bind-mount) or to copy it onto a named volume during dev-deploy.yaml and bind that instead. Either change removes the runner-workspace dependency entirely.