# `tools/dev-deploy/` — known issues Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands. ## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch ### Symptom A previously `running` "Dev Sandbox" game (created by `backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's browser session survives (the same `device_session_id` keeps working), but the lobby shows no game because the only game it had is now terminal. `purgeTerminalSandboxGames` does pick it up on the **next** boot and creates a fresh sandbox — but the first redispatch leaves the user with an empty lobby until backend restarts again. ### Diagnostic evidence Backend logs from the broken cycle (timestamps abbreviated): ```text 20:24:40 dev_sandbox: purged terminal sandbox game game_id= status=cancelled 20:24:40 dev_sandbox: memberships ensured count=20 game_id= 20:24:40 dev_sandbox: bootstrap complete user_id= game_id= status=starting ... 20:25:09 user mail sent failed (diplomail tables missing — unrelated) ... 20:39:40 lobby: game cancelled by runtime reconciler game_id= op=reconcile status=removed message="container disappeared" ``` Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel) the backend logs are silent on the runtime / engine paths — no `engine spawned`, no `engine container started`, no `runtime transition` lines. The reconciler then fires and reports the engine container as missing. `docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'` returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot. ### What has been ruled out A live `docker inspect` on a healthy engine container shows: ```text Labels: galaxy.backend=1, galaxy.engine_version=0.1.0, galaxy.game_id=, org.opencontainers.image.title=galaxy-game-engine, com.galaxy.{cpu_quota,memory,pids_limit} AutoRemove: false RestartPolicy: on-failure NetworkMode: galaxy-dev-internal ``` There are no `com.docker.compose.*` labels and `AutoRemove=false`, so `--remove-orphans` cannot reap the engine and a `--rm`-style self-destruct is not in play. Two redispatches captured under `docker events --filter event=create,start,die,destroy,kill,stop` also confirmed it: across both runs the only `die` / `destroy` events were for `galaxy-dev-{backend,api,caddy}`. The live engine container survived both redispatches, and the reconciler that fires 60 seconds after the new backend boots correctly matched it through `byGameID` / `byContainerID`. `backend/internal/runtime/service.go` only removes engine containers from the explicit `runStop` / `runRestart` / `runPatch` paths. There is no `runtime.Service.Shutdown` that proactively kills containers on backend exit, so a graceful SIGTERM to `galaxy-dev-backend` will not touch its child engine containers. ### Host-side hypotheses considered and rejected by the owner The natural follow-up suspects after compose was cleared — host-side `docker prune` cron jobs, a manual `docker rm`, an out-of-band `dockerd` restart, and an idle-state engine crash — were all rejected by the project owner: the dev host runs none of those periodic cleanups, no one manually removed the container, dockerd was not restarted in the window, and the engine binary does not crash while idling on API calls. ### Best remaining suspicion Something the `dev-deploy.yaml` CI run does between successful image builds and the final `docker compose up -d --wait --remove-orphans` clobbers the previously-spawned engine container. The chain at runtime contains: 1. `docker build -t galaxy-engine:dev -f game/Dockerfile .` 2. `docker compose build galaxy-backend galaxy-api` 3. `docker run --rm` alpine for the UI volume seed 4. `docker compose up -d --wait --remove-orphans` None of these *should* touch an unmanaged engine container, but the reproduction window points squarely inside this sequence. A deliberate next reproduction with `docker events --since 0` armed *before* the deploy starts and live for the entire job — captured end-to-end on the dev host, not just the chunk after backend recreate — would pin which step emits the `destroy` on the engine. ### Status Parked. The bug is mildly disruptive (one redispatch + a manual `make seed-ui`-style follow-up brings the sandbox back) and the remaining hypotheses are speculative. If the symptom recurs, attach the next bad-window `docker events` capture to this entry and reopen. A `tools/dev-deploy/` rewrite may obviate the issue entirely; that is on the project owner's medium-term list. ### Workaround in use today When the sandbox game flips to `cancelled`, redispatch `dev-deploy`: ```sh curl -X POST -n -H 'Content-Type: application/json' \ -d '{"ref":""}' \ https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches ``` The next boot's `purgeTerminalSandboxGames` removes the cancelled row, `findOrCreateSandboxGame` creates a fresh one, and `ensureMembershipsAndDrive` puts the new game back to `running`. ### Owner Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.