# `tools/dev-deploy/` — known issues Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands. ## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch ### Symptom A previously `running` "Dev Sandbox" game (created by `backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's browser session survives (the same `device_session_id` keeps working), but the lobby shows no game because the only game it had is now terminal. `purgeTerminalSandboxGames` does pick it up on the **next** boot and creates a fresh sandbox — but the first redispatch leaves the user with an empty lobby until backend restarts again. ### Diagnostic evidence Backend logs from the broken cycle (timestamps abbreviated): ```text 20:24:40 dev_sandbox: purged terminal sandbox game game_id= status=cancelled 20:24:40 dev_sandbox: memberships ensured count=20 game_id= 20:24:40 dev_sandbox: bootstrap complete user_id= game_id= status=starting ... 20:25:09 user mail sent failed (diplomail tables missing — unrelated) ... 20:39:40 lobby: game cancelled by runtime reconciler game_id= op=reconcile status=removed message="container disappeared" ``` Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel) the backend logs are silent on the runtime / engine paths — no `engine spawned`, no `engine container started`, no `runtime transition` lines. The reconciler then fires and reports the engine container as missing. `docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'` returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot. ### What has been ruled out A live `docker inspect` on a healthy engine container shows: ```text Labels: galaxy.backend=1, galaxy.engine_version=0.1.0, galaxy.game_id=, org.opencontainers.image.title=galaxy-game-engine, com.galaxy.{cpu_quota,memory,pids_limit} AutoRemove: false RestartPolicy: on-failure NetworkMode: galaxy-dev-internal ``` There are no `com.docker.compose.*` labels and `AutoRemove=false`, so `--remove-orphans` cannot reap the engine and a `--rm`-style self-destruct is not in play. Two redispatches captured under `docker events --filter event=create,start,die,destroy,kill,stop` also confirmed it: across both runs the only `die` / `destroy` events were for `galaxy-dev-{backend,api,caddy}`. The live engine container survived both redispatches, and the reconciler that fires 60 seconds after the new backend boots correctly matched it through `byGameID` / `byContainerID`. `backend/internal/runtime/service.go` only removes engine containers from the explicit `runStop` / `runRestart` / `runPatch` paths. There is no `runtime.Service.Shutdown` that proactively kills containers on backend exit, so a graceful SIGTERM to `galaxy-dev-backend` will not touch its child engine containers. ### Remaining hypotheses 1. **Engine self-crashed and was reaped by something host-side.** `RestartPolicy=on-failure` only retries within Docker's own limits; if the engine exited cleanly (status 0) Docker does not restart, but does keep the row in `docker ps -a`. The reproduction case had the engine missing from `docker ps -a` entirely, so a separate cleanup (cron `docker container prune`, a host script, manual `docker rm`) needs to be ruled out. 2. **An out-of-band Docker daemon restart dropped the container.** A `dockerd` restart that loses sight of an unmanaged container is rare, but would explain why both the live tracking and `docker ps -a` are empty. Correlate the gap with `journalctl -u docker` on the host. 3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init` and the engine exited on its own before `status=running`.** Bootstrap logs `status=starting` and then is silent until the reconciler 15 minutes later; the runtime row in that case should have been written with `status=engine_unreachable`, so any reproduction needs a `runtime_records` snapshot from the bad window — that table got wiped together with the cancelled game on the next boot, so the post-mortem currently lacks it. ### What to investigate next - On the dev host: list cron jobs, systemd timers, and any custom shell that periodically runs `docker container prune` or `docker system prune`. The host also runs gitea + crowdsec so unrelated maintenance is plausible. - Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until '2026-05-16 20:56:34'` for the original repro window — confirm whether the daemon flagged a container removal in that gap. - Re-run with backend logging level `debug` so the `runtime.scheduler` and `runtime.workers` paths surface their per-game timer / job decisions. The current `info` level says nothing between bootstrap and the reconciler. - Capture `runtime_records` for the broken game *before* the next boot purges it; the column set (`status`, `current_container_id`, `engine_endpoint`) tells whether the engine ever reached `running` or stopped at `engine_unreachable`. - Reproduce on a freshly seeded `clean-data` volume to rule out postgres-state ambiguity. ### Workaround in use today When the sandbox game flips to `cancelled`, redispatch `dev-deploy`: ```sh curl -X POST -n -H 'Content-Type: application/json' \ -d '{"ref":""}' \ https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches ``` The next boot's `purgeTerminalSandboxGames` removes the cancelled row, `findOrCreateSandboxGame` creates a fresh one, and `ensureMembershipsAndDrive` puts the new game back to `running`. ### Owner Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.