# `tools/dev-deploy/` — known issues Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands. ## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch ### Symptom A previously `running` "Dev Sandbox" game (created by `backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's browser session survives (the same `device_session_id` keeps working), but the lobby shows no game because the only game it had is now terminal. `purgeTerminalSandboxGames` does pick it up on the **next** boot and creates a fresh sandbox — but the first redispatch leaves the user with an empty lobby until backend restarts again. ### Diagnostic evidence Backend logs from the broken cycle (timestamps abbreviated): ```text 20:24:40 dev_sandbox: purged terminal sandbox game game_id= status=cancelled 20:24:40 dev_sandbox: memberships ensured count=20 game_id= 20:24:40 dev_sandbox: bootstrap complete user_id= game_id= status=starting ... 20:25:09 user mail sent failed (diplomail tables missing — unrelated) ... 20:39:40 lobby: game cancelled by runtime reconciler game_id= op=reconcile status=removed message="container disappeared" ``` Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel) the backend logs are silent on the runtime / engine paths — no `engine spawned`, no `engine container started`, no `runtime transition` lines. The reconciler then fires and reports the engine container as missing. `docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'` returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot. ### Working hypotheses 1. **Race between `Start` returning and the runtime spawn writing the container record.** Bootstrap returns `status=starting` and the service layer's `Start` is supposed to drive to `running` via the runtime layer's container spawn. If the spawn fails silently — or the goroutine that owns it exits before persisting the runtime record — the reconciler later sees a `starting` game with no container and cancels. 2. **`docker compose up -d --wait --remove-orphans` interaction.** `--remove-orphans` is documented as "remove containers for services not defined in the Compose file". Engine containers are spawned by the backend with their own labels, not under the compose project namespace, so they *should* be exempt — but it is worth verifying with `docker inspect` on a live engine container that none of its labels accidentally pin it to the `name: galaxy-dev` compose project. 3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm` semantics, a transient crash that exits the process leaves no record on the host. Combined with hypothesis 1, the reconciler's "container disappeared" branch is exactly the shape we observe. ### What to investigate before fixing - Inspect `backend/internal/runtime/` (spawn / reconciler) for the exact path the engine takes from `status=starting` to either `running` or `start_failed`. Specifically: which goroutine owns the spawn, where its error is logged, and whether `start_failed` is reachable from the runtime reconciler path or only from the in-bootstrap `Start` call. - Check the engine container's `Config.Labels`, `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with a deliberate redispatch and `docker events --since 0` capture bracketing the deploy. - Reproduce on a freshly seeded `clean-data` volume to rule out postgres-state ambiguity. ### Workaround in use today When the sandbox game flips to `cancelled`, redispatch `dev-deploy`: ```sh curl -X POST -n -H 'Content-Type: application/json' \ -d '{"ref":""}' \ https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches ``` The next boot's `purgeTerminalSandboxGames` removes the cancelled row, `findOrCreateSandboxGame` creates a fresh one, and `ensureMembershipsAndDrive` puts the new game back to `running`. ### Owner Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.