Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.5 KiB
tools/dev-deploy/ — known issues
Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands.
Dev Sandbox game flips to cancelled after a dev-deploy redispatch
Symptom
A previously running "Dev Sandbox" game (created by
backend/internal/devsandbox) transitions to cancelled ~15 minutes
after a dev-deploy.yaml workflow_dispatch run finishes. The user's
browser session survives (the same device_session_id keeps working),
but the lobby shows no game because the only game it had is now
terminal. purgeTerminalSandboxGames does pick it up on the next
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.
Diagnostic evidence
Backend logs from the broken cycle (timestamps abbreviated):
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
op=reconcile status=removed message="container disappeared"
Between 20:24:40 (status=starting) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
engine spawned, no engine container started, no runtime transition lines. The reconciler then fires and reports the engine
container as missing.
docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
Working hypotheses
- Race between
Startreturning and the runtime spawn writing the container record. Bootstrap returnsstatus=startingand the service layer'sStartis supposed to drive torunningvia the runtime layer's container spawn. If the spawn fails silently — or the goroutine that owns it exits before persisting the runtime record — the reconciler later sees astartinggame with no container and cancels. docker compose up -d --wait --remove-orphansinteraction.--remove-orphansis documented as "remove containers for services not defined in the Compose file". Engine containers are spawned by the backend with their own labels, not under the compose project namespace, so they should be exempt — but it is worth verifying withdocker inspecton a live engine container that none of its labels accidentally pin it to thename: galaxy-devcompose project.- Engine
--rmlifecycle. If the engine spawn uses--rmsemantics, a transient crash that exits the process leaves no record on the host. Combined with hypothesis 1, the reconciler's "container disappeared" branch is exactly the shape we observe.
What to investigate before fixing
- Inspect
backend/internal/runtime/(spawn / reconciler) for the exact path the engine takes fromstatus=startingto eitherrunningorstart_failed. Specifically: which goroutine owns the spawn, where its error is logged, and whetherstart_failedis reachable from the runtime reconciler path or only from the in-bootstrapStartcall. - Check the engine container's
Config.Labels,HostConfig.AutoRemove, and the--remove-orphanssemantics with a deliberate redispatch anddocker events --since 0capture bracketing the deploy. - Reproduce on a freshly seeded
clean-datavolume to rule out postgres-state ambiguity.
Workaround in use today
When the sandbox game flips to cancelled, redispatch dev-deploy:
curl -X POST -n -H 'Content-Type: application/json' \
-d '{"ref":"<branch>"}' \
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
The next boot's purgeTerminalSandboxGames removes the cancelled
row, findOrCreateSandboxGame creates a fresh one, and
ensureMembershipsAndDrive puts the new game back to running.
Owner
Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.