A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:
- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
already running emitted `die` / `destroy` events only for
`galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
correctly matched the surviving engine in both cases
(`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
engine containers, so a graceful backend exit also leaves them
alone.
The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6.1 KiB
tools/dev-deploy/ — known issues
Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands.
Dev Sandbox game flips to cancelled after a dev-deploy redispatch
Symptom
A previously running "Dev Sandbox" game (created by
backend/internal/devsandbox) transitions to cancelled ~15 minutes
after a dev-deploy.yaml workflow_dispatch run finishes. The user's
browser session survives (the same device_session_id keeps working),
but the lobby shows no game because the only game it had is now
terminal. purgeTerminalSandboxGames does pick it up on the next
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.
Diagnostic evidence
Backend logs from the broken cycle (timestamps abbreviated):
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
op=reconcile status=removed message="container disappeared"
Between 20:24:40 (status=starting) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
engine spawned, no engine container started, no runtime transition lines. The reconciler then fires and reports the engine
container as missing.
docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
What has been ruled out
A live docker inspect on a healthy engine container shows:
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
galaxy.game_id=<uuid>,
org.opencontainers.image.title=galaxy-game-engine,
com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove: false
RestartPolicy: on-failure
NetworkMode: galaxy-dev-internal
There are no com.docker.compose.* labels and AutoRemove=false,
so --remove-orphans cannot reap the engine and a --rm-style
self-destruct is not in play. Two redispatches captured under
docker events --filter event=create,start,die,destroy,kill,stop
also confirmed it: across both runs the only die / destroy
events were for galaxy-dev-{backend,api,caddy}. The live engine
container survived both redispatches, and the reconciler that
fires 60 seconds after the new backend boots correctly matched
it through byGameID / byContainerID.
backend/internal/runtime/service.go only removes engine
containers from the explicit runStop / runRestart / runPatch
paths. There is no runtime.Service.Shutdown that proactively
kills containers on backend exit, so a graceful SIGTERM to
galaxy-dev-backend will not touch its child engine containers.
Remaining hypotheses
- Engine self-crashed and was reaped by something host-side.
RestartPolicy=on-failureonly retries within Docker's own limits; if the engine exited cleanly (status 0) Docker does not restart, but does keep the row indocker ps -a. The reproduction case had the engine missing fromdocker ps -aentirely, so a separate cleanup (crondocker container prune, a host script, manualdocker rm) needs to be ruled out. - An out-of-band Docker daemon restart dropped the container.
A
dockerdrestart that loses sight of an unmanaged container is rare, but would explain why both the live tracking anddocker ps -aare empty. Correlate the gap withjournalctl -u dockeron the host. runStarterrored atwaitForEngineHealthzorEngine.Initand the engine exited on its own beforestatus=running. Bootstrap logsstatus=startingand then is silent until the reconciler 15 minutes later; the runtime row in that case should have been written withstatus=engine_unreachable, so any reproduction needs aruntime_recordssnapshot from the bad window — that table got wiped together with the cancelled game on the next boot, so the post-mortem currently lacks it.
What to investigate next
- On the dev host: list cron jobs, systemd timers, and any custom
shell that periodically runs
docker container pruneordocker system prune. The host also runs gitea + crowdsec so unrelated maintenance is plausible. - Inspect
journalctl -u docker --since '2026-05-16 20:50:00' --until '2026-05-16 20:56:34'for the original repro window — confirm whether the daemon flagged a container removal in that gap. - Re-run with backend logging level
debugso theruntime.schedulerandruntime.workerspaths surface their per-game timer / job decisions. The currentinfolevel says nothing between bootstrap and the reconciler. - Capture
runtime_recordsfor the broken game before the next boot purges it; the column set (status,current_container_id,engine_endpoint) tells whether the engine ever reachedrunningor stopped atengine_unreachable. - Reproduce on a freshly seeded
clean-datavolume to rule out postgres-state ambiguity.
Workaround in use today
When the sandbox game flips to cancelled, redispatch dev-deploy:
curl -X POST -n -H 'Content-Type: application/json' \
-d '{"ref":"<branch>"}' \
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
The next boot's purgeTerminalSandboxGames removes the cancelled
row, findOrCreateSandboxGame creates a fresh one, and
ensureMembershipsAndDrive puts the new game back to running.
Owner
Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.