From 49f614926a82bd617a3e7248ac62f587a7c2f266 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Sat, 16 May 2026 23:16:51 +0200 Subject: [PATCH] KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 --- tools/dev-deploy/KNOWN-ISSUES.md | 74 +++++++++++++++----------------- 1 file changed, 34 insertions(+), 40 deletions(-) diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md index b3b7277..32ab2d5 100644 --- a/tools/dev-deploy/KNOWN-ISSUES.md +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively kills containers on backend exit, so a graceful SIGTERM to `galaxy-dev-backend` will not touch its child engine containers. -### Remaining hypotheses +### Host-side hypotheses considered and rejected by the owner -1. **Engine self-crashed and was reaped by something host-side.** - `RestartPolicy=on-failure` only retries within Docker's own - limits; if the engine exited cleanly (status 0) Docker does - not restart, but does keep the row in `docker ps -a`. The - reproduction case had the engine missing from `docker ps -a` - entirely, so a separate cleanup (cron `docker container prune`, - a host script, manual `docker rm`) needs to be ruled out. -2. **An out-of-band Docker daemon restart dropped the container.** - A `dockerd` restart that loses sight of an unmanaged container - is rare, but would explain why both the live tracking and - `docker ps -a` are empty. Correlate the gap with - `journalctl -u docker` on the host. -3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init` - and the engine exited on its own before `status=running`.** - Bootstrap logs `status=starting` and then is silent until the - reconciler 15 minutes later; the runtime row in that case - should have been written with `status=engine_unreachable`, so - any reproduction needs a `runtime_records` snapshot from the - bad window — that table got wiped together with the cancelled - game on the next boot, so the post-mortem currently lacks it. +The natural follow-up suspects after compose was cleared — host-side +`docker prune` cron jobs, a manual `docker rm`, an out-of-band +`dockerd` restart, and an idle-state engine crash — were all +rejected by the project owner: the dev host runs none of those +periodic cleanups, no one manually removed the container, dockerd +was not restarted in the window, and the engine binary does not +crash while idling on API calls. -### What to investigate next +### Best remaining suspicion -- On the dev host: list cron jobs, systemd timers, and any custom - shell that periodically runs `docker container prune` or - `docker system prune`. The host also runs gitea + crowdsec so - unrelated maintenance is plausible. -- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until - '2026-05-16 20:56:34'` for the original repro window — confirm - whether the daemon flagged a container removal in that gap. -- Re-run with backend logging level `debug` so the - `runtime.scheduler` and `runtime.workers` paths surface their - per-game timer / job decisions. The current `info` level says - nothing between bootstrap and the reconciler. -- Capture `runtime_records` for the broken game *before* the next - boot purges it; the column set - (`status`, `current_container_id`, `engine_endpoint`) tells - whether the engine ever reached `running` or stopped at - `engine_unreachable`. -- Reproduce on a freshly seeded `clean-data` volume to rule out - postgres-state ambiguity. +Something the `dev-deploy.yaml` CI run does between successful +image builds and the final `docker compose up -d --wait +--remove-orphans` clobbers the previously-spawned engine container. +The chain at runtime contains: + +1. `docker build -t galaxy-engine:dev -f game/Dockerfile .` +2. `docker compose build galaxy-backend galaxy-api` +3. `docker run --rm` alpine for the UI volume seed +4. `docker compose up -d --wait --remove-orphans` + +None of these *should* touch an unmanaged engine container, but +the reproduction window points squarely inside this sequence. A +deliberate next reproduction with `docker events --since 0` armed +*before* the deploy starts and live for the entire job — captured +end-to-end on the dev host, not just the chunk after backend +recreate — would pin which step emits the `destroy` on the engine. + +### Status + +Parked. The bug is mildly disruptive (one redispatch + a manual +`make seed-ui`-style follow-up brings the sandbox back) and the +remaining hypotheses are speculative. If the symptom recurs, attach +the next bad-window `docker events` capture to this entry and +reopen. A `tools/dev-deploy/` rewrite may obviate the issue +entirely; that is on the project owner's medium-term list. ### Workaround in use today