KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses

After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:16:51 +02:00
parent cadb72b412
commit 49f614926a
1 changed files with 34 additions and 40 deletions
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
 kills containers on backend exit, so a graceful SIGTERM to
 `galaxy-dev-backend` will not touch its child engine containers.

-### Remaining hypotheses
+### Host-side hypotheses considered and rejected by the owner

-1. **Engine self-crashed and was reaped by something host-side.**
-   `RestartPolicy=on-failure` only retries within Docker's own
-   limits; if the engine exited cleanly (status 0) Docker does
-   not restart, but does keep the row in `docker ps -a`. The
-   reproduction case had the engine missing from `docker ps -a`
-   entirely, so a separate cleanup (cron `docker container prune`,
-   a host script, manual `docker rm`) needs to be ruled out.
-2. **An out-of-band Docker daemon restart dropped the container.**
-   A `dockerd` restart that loses sight of an unmanaged container
-   is rare, but would explain why both the live tracking and
-   `docker ps -a` are empty. Correlate the gap with
-   `journalctl -u docker` on the host.
-3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
-   and the engine exited on its own before `status=running`.**
-   Bootstrap logs `status=starting` and then is silent until the
-   reconciler 15 minutes later; the runtime row in that case
-   should have been written with `status=engine_unreachable`, so
-   any reproduction needs a `runtime_records` snapshot from the
-   bad window — that table got wiped together with the cancelled
-   game on the next boot, so the post-mortem currently lacks it.
+The natural follow-up suspects after compose was cleared — host-side
+`docker prune` cron jobs, a manual `docker rm`, an out-of-band
+`dockerd` restart, and an idle-state engine crash — were all
+rejected by the project owner: the dev host runs none of those
+periodic cleanups, no one manually removed the container, dockerd
+was not restarted in the window, and the engine binary does not
+crash while idling on API calls.

-### What to investigate next
+### Best remaining suspicion

- On the dev host: list cron jobs, systemd timers, and any custom
-  shell that periodically runs `docker container prune` or
-  `docker system prune`. The host also runs gitea + crowdsec so
-  unrelated maintenance is plausible.
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
-  '2026-05-16 20:56:34'` for the original repro window — confirm
-  whether the daemon flagged a container removal in that gap.
- Re-run with backend logging level `debug` so the
-  `runtime.scheduler` and `runtime.workers` paths surface their
-  per-game timer / job decisions. The current `info` level says
-  nothing between bootstrap and the reconciler.
- Capture `runtime_records` for the broken game *before* the next
-  boot purges it; the column set
-  (`status`, `current_container_id`, `engine_endpoint`) tells
-  whether the engine ever reached `running` or stopped at
-  `engine_unreachable`.
- Reproduce on a freshly seeded `clean-data` volume to rule out
-  postgres-state ambiguity.
+Something the `dev-deploy.yaml` CI run does between successful
+image builds and the final `docker compose up -d --wait
+--remove-orphans` clobbers the previously-spawned engine container.
+The chain at runtime contains:
+
+1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
+2. `docker compose build galaxy-backend galaxy-api`
+3. `docker run --rm` alpine for the UI volume seed
+4. `docker compose up -d --wait --remove-orphans`
+
+None of these *should* touch an unmanaged engine container, but
+the reproduction window points squarely inside this sequence. A
+deliberate next reproduction with `docker events --since 0` armed
+*before* the deploy starts and live for the entire job — captured
+end-to-end on the dev host, not just the chunk after backend
+recreate — would pin which step emits the `destroy` on the engine.
+
+### Status
+
+Parked. The bug is mildly disruptive (one redispatch + a manual
+`make seed-ui`-style follow-up brings the sandbox back) and the
+remaining hypotheses are speculative. If the symptom recurs, attach
+the next bad-window `docker events` capture to this entry and
+reopen. A `tools/dev-deploy/` rewrite may obviate the issue
+entirely; that is on the project owner's medium-term list.

 ### Workaround in use today