KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses

After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:16:51 +02:00
parent cadb72b412
commit 49f614926a
1 changed files with 34 additions and 40 deletions
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
 kills containers on backend exit, so a graceful SIGTERM to
 `galaxy-dev-backend` will not touch its child engine containers.
-### Remaining hypotheses
+### Host-side hypotheses considered and rejected by the owner
-1. **Engine self-crashed and was reaped by something host-side.**
+The natural follow-up suspects after compose was cleared — host-side
-   `RestartPolicy=on-failure` only retries within Docker's own
+`docker prune` cron jobs, a manual `docker rm`, an out-of-band
-   limits; if the engine exited cleanly (status 0) Docker does
+`dockerd` restart, and an idle-state engine crash — were all
-   not restart, but does keep the row in `docker ps -a`. The
+rejected by the project owner: the dev host runs none of those
-   reproduction case had the engine missing from `docker ps -a`
+periodic cleanups, no one manually removed the container, dockerd
-   entirely, so a separate cleanup (cron `docker container prune`,
+was not restarted in the window, and the engine binary does not
-   a host script, manual `docker rm`) needs to be ruled out.
+crash while idling on API calls.
 2. **An out-of-band Docker daemon restart dropped the container.**
   A `dockerd` restart that loses sight of an unmanaged container
   is rare, but would explain why both the live tracking and
   `docker ps -a` are empty. Correlate the gap with
   `journalctl -u docker` on the host.
 3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
   and the engine exited on its own before `status=running`.**
   Bootstrap logs `status=starting` and then is silent until the
   reconciler 15 minutes later; the runtime row in that case
   should have been written with `status=engine_unreachable`, so
   any reproduction needs a `runtime_records` snapshot from the
   bad window — that table got wiped together with the cancelled
   game on the next boot, so the post-mortem currently lacks it.
-### What to investigate next
+### Best remaining suspicion
- On the dev host: list cron jobs, systemd timers, and any custom
+Something the `dev-deploy.yaml` CI run does between successful
-  shell that periodically runs `docker container prune` or
+image builds and the final `docker compose up -d --wait
-  `docker system prune`. The host also runs gitea + crowdsec so
+--remove-orphans` clobbers the previously-spawned engine container.
-  unrelated maintenance is plausible.
+The chain at runtime contains:
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
+
-  '2026-05-16 20:56:34'` for the original repro window — confirm
+1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
-  whether the daemon flagged a container removal in that gap.
+2. `docker compose build galaxy-backend galaxy-api`
- Re-run with backend logging level `debug` so the
+3. `docker run --rm` alpine for the UI volume seed
-  `runtime.scheduler` and `runtime.workers` paths surface their
+4. `docker compose up -d --wait --remove-orphans`
-  per-game timer / job decisions. The current `info` level says
+
-  nothing between bootstrap and the reconciler.
+None of these *should* touch an unmanaged engine container, but
- Capture `runtime_records` for the broken game *before* the next
+the reproduction window points squarely inside this sequence. A
-  boot purges it; the column set
+deliberate next reproduction with `docker events --since 0` armed
-  (`status`, `current_container_id`, `engine_endpoint`) tells
+*before* the deploy starts and live for the entire job — captured
-  whether the engine ever reached `running` or stopped at
+end-to-end on the dev host, not just the chunk after backend
-  `engine_unreachable`.
+recreate — would pin which step emits the `destroy` on the engine.
- Reproduce on a freshly seeded `clean-data` volume to rule out
+
-  postgres-state ambiguity.
+### Status
 Parked. The bug is mildly disruptive (one redispatch + a manual
 `make seed-ui`-style follow-up brings the sandbox back) and the
 remaining hypotheses are speculative. If the symptom recurs, attach
 the next bad-window `docker events` capture to this entry and
 reopen. A `tools/dev-deploy/` rewrite may obviate the issue
 entirely; that is on the project owner's medium-term list.
 ### Workaround in use today