KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses

After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.

Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-16 23:16:51 +02:00
parent cadb72b412
commit 49f614926a
+34 -40
View File
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers. `galaxy-dev-backend` will not touch its child engine containers.
### Remaining hypotheses ### Host-side hypotheses considered and rejected by the owner
1. **Engine self-crashed and was reaped by something host-side.** The natural follow-up suspects after compose was cleared host-side
`RestartPolicy=on-failure` only retries within Docker's own `docker prune` cron jobs, a manual `docker rm`, an out-of-band
limits; if the engine exited cleanly (status 0) Docker does `dockerd` restart, and an idle-state engine crash — were all
not restart, but does keep the row in `docker ps -a`. The rejected by the project owner: the dev host runs none of those
reproduction case had the engine missing from `docker ps -a` periodic cleanups, no one manually removed the container, dockerd
entirely, so a separate cleanup (cron `docker container prune`, was not restarted in the window, and the engine binary does not
a host script, manual `docker rm`) needs to be ruled out. crash while idling on API calls.
2. **An out-of-band Docker daemon restart dropped the container.**
A `dockerd` restart that loses sight of an unmanaged container
is rare, but would explain why both the live tracking and
`docker ps -a` are empty. Correlate the gap with
`journalctl -u docker` on the host.
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
and the engine exited on its own before `status=running`.**
Bootstrap logs `status=starting` and then is silent until the
reconciler 15 minutes later; the runtime row in that case
should have been written with `status=engine_unreachable`, so
any reproduction needs a `runtime_records` snapshot from the
bad window — that table got wiped together with the cancelled
game on the next boot, so the post-mortem currently lacks it.
### What to investigate next ### Best remaining suspicion
- On the dev host: list cron jobs, systemd timers, and any custom Something the `dev-deploy.yaml` CI run does between successful
shell that periodically runs `docker container prune` or image builds and the final `docker compose up -d --wait
`docker system prune`. The host also runs gitea + crowdsec so --remove-orphans` clobbers the previously-spawned engine container.
unrelated maintenance is plausible. The chain at runtime contains:
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
'2026-05-16 20:56:34'` for the original repro window — confirm 1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
whether the daemon flagged a container removal in that gap. 2. `docker compose build galaxy-backend galaxy-api`
- Re-run with backend logging level `debug` so the 3. `docker run --rm` alpine for the UI volume seed
`runtime.scheduler` and `runtime.workers` paths surface their 4. `docker compose up -d --wait --remove-orphans`
per-game timer / job decisions. The current `info` level says
nothing between bootstrap and the reconciler. None of these *should* touch an unmanaged engine container, but
- Capture `runtime_records` for the broken game *before* the next the reproduction window points squarely inside this sequence. A
boot purges it; the column set deliberate next reproduction with `docker events --since 0` armed
(`status`, `current_container_id`, `engine_endpoint`) tells *before* the deploy starts and live for the entire job — captured
whether the engine ever reached `running` or stopped at end-to-end on the dev host, not just the chunk after backend
`engine_unreachable`. recreate — would pin which step emits the `destroy` on the engine.
- Reproduce on a freshly seeded `clean-data` volume to rule out
postgres-state ambiguity. ### Status
Parked. The bug is mildly disruptive (one redispatch + a manual
`make seed-ui`-style follow-up brings the sandbox back) and the
remaining hypotheses are speculative. If the symptom recurs, attach
the next bad-window `docker events` capture to this entry and
reopen. A `tools/dev-deploy/` rewrite may obviate the issue
entirely; that is on the project owner's medium-term list.
### Workaround in use today ### Workaround in use today