KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses

After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.

Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-16 23:16:51 +02:00
parent cadb72b412
commit 49f614926a
+34 -40
View File
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers.
### Remaining hypotheses
### Host-side hypotheses considered and rejected by the owner
1. **Engine self-crashed and was reaped by something host-side.**
`RestartPolicy=on-failure` only retries within Docker's own
limits; if the engine exited cleanly (status 0) Docker does
not restart, but does keep the row in `docker ps -a`. The
reproduction case had the engine missing from `docker ps -a`
entirely, so a separate cleanup (cron `docker container prune`,
a host script, manual `docker rm`) needs to be ruled out.
2. **An out-of-band Docker daemon restart dropped the container.**
A `dockerd` restart that loses sight of an unmanaged container
is rare, but would explain why both the live tracking and
`docker ps -a` are empty. Correlate the gap with
`journalctl -u docker` on the host.
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
and the engine exited on its own before `status=running`.**
Bootstrap logs `status=starting` and then is silent until the
reconciler 15 minutes later; the runtime row in that case
should have been written with `status=engine_unreachable`, so
any reproduction needs a `runtime_records` snapshot from the
bad window — that table got wiped together with the cancelled
game on the next boot, so the post-mortem currently lacks it.
The natural follow-up suspects after compose was cleared host-side
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
`dockerd` restart, and an idle-state engine crash — were all
rejected by the project owner: the dev host runs none of those
periodic cleanups, no one manually removed the container, dockerd
was not restarted in the window, and the engine binary does not
crash while idling on API calls.
### What to investigate next
### Best remaining suspicion
- On the dev host: list cron jobs, systemd timers, and any custom
shell that periodically runs `docker container prune` or
`docker system prune`. The host also runs gitea + crowdsec so
unrelated maintenance is plausible.
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
'2026-05-16 20:56:34'` for the original repro window — confirm
whether the daemon flagged a container removal in that gap.
- Re-run with backend logging level `debug` so the
`runtime.scheduler` and `runtime.workers` paths surface their
per-game timer / job decisions. The current `info` level says
nothing between bootstrap and the reconciler.
- Capture `runtime_records` for the broken game *before* the next
boot purges it; the column set
(`status`, `current_container_id`, `engine_endpoint`) tells
whether the engine ever reached `running` or stopped at
`engine_unreachable`.
- Reproduce on a freshly seeded `clean-data` volume to rule out
postgres-state ambiguity.
Something the `dev-deploy.yaml` CI run does between successful
image builds and the final `docker compose up -d --wait
--remove-orphans` clobbers the previously-spawned engine container.
The chain at runtime contains:
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
2. `docker compose build galaxy-backend galaxy-api`
3. `docker run --rm` alpine for the UI volume seed
4. `docker compose up -d --wait --remove-orphans`
None of these *should* touch an unmanaged engine container, but
the reproduction window points squarely inside this sequence. A
deliberate next reproduction with `docker events --since 0` armed
*before* the deploy starts and live for the entire job — captured
end-to-end on the dev host, not just the chunk after backend
recreate — would pin which step emits the `destroy` on the engine.
### Status
Parked. The bug is mildly disruptive (one redispatch + a manual
`make seed-ui`-style follow-up brings the sandbox back) and the
remaining hypotheses are speculative. If the symptom recurs, attach
the next bad-window `docker events` capture to this entry and
reopen. A `tools/dev-deploy/` rewrite may obviate the issue
entirely; that is on the project owner's medium-term list.
### Workaround in use today