KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses
After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
|
||||
kills containers on backend exit, so a graceful SIGTERM to
|
||||
`galaxy-dev-backend` will not touch its child engine containers.
|
||||
|
||||
### Remaining hypotheses
|
||||
### Host-side hypotheses considered and rejected by the owner
|
||||
|
||||
1. **Engine self-crashed and was reaped by something host-side.**
|
||||
`RestartPolicy=on-failure` only retries within Docker's own
|
||||
limits; if the engine exited cleanly (status 0) Docker does
|
||||
not restart, but does keep the row in `docker ps -a`. The
|
||||
reproduction case had the engine missing from `docker ps -a`
|
||||
entirely, so a separate cleanup (cron `docker container prune`,
|
||||
a host script, manual `docker rm`) needs to be ruled out.
|
||||
2. **An out-of-band Docker daemon restart dropped the container.**
|
||||
A `dockerd` restart that loses sight of an unmanaged container
|
||||
is rare, but would explain why both the live tracking and
|
||||
`docker ps -a` are empty. Correlate the gap with
|
||||
`journalctl -u docker` on the host.
|
||||
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
|
||||
and the engine exited on its own before `status=running`.**
|
||||
Bootstrap logs `status=starting` and then is silent until the
|
||||
reconciler 15 minutes later; the runtime row in that case
|
||||
should have been written with `status=engine_unreachable`, so
|
||||
any reproduction needs a `runtime_records` snapshot from the
|
||||
bad window — that table got wiped together with the cancelled
|
||||
game on the next boot, so the post-mortem currently lacks it.
|
||||
The natural follow-up suspects after compose was cleared — host-side
|
||||
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
|
||||
`dockerd` restart, and an idle-state engine crash — were all
|
||||
rejected by the project owner: the dev host runs none of those
|
||||
periodic cleanups, no one manually removed the container, dockerd
|
||||
was not restarted in the window, and the engine binary does not
|
||||
crash while idling on API calls.
|
||||
|
||||
### What to investigate next
|
||||
### Best remaining suspicion
|
||||
|
||||
- On the dev host: list cron jobs, systemd timers, and any custom
|
||||
shell that periodically runs `docker container prune` or
|
||||
`docker system prune`. The host also runs gitea + crowdsec so
|
||||
unrelated maintenance is plausible.
|
||||
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
|
||||
'2026-05-16 20:56:34'` for the original repro window — confirm
|
||||
whether the daemon flagged a container removal in that gap.
|
||||
- Re-run with backend logging level `debug` so the
|
||||
`runtime.scheduler` and `runtime.workers` paths surface their
|
||||
per-game timer / job decisions. The current `info` level says
|
||||
nothing between bootstrap and the reconciler.
|
||||
- Capture `runtime_records` for the broken game *before* the next
|
||||
boot purges it; the column set
|
||||
(`status`, `current_container_id`, `engine_endpoint`) tells
|
||||
whether the engine ever reached `running` or stopped at
|
||||
`engine_unreachable`.
|
||||
- Reproduce on a freshly seeded `clean-data` volume to rule out
|
||||
postgres-state ambiguity.
|
||||
Something the `dev-deploy.yaml` CI run does between successful
|
||||
image builds and the final `docker compose up -d --wait
|
||||
--remove-orphans` clobbers the previously-spawned engine container.
|
||||
The chain at runtime contains:
|
||||
|
||||
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
|
||||
2. `docker compose build galaxy-backend galaxy-api`
|
||||
3. `docker run --rm` alpine for the UI volume seed
|
||||
4. `docker compose up -d --wait --remove-orphans`
|
||||
|
||||
None of these *should* touch an unmanaged engine container, but
|
||||
the reproduction window points squarely inside this sequence. A
|
||||
deliberate next reproduction with `docker events --since 0` armed
|
||||
*before* the deploy starts and live for the entire job — captured
|
||||
end-to-end on the dev host, not just the chunk after backend
|
||||
recreate — would pin which step emits the `destroy` on the engine.
|
||||
|
||||
### Status
|
||||
|
||||
Parked. The bug is mildly disruptive (one redispatch + a manual
|
||||
`make seed-ui`-style follow-up brings the sandbox back) and the
|
||||
remaining hypotheses are speculative. If the symptom recurs, attach
|
||||
the next bad-window `docker events` capture to this entry and
|
||||
reopen. A `tools/dev-deploy/` rewrite may obviate the issue
|
||||
entirely; that is on the project owner's medium-term list.
|
||||
|
||||
### Workaround in use today
|
||||
|
||||
|
||||
Reference in New Issue
Block a user