KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses
After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
|
|||||||
kills containers on backend exit, so a graceful SIGTERM to
|
kills containers on backend exit, so a graceful SIGTERM to
|
||||||
`galaxy-dev-backend` will not touch its child engine containers.
|
`galaxy-dev-backend` will not touch its child engine containers.
|
||||||
|
|
||||||
### Remaining hypotheses
|
### Host-side hypotheses considered and rejected by the owner
|
||||||
|
|
||||||
1. **Engine self-crashed and was reaped by something host-side.**
|
The natural follow-up suspects after compose was cleared — host-side
|
||||||
`RestartPolicy=on-failure` only retries within Docker's own
|
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
|
||||||
limits; if the engine exited cleanly (status 0) Docker does
|
`dockerd` restart, and an idle-state engine crash — were all
|
||||||
not restart, but does keep the row in `docker ps -a`. The
|
rejected by the project owner: the dev host runs none of those
|
||||||
reproduction case had the engine missing from `docker ps -a`
|
periodic cleanups, no one manually removed the container, dockerd
|
||||||
entirely, so a separate cleanup (cron `docker container prune`,
|
was not restarted in the window, and the engine binary does not
|
||||||
a host script, manual `docker rm`) needs to be ruled out.
|
crash while idling on API calls.
|
||||||
2. **An out-of-band Docker daemon restart dropped the container.**
|
|
||||||
A `dockerd` restart that loses sight of an unmanaged container
|
|
||||||
is rare, but would explain why both the live tracking and
|
|
||||||
`docker ps -a` are empty. Correlate the gap with
|
|
||||||
`journalctl -u docker` on the host.
|
|
||||||
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
|
|
||||||
and the engine exited on its own before `status=running`.**
|
|
||||||
Bootstrap logs `status=starting` and then is silent until the
|
|
||||||
reconciler 15 minutes later; the runtime row in that case
|
|
||||||
should have been written with `status=engine_unreachable`, so
|
|
||||||
any reproduction needs a `runtime_records` snapshot from the
|
|
||||||
bad window — that table got wiped together with the cancelled
|
|
||||||
game on the next boot, so the post-mortem currently lacks it.
|
|
||||||
|
|
||||||
### What to investigate next
|
### Best remaining suspicion
|
||||||
|
|
||||||
- On the dev host: list cron jobs, systemd timers, and any custom
|
Something the `dev-deploy.yaml` CI run does between successful
|
||||||
shell that periodically runs `docker container prune` or
|
image builds and the final `docker compose up -d --wait
|
||||||
`docker system prune`. The host also runs gitea + crowdsec so
|
--remove-orphans` clobbers the previously-spawned engine container.
|
||||||
unrelated maintenance is plausible.
|
The chain at runtime contains:
|
||||||
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
|
|
||||||
'2026-05-16 20:56:34'` for the original repro window — confirm
|
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
|
||||||
whether the daemon flagged a container removal in that gap.
|
2. `docker compose build galaxy-backend galaxy-api`
|
||||||
- Re-run with backend logging level `debug` so the
|
3. `docker run --rm` alpine for the UI volume seed
|
||||||
`runtime.scheduler` and `runtime.workers` paths surface their
|
4. `docker compose up -d --wait --remove-orphans`
|
||||||
per-game timer / job decisions. The current `info` level says
|
|
||||||
nothing between bootstrap and the reconciler.
|
None of these *should* touch an unmanaged engine container, but
|
||||||
- Capture `runtime_records` for the broken game *before* the next
|
the reproduction window points squarely inside this sequence. A
|
||||||
boot purges it; the column set
|
deliberate next reproduction with `docker events --since 0` armed
|
||||||
(`status`, `current_container_id`, `engine_endpoint`) tells
|
*before* the deploy starts and live for the entire job — captured
|
||||||
whether the engine ever reached `running` or stopped at
|
end-to-end on the dev host, not just the chunk after backend
|
||||||
`engine_unreachable`.
|
recreate — would pin which step emits the `destroy` on the engine.
|
||||||
- Reproduce on a freshly seeded `clean-data` volume to rule out
|
|
||||||
postgres-state ambiguity.
|
### Status
|
||||||
|
|
||||||
|
Parked. The bug is mildly disruptive (one redispatch + a manual
|
||||||
|
`make seed-ui`-style follow-up brings the sandbox back) and the
|
||||||
|
remaining hypotheses are speculative. If the symptom recurs, attach
|
||||||
|
the next bad-window `docker events` capture to this entry and
|
||||||
|
reopen. A `tools/dev-deploy/` rewrite may obviate the issue
|
||||||
|
entirely; that is on the project owner's medium-term list.
|
||||||
|
|
||||||
### Workaround in use today
|
### Workaround in use today
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user