KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:
- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
already running emitted `die` / `destroy` events only for
`galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
correctly matched the surviving engine in both cases
(`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
engine containers, so a graceful backend exit also leaves them
alone.
The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither
|
|||||||
running nor stopped on the host, so it either was never spawned or
|
running nor stopped on the host, so it either was never spawned or
|
||||||
was removed before the host snapshot.
|
was removed before the host snapshot.
|
||||||
|
|
||||||
### Working hypotheses
|
### What has been ruled out
|
||||||
|
|
||||||
1. **Race between `Start` returning and the runtime spawn writing the
|
A live `docker inspect` on a healthy engine container shows:
|
||||||
container record.** Bootstrap returns `status=starting` and the
|
|
||||||
service layer's `Start` is supposed to drive to `running` via the
|
|
||||||
runtime layer's container spawn. If the spawn fails silently — or
|
|
||||||
the goroutine that owns it exits before persisting the runtime
|
|
||||||
record — the reconciler later sees a `starting` game with no
|
|
||||||
container and cancels.
|
|
||||||
2. **`docker compose up -d --wait --remove-orphans` interaction.**
|
|
||||||
`--remove-orphans` is documented as "remove containers for
|
|
||||||
services not defined in the Compose file". Engine containers are
|
|
||||||
spawned by the backend with their own labels, not under the
|
|
||||||
compose project namespace, so they *should* be exempt — but it
|
|
||||||
is worth verifying with `docker inspect` on a live engine
|
|
||||||
container that none of its labels accidentally pin it to the
|
|
||||||
`name: galaxy-dev` compose project.
|
|
||||||
3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
|
|
||||||
semantics, a transient crash that exits the process leaves no
|
|
||||||
record on the host. Combined with hypothesis 1, the reconciler's
|
|
||||||
"container disappeared" branch is exactly the shape we observe.
|
|
||||||
|
|
||||||
### What to investigate before fixing
|
```text
|
||||||
|
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
|
||||||
|
galaxy.game_id=<uuid>,
|
||||||
|
org.opencontainers.image.title=galaxy-game-engine,
|
||||||
|
com.galaxy.{cpu_quota,memory,pids_limit}
|
||||||
|
AutoRemove: false
|
||||||
|
RestartPolicy: on-failure
|
||||||
|
NetworkMode: galaxy-dev-internal
|
||||||
|
```
|
||||||
|
|
||||||
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
|
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
|
||||||
exact path the engine takes from `status=starting` to either
|
so `--remove-orphans` cannot reap the engine and a `--rm`-style
|
||||||
`running` or `start_failed`. Specifically: which goroutine owns
|
self-destruct is not in play. Two redispatches captured under
|
||||||
the spawn, where its error is logged, and whether `start_failed`
|
`docker events --filter event=create,start,die,destroy,kill,stop`
|
||||||
is reachable from the runtime reconciler path or only from the
|
also confirmed it: across both runs the only `die` / `destroy`
|
||||||
in-bootstrap `Start` call.
|
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
|
||||||
- Check the engine container's `Config.Labels`,
|
container survived both redispatches, and the reconciler that
|
||||||
`HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
|
fires 60 seconds after the new backend boots correctly matched
|
||||||
a deliberate redispatch and `docker events --since 0` capture
|
it through `byGameID` / `byContainerID`.
|
||||||
bracketing the deploy.
|
|
||||||
|
`backend/internal/runtime/service.go` only removes engine
|
||||||
|
containers from the explicit `runStop` / `runRestart` / `runPatch`
|
||||||
|
paths. There is no `runtime.Service.Shutdown` that proactively
|
||||||
|
kills containers on backend exit, so a graceful SIGTERM to
|
||||||
|
`galaxy-dev-backend` will not touch its child engine containers.
|
||||||
|
|
||||||
|
### Remaining hypotheses
|
||||||
|
|
||||||
|
1. **Engine self-crashed and was reaped by something host-side.**
|
||||||
|
`RestartPolicy=on-failure` only retries within Docker's own
|
||||||
|
limits; if the engine exited cleanly (status 0) Docker does
|
||||||
|
not restart, but does keep the row in `docker ps -a`. The
|
||||||
|
reproduction case had the engine missing from `docker ps -a`
|
||||||
|
entirely, so a separate cleanup (cron `docker container prune`,
|
||||||
|
a host script, manual `docker rm`) needs to be ruled out.
|
||||||
|
2. **An out-of-band Docker daemon restart dropped the container.**
|
||||||
|
A `dockerd` restart that loses sight of an unmanaged container
|
||||||
|
is rare, but would explain why both the live tracking and
|
||||||
|
`docker ps -a` are empty. Correlate the gap with
|
||||||
|
`journalctl -u docker` on the host.
|
||||||
|
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
|
||||||
|
and the engine exited on its own before `status=running`.**
|
||||||
|
Bootstrap logs `status=starting` and then is silent until the
|
||||||
|
reconciler 15 minutes later; the runtime row in that case
|
||||||
|
should have been written with `status=engine_unreachable`, so
|
||||||
|
any reproduction needs a `runtime_records` snapshot from the
|
||||||
|
bad window — that table got wiped together with the cancelled
|
||||||
|
game on the next boot, so the post-mortem currently lacks it.
|
||||||
|
|
||||||
|
### What to investigate next
|
||||||
|
|
||||||
|
- On the dev host: list cron jobs, systemd timers, and any custom
|
||||||
|
shell that periodically runs `docker container prune` or
|
||||||
|
`docker system prune`. The host also runs gitea + crowdsec so
|
||||||
|
unrelated maintenance is plausible.
|
||||||
|
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
|
||||||
|
'2026-05-16 20:56:34'` for the original repro window — confirm
|
||||||
|
whether the daemon flagged a container removal in that gap.
|
||||||
|
- Re-run with backend logging level `debug` so the
|
||||||
|
`runtime.scheduler` and `runtime.workers` paths surface their
|
||||||
|
per-game timer / job decisions. The current `info` level says
|
||||||
|
nothing between bootstrap and the reconciler.
|
||||||
|
- Capture `runtime_records` for the broken game *before* the next
|
||||||
|
boot purges it; the column set
|
||||||
|
(`status`, `current_container_id`, `engine_endpoint`) tells
|
||||||
|
whether the engine ever reached `running` or stopped at
|
||||||
|
`engine_unreachable`.
|
||||||
- Reproduce on a freshly seeded `clean-data` volume to rule out
|
- Reproduce on a freshly seeded `clean-data` volume to rule out
|
||||||
postgres-state ambiguity.
|
postgres-state ambiguity.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user