KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
Tests · UI / test (push) Successful in 2m36s
Tests · Go / test (push) Successful in 2m38s

A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:

- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
  so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
  already running emitted `die` / `destroy` events only for
  `galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
  correctly matched the surviving engine in both cases
  (`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
  engine containers, so a graceful backend exit also leaves them
  alone.

The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-16 23:10:13 +02:00
parent 5177fef2ef
commit cadb72b412
+68 -31
View File
@@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
### Working hypotheses
### What has been ruled out
1. **Race between `Start` returning and the runtime spawn writing the
container record.** Bootstrap returns `status=starting` and the
service layer's `Start` is supposed to drive to `running` via the
runtime layer's container spawn. If the spawn fails silently — or
the goroutine that owns it exits before persisting the runtime
record — the reconciler later sees a `starting` game with no
container and cancels.
2. **`docker compose up -d --wait --remove-orphans` interaction.**
`--remove-orphans` is documented as "remove containers for
services not defined in the Compose file". Engine containers are
spawned by the backend with their own labels, not under the
compose project namespace, so they *should* be exempt — but it
is worth verifying with `docker inspect` on a live engine
container that none of its labels accidentally pin it to the
`name: galaxy-dev` compose project.
3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
semantics, a transient crash that exits the process leaves no
record on the host. Combined with hypothesis 1, the reconciler's
"container disappeared" branch is exactly the shape we observe.
A live `docker inspect` on a healthy engine container shows:
### What to investigate before fixing
```text
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
galaxy.game_id=<uuid>,
org.opencontainers.image.title=galaxy-game-engine,
com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove: false
RestartPolicy: on-failure
NetworkMode: galaxy-dev-internal
```
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
exact path the engine takes from `status=starting` to either
`running` or `start_failed`. Specifically: which goroutine owns
the spawn, where its error is logged, and whether `start_failed`
is reachable from the runtime reconciler path or only from the
in-bootstrap `Start` call.
- Check the engine container's `Config.Labels`,
`HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
a deliberate redispatch and `docker events --since 0` capture
bracketing the deploy.
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap the engine and a `--rm`-style
self-destruct is not in play. Two redispatches captured under
`docker events --filter event=create,start,die,destroy,kill,stop`
also confirmed it: across both runs the only `die` / `destroy`
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
container survived both redispatches, and the reconciler that
fires 60 seconds after the new backend boots correctly matched
it through `byGameID` / `byContainerID`.
`backend/internal/runtime/service.go` only removes engine
containers from the explicit `runStop` / `runRestart` / `runPatch`
paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers.
### Remaining hypotheses
1. **Engine self-crashed and was reaped by something host-side.**
`RestartPolicy=on-failure` only retries within Docker's own
limits; if the engine exited cleanly (status 0) Docker does
not restart, but does keep the row in `docker ps -a`. The
reproduction case had the engine missing from `docker ps -a`
entirely, so a separate cleanup (cron `docker container prune`,
a host script, manual `docker rm`) needs to be ruled out.
2. **An out-of-band Docker daemon restart dropped the container.**
A `dockerd` restart that loses sight of an unmanaged container
is rare, but would explain why both the live tracking and
`docker ps -a` are empty. Correlate the gap with
`journalctl -u docker` on the host.
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
and the engine exited on its own before `status=running`.**
Bootstrap logs `status=starting` and then is silent until the
reconciler 15 minutes later; the runtime row in that case
should have been written with `status=engine_unreachable`, so
any reproduction needs a `runtime_records` snapshot from the
bad window — that table got wiped together with the cancelled
game on the next boot, so the post-mortem currently lacks it.
### What to investigate next
- On the dev host: list cron jobs, systemd timers, and any custom
shell that periodically runs `docker container prune` or
`docker system prune`. The host also runs gitea + crowdsec so
unrelated maintenance is plausible.
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
'2026-05-16 20:56:34'` for the original repro window — confirm
whether the daemon flagged a container removal in that gap.
- Re-run with backend logging level `debug` so the
`runtime.scheduler` and `runtime.workers` paths surface their
per-game timer / job decisions. The current `info` level says
nothing between bootstrap and the reconciler.
- Capture `runtime_records` for the broken game *before* the next
boot purges it; the column set
(`status`, `current_container_id`, `engine_endpoint`) tells
whether the engine ever reached `running` or stopped at
`engine_unreachable`.
- Reproduce on a freshly seeded `clean-data` volume to rule out
postgres-state ambiguity.