KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap

A live `docker inspect` of an engine container and two redispatch runs with `docker events` captured confirm: - Engine has no `com.docker.compose.*` labels and `AutoRemove=false`, so `--remove-orphans` cannot reap it. - Two consecutive `dev-deploy.yaml` redispatches with an engine already running emitted `die` / `destroy` events only for `galaxy-dev-{backend,api,caddy}` — never for the engine. - The reconciler tick that fires 60s after backend recreate correctly matched the surviving engine in both cases (`status=running` in both `games` and `runtime_records`). - `runtime.Service` has no `Shutdown` that proactively removes engine containers, so a graceful backend exit also leaves them alone. The repro window therefore needs a separate trigger that removed the engine container outside of compose. The new hypotheses point at host-side `docker prune` jobs, a `dockerd` restart that lost the container, or an early `Engine.Init` failure that exited the engine before `status=running` reached the runtime row. The investigation list now leads with `journalctl -u docker` and the host crontab — those are the cheapest checks to confirm or rule out next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:10:13 +02:00
parent 5177fef2ef
commit cadb72b412
1 changed files with 68 additions and 31 deletions
@@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither
 running nor stopped on the host, so it either was never spawned or
 was removed before the host snapshot.
-### Working hypotheses
+### What has been ruled out
-1. **Race between `Start` returning and the runtime spawn writing the
+A live `docker inspect` on a healthy engine container shows:
   container record.** Bootstrap returns `status=starting` and the
   service layer's `Start` is supposed to drive to `running` via the
   runtime layer's container spawn. If the spawn fails silently — or
   the goroutine that owns it exits before persisting the runtime
   record — the reconciler later sees a `starting` game with no
   container and cancels.
 2. **`docker compose up -d --wait --remove-orphans` interaction.**
   `--remove-orphans` is documented as "remove containers for
   services not defined in the Compose file". Engine containers are
   spawned by the backend with their own labels, not under the
   compose project namespace, so they *should* be exempt — but it
   is worth verifying with `docker inspect` on a live engine
   container that none of its labels accidentally pin it to the
   `name: galaxy-dev` compose project.
 3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
   semantics, a transient crash that exits the process leaves no
   record on the host. Combined with hypothesis 1, the reconciler's
   "container disappeared" branch is exactly the shape we observe.
-### What to investigate before fixing
+```text
 Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
        galaxy.game_id=<uuid>,
        org.opencontainers.image.title=galaxy-game-engine,
        com.galaxy.{cpu_quota,memory,pids_limit}
 AutoRemove:    false
 RestartPolicy: on-failure
 NetworkMode:   galaxy-dev-internal
 ```
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
+There are no `com.docker.compose.*` labels and `AutoRemove=false`,
-  exact path the engine takes from `status=starting` to either
+so `--remove-orphans` cannot reap the engine and a `--rm`-style
-  `running` or `start_failed`. Specifically: which goroutine owns
+self-destruct is not in play. Two redispatches captured under
-  the spawn, where its error is logged, and whether `start_failed`
+`docker events --filter event=create,start,die,destroy,kill,stop`
-  is reachable from the runtime reconciler path or only from the
+also confirmed it: across both runs the only `die` / `destroy`
-  in-bootstrap `Start` call.
+events were for `galaxy-dev-{backend,api,caddy}`. The live engine
- Check the engine container's `Config.Labels`,
+container survived both redispatches, and the reconciler that
-  `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
+fires 60 seconds after the new backend boots correctly matched
-  a deliberate redispatch and `docker events --since 0` capture
+it through `byGameID` / `byContainerID`.
-  bracketing the deploy.
+
 `backend/internal/runtime/service.go` only removes engine
 containers from the explicit `runStop` / `runRestart` / `runPatch`
 paths. There is no `runtime.Service.Shutdown` that proactively
 kills containers on backend exit, so a graceful SIGTERM to
 `galaxy-dev-backend` will not touch its child engine containers.
 ### Remaining hypotheses
 1. **Engine self-crashed and was reaped by something host-side.**
   `RestartPolicy=on-failure` only retries within Docker's own
   limits; if the engine exited cleanly (status 0) Docker does
   not restart, but does keep the row in `docker ps -a`. The
   reproduction case had the engine missing from `docker ps -a`
   entirely, so a separate cleanup (cron `docker container prune`,
   a host script, manual `docker rm`) needs to be ruled out.
 2. **An out-of-band Docker daemon restart dropped the container.**
   A `dockerd` restart that loses sight of an unmanaged container
   is rare, but would explain why both the live tracking and
   `docker ps -a` are empty. Correlate the gap with
   `journalctl -u docker` on the host.
 3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
   and the engine exited on its own before `status=running`.**
   Bootstrap logs `status=starting` and then is silent until the
   reconciler 15 minutes later; the runtime row in that case
   should have been written with `status=engine_unreachable`, so
   any reproduction needs a `runtime_records` snapshot from the
   bad window — that table got wiped together with the cancelled
   game on the next boot, so the post-mortem currently lacks it.
 ### What to investigate next
 - On the dev host: list cron jobs, systemd timers, and any custom
  shell that periodically runs `docker container prune` or
  `docker system prune`. The host also runs gitea + crowdsec so
  unrelated maintenance is plausible.
 - Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
  '2026-05-16 20:56:34'` for the original repro window — confirm
  whether the daemon flagged a container removal in that gap.
 - Re-run with backend logging level `debug` so the
  `runtime.scheduler` and `runtime.workers` paths surface their
  per-game timer / job decisions. The current `info` level says
  nothing between bootstrap and the reconciler.
 - Capture `runtime_records` for the broken game *before* the next
  boot purges it; the column set
  (`status`, `current_container_id`, `engine_endpoint`) tells
  whether the engine ever reached `running` or stopped at
  `engine_unreachable`.
 - Reproduce on a freshly seeded `clean-data` volume to rule out
  postgres-state ambiguity.