KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap

A live `docker inspect` of an engine container and two redispatch runs with `docker events` captured confirm: - Engine has no `com.docker.compose.*` labels and `AutoRemove=false`, so `--remove-orphans` cannot reap it. - Two consecutive `dev-deploy.yaml` redispatches with an engine already running emitted `die` / `destroy` events only for `galaxy-dev-{backend,api,caddy}` — never for the engine. - The reconciler tick that fires 60s after backend recreate correctly matched the surviving engine in both cases (`status=running` in both `games` and `runtime_records`). - `runtime.Service` has no `Shutdown` that proactively removes engine containers, so a graceful backend exit also leaves them alone. The repro window therefore needs a separate trigger that removed the engine container outside of compose. The new hypotheses point at host-side `docker prune` jobs, a `dockerd` restart that lost the container, or an early `Engine.Init` failure that exited the engine before `status=running` reached the runtime row. The investigation list now leads with `journalctl -u docker` and the host crontab — those are the cheapest checks to confirm or rule out next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:10:13 +02:00
parent 5177fef2ef
commit cadb72b412
1 changed files with 68 additions and 31 deletions
@@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither
 running nor stopped on the host, so it either was never spawned or
 was removed before the host snapshot.

-### Working hypotheses
+### What has been ruled out

-1. **Race between `Start` returning and the runtime spawn writing the
-   container record.** Bootstrap returns `status=starting` and the
-   service layer's `Start` is supposed to drive to `running` via the
-   runtime layer's container spawn. If the spawn fails silently — or
-   the goroutine that owns it exits before persisting the runtime
-   record — the reconciler later sees a `starting` game with no
-   container and cancels.
-2. **`docker compose up -d --wait --remove-orphans` interaction.**
-   `--remove-orphans` is documented as "remove containers for
-   services not defined in the Compose file". Engine containers are
-   spawned by the backend with their own labels, not under the
-   compose project namespace, so they *should* be exempt — but it
-   is worth verifying with `docker inspect` on a live engine
-   container that none of its labels accidentally pin it to the
-   `name: galaxy-dev` compose project.
-3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
-   semantics, a transient crash that exits the process leaves no
-   record on the host. Combined with hypothesis 1, the reconciler's
-   "container disappeared" branch is exactly the shape we observe.
+A live `docker inspect` on a healthy engine container shows:

-### What to investigate before fixing
+```text
+Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
+        galaxy.game_id=<uuid>,
+        org.opencontainers.image.title=galaxy-game-engine,
+        com.galaxy.{cpu_quota,memory,pids_limit}
+AutoRemove:    false
+RestartPolicy: on-failure
+NetworkMode:   galaxy-dev-internal
+```

- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
-  exact path the engine takes from `status=starting` to either
-  `running` or `start_failed`. Specifically: which goroutine owns
-  the spawn, where its error is logged, and whether `start_failed`
-  is reachable from the runtime reconciler path or only from the
-  in-bootstrap `Start` call.
- Check the engine container's `Config.Labels`,
-  `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
-  a deliberate redispatch and `docker events --since 0` capture
-  bracketing the deploy.
+There are no `com.docker.compose.*` labels and `AutoRemove=false`,
+so `--remove-orphans` cannot reap the engine and a `--rm`-style
+self-destruct is not in play. Two redispatches captured under
+`docker events --filter event=create,start,die,destroy,kill,stop`
+also confirmed it: across both runs the only `die` / `destroy`
+events were for `galaxy-dev-{backend,api,caddy}`. The live engine
+container survived both redispatches, and the reconciler that
+fires 60 seconds after the new backend boots correctly matched
+it through `byGameID` / `byContainerID`.
+
+`backend/internal/runtime/service.go` only removes engine
+containers from the explicit `runStop` / `runRestart` / `runPatch`
+paths. There is no `runtime.Service.Shutdown` that proactively
+kills containers on backend exit, so a graceful SIGTERM to
+`galaxy-dev-backend` will not touch its child engine containers.
+
+### Remaining hypotheses
+
+1. **Engine self-crashed and was reaped by something host-side.**
+   `RestartPolicy=on-failure` only retries within Docker's own
+   limits; if the engine exited cleanly (status 0) Docker does
+   not restart, but does keep the row in `docker ps -a`. The
+   reproduction case had the engine missing from `docker ps -a`
+   entirely, so a separate cleanup (cron `docker container prune`,
+   a host script, manual `docker rm`) needs to be ruled out.
+2. **An out-of-band Docker daemon restart dropped the container.**
+   A `dockerd` restart that loses sight of an unmanaged container
+   is rare, but would explain why both the live tracking and
+   `docker ps -a` are empty. Correlate the gap with
+   `journalctl -u docker` on the host.
+3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
+   and the engine exited on its own before `status=running`.**
+   Bootstrap logs `status=starting` and then is silent until the
+   reconciler 15 minutes later; the runtime row in that case
+   should have been written with `status=engine_unreachable`, so
+   any reproduction needs a `runtime_records` snapshot from the
+   bad window — that table got wiped together with the cancelled
+   game on the next boot, so the post-mortem currently lacks it.
+
+### What to investigate next
+
+- On the dev host: list cron jobs, systemd timers, and any custom
+  shell that periodically runs `docker container prune` or
+  `docker system prune`. The host also runs gitea + crowdsec so
+  unrelated maintenance is plausible.
+- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
+  '2026-05-16 20:56:34'` for the original repro window — confirm
+  whether the daemon flagged a container removal in that gap.
+- Re-run with backend logging level `debug` so the
+  `runtime.scheduler` and `runtime.workers` paths surface their
+  per-game timer / job decisions. The current `info` level says
+  nothing between bootstrap and the reconciler.
+- Capture `runtime_records` for the broken game *before* the next
+  boot purges it; the column set
+  (`status`, `current_container_id`, `engine_endpoint`) tells
+  whether the engine ever reached `running` or stopped at
+  `engine_unreachable`.
 - Reproduce on a freshly seeded `clean-data` volume to rule out
  postgres-state ambiguity.