diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md index 7e4dc4f..b3b7277 100644 --- a/tools/dev-deploy/KNOWN-ISSUES.md +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot. -### Working hypotheses +### What has been ruled out -1. **Race between `Start` returning and the runtime spawn writing the - container record.** Bootstrap returns `status=starting` and the - service layer's `Start` is supposed to drive to `running` via the - runtime layer's container spawn. If the spawn fails silently — or - the goroutine that owns it exits before persisting the runtime - record — the reconciler later sees a `starting` game with no - container and cancels. -2. **`docker compose up -d --wait --remove-orphans` interaction.** - `--remove-orphans` is documented as "remove containers for - services not defined in the Compose file". Engine containers are - spawned by the backend with their own labels, not under the - compose project namespace, so they *should* be exempt — but it - is worth verifying with `docker inspect` on a live engine - container that none of its labels accidentally pin it to the - `name: galaxy-dev` compose project. -3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm` - semantics, a transient crash that exits the process leaves no - record on the host. Combined with hypothesis 1, the reconciler's - "container disappeared" branch is exactly the shape we observe. +A live `docker inspect` on a healthy engine container shows: -### What to investigate before fixing +```text +Labels: galaxy.backend=1, galaxy.engine_version=0.1.0, + galaxy.game_id=, + org.opencontainers.image.title=galaxy-game-engine, + com.galaxy.{cpu_quota,memory,pids_limit} +AutoRemove: false +RestartPolicy: on-failure +NetworkMode: galaxy-dev-internal +``` -- Inspect `backend/internal/runtime/` (spawn / reconciler) for the - exact path the engine takes from `status=starting` to either - `running` or `start_failed`. Specifically: which goroutine owns - the spawn, where its error is logged, and whether `start_failed` - is reachable from the runtime reconciler path or only from the - in-bootstrap `Start` call. -- Check the engine container's `Config.Labels`, - `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with - a deliberate redispatch and `docker events --since 0` capture - bracketing the deploy. +There are no `com.docker.compose.*` labels and `AutoRemove=false`, +so `--remove-orphans` cannot reap the engine and a `--rm`-style +self-destruct is not in play. Two redispatches captured under +`docker events --filter event=create,start,die,destroy,kill,stop` +also confirmed it: across both runs the only `die` / `destroy` +events were for `galaxy-dev-{backend,api,caddy}`. The live engine +container survived both redispatches, and the reconciler that +fires 60 seconds after the new backend boots correctly matched +it through `byGameID` / `byContainerID`. + +`backend/internal/runtime/service.go` only removes engine +containers from the explicit `runStop` / `runRestart` / `runPatch` +paths. There is no `runtime.Service.Shutdown` that proactively +kills containers on backend exit, so a graceful SIGTERM to +`galaxy-dev-backend` will not touch its child engine containers. + +### Remaining hypotheses + +1. **Engine self-crashed and was reaped by something host-side.** + `RestartPolicy=on-failure` only retries within Docker's own + limits; if the engine exited cleanly (status 0) Docker does + not restart, but does keep the row in `docker ps -a`. The + reproduction case had the engine missing from `docker ps -a` + entirely, so a separate cleanup (cron `docker container prune`, + a host script, manual `docker rm`) needs to be ruled out. +2. **An out-of-band Docker daemon restart dropped the container.** + A `dockerd` restart that loses sight of an unmanaged container + is rare, but would explain why both the live tracking and + `docker ps -a` are empty. Correlate the gap with + `journalctl -u docker` on the host. +3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init` + and the engine exited on its own before `status=running`.** + Bootstrap logs `status=starting` and then is silent until the + reconciler 15 minutes later; the runtime row in that case + should have been written with `status=engine_unreachable`, so + any reproduction needs a `runtime_records` snapshot from the + bad window — that table got wiped together with the cancelled + game on the next boot, so the post-mortem currently lacks it. + +### What to investigate next + +- On the dev host: list cron jobs, systemd timers, and any custom + shell that periodically runs `docker container prune` or + `docker system prune`. The host also runs gitea + crowdsec so + unrelated maintenance is plausible. +- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until + '2026-05-16 20:56:34'` for the original repro window — confirm + whether the daemon flagged a container removal in that gap. +- Re-run with backend logging level `debug` so the + `runtime.scheduler` and `runtime.workers` paths surface their + per-game timer / job decisions. The current `info` level says + nothing between bootstrap and the reconciler. +- Capture `runtime_records` for the broken game *before* the next + boot purges it; the column set + (`status`, `current_container_id`, `engine_endpoint`) tells + whether the engine ever reached `running` or stopped at + `engine_unreachable`. - Reproduce on a freshly seeded `clean-data` volume to rule out postgres-state ambiguity.