chore(dev-deploy): KNOWN-ISSUES entry for sandbox-cancel after redispatch #12

Merged
developer merged 3 commits from chore/dev-sandbox-cancel-todo into development 2026-05-16 21:17:37 +00:00
Showing only changes of commit cadb72b412 - Show all commits
+68 -31
View File
@@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot. was removed before the host snapshot.
### Working hypotheses ### What has been ruled out
1. **Race between `Start` returning and the runtime spawn writing the A live `docker inspect` on a healthy engine container shows:
container record.** Bootstrap returns `status=starting` and the
service layer's `Start` is supposed to drive to `running` via the
runtime layer's container spawn. If the spawn fails silently — or
the goroutine that owns it exits before persisting the runtime
record — the reconciler later sees a `starting` game with no
container and cancels.
2. **`docker compose up -d --wait --remove-orphans` interaction.**
`--remove-orphans` is documented as "remove containers for
services not defined in the Compose file". Engine containers are
spawned by the backend with their own labels, not under the
compose project namespace, so they *should* be exempt — but it
is worth verifying with `docker inspect` on a live engine
container that none of its labels accidentally pin it to the
`name: galaxy-dev` compose project.
3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
semantics, a transient crash that exits the process leaves no
record on the host. Combined with hypothesis 1, the reconciler's
"container disappeared" branch is exactly the shape we observe.
### What to investigate before fixing ```text
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
galaxy.game_id=<uuid>,
org.opencontainers.image.title=galaxy-game-engine,
com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove: false
RestartPolicy: on-failure
NetworkMode: galaxy-dev-internal
```
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the There are no `com.docker.compose.*` labels and `AutoRemove=false`,
exact path the engine takes from `status=starting` to either so `--remove-orphans` cannot reap the engine and a `--rm`-style
`running` or `start_failed`. Specifically: which goroutine owns self-destruct is not in play. Two redispatches captured under
the spawn, where its error is logged, and whether `start_failed` `docker events --filter event=create,start,die,destroy,kill,stop`
is reachable from the runtime reconciler path or only from the also confirmed it: across both runs the only `die` / `destroy`
in-bootstrap `Start` call. events were for `galaxy-dev-{backend,api,caddy}`. The live engine
- Check the engine container's `Config.Labels`, container survived both redispatches, and the reconciler that
`HostConfig.AutoRemove`, and the `--remove-orphans` semantics with fires 60 seconds after the new backend boots correctly matched
a deliberate redispatch and `docker events --since 0` capture it through `byGameID` / `byContainerID`.
bracketing the deploy.
`backend/internal/runtime/service.go` only removes engine
containers from the explicit `runStop` / `runRestart` / `runPatch`
paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers.
### Remaining hypotheses
1. **Engine self-crashed and was reaped by something host-side.**
`RestartPolicy=on-failure` only retries within Docker's own
limits; if the engine exited cleanly (status 0) Docker does
not restart, but does keep the row in `docker ps -a`. The
reproduction case had the engine missing from `docker ps -a`
entirely, so a separate cleanup (cron `docker container prune`,
a host script, manual `docker rm`) needs to be ruled out.
2. **An out-of-band Docker daemon restart dropped the container.**
A `dockerd` restart that loses sight of an unmanaged container
is rare, but would explain why both the live tracking and
`docker ps -a` are empty. Correlate the gap with
`journalctl -u docker` on the host.
3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
and the engine exited on its own before `status=running`.**
Bootstrap logs `status=starting` and then is silent until the
reconciler 15 minutes later; the runtime row in that case
should have been written with `status=engine_unreachable`, so
any reproduction needs a `runtime_records` snapshot from the
bad window — that table got wiped together with the cancelled
game on the next boot, so the post-mortem currently lacks it.
### What to investigate next
- On the dev host: list cron jobs, systemd timers, and any custom
shell that periodically runs `docker container prune` or
`docker system prune`. The host also runs gitea + crowdsec so
unrelated maintenance is plausible.
- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
'2026-05-16 20:56:34'` for the original repro window — confirm
whether the daemon flagged a container removal in that gap.
- Re-run with backend logging level `debug` so the
`runtime.scheduler` and `runtime.workers` paths surface their
per-game timer / job decisions. The current `info` level says
nothing between bootstrap and the reconciler.
- Capture `runtime_records` for the broken game *before* the next
boot purges it; the column set
(`status`, `current_container_id`, `engine_endpoint`) tells
whether the engine ever reached `running` or stopped at
`engine_unreachable`.
- Reproduce on a freshly seeded `clean-data` volume to rule out - Reproduce on a freshly seeded `clean-data` volume to rule out
postgres-state ambiguity. postgres-state ambiguity.