From 5177fef2efb7bd086eb2152e150a8dc31be060e7 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Sat, 16 May 2026 22:53:21 +0200 Subject: [PATCH 1/3] tools/dev-deploy: log the sandbox-cancellation TODO MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 --- tools/dev-deploy/KNOWN-ISSUES.md | 102 +++++++++++++++++++++++++++++++ tools/dev-deploy/README.md | 6 ++ 2 files changed, 108 insertions(+) create mode 100644 tools/dev-deploy/KNOWN-ISSUES.md diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md new file mode 100644 index 0000000..7e4dc4f --- /dev/null +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -0,0 +1,102 @@ +# `tools/dev-deploy/` — known issues + +Issues that surface in the long-lived dev environment but are not yet +fixed. Each entry lists the observed symptom, the diagnostic evidence, +the working hypothesis, and the open questions that have to be +answered before a fix lands. + +## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch + +### Symptom + +A previously `running` "Dev Sandbox" game (created by +`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes +after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's +browser session survives (the same `device_session_id` keeps working), +but the lobby shows no game because the only game it had is now +terminal. `purgeTerminalSandboxGames` does pick it up on the **next** +boot and creates a fresh sandbox — but the first redispatch leaves +the user with an empty lobby until backend restarts again. + +### Diagnostic evidence + +Backend logs from the broken cycle (timestamps abbreviated): + +```text +20:24:40 dev_sandbox: purged terminal sandbox game game_id= status=cancelled +20:24:40 dev_sandbox: memberships ensured count=20 game_id= +20:24:40 dev_sandbox: bootstrap complete user_id= game_id= status=starting +... +20:25:09 user mail sent failed (diplomail tables missing — unrelated) +... +20:39:40 lobby: game cancelled by runtime reconciler game_id= + op=reconcile status=removed message="container disappeared" +``` + +Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel) +the backend logs are silent on the runtime / engine paths — no +`engine spawned`, no `engine container started`, no `runtime +transition` lines. The reconciler then fires and reports the engine +container as missing. + +`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'` +returns no rows during this window — the engine container is neither +running nor stopped on the host, so it either was never spawned or +was removed before the host snapshot. + +### Working hypotheses + +1. **Race between `Start` returning and the runtime spawn writing the + container record.** Bootstrap returns `status=starting` and the + service layer's `Start` is supposed to drive to `running` via the + runtime layer's container spawn. If the spawn fails silently — or + the goroutine that owns it exits before persisting the runtime + record — the reconciler later sees a `starting` game with no + container and cancels. +2. **`docker compose up -d --wait --remove-orphans` interaction.** + `--remove-orphans` is documented as "remove containers for + services not defined in the Compose file". Engine containers are + spawned by the backend with their own labels, not under the + compose project namespace, so they *should* be exempt — but it + is worth verifying with `docker inspect` on a live engine + container that none of its labels accidentally pin it to the + `name: galaxy-dev` compose project. +3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm` + semantics, a transient crash that exits the process leaves no + record on the host. Combined with hypothesis 1, the reconciler's + "container disappeared" branch is exactly the shape we observe. + +### What to investigate before fixing + +- Inspect `backend/internal/runtime/` (spawn / reconciler) for the + exact path the engine takes from `status=starting` to either + `running` or `start_failed`. Specifically: which goroutine owns + the spawn, where its error is logged, and whether `start_failed` + is reachable from the runtime reconciler path or only from the + in-bootstrap `Start` call. +- Check the engine container's `Config.Labels`, + `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with + a deliberate redispatch and `docker events --since 0` capture + bracketing the deploy. +- Reproduce on a freshly seeded `clean-data` volume to rule out + postgres-state ambiguity. + +### Workaround in use today + +When the sandbox game flips to `cancelled`, redispatch `dev-deploy`: + +```sh +curl -X POST -n -H 'Content-Type: application/json' \ + -d '{"ref":""}' \ + https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches +``` + +The next boot's `purgeTerminalSandboxGames` removes the cancelled +row, `findOrCreateSandboxGame` creates a fresh one, and +`ensureMembershipsAndDrive` puts the new game back to `running`. + +### Owner + +Unassigned. File an issue once we have the runtime / reconciler +analysis above; reference this section in the issue body so future +redeploys can short-circuit the diagnostic loop. diff --git a/tools/dev-deploy/README.md b/tools/dev-deploy/README.md index 1a04485..5d3f68c 100644 --- a/tools/dev-deploy/README.md +++ b/tools/dev-deploy/README.md @@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir - `.env.example` — non-secret defaults for the compose `${VAR:-}` expansions. Copy to `.env` if you want host-local overrides. +## Known issues + +See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface +in the long-lived dev environment but are not yet fixed (currently: +the sandbox game flipping to `cancelled` after a redispatch). + ## Relationship to other infrastructure - `tools/local-dev/` — single-developer playground, host-port mapped, -- 2.52.0 From cadb72b412108466819c20f28866b59fa7927be9 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Sat, 16 May 2026 23:10:13 +0200 Subject: [PATCH 2/3] KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A live `docker inspect` of an engine container and two redispatch runs with `docker events` captured confirm: - Engine has no `com.docker.compose.*` labels and `AutoRemove=false`, so `--remove-orphans` cannot reap it. - Two consecutive `dev-deploy.yaml` redispatches with an engine already running emitted `die` / `destroy` events only for `galaxy-dev-{backend,api,caddy}` — never for the engine. - The reconciler tick that fires 60s after backend recreate correctly matched the surviving engine in both cases (`status=running` in both `games` and `runtime_records`). - `runtime.Service` has no `Shutdown` that proactively removes engine containers, so a graceful backend exit also leaves them alone. The repro window therefore needs a separate trigger that removed the engine container outside of compose. The new hypotheses point at host-side `docker prune` jobs, a `dockerd` restart that lost the container, or an early `Engine.Init` failure that exited the engine before `status=running` reached the runtime row. The investigation list now leads with `journalctl -u docker` and the host crontab — those are the cheapest checks to confirm or rule out next. Co-Authored-By: Claude Opus 4.7 --- tools/dev-deploy/KNOWN-ISSUES.md | 99 ++++++++++++++++++++++---------- 1 file changed, 68 insertions(+), 31 deletions(-) diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md index 7e4dc4f..b3b7277 100644 --- a/tools/dev-deploy/KNOWN-ISSUES.md +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot. -### Working hypotheses +### What has been ruled out -1. **Race between `Start` returning and the runtime spawn writing the - container record.** Bootstrap returns `status=starting` and the - service layer's `Start` is supposed to drive to `running` via the - runtime layer's container spawn. If the spawn fails silently — or - the goroutine that owns it exits before persisting the runtime - record — the reconciler later sees a `starting` game with no - container and cancels. -2. **`docker compose up -d --wait --remove-orphans` interaction.** - `--remove-orphans` is documented as "remove containers for - services not defined in the Compose file". Engine containers are - spawned by the backend with their own labels, not under the - compose project namespace, so they *should* be exempt — but it - is worth verifying with `docker inspect` on a live engine - container that none of its labels accidentally pin it to the - `name: galaxy-dev` compose project. -3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm` - semantics, a transient crash that exits the process leaves no - record on the host. Combined with hypothesis 1, the reconciler's - "container disappeared" branch is exactly the shape we observe. +A live `docker inspect` on a healthy engine container shows: -### What to investigate before fixing +```text +Labels: galaxy.backend=1, galaxy.engine_version=0.1.0, + galaxy.game_id=, + org.opencontainers.image.title=galaxy-game-engine, + com.galaxy.{cpu_quota,memory,pids_limit} +AutoRemove: false +RestartPolicy: on-failure +NetworkMode: galaxy-dev-internal +``` -- Inspect `backend/internal/runtime/` (spawn / reconciler) for the - exact path the engine takes from `status=starting` to either - `running` or `start_failed`. Specifically: which goroutine owns - the spawn, where its error is logged, and whether `start_failed` - is reachable from the runtime reconciler path or only from the - in-bootstrap `Start` call. -- Check the engine container's `Config.Labels`, - `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with - a deliberate redispatch and `docker events --since 0` capture - bracketing the deploy. +There are no `com.docker.compose.*` labels and `AutoRemove=false`, +so `--remove-orphans` cannot reap the engine and a `--rm`-style +self-destruct is not in play. Two redispatches captured under +`docker events --filter event=create,start,die,destroy,kill,stop` +also confirmed it: across both runs the only `die` / `destroy` +events were for `galaxy-dev-{backend,api,caddy}`. The live engine +container survived both redispatches, and the reconciler that +fires 60 seconds after the new backend boots correctly matched +it through `byGameID` / `byContainerID`. + +`backend/internal/runtime/service.go` only removes engine +containers from the explicit `runStop` / `runRestart` / `runPatch` +paths. There is no `runtime.Service.Shutdown` that proactively +kills containers on backend exit, so a graceful SIGTERM to +`galaxy-dev-backend` will not touch its child engine containers. + +### Remaining hypotheses + +1. **Engine self-crashed and was reaped by something host-side.** + `RestartPolicy=on-failure` only retries within Docker's own + limits; if the engine exited cleanly (status 0) Docker does + not restart, but does keep the row in `docker ps -a`. The + reproduction case had the engine missing from `docker ps -a` + entirely, so a separate cleanup (cron `docker container prune`, + a host script, manual `docker rm`) needs to be ruled out. +2. **An out-of-band Docker daemon restart dropped the container.** + A `dockerd` restart that loses sight of an unmanaged container + is rare, but would explain why both the live tracking and + `docker ps -a` are empty. Correlate the gap with + `journalctl -u docker` on the host. +3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init` + and the engine exited on its own before `status=running`.** + Bootstrap logs `status=starting` and then is silent until the + reconciler 15 minutes later; the runtime row in that case + should have been written with `status=engine_unreachable`, so + any reproduction needs a `runtime_records` snapshot from the + bad window — that table got wiped together with the cancelled + game on the next boot, so the post-mortem currently lacks it. + +### What to investigate next + +- On the dev host: list cron jobs, systemd timers, and any custom + shell that periodically runs `docker container prune` or + `docker system prune`. The host also runs gitea + crowdsec so + unrelated maintenance is plausible. +- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until + '2026-05-16 20:56:34'` for the original repro window — confirm + whether the daemon flagged a container removal in that gap. +- Re-run with backend logging level `debug` so the + `runtime.scheduler` and `runtime.workers` paths surface their + per-game timer / job decisions. The current `info` level says + nothing between bootstrap and the reconciler. +- Capture `runtime_records` for the broken game *before* the next + boot purges it; the column set + (`status`, `current_container_id`, `engine_endpoint`) tells + whether the engine ever reached `running` or stopped at + `engine_unreachable`. - Reproduce on a freshly seeded `clean-data` volume to rule out postgres-state ambiguity. -- 2.52.0 From 49f614926a82bd617a3e7248ac62f587a7c2f266 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Sat, 16 May 2026 23:16:51 +0200 Subject: [PATCH 3/3] KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 --- tools/dev-deploy/KNOWN-ISSUES.md | 74 +++++++++++++++----------------- 1 file changed, 34 insertions(+), 40 deletions(-) diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md index b3b7277..32ab2d5 100644 --- a/tools/dev-deploy/KNOWN-ISSUES.md +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively kills containers on backend exit, so a graceful SIGTERM to `galaxy-dev-backend` will not touch its child engine containers. -### Remaining hypotheses +### Host-side hypotheses considered and rejected by the owner -1. **Engine self-crashed and was reaped by something host-side.** - `RestartPolicy=on-failure` only retries within Docker's own - limits; if the engine exited cleanly (status 0) Docker does - not restart, but does keep the row in `docker ps -a`. The - reproduction case had the engine missing from `docker ps -a` - entirely, so a separate cleanup (cron `docker container prune`, - a host script, manual `docker rm`) needs to be ruled out. -2. **An out-of-band Docker daemon restart dropped the container.** - A `dockerd` restart that loses sight of an unmanaged container - is rare, but would explain why both the live tracking and - `docker ps -a` are empty. Correlate the gap with - `journalctl -u docker` on the host. -3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init` - and the engine exited on its own before `status=running`.** - Bootstrap logs `status=starting` and then is silent until the - reconciler 15 minutes later; the runtime row in that case - should have been written with `status=engine_unreachable`, so - any reproduction needs a `runtime_records` snapshot from the - bad window — that table got wiped together with the cancelled - game on the next boot, so the post-mortem currently lacks it. +The natural follow-up suspects after compose was cleared — host-side +`docker prune` cron jobs, a manual `docker rm`, an out-of-band +`dockerd` restart, and an idle-state engine crash — were all +rejected by the project owner: the dev host runs none of those +periodic cleanups, no one manually removed the container, dockerd +was not restarted in the window, and the engine binary does not +crash while idling on API calls. -### What to investigate next +### Best remaining suspicion -- On the dev host: list cron jobs, systemd timers, and any custom - shell that periodically runs `docker container prune` or - `docker system prune`. The host also runs gitea + crowdsec so - unrelated maintenance is plausible. -- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until - '2026-05-16 20:56:34'` for the original repro window — confirm - whether the daemon flagged a container removal in that gap. -- Re-run with backend logging level `debug` so the - `runtime.scheduler` and `runtime.workers` paths surface their - per-game timer / job decisions. The current `info` level says - nothing between bootstrap and the reconciler. -- Capture `runtime_records` for the broken game *before* the next - boot purges it; the column set - (`status`, `current_container_id`, `engine_endpoint`) tells - whether the engine ever reached `running` or stopped at - `engine_unreachable`. -- Reproduce on a freshly seeded `clean-data` volume to rule out - postgres-state ambiguity. +Something the `dev-deploy.yaml` CI run does between successful +image builds and the final `docker compose up -d --wait +--remove-orphans` clobbers the previously-spawned engine container. +The chain at runtime contains: + +1. `docker build -t galaxy-engine:dev -f game/Dockerfile .` +2. `docker compose build galaxy-backend galaxy-api` +3. `docker run --rm` alpine for the UI volume seed +4. `docker compose up -d --wait --remove-orphans` + +None of these *should* touch an unmanaged engine container, but +the reproduction window points squarely inside this sequence. A +deliberate next reproduction with `docker events --since 0` armed +*before* the deploy starts and live for the entire job — captured +end-to-end on the dev host, not just the chunk after backend +recreate — would pin which step emits the `destroy` on the engine. + +### Status + +Parked. The bug is mildly disruptive (one redispatch + a manual +`make seed-ui`-style follow-up brings the sandbox back) and the +remaining hypotheses are speculative. If the symptom recurs, attach +the next bad-window `docker events` capture to this entry and +reopen. A `tools/dev-deploy/` rewrite may obviate the issue +entirely; that is on the project owner's medium-term list. ### Workaround in use today -- 2.52.0