From 5177fef2efb7bd086eb2152e150a8dc31be060e7 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Sat, 16 May 2026 22:53:21 +0200 Subject: [PATCH] tools/dev-deploy: log the sandbox-cancellation TODO MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 --- tools/dev-deploy/KNOWN-ISSUES.md | 102 +++++++++++++++++++++++++++++++ tools/dev-deploy/README.md | 6 ++ 2 files changed, 108 insertions(+) create mode 100644 tools/dev-deploy/KNOWN-ISSUES.md diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md new file mode 100644 index 0000000..7e4dc4f --- /dev/null +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -0,0 +1,102 @@ +# `tools/dev-deploy/` — known issues + +Issues that surface in the long-lived dev environment but are not yet +fixed. Each entry lists the observed symptom, the diagnostic evidence, +the working hypothesis, and the open questions that have to be +answered before a fix lands. + +## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch + +### Symptom + +A previously `running` "Dev Sandbox" game (created by +`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes +after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's +browser session survives (the same `device_session_id` keeps working), +but the lobby shows no game because the only game it had is now +terminal. `purgeTerminalSandboxGames` does pick it up on the **next** +boot and creates a fresh sandbox — but the first redispatch leaves +the user with an empty lobby until backend restarts again. + +### Diagnostic evidence + +Backend logs from the broken cycle (timestamps abbreviated): + +```text +20:24:40 dev_sandbox: purged terminal sandbox game game_id= status=cancelled +20:24:40 dev_sandbox: memberships ensured count=20 game_id= +20:24:40 dev_sandbox: bootstrap complete user_id= game_id= status=starting +... +20:25:09 user mail sent failed (diplomail tables missing — unrelated) +... +20:39:40 lobby: game cancelled by runtime reconciler game_id= + op=reconcile status=removed message="container disappeared" +``` + +Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel) +the backend logs are silent on the runtime / engine paths — no +`engine spawned`, no `engine container started`, no `runtime +transition` lines. The reconciler then fires and reports the engine +container as missing. + +`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'` +returns no rows during this window — the engine container is neither +running nor stopped on the host, so it either was never spawned or +was removed before the host snapshot. + +### Working hypotheses + +1. **Race between `Start` returning and the runtime spawn writing the + container record.** Bootstrap returns `status=starting` and the + service layer's `Start` is supposed to drive to `running` via the + runtime layer's container spawn. If the spawn fails silently — or + the goroutine that owns it exits before persisting the runtime + record — the reconciler later sees a `starting` game with no + container and cancels. +2. **`docker compose up -d --wait --remove-orphans` interaction.** + `--remove-orphans` is documented as "remove containers for + services not defined in the Compose file". Engine containers are + spawned by the backend with their own labels, not under the + compose project namespace, so they *should* be exempt — but it + is worth verifying with `docker inspect` on a live engine + container that none of its labels accidentally pin it to the + `name: galaxy-dev` compose project. +3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm` + semantics, a transient crash that exits the process leaves no + record on the host. Combined with hypothesis 1, the reconciler's + "container disappeared" branch is exactly the shape we observe. + +### What to investigate before fixing + +- Inspect `backend/internal/runtime/` (spawn / reconciler) for the + exact path the engine takes from `status=starting` to either + `running` or `start_failed`. Specifically: which goroutine owns + the spawn, where its error is logged, and whether `start_failed` + is reachable from the runtime reconciler path or only from the + in-bootstrap `Start` call. +- Check the engine container's `Config.Labels`, + `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with + a deliberate redispatch and `docker events --since 0` capture + bracketing the deploy. +- Reproduce on a freshly seeded `clean-data` volume to rule out + postgres-state ambiguity. + +### Workaround in use today + +When the sandbox game flips to `cancelled`, redispatch `dev-deploy`: + +```sh +curl -X POST -n -H 'Content-Type: application/json' \ + -d '{"ref":""}' \ + https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches +``` + +The next boot's `purgeTerminalSandboxGames` removes the cancelled +row, `findOrCreateSandboxGame` creates a fresh one, and +`ensureMembershipsAndDrive` puts the new game back to `running`. + +### Owner + +Unassigned. File an issue once we have the runtime / reconciler +analysis above; reference this section in the issue body so future +redeploys can short-circuit the diagnostic loop. diff --git a/tools/dev-deploy/README.md b/tools/dev-deploy/README.md index 1a04485..5d3f68c 100644 --- a/tools/dev-deploy/README.md +++ b/tools/dev-deploy/README.md @@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir - `.env.example` — non-secret defaults for the compose `${VAR:-}` expansions. Copy to `.env` if you want host-local overrides. +## Known issues + +See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface +in the long-lived dev environment but are not yet fixed (currently: +the sandbox game flipping to `cancelled` after a redispatch). + ## Relationship to other infrastructure - `tools/local-dev/` — single-developer playground, host-port mapped,