From 5177fef2efb7bd086eb2152e150a8dc31be060e7 Mon Sep 17 00:00:00 2001
From: Ilia Denisov <evilcart@gmail.com>
Date: Sat, 16 May 2026 22:53:21 +0200
Subject: [PATCH 1/3] tools/dev-deploy: log the sandbox-cancellation TODO
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.

`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.

No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 tools/dev-deploy/KNOWN-ISSUES.md | 102 +++++++++++++++++++++++++++++++
 tools/dev-deploy/README.md       |   6 ++
 2 files changed, 108 insertions(+)
 create mode 100644 tools/dev-deploy/KNOWN-ISSUES.md
diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md
new file mode 100644
index 0000000..7e4dc4f
--- /dev/null
+++ b/tools/dev-deploy/KNOWN-ISSUES.md
@@ -0,0 +1,102 @@
+# `tools/dev-deploy/` — known issues
+
+Issues that surface in the long-lived dev environment but are not yet
+fixed. Each entry lists the observed symptom, the diagnostic evidence,
+the working hypothesis, and the open questions that have to be
+answered before a fix lands.
+
+## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
+
+### Symptom
+
+A previously `running` "Dev Sandbox" game (created by
+`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
+after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
+browser session survives (the same `device_session_id` keeps working),
+but the lobby shows no game because the only game it had is now
+terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
+boot and creates a fresh sandbox — but the first redispatch leaves
+the user with an empty lobby until backend restarts again.
+
+### Diagnostic evidence
+
+Backend logs from the broken cycle (timestamps abbreviated):
+
+```text
+20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
+20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
+20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
+...
+20:25:09 user mail sent failed (diplomail tables missing — unrelated)
+...
+20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
+         op=reconcile status=removed message="container disappeared"
+```
+
+Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
+the backend logs are silent on the runtime / engine paths — no
+`engine spawned`, no `engine container started`, no `runtime
+transition` lines. The reconciler then fires and reports the engine
+container as missing.
+
+`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
+returns no rows during this window — the engine container is neither
+running nor stopped on the host, so it either was never spawned or
+was removed before the host snapshot.
+
+### Working hypotheses
+
+1. **Race between `Start` returning and the runtime spawn writing the
+   container record.** Bootstrap returns `status=starting` and the
+   service layer's `Start` is supposed to drive to `running` via the
+   runtime layer's container spawn. If the spawn fails silently — or
+   the goroutine that owns it exits before persisting the runtime
+   record — the reconciler later sees a `starting` game with no
+   container and cancels.
+2. **`docker compose up -d --wait --remove-orphans` interaction.**
+   `--remove-orphans` is documented as "remove containers for
+   services not defined in the Compose file". Engine containers are
+   spawned by the backend with their own labels, not under the
+   compose project namespace, so they *should* be exempt — but it
+   is worth verifying with `docker inspect` on a live engine
+   container that none of its labels accidentally pin it to the
+   `name: galaxy-dev` compose project.
+3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
+   semantics, a transient crash that exits the process leaves no
+   record on the host. Combined with hypothesis 1, the reconciler's
+   "container disappeared" branch is exactly the shape we observe.
+
+### What to investigate before fixing
+
+- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
+  exact path the engine takes from `status=starting` to either
+  `running` or `start_failed`. Specifically: which goroutine owns
+  the spawn, where its error is logged, and whether `start_failed`
+  is reachable from the runtime reconciler path or only from the
+  in-bootstrap `Start` call.
+- Check the engine container's `Config.Labels`,
+  `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
+  a deliberate redispatch and `docker events --since 0` capture
+  bracketing the deploy.
+- Reproduce on a freshly seeded `clean-data` volume to rule out
+  postgres-state ambiguity.
+
+### Workaround in use today
+
+When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
+
+```sh
+curl -X POST -n -H 'Content-Type: application/json' \
+  -d '{"ref":"<branch>"}' \
+  https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
+```
+
+The next boot's `purgeTerminalSandboxGames` removes the cancelled
+row, `findOrCreateSandboxGame` creates a fresh one, and
+`ensureMembershipsAndDrive` puts the new game back to `running`.
+
+### Owner
+
+Unassigned. File an issue once we have the runtime / reconciler
+analysis above; reference this section in the issue body so future
+redeploys can short-circuit the diagnostic loop.
diff --git a/tools/dev-deploy/README.md b/tools/dev-deploy/README.md
index 1a04485..5d3f68c 100644
--- a/tools/dev-deploy/README.md
+++ b/tools/dev-deploy/README.md
@@ -177,6 +177,12 @@ make clean-data     Stop everything and wipe volumes + game-state dir
 - `.env.example` — non-secret defaults for the compose `${VAR:-}`
   expansions. Copy to `.env` if you want host-local overrides.
 
+## Known issues
+
+See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
+in the long-lived dev environment but are not yet fixed (currently:
+the sandbox game flipping to `cancelled` after a redispatch).
+
 ## Relationship to other infrastructure
 
 - `tools/local-dev/` — single-developer playground, host-port mapped,
-- 
2.52.0


From cadb72b412108466819c20f28866b59fa7927be9 Mon Sep 17 00:00:00 2001
From: Ilia Denisov <evilcart@gmail.com>
Date: Sat, 16 May 2026 23:10:13 +0200
Subject: [PATCH 2/3] KNOWN-ISSUES: rule out compose orphan reap; narrow to
 host-side reap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:

- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
  so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
  already running emitted `die` / `destroy` events only for
  `galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
  correctly matched the surviving engine in both cases
  (`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
  engine containers, so a graceful backend exit also leaves them
  alone.

The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 tools/dev-deploy/KNOWN-ISSUES.md | 99 ++++++++++++++++++++++----------
 1 file changed, 68 insertions(+), 31 deletions(-)

diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md
index 7e4dc4f..b3b7277 100644
--- a/tools/dev-deploy/KNOWN-ISSUES.md
+++ b/tools/dev-deploy/KNOWN-ISSUES.md
@@ -44,40 +44,77 @@ returns no rows during this window — the engine container is neither
 running nor stopped on the host, so it either was never spawned or
 was removed before the host snapshot.
 
-### Working hypotheses
+### What has been ruled out
 
-1. **Race between `Start` returning and the runtime spawn writing the
-   container record.** Bootstrap returns `status=starting` and the
-   service layer's `Start` is supposed to drive to `running` via the
-   runtime layer's container spawn. If the spawn fails silently — or
-   the goroutine that owns it exits before persisting the runtime
-   record — the reconciler later sees a `starting` game with no
-   container and cancels.
-2. **`docker compose up -d --wait --remove-orphans` interaction.**
-   `--remove-orphans` is documented as "remove containers for
-   services not defined in the Compose file". Engine containers are
-   spawned by the backend with their own labels, not under the
-   compose project namespace, so they *should* be exempt — but it
-   is worth verifying with `docker inspect` on a live engine
-   container that none of its labels accidentally pin it to the
-   `name: galaxy-dev` compose project.
-3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
-   semantics, a transient crash that exits the process leaves no
-   record on the host. Combined with hypothesis 1, the reconciler's
-   "container disappeared" branch is exactly the shape we observe.
+A live `docker inspect` on a healthy engine container shows:
 
-### What to investigate before fixing
+```text
+Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
+        galaxy.game_id=<uuid>,
+        org.opencontainers.image.title=galaxy-game-engine,
+        com.galaxy.{cpu_quota,memory,pids_limit}
+AutoRemove:    false
+RestartPolicy: on-failure
+NetworkMode:   galaxy-dev-internal
+```
 
-- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
-  exact path the engine takes from `status=starting` to either
-  `running` or `start_failed`. Specifically: which goroutine owns
-  the spawn, where its error is logged, and whether `start_failed`
-  is reachable from the runtime reconciler path or only from the
-  in-bootstrap `Start` call.
-- Check the engine container's `Config.Labels`,
-  `HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
-  a deliberate redispatch and `docker events --since 0` capture
-  bracketing the deploy.
+There are no `com.docker.compose.*` labels and `AutoRemove=false`,
+so `--remove-orphans` cannot reap the engine and a `--rm`-style
+self-destruct is not in play. Two redispatches captured under
+`docker events --filter event=create,start,die,destroy,kill,stop`
+also confirmed it: across both runs the only `die` / `destroy`
+events were for `galaxy-dev-{backend,api,caddy}`. The live engine
+container survived both redispatches, and the reconciler that
+fires 60 seconds after the new backend boots correctly matched
+it through `byGameID` / `byContainerID`.
+
+`backend/internal/runtime/service.go` only removes engine
+containers from the explicit `runStop` / `runRestart` / `runPatch`
+paths. There is no `runtime.Service.Shutdown` that proactively
+kills containers on backend exit, so a graceful SIGTERM to
+`galaxy-dev-backend` will not touch its child engine containers.
+
+### Remaining hypotheses
+
+1. **Engine self-crashed and was reaped by something host-side.**
+   `RestartPolicy=on-failure` only retries within Docker's own
+   limits; if the engine exited cleanly (status 0) Docker does
+   not restart, but does keep the row in `docker ps -a`. The
+   reproduction case had the engine missing from `docker ps -a`
+   entirely, so a separate cleanup (cron `docker container prune`,
+   a host script, manual `docker rm`) needs to be ruled out.
+2. **An out-of-band Docker daemon restart dropped the container.**
+   A `dockerd` restart that loses sight of an unmanaged container
+   is rare, but would explain why both the live tracking and
+   `docker ps -a` are empty. Correlate the gap with
+   `journalctl -u docker` on the host.
+3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
+   and the engine exited on its own before `status=running`.**
+   Bootstrap logs `status=starting` and then is silent until the
+   reconciler 15 minutes later; the runtime row in that case
+   should have been written with `status=engine_unreachable`, so
+   any reproduction needs a `runtime_records` snapshot from the
+   bad window — that table got wiped together with the cancelled
+   game on the next boot, so the post-mortem currently lacks it.
+
+### What to investigate next
+
+- On the dev host: list cron jobs, systemd timers, and any custom
+  shell that periodically runs `docker container prune` or
+  `docker system prune`. The host also runs gitea + crowdsec so
+  unrelated maintenance is plausible.
+- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
+  '2026-05-16 20:56:34'` for the original repro window — confirm
+  whether the daemon flagged a container removal in that gap.
+- Re-run with backend logging level `debug` so the
+  `runtime.scheduler` and `runtime.workers` paths surface their
+  per-game timer / job decisions. The current `info` level says
+  nothing between bootstrap and the reconciler.
+- Capture `runtime_records` for the broken game *before* the next
+  boot purges it; the column set
+  (`status`, `current_container_id`, `engine_endpoint`) tells
+  whether the engine ever reached `running` or stopped at
+  `engine_unreachable`.
 - Reproduce on a freshly seeded `clean-data` volume to rule out
   postgres-state ambiguity.
 
-- 
2.52.0


From 49f614926a82bd617a3e7248ac62f587a7c2f266 Mon Sep 17 00:00:00 2001
From: Ilia Denisov <evilcart@gmail.com>
Date: Sat, 16 May 2026 23:16:51 +0200
Subject: [PATCH 3/3] KNOWN-ISSUES: park sandbox-cancel; owner rejected
 host-side hypotheses

After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.

Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 tools/dev-deploy/KNOWN-ISSUES.md | 74 +++++++++++++++-----------------
 1 file changed, 34 insertions(+), 40 deletions(-)

diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md
index b3b7277..32ab2d5 100644
--- a/tools/dev-deploy/KNOWN-ISSUES.md
+++ b/tools/dev-deploy/KNOWN-ISSUES.md
@@ -74,49 +74,43 @@ paths. There is no `runtime.Service.Shutdown` that proactively
 kills containers on backend exit, so a graceful SIGTERM to
 `galaxy-dev-backend` will not touch its child engine containers.
 
-### Remaining hypotheses
+### Host-side hypotheses considered and rejected by the owner
 
-1. **Engine self-crashed and was reaped by something host-side.**
-   `RestartPolicy=on-failure` only retries within Docker's own
-   limits; if the engine exited cleanly (status 0) Docker does
-   not restart, but does keep the row in `docker ps -a`. The
-   reproduction case had the engine missing from `docker ps -a`
-   entirely, so a separate cleanup (cron `docker container prune`,
-   a host script, manual `docker rm`) needs to be ruled out.
-2. **An out-of-band Docker daemon restart dropped the container.**
-   A `dockerd` restart that loses sight of an unmanaged container
-   is rare, but would explain why both the live tracking and
-   `docker ps -a` are empty. Correlate the gap with
-   `journalctl -u docker` on the host.
-3. **`runStart` errored at `waitForEngineHealthz` or `Engine.Init`
-   and the engine exited on its own before `status=running`.**
-   Bootstrap logs `status=starting` and then is silent until the
-   reconciler 15 minutes later; the runtime row in that case
-   should have been written with `status=engine_unreachable`, so
-   any reproduction needs a `runtime_records` snapshot from the
-   bad window — that table got wiped together with the cancelled
-   game on the next boot, so the post-mortem currently lacks it.
+The natural follow-up suspects after compose was cleared — host-side
+`docker prune` cron jobs, a manual `docker rm`, an out-of-band
+`dockerd` restart, and an idle-state engine crash — were all
+rejected by the project owner: the dev host runs none of those
+periodic cleanups, no one manually removed the container, dockerd
+was not restarted in the window, and the engine binary does not
+crash while idling on API calls.
 
-### What to investigate next
+### Best remaining suspicion
 
-- On the dev host: list cron jobs, systemd timers, and any custom
-  shell that periodically runs `docker container prune` or
-  `docker system prune`. The host also runs gitea + crowdsec so
-  unrelated maintenance is plausible.
-- Inspect `journalctl -u docker --since '2026-05-16 20:50:00' --until
-  '2026-05-16 20:56:34'` for the original repro window — confirm
-  whether the daemon flagged a container removal in that gap.
-- Re-run with backend logging level `debug` so the
-  `runtime.scheduler` and `runtime.workers` paths surface their
-  per-game timer / job decisions. The current `info` level says
-  nothing between bootstrap and the reconciler.
-- Capture `runtime_records` for the broken game *before* the next
-  boot purges it; the column set
-  (`status`, `current_container_id`, `engine_endpoint`) tells
-  whether the engine ever reached `running` or stopped at
-  `engine_unreachable`.
-- Reproduce on a freshly seeded `clean-data` volume to rule out
-  postgres-state ambiguity.
+Something the `dev-deploy.yaml` CI run does between successful
+image builds and the final `docker compose up -d --wait
+--remove-orphans` clobbers the previously-spawned engine container.
+The chain at runtime contains:
+
+1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
+2. `docker compose build galaxy-backend galaxy-api`
+3. `docker run --rm` alpine for the UI volume seed
+4. `docker compose up -d --wait --remove-orphans`
+
+None of these *should* touch an unmanaged engine container, but
+the reproduction window points squarely inside this sequence. A
+deliberate next reproduction with `docker events --since 0` armed
+*before* the deploy starts and live for the entire job — captured
+end-to-end on the dev host, not just the chunk after backend
+recreate — would pin which step emits the `destroy` on the engine.
+
+### Status
+
+Parked. The bug is mildly disruptive (one redispatch + a manual
+`make seed-ui`-style follow-up brings the sandbox back) and the
+remaining hypotheses are speculative. If the symptom recurs, attach
+the next bad-window `docker events` capture to this entry and
+reopen. A `tools/dev-deploy/` rewrite may obviate the issue
+entirely; that is on the project owner's medium-term list.
 
 ### Workaround in use today
 
-- 
2.52.0