chore(dev-deploy): KNOWN-ISSUES entry for sandbox-cancel after redispatch #12

Merged
developer merged 3 commits from chore/dev-sandbox-cancel-todo into development 2026-05-16 21:17:37 +00:00
2 changed files with 108 additions and 0 deletions
Showing only changes of commit 5177fef2ef - Show all commits
+102
View File
@@ -0,0 +1,102 @@
# `tools/dev-deploy/` — known issues
Issues that surface in the long-lived dev environment but are not yet
fixed. Each entry lists the observed symptom, the diagnostic evidence,
the working hypothesis, and the open questions that have to be
answered before a fix lands.
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
### Symptom
A previously `running` "Dev Sandbox" game (created by
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
browser session survives (the same `device_session_id` keeps working),
but the lobby shows no game because the only game it had is now
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.
### Diagnostic evidence
Backend logs from the broken cycle (timestamps abbreviated):
```text
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
op=reconcile status=removed message="container disappeared"
```
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
`engine spawned`, no `engine container started`, no `runtime
transition` lines. The reconciler then fires and reports the engine
container as missing.
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
### Working hypotheses
1. **Race between `Start` returning and the runtime spawn writing the
container record.** Bootstrap returns `status=starting` and the
service layer's `Start` is supposed to drive to `running` via the
runtime layer's container spawn. If the spawn fails silently — or
the goroutine that owns it exits before persisting the runtime
record — the reconciler later sees a `starting` game with no
container and cancels.
2. **`docker compose up -d --wait --remove-orphans` interaction.**
`--remove-orphans` is documented as "remove containers for
services not defined in the Compose file". Engine containers are
spawned by the backend with their own labels, not under the
compose project namespace, so they *should* be exempt — but it
is worth verifying with `docker inspect` on a live engine
container that none of its labels accidentally pin it to the
`name: galaxy-dev` compose project.
3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
semantics, a transient crash that exits the process leaves no
record on the host. Combined with hypothesis 1, the reconciler's
"container disappeared" branch is exactly the shape we observe.
### What to investigate before fixing
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
exact path the engine takes from `status=starting` to either
`running` or `start_failed`. Specifically: which goroutine owns
the spawn, where its error is logged, and whether `start_failed`
is reachable from the runtime reconciler path or only from the
in-bootstrap `Start` call.
- Check the engine container's `Config.Labels`,
`HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
a deliberate redispatch and `docker events --since 0` capture
bracketing the deploy.
- Reproduce on a freshly seeded `clean-data` volume to rule out
postgres-state ambiguity.
### Workaround in use today
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
```sh
curl -X POST -n -H 'Content-Type: application/json' \
-d '{"ref":"<branch>"}' \
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
```
The next boot's `purgeTerminalSandboxGames` removes the cancelled
row, `findOrCreateSandboxGame` creates a fresh one, and
`ensureMembershipsAndDrive` puts the new game back to `running`.
### Owner
Unassigned. File an issue once we have the runtime / reconciler
analysis above; reference this section in the issue body so future
redeploys can short-circuit the diagnostic loop.
+6
View File
@@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
expansions. Copy to `.env` if you want host-local overrides.
## Known issues
See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
in the long-lived dev environment but are not yet fixed (currently:
the sandbox game flipping to `cancelled` after a redispatch).
## Relationship to other infrastructure
- `tools/local-dev/` — single-developer playground, host-port mapped,