tools/dev-deploy: log the sandbox-cancellation TODO
Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,102 @@
|
||||
# `tools/dev-deploy/` — known issues
|
||||
|
||||
Issues that surface in the long-lived dev environment but are not yet
|
||||
fixed. Each entry lists the observed symptom, the diagnostic evidence,
|
||||
the working hypothesis, and the open questions that have to be
|
||||
answered before a fix lands.
|
||||
|
||||
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
|
||||
|
||||
### Symptom
|
||||
|
||||
A previously `running` "Dev Sandbox" game (created by
|
||||
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
|
||||
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
|
||||
browser session survives (the same `device_session_id` keeps working),
|
||||
but the lobby shows no game because the only game it had is now
|
||||
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
|
||||
boot and creates a fresh sandbox — but the first redispatch leaves
|
||||
the user with an empty lobby until backend restarts again.
|
||||
|
||||
### Diagnostic evidence
|
||||
|
||||
Backend logs from the broken cycle (timestamps abbreviated):
|
||||
|
||||
```text
|
||||
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
|
||||
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
|
||||
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
|
||||
...
|
||||
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
|
||||
...
|
||||
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
|
||||
op=reconcile status=removed message="container disappeared"
|
||||
```
|
||||
|
||||
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
|
||||
the backend logs are silent on the runtime / engine paths — no
|
||||
`engine spawned`, no `engine container started`, no `runtime
|
||||
transition` lines. The reconciler then fires and reports the engine
|
||||
container as missing.
|
||||
|
||||
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
|
||||
returns no rows during this window — the engine container is neither
|
||||
running nor stopped on the host, so it either was never spawned or
|
||||
was removed before the host snapshot.
|
||||
|
||||
### Working hypotheses
|
||||
|
||||
1. **Race between `Start` returning and the runtime spawn writing the
|
||||
container record.** Bootstrap returns `status=starting` and the
|
||||
service layer's `Start` is supposed to drive to `running` via the
|
||||
runtime layer's container spawn. If the spawn fails silently — or
|
||||
the goroutine that owns it exits before persisting the runtime
|
||||
record — the reconciler later sees a `starting` game with no
|
||||
container and cancels.
|
||||
2. **`docker compose up -d --wait --remove-orphans` interaction.**
|
||||
`--remove-orphans` is documented as "remove containers for
|
||||
services not defined in the Compose file". Engine containers are
|
||||
spawned by the backend with their own labels, not under the
|
||||
compose project namespace, so they *should* be exempt — but it
|
||||
is worth verifying with `docker inspect` on a live engine
|
||||
container that none of its labels accidentally pin it to the
|
||||
`name: galaxy-dev` compose project.
|
||||
3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
|
||||
semantics, a transient crash that exits the process leaves no
|
||||
record on the host. Combined with hypothesis 1, the reconciler's
|
||||
"container disappeared" branch is exactly the shape we observe.
|
||||
|
||||
### What to investigate before fixing
|
||||
|
||||
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
|
||||
exact path the engine takes from `status=starting` to either
|
||||
`running` or `start_failed`. Specifically: which goroutine owns
|
||||
the spawn, where its error is logged, and whether `start_failed`
|
||||
is reachable from the runtime reconciler path or only from the
|
||||
in-bootstrap `Start` call.
|
||||
- Check the engine container's `Config.Labels`,
|
||||
`HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
|
||||
a deliberate redispatch and `docker events --since 0` capture
|
||||
bracketing the deploy.
|
||||
- Reproduce on a freshly seeded `clean-data` volume to rule out
|
||||
postgres-state ambiguity.
|
||||
|
||||
### Workaround in use today
|
||||
|
||||
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
|
||||
|
||||
```sh
|
||||
curl -X POST -n -H 'Content-Type: application/json' \
|
||||
-d '{"ref":"<branch>"}' \
|
||||
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
|
||||
```
|
||||
|
||||
The next boot's `purgeTerminalSandboxGames` removes the cancelled
|
||||
row, `findOrCreateSandboxGame` creates a fresh one, and
|
||||
`ensureMembershipsAndDrive` puts the new game back to `running`.
|
||||
|
||||
### Owner
|
||||
|
||||
Unassigned. File an issue once we have the runtime / reconciler
|
||||
analysis above; reference this section in the issue body so future
|
||||
redeploys can short-circuit the diagnostic loop.
|
||||
@@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir
|
||||
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
||||
expansions. Copy to `.env` if you want host-local overrides.
|
||||
|
||||
## Known issues
|
||||
|
||||
See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
|
||||
in the long-lived dev environment but are not yet fixed (currently:
|
||||
the sandbox game flipping to `cancelled` after a redispatch).
|
||||
|
||||
## Relationship to other infrastructure
|
||||
|
||||
- `tools/local-dev/` — single-developer playground, host-port mapped,
|
||||
|
||||
Reference in New Issue
Block a user