tools/dev-deploy: log the sandbox-cancellation TODO
Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,102 @@
|
|||||||
|
# `tools/dev-deploy/` — known issues
|
||||||
|
|
||||||
|
Issues that surface in the long-lived dev environment but are not yet
|
||||||
|
fixed. Each entry lists the observed symptom, the diagnostic evidence,
|
||||||
|
the working hypothesis, and the open questions that have to be
|
||||||
|
answered before a fix lands.
|
||||||
|
|
||||||
|
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
A previously `running` "Dev Sandbox" game (created by
|
||||||
|
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
|
||||||
|
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
|
||||||
|
browser session survives (the same `device_session_id` keeps working),
|
||||||
|
but the lobby shows no game because the only game it had is now
|
||||||
|
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
|
||||||
|
boot and creates a fresh sandbox — but the first redispatch leaves
|
||||||
|
the user with an empty lobby until backend restarts again.
|
||||||
|
|
||||||
|
### Diagnostic evidence
|
||||||
|
|
||||||
|
Backend logs from the broken cycle (timestamps abbreviated):
|
||||||
|
|
||||||
|
```text
|
||||||
|
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
|
||||||
|
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
|
||||||
|
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
|
||||||
|
...
|
||||||
|
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
|
||||||
|
...
|
||||||
|
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
|
||||||
|
op=reconcile status=removed message="container disappeared"
|
||||||
|
```
|
||||||
|
|
||||||
|
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
|
||||||
|
the backend logs are silent on the runtime / engine paths — no
|
||||||
|
`engine spawned`, no `engine container started`, no `runtime
|
||||||
|
transition` lines. The reconciler then fires and reports the engine
|
||||||
|
container as missing.
|
||||||
|
|
||||||
|
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
|
||||||
|
returns no rows during this window — the engine container is neither
|
||||||
|
running nor stopped on the host, so it either was never spawned or
|
||||||
|
was removed before the host snapshot.
|
||||||
|
|
||||||
|
### Working hypotheses
|
||||||
|
|
||||||
|
1. **Race between `Start` returning and the runtime spawn writing the
|
||||||
|
container record.** Bootstrap returns `status=starting` and the
|
||||||
|
service layer's `Start` is supposed to drive to `running` via the
|
||||||
|
runtime layer's container spawn. If the spawn fails silently — or
|
||||||
|
the goroutine that owns it exits before persisting the runtime
|
||||||
|
record — the reconciler later sees a `starting` game with no
|
||||||
|
container and cancels.
|
||||||
|
2. **`docker compose up -d --wait --remove-orphans` interaction.**
|
||||||
|
`--remove-orphans` is documented as "remove containers for
|
||||||
|
services not defined in the Compose file". Engine containers are
|
||||||
|
spawned by the backend with their own labels, not under the
|
||||||
|
compose project namespace, so they *should* be exempt — but it
|
||||||
|
is worth verifying with `docker inspect` on a live engine
|
||||||
|
container that none of its labels accidentally pin it to the
|
||||||
|
`name: galaxy-dev` compose project.
|
||||||
|
3. **Engine `--rm` lifecycle.** If the engine spawn uses `--rm`
|
||||||
|
semantics, a transient crash that exits the process leaves no
|
||||||
|
record on the host. Combined with hypothesis 1, the reconciler's
|
||||||
|
"container disappeared" branch is exactly the shape we observe.
|
||||||
|
|
||||||
|
### What to investigate before fixing
|
||||||
|
|
||||||
|
- Inspect `backend/internal/runtime/` (spawn / reconciler) for the
|
||||||
|
exact path the engine takes from `status=starting` to either
|
||||||
|
`running` or `start_failed`. Specifically: which goroutine owns
|
||||||
|
the spawn, where its error is logged, and whether `start_failed`
|
||||||
|
is reachable from the runtime reconciler path or only from the
|
||||||
|
in-bootstrap `Start` call.
|
||||||
|
- Check the engine container's `Config.Labels`,
|
||||||
|
`HostConfig.AutoRemove`, and the `--remove-orphans` semantics with
|
||||||
|
a deliberate redispatch and `docker events --since 0` capture
|
||||||
|
bracketing the deploy.
|
||||||
|
- Reproduce on a freshly seeded `clean-data` volume to rule out
|
||||||
|
postgres-state ambiguity.
|
||||||
|
|
||||||
|
### Workaround in use today
|
||||||
|
|
||||||
|
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -X POST -n -H 'Content-Type: application/json' \
|
||||||
|
-d '{"ref":"<branch>"}' \
|
||||||
|
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
|
||||||
|
```
|
||||||
|
|
||||||
|
The next boot's `purgeTerminalSandboxGames` removes the cancelled
|
||||||
|
row, `findOrCreateSandboxGame` creates a fresh one, and
|
||||||
|
`ensureMembershipsAndDrive` puts the new game back to `running`.
|
||||||
|
|
||||||
|
### Owner
|
||||||
|
|
||||||
|
Unassigned. File an issue once we have the runtime / reconciler
|
||||||
|
analysis above; reference this section in the issue body so future
|
||||||
|
redeploys can short-circuit the diagnostic loop.
|
||||||
@@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir
|
|||||||
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
||||||
expansions. Copy to `.env` if you want host-local overrides.
|
expansions. Copy to `.env` if you want host-local overrides.
|
||||||
|
|
||||||
|
## Known issues
|
||||||
|
|
||||||
|
See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
|
||||||
|
in the long-lived dev environment but are not yet fixed (currently:
|
||||||
|
the sandbox game flipping to `cancelled` after a redispatch).
|
||||||
|
|
||||||
## Relationship to other infrastructure
|
## Relationship to other infrastructure
|
||||||
|
|
||||||
- `tools/local-dev/` — single-developer playground, host-port mapped,
|
- `tools/local-dev/` — single-developer playground, host-port mapped,
|
||||||
|
|||||||
Reference in New Issue
Block a user