Merge pull request 'chore(dev-deploy): KNOWN-ISSUES entry for sandbox-cancel after redispatch' (#12) from chore/dev-sandbox-cancel-todo into development
This commit was merged in pull request #12.
This commit is contained in:
@@ -0,0 +1,133 @@
|
|||||||
|
# `tools/dev-deploy/` — known issues
|
||||||
|
|
||||||
|
Issues that surface in the long-lived dev environment but are not yet
|
||||||
|
fixed. Each entry lists the observed symptom, the diagnostic evidence,
|
||||||
|
the working hypothesis, and the open questions that have to be
|
||||||
|
answered before a fix lands.
|
||||||
|
|
||||||
|
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
|
||||||
|
|
||||||
|
### Symptom
|
||||||
|
|
||||||
|
A previously `running` "Dev Sandbox" game (created by
|
||||||
|
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
|
||||||
|
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
|
||||||
|
browser session survives (the same `device_session_id` keeps working),
|
||||||
|
but the lobby shows no game because the only game it had is now
|
||||||
|
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
|
||||||
|
boot and creates a fresh sandbox — but the first redispatch leaves
|
||||||
|
the user with an empty lobby until backend restarts again.
|
||||||
|
|
||||||
|
### Diagnostic evidence
|
||||||
|
|
||||||
|
Backend logs from the broken cycle (timestamps abbreviated):
|
||||||
|
|
||||||
|
```text
|
||||||
|
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
|
||||||
|
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
|
||||||
|
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
|
||||||
|
...
|
||||||
|
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
|
||||||
|
...
|
||||||
|
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
|
||||||
|
op=reconcile status=removed message="container disappeared"
|
||||||
|
```
|
||||||
|
|
||||||
|
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
|
||||||
|
the backend logs are silent on the runtime / engine paths — no
|
||||||
|
`engine spawned`, no `engine container started`, no `runtime
|
||||||
|
transition` lines. The reconciler then fires and reports the engine
|
||||||
|
container as missing.
|
||||||
|
|
||||||
|
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
|
||||||
|
returns no rows during this window — the engine container is neither
|
||||||
|
running nor stopped on the host, so it either was never spawned or
|
||||||
|
was removed before the host snapshot.
|
||||||
|
|
||||||
|
### What has been ruled out
|
||||||
|
|
||||||
|
A live `docker inspect` on a healthy engine container shows:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
|
||||||
|
galaxy.game_id=<uuid>,
|
||||||
|
org.opencontainers.image.title=galaxy-game-engine,
|
||||||
|
com.galaxy.{cpu_quota,memory,pids_limit}
|
||||||
|
AutoRemove: false
|
||||||
|
RestartPolicy: on-failure
|
||||||
|
NetworkMode: galaxy-dev-internal
|
||||||
|
```
|
||||||
|
|
||||||
|
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
|
||||||
|
so `--remove-orphans` cannot reap the engine and a `--rm`-style
|
||||||
|
self-destruct is not in play. Two redispatches captured under
|
||||||
|
`docker events --filter event=create,start,die,destroy,kill,stop`
|
||||||
|
also confirmed it: across both runs the only `die` / `destroy`
|
||||||
|
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
|
||||||
|
container survived both redispatches, and the reconciler that
|
||||||
|
fires 60 seconds after the new backend boots correctly matched
|
||||||
|
it through `byGameID` / `byContainerID`.
|
||||||
|
|
||||||
|
`backend/internal/runtime/service.go` only removes engine
|
||||||
|
containers from the explicit `runStop` / `runRestart` / `runPatch`
|
||||||
|
paths. There is no `runtime.Service.Shutdown` that proactively
|
||||||
|
kills containers on backend exit, so a graceful SIGTERM to
|
||||||
|
`galaxy-dev-backend` will not touch its child engine containers.
|
||||||
|
|
||||||
|
### Host-side hypotheses considered and rejected by the owner
|
||||||
|
|
||||||
|
The natural follow-up suspects after compose was cleared — host-side
|
||||||
|
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
|
||||||
|
`dockerd` restart, and an idle-state engine crash — were all
|
||||||
|
rejected by the project owner: the dev host runs none of those
|
||||||
|
periodic cleanups, no one manually removed the container, dockerd
|
||||||
|
was not restarted in the window, and the engine binary does not
|
||||||
|
crash while idling on API calls.
|
||||||
|
|
||||||
|
### Best remaining suspicion
|
||||||
|
|
||||||
|
Something the `dev-deploy.yaml` CI run does between successful
|
||||||
|
image builds and the final `docker compose up -d --wait
|
||||||
|
--remove-orphans` clobbers the previously-spawned engine container.
|
||||||
|
The chain at runtime contains:
|
||||||
|
|
||||||
|
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
|
||||||
|
2. `docker compose build galaxy-backend galaxy-api`
|
||||||
|
3. `docker run --rm` alpine for the UI volume seed
|
||||||
|
4. `docker compose up -d --wait --remove-orphans`
|
||||||
|
|
||||||
|
None of these *should* touch an unmanaged engine container, but
|
||||||
|
the reproduction window points squarely inside this sequence. A
|
||||||
|
deliberate next reproduction with `docker events --since 0` armed
|
||||||
|
*before* the deploy starts and live for the entire job — captured
|
||||||
|
end-to-end on the dev host, not just the chunk after backend
|
||||||
|
recreate — would pin which step emits the `destroy` on the engine.
|
||||||
|
|
||||||
|
### Status
|
||||||
|
|
||||||
|
Parked. The bug is mildly disruptive (one redispatch + a manual
|
||||||
|
`make seed-ui`-style follow-up brings the sandbox back) and the
|
||||||
|
remaining hypotheses are speculative. If the symptom recurs, attach
|
||||||
|
the next bad-window `docker events` capture to this entry and
|
||||||
|
reopen. A `tools/dev-deploy/` rewrite may obviate the issue
|
||||||
|
entirely; that is on the project owner's medium-term list.
|
||||||
|
|
||||||
|
### Workaround in use today
|
||||||
|
|
||||||
|
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
curl -X POST -n -H 'Content-Type: application/json' \
|
||||||
|
-d '{"ref":"<branch>"}' \
|
||||||
|
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
|
||||||
|
```
|
||||||
|
|
||||||
|
The next boot's `purgeTerminalSandboxGames` removes the cancelled
|
||||||
|
row, `findOrCreateSandboxGame` creates a fresh one, and
|
||||||
|
`ensureMembershipsAndDrive` puts the new game back to `running`.
|
||||||
|
|
||||||
|
### Owner
|
||||||
|
|
||||||
|
Unassigned. File an issue once we have the runtime / reconciler
|
||||||
|
analysis above; reference this section in the issue body so future
|
||||||
|
redeploys can short-circuit the diagnostic loop.
|
||||||
@@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir
|
|||||||
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
- `.env.example` — non-secret defaults for the compose `${VAR:-}`
|
||||||
expansions. Copy to `.env` if you want host-local overrides.
|
expansions. Copy to `.env` if you want host-local overrides.
|
||||||
|
|
||||||
|
## Known issues
|
||||||
|
|
||||||
|
See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
|
||||||
|
in the long-lived dev environment but are not yet fixed (currently:
|
||||||
|
the sandbox game flipping to `cancelled` after a redispatch).
|
||||||
|
|
||||||
## Relationship to other infrastructure
|
## Relationship to other infrastructure
|
||||||
|
|
||||||
- `tools/local-dev/` — single-developer playground, host-port mapped,
|
- `tools/local-dev/` — single-developer playground, host-port mapped,
|
||||||
|
|||||||
Reference in New Issue
Block a user