diff --git a/tools/dev-deploy/KNOWN-ISSUES.md b/tools/dev-deploy/KNOWN-ISSUES.md new file mode 100644 index 0000000..32ab2d5 --- /dev/null +++ b/tools/dev-deploy/KNOWN-ISSUES.md @@ -0,0 +1,133 @@ +# `tools/dev-deploy/` — known issues + +Issues that surface in the long-lived dev environment but are not yet +fixed. Each entry lists the observed symptom, the diagnostic evidence, +the working hypothesis, and the open questions that have to be +answered before a fix lands. + +## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch + +### Symptom + +A previously `running` "Dev Sandbox" game (created by +`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes +after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's +browser session survives (the same `device_session_id` keeps working), +but the lobby shows no game because the only game it had is now +terminal. `purgeTerminalSandboxGames` does pick it up on the **next** +boot and creates a fresh sandbox — but the first redispatch leaves +the user with an empty lobby until backend restarts again. + +### Diagnostic evidence + +Backend logs from the broken cycle (timestamps abbreviated): + +```text +20:24:40 dev_sandbox: purged terminal sandbox game game_id= status=cancelled +20:24:40 dev_sandbox: memberships ensured count=20 game_id= +20:24:40 dev_sandbox: bootstrap complete user_id= game_id= status=starting +... +20:25:09 user mail sent failed (diplomail tables missing — unrelated) +... +20:39:40 lobby: game cancelled by runtime reconciler game_id= + op=reconcile status=removed message="container disappeared" +``` + +Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel) +the backend logs are silent on the runtime / engine paths — no +`engine spawned`, no `engine container started`, no `runtime +transition` lines. The reconciler then fires and reports the engine +container as missing. + +`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'` +returns no rows during this window — the engine container is neither +running nor stopped on the host, so it either was never spawned or +was removed before the host snapshot. + +### What has been ruled out + +A live `docker inspect` on a healthy engine container shows: + +```text +Labels: galaxy.backend=1, galaxy.engine_version=0.1.0, + galaxy.game_id=, + org.opencontainers.image.title=galaxy-game-engine, + com.galaxy.{cpu_quota,memory,pids_limit} +AutoRemove: false +RestartPolicy: on-failure +NetworkMode: galaxy-dev-internal +``` + +There are no `com.docker.compose.*` labels and `AutoRemove=false`, +so `--remove-orphans` cannot reap the engine and a `--rm`-style +self-destruct is not in play. Two redispatches captured under +`docker events --filter event=create,start,die,destroy,kill,stop` +also confirmed it: across both runs the only `die` / `destroy` +events were for `galaxy-dev-{backend,api,caddy}`. The live engine +container survived both redispatches, and the reconciler that +fires 60 seconds after the new backend boots correctly matched +it through `byGameID` / `byContainerID`. + +`backend/internal/runtime/service.go` only removes engine +containers from the explicit `runStop` / `runRestart` / `runPatch` +paths. There is no `runtime.Service.Shutdown` that proactively +kills containers on backend exit, so a graceful SIGTERM to +`galaxy-dev-backend` will not touch its child engine containers. + +### Host-side hypotheses considered and rejected by the owner + +The natural follow-up suspects after compose was cleared — host-side +`docker prune` cron jobs, a manual `docker rm`, an out-of-band +`dockerd` restart, and an idle-state engine crash — were all +rejected by the project owner: the dev host runs none of those +periodic cleanups, no one manually removed the container, dockerd +was not restarted in the window, and the engine binary does not +crash while idling on API calls. + +### Best remaining suspicion + +Something the `dev-deploy.yaml` CI run does between successful +image builds and the final `docker compose up -d --wait +--remove-orphans` clobbers the previously-spawned engine container. +The chain at runtime contains: + +1. `docker build -t galaxy-engine:dev -f game/Dockerfile .` +2. `docker compose build galaxy-backend galaxy-api` +3. `docker run --rm` alpine for the UI volume seed +4. `docker compose up -d --wait --remove-orphans` + +None of these *should* touch an unmanaged engine container, but +the reproduction window points squarely inside this sequence. A +deliberate next reproduction with `docker events --since 0` armed +*before* the deploy starts and live for the entire job — captured +end-to-end on the dev host, not just the chunk after backend +recreate — would pin which step emits the `destroy` on the engine. + +### Status + +Parked. The bug is mildly disruptive (one redispatch + a manual +`make seed-ui`-style follow-up brings the sandbox back) and the +remaining hypotheses are speculative. If the symptom recurs, attach +the next bad-window `docker events` capture to this entry and +reopen. A `tools/dev-deploy/` rewrite may obviate the issue +entirely; that is on the project owner's medium-term list. + +### Workaround in use today + +When the sandbox game flips to `cancelled`, redispatch `dev-deploy`: + +```sh +curl -X POST -n -H 'Content-Type: application/json' \ + -d '{"ref":""}' \ + https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches +``` + +The next boot's `purgeTerminalSandboxGames` removes the cancelled +row, `findOrCreateSandboxGame` creates a fresh one, and +`ensureMembershipsAndDrive` puts the new game back to `running`. + +### Owner + +Unassigned. File an issue once we have the runtime / reconciler +analysis above; reference this section in the issue body so future +redeploys can short-circuit the diagnostic loop. diff --git a/tools/dev-deploy/README.md b/tools/dev-deploy/README.md index 1a04485..5d3f68c 100644 --- a/tools/dev-deploy/README.md +++ b/tools/dev-deploy/README.md @@ -177,6 +177,12 @@ make clean-data Stop everything and wipe volumes + game-state dir - `.env.example` — non-secret defaults for the compose `${VAR:-}` expansions. Copy to `.env` if you want host-local overrides. +## Known issues + +See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface +in the long-lived dev environment but are not yet fixed (currently: +the sandbox game flipping to `cancelled` after a redispatch). + ## Relationship to other infrastructure - `tools/local-dev/` — single-developer playground, host-port mapped,