Files
galaxy-game/tools/dev-deploy/KNOWN-ISSUES.md
T
Ilia Denisov 5177fef2ef tools/dev-deploy: log the sandbox-cancellation TODO
Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.

`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.

No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:56:25 +02:00

4.5 KiB

tools/dev-deploy/ — known issues

Issues that surface in the long-lived dev environment but are not yet fixed. Each entry lists the observed symptom, the diagnostic evidence, the working hypothesis, and the open questions that have to be answered before a fix lands.

Dev Sandbox game flips to cancelled after a dev-deploy redispatch

Symptom

A previously running "Dev Sandbox" game (created by backend/internal/devsandbox) transitions to cancelled ~15 minutes after a dev-deploy.yaml workflow_dispatch run finishes. The user's browser session survives (the same device_session_id keeps working), but the lobby shows no game because the only game it had is now terminal. purgeTerminalSandboxGames does pick it up on the next boot and creates a fresh sandbox — but the first redispatch leaves the user with an empty lobby until backend restarts again.

Diagnostic evidence

Backend logs from the broken cycle (timestamps abbreviated):

20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
         op=reconcile status=removed message="container disappeared"

Between 20:24:40 (status=starting) and 20:39:40 (reconciler cancel) the backend logs are silent on the runtime / engine paths — no engine spawned, no engine container started, no runtime transition lines. The reconciler then fires and reports the engine container as missing.

docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine' returns no rows during this window — the engine container is neither running nor stopped on the host, so it either was never spawned or was removed before the host snapshot.

Working hypotheses

  1. Race between Start returning and the runtime spawn writing the container record. Bootstrap returns status=starting and the service layer's Start is supposed to drive to running via the runtime layer's container spawn. If the spawn fails silently — or the goroutine that owns it exits before persisting the runtime record — the reconciler later sees a starting game with no container and cancels.
  2. docker compose up -d --wait --remove-orphans interaction. --remove-orphans is documented as "remove containers for services not defined in the Compose file". Engine containers are spawned by the backend with their own labels, not under the compose project namespace, so they should be exempt — but it is worth verifying with docker inspect on a live engine container that none of its labels accidentally pin it to the name: galaxy-dev compose project.
  3. Engine --rm lifecycle. If the engine spawn uses --rm semantics, a transient crash that exits the process leaves no record on the host. Combined with hypothesis 1, the reconciler's "container disappeared" branch is exactly the shape we observe.

What to investigate before fixing

  • Inspect backend/internal/runtime/ (spawn / reconciler) for the exact path the engine takes from status=starting to either running or start_failed. Specifically: which goroutine owns the spawn, where its error is logged, and whether start_failed is reachable from the runtime reconciler path or only from the in-bootstrap Start call.
  • Check the engine container's Config.Labels, HostConfig.AutoRemove, and the --remove-orphans semantics with a deliberate redispatch and docker events --since 0 capture bracketing the deploy.
  • Reproduce on a freshly seeded clean-data volume to rule out postgres-state ambiguity.

Workaround in use today

When the sandbox game flips to cancelled, redispatch dev-deploy:

curl -X POST -n -H 'Content-Type: application/json' \
  -d '{"ref":"<branch>"}' \
  https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches

The next boot's purgeTerminalSandboxGames removes the cancelled row, findOrCreateSandboxGame creates a fresh one, and ensureMembershipsAndDrive puts the new game back to running.

Owner

Unassigned. File an issue once we have the runtime / reconciler analysis above; reference this section in the issue body so future redeploys can short-circuit the diagnostic loop.