refactor(dev): remove the dev-sandbox bootstrap everywhere

Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev Sandbox" game and dummy users are removed so the dev contour starts empty like prod; the separate legacy-report loader stays as the test-data path. - delete backend/internal/devsandbox (package + tests) - drop the bootstrap call + DevSandboxConfig (struct, Config field, BACKEND_DEV_SANDBOX_* env, defaults, loader, validation) - strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and .env.example; the generic engine-recycle / prune-broken-engines logic stays (it serves real games) - update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev README + Makefile) and stale comments; DeleteGame and InsertMembershipDirect remain (exercised by lobby integration tests) No app behaviour change beyond not auto-creating the sandbox game.
2026-05-31 22:28:03 +02:00
parent 26f1e62924
commit 0cae89cba2
17 changed files with 60 additions and 737 deletions
@@ -1,164 +1,8 @@
 # `tools/dev-deploy/` — known issues

-Issues that surface in the long-lived dev environment but are not yet
-fixed. Each entry lists the observed symptom, the diagnostic evidence,
-the working hypothesis, and the open questions that have to be
-answered before a fix lands.
-
-## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
-
-### Symptom
-
-A previously `running` "Dev Sandbox" game (created by
-`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
-after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
-browser session survives (the same `device_session_id` keeps working),
-but the lobby shows no game because the only game it had is now
-terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
-boot and creates a fresh sandbox — but the first redispatch leaves
-the user with an empty lobby until backend restarts again.
-
-### Diagnostic evidence
-
-Backend logs from the broken cycle (timestamps abbreviated):
-
-```text
-20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
-20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
-20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
-...
-20:25:09 user mail sent failed (diplomail tables missing — unrelated)
-...
-20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
-         op=reconcile status=removed message="container disappeared"
-```
-
-Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
-the backend logs are silent on the runtime / engine paths — no
-`engine spawned`, no `engine container started`, no `runtime
-transition` lines. The reconciler then fires and reports the engine
-container as missing.
-
-`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
-returns no rows during this window — the engine container is neither
-running nor stopped on the host, so it either was never spawned or
-was removed before the host snapshot.
-
-### What has been ruled out
-
-A live `docker inspect` on a healthy engine container shows:
-
-```text
-Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
-        galaxy.game_id=<uuid>,
-        org.opencontainers.image.title=galaxy-game-engine,
-        com.galaxy.{cpu_quota,memory,pids_limit}
-AutoRemove:    false
-RestartPolicy: on-failure
-NetworkMode:   galaxy-dev-internal
-```
-
-There are no `com.docker.compose.*` labels and `AutoRemove=false`,
-so `--remove-orphans` cannot reap the engine and a `--rm`-style
-self-destruct is not in play. Two redispatches captured under
-`docker events --filter event=create,start,die,destroy,kill,stop`
-also confirmed it: across both runs the only `die` / `destroy`
-events were for `galaxy-dev-{backend,api,caddy}`. The live engine
-container survived both redispatches, and the reconciler that
-fires 60 seconds after the new backend boots correctly matched
-it through `byGameID` / `byContainerID`.
-
-`backend/internal/runtime/service.go` only removes engine
-containers from the explicit `runStop` / `runRestart` / `runPatch`
-paths. There is no `runtime.Service.Shutdown` that proactively
-kills containers on backend exit, so a graceful SIGTERM to
-`galaxy-dev-backend` will not touch its child engine containers.
-
-### Host-side hypotheses considered and rejected by the owner
-
-The natural follow-up suspects after compose was cleared — host-side
-`docker prune` cron jobs, a manual `docker rm`, an out-of-band
-`dockerd` restart, and an idle-state engine crash — were all
-rejected by the project owner: the dev host runs none of those
-periodic cleanups, no one manually removed the container, dockerd
-was not restarted in the window, and the engine binary does not
-crash while idling on API calls.
-
-### Best remaining suspicion
-
-Something the `dev-deploy.yaml` CI run does between successful
-image builds and the final `docker compose up -d --wait
--remove-orphans` clobbers the previously-spawned engine container.
-The chain at runtime contains:
-
-1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
-2. `docker compose build galaxy-backend galaxy-api`
-3. `docker run --rm` alpine for the UI volume seed
-4. `docker compose up -d --wait --remove-orphans`
-
-None of these *should* touch an unmanaged engine container, but
-the reproduction window points squarely inside this sequence. A
-deliberate next reproduction with `docker events --since 0` armed
-*before* the deploy starts and live for the entire job — captured
-end-to-end on the dev host, not just the chunk after backend
-recreate — would pin which step emits the `destroy` on the engine.
-
-### Update 2026-05-19: integration preclean identified as one cause
-
-A live reproduction during the post-merge auto-deploy cycle (Gitea
-run #188 dev-deploy plus parallel run #190 integration) pinned one
-clobbering source: `integration/scripts/preclean.sh` was unscoped
-and removed *every* container labelled `galaxy.backend=1`, including
-the dev-deploy engine. Timeline from the dev host:
-
-```text
-23:10:40  backend pre-bootstrap reconciler tick: engine alive
-23:10:40  dev_sandbox bootstrap: status=running
-23:10:56  preclean: removing 1 backend-managed engine containers  ← integration run #190
-23:11:40  reconciler: container disappeared → game cancelled
-```
-
-Fix landed: `BACKEND_STACK_LABEL=integration` is now passed to
-every integration backend (see
-`integration/testenv/backend.go`) and `preclean.sh` AND-combines
-`galaxy.backend=1` with `galaxy.stack=integration`, so dev-deploy /
-local-dev engines stamped with different stack values are no longer
-collateral.
-
-This covers **push**-triggered cycles where `dev-deploy.yaml` and
-`integration.yaml` run on the same Gitea host. The original
-hypothesis (a `workflow_dispatch dev-deploy` solo run also losing
-the engine) is *not* explained by the integration fix — manual
-dispatches do not trigger `integration.yaml`. Keep this entry open
-until a solo-dispatch reproduction confirms whether the symptom
-still occurs.
-
-### Status
-
-Partially fixed (push-triggered cycles). Solo `workflow_dispatch`
-reproductions still open. If the symptom recurs after the
-integration fix lands, capture `docker events --since 0` for the
-full dispatch window and attach here.
-
-### Workaround in use today
-
-When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
-
-```sh
-curl -X POST -n -H 'Content-Type: application/json' \
-  -d '{"ref":"<branch>"}' \
-  https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
-```
-
-The next boot's `purgeTerminalSandboxGames` removes the cancelled
-row, `findOrCreateSandboxGame` creates a fresh one, and
-`ensureMembershipsAndDrive` puts the new game back to `running`.
-
-### Owner
-
-Unassigned. File an issue once we have the runtime / reconciler
-analysis above; reference this section in the issue body so future
-redeploys can short-circuit the diagnostic loop.
+Issues that surfaced in the long-lived dev environment. Each entry lists
+the observed symptom, the diagnostic evidence, and the fix or the open
+questions that have to be answered before a fix lands.

 ## `docker restart galaxy-dev-backend` fails after the CI runner cleans up