refactor(dev): remove the dev-sandbox bootstrap everywhere
Tests · Go / test (push) Successful in 1m59s

Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev
Sandbox" game and dummy users are removed so the dev contour starts
empty like prod; the separate legacy-report loader stays as the
test-data path.

- delete backend/internal/devsandbox (package + tests)
- drop the bootstrap call + DevSandboxConfig (struct, Config field,
  BACKEND_DEV_SANDBOX_* env, defaults, loader, validation)
- strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and
  .env.example; the generic engine-recycle / prune-broken-engines logic
  stays (it serves real games)
- update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev
  README + Makefile) and stale comments; DeleteGame and
  InsertMembershipDirect remain (exercised by lobby integration tests)

No app behaviour change beyond not auto-creating the sandbox game.
This commit is contained in:
Ilia Denisov
2026-05-31 22:28:03 +02:00
parent 26f1e62924
commit 0cae89cba2
17 changed files with 60 additions and 737 deletions
-6
View File
@@ -7,12 +7,6 @@
# baked into `docker-compose.yml`, so this file documents the knobs
# rather than driving them.
# Auto-provisioned sandbox bootstrap. Empty disables the bootstrap.
BACKEND_DEV_SANDBOX_EMAIL=dev@galaxy.lan
BACKEND_DEV_SANDBOX_ENGINE_IMAGE=galaxy-engine:dev
BACKEND_DEV_SANDBOX_ENGINE_VERSION=0.1.0
BACKEND_DEV_SANDBOX_PLAYER_COUNT=20
# `123456` short-circuits the email-code path for the dev account.
# This is also the docker-compose default — set the variable to an
# empty string here when the environment must rely on real Mailpit
+3 -159
View File
@@ -1,164 +1,8 @@
# `tools/dev-deploy/` — known issues
Issues that surface in the long-lived dev environment but are not yet
fixed. Each entry lists the observed symptom, the diagnostic evidence,
the working hypothesis, and the open questions that have to be
answered before a fix lands.
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
### Symptom
A previously `running` "Dev Sandbox" game (created by
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
browser session survives (the same `device_session_id` keeps working),
but the lobby shows no game because the only game it had is now
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
boot and creates a fresh sandbox — but the first redispatch leaves
the user with an empty lobby until backend restarts again.
### Diagnostic evidence
Backend logs from the broken cycle (timestamps abbreviated):
```text
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
...
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
...
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
op=reconcile status=removed message="container disappeared"
```
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
the backend logs are silent on the runtime / engine paths — no
`engine spawned`, no `engine container started`, no `runtime
transition` lines. The reconciler then fires and reports the engine
container as missing.
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
returns no rows during this window — the engine container is neither
running nor stopped on the host, so it either was never spawned or
was removed before the host snapshot.
### What has been ruled out
A live `docker inspect` on a healthy engine container shows:
```text
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
galaxy.game_id=<uuid>,
org.opencontainers.image.title=galaxy-game-engine,
com.galaxy.{cpu_quota,memory,pids_limit}
AutoRemove: false
RestartPolicy: on-failure
NetworkMode: galaxy-dev-internal
```
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap the engine and a `--rm`-style
self-destruct is not in play. Two redispatches captured under
`docker events --filter event=create,start,die,destroy,kill,stop`
also confirmed it: across both runs the only `die` / `destroy`
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
container survived both redispatches, and the reconciler that
fires 60 seconds after the new backend boots correctly matched
it through `byGameID` / `byContainerID`.
`backend/internal/runtime/service.go` only removes engine
containers from the explicit `runStop` / `runRestart` / `runPatch`
paths. There is no `runtime.Service.Shutdown` that proactively
kills containers on backend exit, so a graceful SIGTERM to
`galaxy-dev-backend` will not touch its child engine containers.
### Host-side hypotheses considered and rejected by the owner
The natural follow-up suspects after compose was cleared — host-side
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
`dockerd` restart, and an idle-state engine crash — were all
rejected by the project owner: the dev host runs none of those
periodic cleanups, no one manually removed the container, dockerd
was not restarted in the window, and the engine binary does not
crash while idling on API calls.
### Best remaining suspicion
Something the `dev-deploy.yaml` CI run does between successful
image builds and the final `docker compose up -d --wait
--remove-orphans` clobbers the previously-spawned engine container.
The chain at runtime contains:
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
2. `docker compose build galaxy-backend galaxy-api`
3. `docker run --rm` alpine for the UI volume seed
4. `docker compose up -d --wait --remove-orphans`
None of these *should* touch an unmanaged engine container, but
the reproduction window points squarely inside this sequence. A
deliberate next reproduction with `docker events --since 0` armed
*before* the deploy starts and live for the entire job — captured
end-to-end on the dev host, not just the chunk after backend
recreate — would pin which step emits the `destroy` on the engine.
### Update 2026-05-19: integration preclean identified as one cause
A live reproduction during the post-merge auto-deploy cycle (Gitea
run #188 dev-deploy plus parallel run #190 integration) pinned one
clobbering source: `integration/scripts/preclean.sh` was unscoped
and removed *every* container labelled `galaxy.backend=1`, including
the dev-deploy engine. Timeline from the dev host:
```text
23:10:40 backend pre-bootstrap reconciler tick: engine alive
23:10:40 dev_sandbox bootstrap: status=running
23:10:56 preclean: removing 1 backend-managed engine containers ← integration run #190
23:11:40 reconciler: container disappeared → game cancelled
```
Fix landed: `BACKEND_STACK_LABEL=integration` is now passed to
every integration backend (see
`integration/testenv/backend.go`) and `preclean.sh` AND-combines
`galaxy.backend=1` with `galaxy.stack=integration`, so dev-deploy /
local-dev engines stamped with different stack values are no longer
collateral.
This covers **push**-triggered cycles where `dev-deploy.yaml` and
`integration.yaml` run on the same Gitea host. The original
hypothesis (a `workflow_dispatch dev-deploy` solo run also losing
the engine) is *not* explained by the integration fix — manual
dispatches do not trigger `integration.yaml`. Keep this entry open
until a solo-dispatch reproduction confirms whether the symptom
still occurs.
### Status
Partially fixed (push-triggered cycles). Solo `workflow_dispatch`
reproductions still open. If the symptom recurs after the
integration fix lands, capture `docker events --since 0` for the
full dispatch window and attach here.
### Workaround in use today
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
```sh
curl -X POST -n -H 'Content-Type: application/json' \
-d '{"ref":"<branch>"}' \
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
```
The next boot's `purgeTerminalSandboxGames` removes the cancelled
row, `findOrCreateSandboxGame` creates a fresh one, and
`ensureMembershipsAndDrive` puts the new game back to `running`.
### Owner
Unassigned. File an issue once we have the runtime / reconciler
analysis above; reference this section in the issue body so future
redeploys can short-circuit the diagnostic loop.
Issues that surfaced in the long-lived dev environment. Each entry lists
the observed symptom, the diagnostic evidence, and the fix or the open
questions that have to be answered before a fix lands.
## `docker restart galaxy-dev-backend` fails after the CI runner cleans up
+9 -13
View File
@@ -114,8 +114,7 @@ calls `make clean-data`.
The same dev-mode email-code override as `tools/local-dev/` applies,
and the dev-deploy compose ships with it enabled by default:
1. Enter `dev@galaxy.lan` (or whatever `BACKEND_DEV_SANDBOX_EMAIL`
resolves to) in the login form.
1. Enter your email address in the login form.
2. Submit `123456` as the code — the docker-compose default for
`BACKEND_AUTH_DEV_FIXED_CODE` is `123456`, so the bcrypt-hashed
email code stays a fallback. To force real Mailpit codes (e.g. for
@@ -212,8 +211,7 @@ make clean-data Stop everything and wipe volumes + game-state dir
## Known issues
See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
in the long-lived dev environment but are not yet fixed (currently:
the sandbox game flipping to `cancelled` after a redispatch).
in the long-lived dev environment but are not yet fixed.
## Deployment cadence
@@ -237,12 +235,12 @@ behind. There is no separate state to clean up between the two paths.
### Engine image drift recycle
`backend` spawns one engine container per game (the long-lived "Dev
Sandbox" plus any user-created games) and the reconciler reattaches
to whatever it finds with the `galaxy.stack=dev-deploy` label. That
reattach does not check the running container's image SHA against the
freshly-built `galaxy-engine:dev` tag, so an unchanged container would
otherwise keep serving the previous engine code after a redeploy.
`backend` spawns one engine container per running game and the
reconciler reattaches to whatever it finds with the
`galaxy.stack=dev-deploy` label. That reattach does not check the
running container's image SHA against the freshly-built
`galaxy-engine:dev` tag, so an unchanged container would otherwise
keep serving the previous engine code after a redeploy.
The `dev-deploy.yaml` workflow handles this in the
`Recycle engine containers on image drift` step. When `docker build`
@@ -250,9 +248,7 @@ produces a new `galaxy-engine:dev` SHA, the step compares it against
every running `galaxy-game-*` container and, for each drifted one,
stops the backend, removes the container, wipes its bind-mounted
state directory (Engine.Init() writes turn-0 over any pre-existing
`turn-N` files), and cascade-deletes the lobby `games` row. The
`dev-sandbox` bootstrap on the next backend boot finds no live
sandbox and provisions a fresh one on the new engine image.
`turn-N` files), and cascade-deletes the lobby `games` row.
When the engine sources are unchanged, the BuildKit cache hits and
the SHA stays the same — the recycle step is a no-op and the running
-9
View File
@@ -127,15 +127,6 @@ services:
# bcrypt-hashed code is single-use). Set the var to an empty
# string in `.env` to disable.
BACKEND_AUTH_DEV_FIXED_CODE: ${BACKEND_AUTH_DEV_FIXED_CODE:-123456}
# Long-lived dev environment always bootstraps the "Dev Sandbox"
# game owned by this email so a freshly redeployed stack already
# has one ready-to-play game in the lobby. Set the variable to an
# empty string in `.env` to disable the bootstrap (e.g. for a
# cold-start QA pass).
BACKEND_DEV_SANDBOX_EMAIL: ${BACKEND_DEV_SANDBOX_EMAIL:-dev@galaxy.lan}
BACKEND_DEV_SANDBOX_ENGINE_IMAGE: ${BACKEND_DEV_SANDBOX_ENGINE_IMAGE:-galaxy-engine:dev}
BACKEND_DEV_SANDBOX_ENGINE_VERSION: ${BACKEND_DEV_SANDBOX_ENGINE_VERSION:-0.1.0}
BACKEND_DEV_SANDBOX_PLAYER_COUNT: ${BACKEND_DEV_SANDBOX_PLAYER_COUNT:-20}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# Per-game state directories live under the same absolute path
+7 -9
View File
@@ -22,7 +22,7 @@ help:
@echo " make up Build (if needed) and bring up the stack, wait until healthy"
@echo " make down Stop compose containers, leave engines + volumes intact"
@echo " make rebuild Force rebuild of backend / gateway images and bring up"
@echo " make build-engine Build the engine image $(ENGINE_IMAGE) used by the dev sandbox"
@echo " make build-engine Build the engine image $(ENGINE_IMAGE) used by running games"
@echo " make stop-engines Stop and remove only the per-game engine containers"
@echo " make prune-broken-engines Remove non-running engine containers Docker can't heal (run inside 'up')"
@echo " make clean Stop everything (incl. engines) and wipe volumes + game state"
@@ -37,8 +37,9 @@ help:
@echo " pnpm -C ui/frontend dev"
@echo "and open http://localhost:5173 (UI) plus http://localhost:8025 (Mailpit)."
@echo ""
@echo "Default login for the auto-provisioned dev sandbox: dev@local.test"
@echo "(see BACKEND_DEV_SANDBOX_EMAIL in .env). Login code: 123456."
@echo "Sign in with email-OTP; the fixed login code 123456 works when"
@echo "BACKEND_AUTH_DEV_FIXED_CODE is set in .env. No game is auto-provisioned —"
@echo "load a legacy report via the UI's DEV report loader to exercise the map."
up: build-engine prune-broken-engines
$(COMPOSE) up -d --wait
@@ -88,12 +89,9 @@ stop-engines:
# bind-mount source and leaves it stuck in `exited` / `created`
# state. This target prunes the husks before `compose up`; the
# backend's pre-bootstrap reconciler tick (`backend/cmd/backend/main.go`)
# then cascades the orphan runtime row to `removed`, the lobby
# cancels the game, and the dev-sandbox bootstrap purges the
# cancelled tile and provisions a fresh sandbox in the same
# `make up` cycle. Healthy `running` / `restarting` containers are
# left intact so a long-lived sandbox survives normal up/down
# cycles.
# then cascades the orphan runtime row to `removed` and the lobby
# cancels the game. Healthy `running` / `restarting` containers are
# left intact so a long-lived game survives normal up/down cycles.
prune-broken-engines:
@ids=""; \
for cid in $$(docker ps -aq \
+16 -50
View File
@@ -78,49 +78,24 @@ To force the second path (no fast-bypass), edit
`make rebuild` (or simply `docker compose up -d backend` to recreate
the backend with the new env).
## Auto-provisioned dev sandbox
## No auto-provisioned game
`make up` provisions a private game called **Dev Sandbox** owned by
the dev user (default `dev@local.test`). The flow is implemented in
`backend/internal/devsandbox` and runs on every backend boot when
`BACKEND_DEV_SANDBOX_EMAIL` is non-empty in `tools/local-dev/.env`.
Bootstrap is idempotent — re-running `make up` after a `make down`
finds the existing user, dummy participants, game, and memberships
without creating duplicates. If a previous boot crashed mid-way
(game stuck in `enrollment_open` or `ready_to_start`), the next boot
resumes the lifecycle.
To log in straight into the sandbox:
`make up` brings up the stack with an empty lobby — there is no
auto-provisioned game. Sign in with email-OTP (the fixed dev code
`123456` works when `BACKEND_AUTH_DEV_FIXED_CODE` is set in
`tools/local-dev/.env`):
1. `make -C tools/local-dev up`
2. `pnpm -C ui/frontend dev` (in another terminal)
3. Open <http://localhost:5173/login>, enter `dev@local.test`, then
the dev code `123456`.
4. The lobby shows **Dev Sandbox** in *My Games*; click in.
3. Open <http://localhost:5173/login>, enter your email, then the dev
code `123456`.
To disable the bootstrap, clear `BACKEND_DEV_SANDBOX_EMAIL` in
`tools/local-dev/.env` and `docker compose up -d backend` (or
`make rebuild`). Existing users / games are not removed.
Terminal sandbox games — anything in `cancelled`, `finished`, or
`start_failed` — are deleted on every boot before find-or-create
runs. The cascade declared in `00001_init.sql` removes the
matching memberships, applications, invites, runtime records,
and player mappings in the same write, so the dev user's lobby
shows exactly one running tile at all times. Cancelling the
sandbox manually and running `docker compose restart backend`
(or `make rebuild`) yields a fresh game without leaving dead
tiles behind.
The bootstrap requires:
- `galaxy-engine:local-dev` Docker image (`make build-engine`).
- `BACKEND_DEV_SANDBOX_ENGINE_VERSION` parses as plain semver
(`MAJOR.MINOR.PATCH`); the default `0.1.0` is what the bootstrap
registers in the `engine_versions` row that points at the image.
- `BACKEND_DEV_SANDBOX_PLAYER_COUNT` ≥ 20 (the engine's minimum;
19 deterministic dummies fill the slots so the single real user
can start the game).
To exercise the map and report views without running a full game, use
the UI's DEV **synthetic report loader**: convert a legacy `.REP` with
`tools/local-dev/legacy-report/` and load the resulting JSON through the
loader (see that tool's README). To play a real game, create one in the
lobby and let the engine (`galaxy-engine:local-dev`, built by
`make build-engine`) run it.
- A frozen turn schedule (`0 0 1 1 *` — once a year) so the visible
game state stays at turn 1 until you explicitly progress it.
@@ -239,24 +214,15 @@ make status docker compose ps
this in one cycle: `prune-broken-engines` (runs as part of `up`)
removes every engine container that is not in `running` /
`restarting` state, the backend's pre-bootstrap reconciler tick
cascades the orphan runtime row to `removed`, the lobby cancels
the matching sandbox game, and the dev-sandbox bootstrap purges
the cancelled tile and provisions a fresh sandbox with a brand
new state directory. To run the cleanup by hand without restarting
the rest of the stack, `make prune-broken-engines`.
cascades the orphan runtime row to `removed`, and the lobby cancels
the matching game. To run the cleanup by hand without restarting the
rest of the stack, `make prune-broken-engines`.
The cycle relies on the backend image carrying the pre-bootstrap
reconciler tick (`backend/cmd/backend/main.go`). `make up` reuses
the cached image, so after pulling this commit the first time you
must `make rebuild` once to bake the fix in. Future `make up`
cycles will heal in one shot.
If after the heal cycle the lobby still shows only a `cancelled`
sandbox tile and no running game, the running backend image
predates the pre-bootstrap reconciler tick — the periodic ticker
cancels the orphan after bootstrap has already returned, leaving
the lobby in the half-baked state. `make rebuild` recreates the
image and then `make up` lands a fresh sandbox.
- **`make up` reports a build error mentioning `pkg/cronutil`** —
upstream module list drifted; copy any new `pkg/<name>/` line into
the local-dev `backend.Dockerfile` / `gateway.Dockerfile` to match
-4
View File
@@ -122,10 +122,6 @@ services:
BACKEND_OTEL_TRACES_EXPORTER: none
BACKEND_OTEL_METRICS_EXPORTER: none
BACKEND_AUTH_DEV_FIXED_CODE: ${BACKEND_AUTH_DEV_FIXED_CODE:-}
BACKEND_DEV_SANDBOX_EMAIL: ${BACKEND_DEV_SANDBOX_EMAIL:-}
BACKEND_DEV_SANDBOX_ENGINE_IMAGE: ${BACKEND_DEV_SANDBOX_ENGINE_IMAGE:-}
BACKEND_DEV_SANDBOX_ENGINE_VERSION: ${BACKEND_DEV_SANDBOX_ENGINE_VERSION:-}
BACKEND_DEV_SANDBOX_PLAYER_COUNT: ${BACKEND_DEV_SANDBOX_PLAYER_COUNT:-}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# Per-game state directories live under the same absolute path