refactor(dev): remove the dev-sandbox bootstrap everywhere
Tests · Go / test (push) Successful in 1m59s
Tests · Go / test (push) Successful in 1m59s
Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev Sandbox" game and dummy users are removed so the dev contour starts empty like prod; the separate legacy-report loader stays as the test-data path. - delete backend/internal/devsandbox (package + tests) - drop the bootstrap call + DevSandboxConfig (struct, Config field, BACKEND_DEV_SANDBOX_* env, defaults, loader, validation) - strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and .env.example; the generic engine-recycle / prune-broken-engines logic stays (it serves real games) - update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev README + Makefile) and stale comments; DeleteGame and InsertMembershipDirect remain (exercised by lobby integration tests) No app behaviour change beyond not auto-creating the sandbox game.
This commit is contained in:
@@ -7,12 +7,6 @@
|
||||
# baked into `docker-compose.yml`, so this file documents the knobs
|
||||
# rather than driving them.
|
||||
|
||||
# Auto-provisioned sandbox bootstrap. Empty disables the bootstrap.
|
||||
BACKEND_DEV_SANDBOX_EMAIL=dev@galaxy.lan
|
||||
BACKEND_DEV_SANDBOX_ENGINE_IMAGE=galaxy-engine:dev
|
||||
BACKEND_DEV_SANDBOX_ENGINE_VERSION=0.1.0
|
||||
BACKEND_DEV_SANDBOX_PLAYER_COUNT=20
|
||||
|
||||
# `123456` short-circuits the email-code path for the dev account.
|
||||
# This is also the docker-compose default — set the variable to an
|
||||
# empty string here when the environment must rely on real Mailpit
|
||||
|
||||
@@ -1,164 +1,8 @@
|
||||
# `tools/dev-deploy/` — known issues
|
||||
|
||||
Issues that surface in the long-lived dev environment but are not yet
|
||||
fixed. Each entry lists the observed symptom, the diagnostic evidence,
|
||||
the working hypothesis, and the open questions that have to be
|
||||
answered before a fix lands.
|
||||
|
||||
## Dev Sandbox game flips to `cancelled` after a `dev-deploy` redispatch
|
||||
|
||||
### Symptom
|
||||
|
||||
A previously `running` "Dev Sandbox" game (created by
|
||||
`backend/internal/devsandbox`) transitions to `cancelled` ~15 minutes
|
||||
after a `dev-deploy.yaml` workflow_dispatch run finishes. The user's
|
||||
browser session survives (the same `device_session_id` keeps working),
|
||||
but the lobby shows no game because the only game it had is now
|
||||
terminal. `purgeTerminalSandboxGames` does pick it up on the **next**
|
||||
boot and creates a fresh sandbox — but the first redispatch leaves
|
||||
the user with an empty lobby until backend restarts again.
|
||||
|
||||
### Diagnostic evidence
|
||||
|
||||
Backend logs from the broken cycle (timestamps abbreviated):
|
||||
|
||||
```text
|
||||
20:24:40 dev_sandbox: purged terminal sandbox game game_id=<prev> status=cancelled
|
||||
20:24:40 dev_sandbox: memberships ensured count=20 game_id=<new>
|
||||
20:24:40 dev_sandbox: bootstrap complete user_id=<owner> game_id=<new> status=starting
|
||||
...
|
||||
20:25:09 user mail sent failed (diplomail tables missing — unrelated)
|
||||
...
|
||||
20:39:40 lobby: game cancelled by runtime reconciler game_id=<new>
|
||||
op=reconcile status=removed message="container disappeared"
|
||||
```
|
||||
|
||||
Between 20:24:40 (`status=starting`) and 20:39:40 (reconciler cancel)
|
||||
the backend logs are silent on the runtime / engine paths — no
|
||||
`engine spawned`, no `engine container started`, no `runtime
|
||||
transition` lines. The reconciler then fires and reports the engine
|
||||
container as missing.
|
||||
|
||||
`docker ps -a --filter 'label=org.opencontainers.image.title=galaxy-game-engine'`
|
||||
returns no rows during this window — the engine container is neither
|
||||
running nor stopped on the host, so it either was never spawned or
|
||||
was removed before the host snapshot.
|
||||
|
||||
### What has been ruled out
|
||||
|
||||
A live `docker inspect` on a healthy engine container shows:
|
||||
|
||||
```text
|
||||
Labels: galaxy.backend=1, galaxy.engine_version=0.1.0,
|
||||
galaxy.game_id=<uuid>,
|
||||
org.opencontainers.image.title=galaxy-game-engine,
|
||||
com.galaxy.{cpu_quota,memory,pids_limit}
|
||||
AutoRemove: false
|
||||
RestartPolicy: on-failure
|
||||
NetworkMode: galaxy-dev-internal
|
||||
```
|
||||
|
||||
There are no `com.docker.compose.*` labels and `AutoRemove=false`,
|
||||
so `--remove-orphans` cannot reap the engine and a `--rm`-style
|
||||
self-destruct is not in play. Two redispatches captured under
|
||||
`docker events --filter event=create,start,die,destroy,kill,stop`
|
||||
also confirmed it: across both runs the only `die` / `destroy`
|
||||
events were for `galaxy-dev-{backend,api,caddy}`. The live engine
|
||||
container survived both redispatches, and the reconciler that
|
||||
fires 60 seconds after the new backend boots correctly matched
|
||||
it through `byGameID` / `byContainerID`.
|
||||
|
||||
`backend/internal/runtime/service.go` only removes engine
|
||||
containers from the explicit `runStop` / `runRestart` / `runPatch`
|
||||
paths. There is no `runtime.Service.Shutdown` that proactively
|
||||
kills containers on backend exit, so a graceful SIGTERM to
|
||||
`galaxy-dev-backend` will not touch its child engine containers.
|
||||
|
||||
### Host-side hypotheses considered and rejected by the owner
|
||||
|
||||
The natural follow-up suspects after compose was cleared — host-side
|
||||
`docker prune` cron jobs, a manual `docker rm`, an out-of-band
|
||||
`dockerd` restart, and an idle-state engine crash — were all
|
||||
rejected by the project owner: the dev host runs none of those
|
||||
periodic cleanups, no one manually removed the container, dockerd
|
||||
was not restarted in the window, and the engine binary does not
|
||||
crash while idling on API calls.
|
||||
|
||||
### Best remaining suspicion
|
||||
|
||||
Something the `dev-deploy.yaml` CI run does between successful
|
||||
image builds and the final `docker compose up -d --wait
|
||||
--remove-orphans` clobbers the previously-spawned engine container.
|
||||
The chain at runtime contains:
|
||||
|
||||
1. `docker build -t galaxy-engine:dev -f game/Dockerfile .`
|
||||
2. `docker compose build galaxy-backend galaxy-api`
|
||||
3. `docker run --rm` alpine for the UI volume seed
|
||||
4. `docker compose up -d --wait --remove-orphans`
|
||||
|
||||
None of these *should* touch an unmanaged engine container, but
|
||||
the reproduction window points squarely inside this sequence. A
|
||||
deliberate next reproduction with `docker events --since 0` armed
|
||||
*before* the deploy starts and live for the entire job — captured
|
||||
end-to-end on the dev host, not just the chunk after backend
|
||||
recreate — would pin which step emits the `destroy` on the engine.
|
||||
|
||||
### Update 2026-05-19: integration preclean identified as one cause
|
||||
|
||||
A live reproduction during the post-merge auto-deploy cycle (Gitea
|
||||
run #188 dev-deploy plus parallel run #190 integration) pinned one
|
||||
clobbering source: `integration/scripts/preclean.sh` was unscoped
|
||||
and removed *every* container labelled `galaxy.backend=1`, including
|
||||
the dev-deploy engine. Timeline from the dev host:
|
||||
|
||||
```text
|
||||
23:10:40 backend pre-bootstrap reconciler tick: engine alive
|
||||
23:10:40 dev_sandbox bootstrap: status=running
|
||||
23:10:56 preclean: removing 1 backend-managed engine containers ← integration run #190
|
||||
23:11:40 reconciler: container disappeared → game cancelled
|
||||
```
|
||||
|
||||
Fix landed: `BACKEND_STACK_LABEL=integration` is now passed to
|
||||
every integration backend (see
|
||||
`integration/testenv/backend.go`) and `preclean.sh` AND-combines
|
||||
`galaxy.backend=1` with `galaxy.stack=integration`, so dev-deploy /
|
||||
local-dev engines stamped with different stack values are no longer
|
||||
collateral.
|
||||
|
||||
This covers **push**-triggered cycles where `dev-deploy.yaml` and
|
||||
`integration.yaml` run on the same Gitea host. The original
|
||||
hypothesis (a `workflow_dispatch dev-deploy` solo run also losing
|
||||
the engine) is *not* explained by the integration fix — manual
|
||||
dispatches do not trigger `integration.yaml`. Keep this entry open
|
||||
until a solo-dispatch reproduction confirms whether the symptom
|
||||
still occurs.
|
||||
|
||||
### Status
|
||||
|
||||
Partially fixed (push-triggered cycles). Solo `workflow_dispatch`
|
||||
reproductions still open. If the symptom recurs after the
|
||||
integration fix lands, capture `docker events --since 0` for the
|
||||
full dispatch window and attach here.
|
||||
|
||||
### Workaround in use today
|
||||
|
||||
When the sandbox game flips to `cancelled`, redispatch `dev-deploy`:
|
||||
|
||||
```sh
|
||||
curl -X POST -n -H 'Content-Type: application/json' \
|
||||
-d '{"ref":"<branch>"}' \
|
||||
https://gitea.iliadenisov.ru/api/v1/repos/developer/galaxy-game/actions/workflows/dev-deploy.yaml/dispatches
|
||||
```
|
||||
|
||||
The next boot's `purgeTerminalSandboxGames` removes the cancelled
|
||||
row, `findOrCreateSandboxGame` creates a fresh one, and
|
||||
`ensureMembershipsAndDrive` puts the new game back to `running`.
|
||||
|
||||
### Owner
|
||||
|
||||
Unassigned. File an issue once we have the runtime / reconciler
|
||||
analysis above; reference this section in the issue body so future
|
||||
redeploys can short-circuit the diagnostic loop.
|
||||
Issues that surfaced in the long-lived dev environment. Each entry lists
|
||||
the observed symptom, the diagnostic evidence, and the fix or the open
|
||||
questions that have to be answered before a fix lands.
|
||||
|
||||
## `docker restart galaxy-dev-backend` fails after the CI runner cleans up
|
||||
|
||||
|
||||
@@ -114,8 +114,7 @@ calls `make clean-data`.
|
||||
The same dev-mode email-code override as `tools/local-dev/` applies,
|
||||
and the dev-deploy compose ships with it enabled by default:
|
||||
|
||||
1. Enter `dev@galaxy.lan` (or whatever `BACKEND_DEV_SANDBOX_EMAIL`
|
||||
resolves to) in the login form.
|
||||
1. Enter your email address in the login form.
|
||||
2. Submit `123456` as the code — the docker-compose default for
|
||||
`BACKEND_AUTH_DEV_FIXED_CODE` is `123456`, so the bcrypt-hashed
|
||||
email code stays a fallback. To force real Mailpit codes (e.g. for
|
||||
@@ -212,8 +211,7 @@ make clean-data Stop everything and wipe volumes + game-state dir
|
||||
## Known issues
|
||||
|
||||
See [`KNOWN-ISSUES.md`](KNOWN-ISSUES.md) for symptoms that surface
|
||||
in the long-lived dev environment but are not yet fixed (currently:
|
||||
the sandbox game flipping to `cancelled` after a redispatch).
|
||||
in the long-lived dev environment but are not yet fixed.
|
||||
|
||||
## Deployment cadence
|
||||
|
||||
@@ -237,12 +235,12 @@ behind. There is no separate state to clean up between the two paths.
|
||||
|
||||
### Engine image drift recycle
|
||||
|
||||
`backend` spawns one engine container per game (the long-lived "Dev
|
||||
Sandbox" plus any user-created games) and the reconciler reattaches
|
||||
to whatever it finds with the `galaxy.stack=dev-deploy` label. That
|
||||
reattach does not check the running container's image SHA against the
|
||||
freshly-built `galaxy-engine:dev` tag, so an unchanged container would
|
||||
otherwise keep serving the previous engine code after a redeploy.
|
||||
`backend` spawns one engine container per running game and the
|
||||
reconciler reattaches to whatever it finds with the
|
||||
`galaxy.stack=dev-deploy` label. That reattach does not check the
|
||||
running container's image SHA against the freshly-built
|
||||
`galaxy-engine:dev` tag, so an unchanged container would otherwise
|
||||
keep serving the previous engine code after a redeploy.
|
||||
|
||||
The `dev-deploy.yaml` workflow handles this in the
|
||||
`Recycle engine containers on image drift` step. When `docker build`
|
||||
@@ -250,9 +248,7 @@ produces a new `galaxy-engine:dev` SHA, the step compares it against
|
||||
every running `galaxy-game-*` container and, for each drifted one,
|
||||
stops the backend, removes the container, wipes its bind-mounted
|
||||
state directory (Engine.Init() writes turn-0 over any pre-existing
|
||||
`turn-N` files), and cascade-deletes the lobby `games` row. The
|
||||
`dev-sandbox` bootstrap on the next backend boot finds no live
|
||||
sandbox and provisions a fresh one on the new engine image.
|
||||
`turn-N` files), and cascade-deletes the lobby `games` row.
|
||||
|
||||
When the engine sources are unchanged, the BuildKit cache hits and
|
||||
the SHA stays the same — the recycle step is a no-op and the running
|
||||
|
||||
@@ -127,15 +127,6 @@ services:
|
||||
# bcrypt-hashed code is single-use). Set the var to an empty
|
||||
# string in `.env` to disable.
|
||||
BACKEND_AUTH_DEV_FIXED_CODE: ${BACKEND_AUTH_DEV_FIXED_CODE:-123456}
|
||||
# Long-lived dev environment always bootstraps the "Dev Sandbox"
|
||||
# game owned by this email so a freshly redeployed stack already
|
||||
# has one ready-to-play game in the lobby. Set the variable to an
|
||||
# empty string in `.env` to disable the bootstrap (e.g. for a
|
||||
# cold-start QA pass).
|
||||
BACKEND_DEV_SANDBOX_EMAIL: ${BACKEND_DEV_SANDBOX_EMAIL:-dev@galaxy.lan}
|
||||
BACKEND_DEV_SANDBOX_ENGINE_IMAGE: ${BACKEND_DEV_SANDBOX_ENGINE_IMAGE:-galaxy-engine:dev}
|
||||
BACKEND_DEV_SANDBOX_ENGINE_VERSION: ${BACKEND_DEV_SANDBOX_ENGINE_VERSION:-0.1.0}
|
||||
BACKEND_DEV_SANDBOX_PLAYER_COUNT: ${BACKEND_DEV_SANDBOX_PLAYER_COUNT:-20}
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
# Per-game state directories live under the same absolute path
|
||||
|
||||
@@ -22,7 +22,7 @@ help:
|
||||
@echo " make up Build (if needed) and bring up the stack, wait until healthy"
|
||||
@echo " make down Stop compose containers, leave engines + volumes intact"
|
||||
@echo " make rebuild Force rebuild of backend / gateway images and bring up"
|
||||
@echo " make build-engine Build the engine image $(ENGINE_IMAGE) used by the dev sandbox"
|
||||
@echo " make build-engine Build the engine image $(ENGINE_IMAGE) used by running games"
|
||||
@echo " make stop-engines Stop and remove only the per-game engine containers"
|
||||
@echo " make prune-broken-engines Remove non-running engine containers Docker can't heal (run inside 'up')"
|
||||
@echo " make clean Stop everything (incl. engines) and wipe volumes + game state"
|
||||
@@ -37,8 +37,9 @@ help:
|
||||
@echo " pnpm -C ui/frontend dev"
|
||||
@echo "and open http://localhost:5173 (UI) plus http://localhost:8025 (Mailpit)."
|
||||
@echo ""
|
||||
@echo "Default login for the auto-provisioned dev sandbox: dev@local.test"
|
||||
@echo "(see BACKEND_DEV_SANDBOX_EMAIL in .env). Login code: 123456."
|
||||
@echo "Sign in with email-OTP; the fixed login code 123456 works when"
|
||||
@echo "BACKEND_AUTH_DEV_FIXED_CODE is set in .env. No game is auto-provisioned —"
|
||||
@echo "load a legacy report via the UI's DEV report loader to exercise the map."
|
||||
|
||||
up: build-engine prune-broken-engines
|
||||
$(COMPOSE) up -d --wait
|
||||
@@ -88,12 +89,9 @@ stop-engines:
|
||||
# bind-mount source and leaves it stuck in `exited` / `created`
|
||||
# state. This target prunes the husks before `compose up`; the
|
||||
# backend's pre-bootstrap reconciler tick (`backend/cmd/backend/main.go`)
|
||||
# then cascades the orphan runtime row to `removed`, the lobby
|
||||
# cancels the game, and the dev-sandbox bootstrap purges the
|
||||
# cancelled tile and provisions a fresh sandbox in the same
|
||||
# `make up` cycle. Healthy `running` / `restarting` containers are
|
||||
# left intact so a long-lived sandbox survives normal up/down
|
||||
# cycles.
|
||||
# then cascades the orphan runtime row to `removed` and the lobby
|
||||
# cancels the game. Healthy `running` / `restarting` containers are
|
||||
# left intact so a long-lived game survives normal up/down cycles.
|
||||
prune-broken-engines:
|
||||
@ids=""; \
|
||||
for cid in $$(docker ps -aq \
|
||||
|
||||
+16
-50
@@ -78,49 +78,24 @@ To force the second path (no fast-bypass), edit
|
||||
`make rebuild` (or simply `docker compose up -d backend` to recreate
|
||||
the backend with the new env).
|
||||
|
||||
## Auto-provisioned dev sandbox
|
||||
## No auto-provisioned game
|
||||
|
||||
`make up` provisions a private game called **Dev Sandbox** owned by
|
||||
the dev user (default `dev@local.test`). The flow is implemented in
|
||||
`backend/internal/devsandbox` and runs on every backend boot when
|
||||
`BACKEND_DEV_SANDBOX_EMAIL` is non-empty in `tools/local-dev/.env`.
|
||||
|
||||
Bootstrap is idempotent — re-running `make up` after a `make down`
|
||||
finds the existing user, dummy participants, game, and memberships
|
||||
without creating duplicates. If a previous boot crashed mid-way
|
||||
(game stuck in `enrollment_open` or `ready_to_start`), the next boot
|
||||
resumes the lifecycle.
|
||||
|
||||
To log in straight into the sandbox:
|
||||
`make up` brings up the stack with an empty lobby — there is no
|
||||
auto-provisioned game. Sign in with email-OTP (the fixed dev code
|
||||
`123456` works when `BACKEND_AUTH_DEV_FIXED_CODE` is set in
|
||||
`tools/local-dev/.env`):
|
||||
|
||||
1. `make -C tools/local-dev up`
|
||||
2. `pnpm -C ui/frontend dev` (in another terminal)
|
||||
3. Open <http://localhost:5173/login>, enter `dev@local.test`, then
|
||||
the dev code `123456`.
|
||||
4. The lobby shows **Dev Sandbox** in *My Games*; click in.
|
||||
3. Open <http://localhost:5173/login>, enter your email, then the dev
|
||||
code `123456`.
|
||||
|
||||
To disable the bootstrap, clear `BACKEND_DEV_SANDBOX_EMAIL` in
|
||||
`tools/local-dev/.env` and `docker compose up -d backend` (or
|
||||
`make rebuild`). Existing users / games are not removed.
|
||||
|
||||
Terminal sandbox games — anything in `cancelled`, `finished`, or
|
||||
`start_failed` — are deleted on every boot before find-or-create
|
||||
runs. The cascade declared in `00001_init.sql` removes the
|
||||
matching memberships, applications, invites, runtime records,
|
||||
and player mappings in the same write, so the dev user's lobby
|
||||
shows exactly one running tile at all times. Cancelling the
|
||||
sandbox manually and running `docker compose restart backend`
|
||||
(or `make rebuild`) yields a fresh game without leaving dead
|
||||
tiles behind.
|
||||
|
||||
The bootstrap requires:
|
||||
- `galaxy-engine:local-dev` Docker image (`make build-engine`).
|
||||
- `BACKEND_DEV_SANDBOX_ENGINE_VERSION` parses as plain semver
|
||||
(`MAJOR.MINOR.PATCH`); the default `0.1.0` is what the bootstrap
|
||||
registers in the `engine_versions` row that points at the image.
|
||||
- `BACKEND_DEV_SANDBOX_PLAYER_COUNT` ≥ 20 (the engine's minimum;
|
||||
19 deterministic dummies fill the slots so the single real user
|
||||
can start the game).
|
||||
To exercise the map and report views without running a full game, use
|
||||
the UI's DEV **synthetic report loader**: convert a legacy `.REP` with
|
||||
`tools/local-dev/legacy-report/` and load the resulting JSON through the
|
||||
loader (see that tool's README). To play a real game, create one in the
|
||||
lobby and let the engine (`galaxy-engine:local-dev`, built by
|
||||
`make build-engine`) run it.
|
||||
- A frozen turn schedule (`0 0 1 1 *` — once a year) so the visible
|
||||
game state stays at turn 1 until you explicitly progress it.
|
||||
|
||||
@@ -239,24 +214,15 @@ make status docker compose ps
|
||||
this in one cycle: `prune-broken-engines` (runs as part of `up`)
|
||||
removes every engine container that is not in `running` /
|
||||
`restarting` state, the backend's pre-bootstrap reconciler tick
|
||||
cascades the orphan runtime row to `removed`, the lobby cancels
|
||||
the matching sandbox game, and the dev-sandbox bootstrap purges
|
||||
the cancelled tile and provisions a fresh sandbox with a brand
|
||||
new state directory. To run the cleanup by hand without restarting
|
||||
the rest of the stack, `make prune-broken-engines`.
|
||||
cascades the orphan runtime row to `removed`, and the lobby cancels
|
||||
the matching game. To run the cleanup by hand without restarting the
|
||||
rest of the stack, `make prune-broken-engines`.
|
||||
|
||||
The cycle relies on the backend image carrying the pre-bootstrap
|
||||
reconciler tick (`backend/cmd/backend/main.go`). `make up` reuses
|
||||
the cached image, so after pulling this commit the first time you
|
||||
must `make rebuild` once to bake the fix in. Future `make up`
|
||||
cycles will heal in one shot.
|
||||
|
||||
If after the heal cycle the lobby still shows only a `cancelled`
|
||||
sandbox tile and no running game, the running backend image
|
||||
predates the pre-bootstrap reconciler tick — the periodic ticker
|
||||
cancels the orphan after bootstrap has already returned, leaving
|
||||
the lobby in the half-baked state. `make rebuild` recreates the
|
||||
image and then `make up` lands a fresh sandbox.
|
||||
- **`make up` reports a build error mentioning `pkg/cronutil`** —
|
||||
upstream module list drifted; copy any new `pkg/<name>/` line into
|
||||
the local-dev `backend.Dockerfile` / `gateway.Dockerfile` to match
|
||||
|
||||
@@ -122,10 +122,6 @@ services:
|
||||
BACKEND_OTEL_TRACES_EXPORTER: none
|
||||
BACKEND_OTEL_METRICS_EXPORTER: none
|
||||
BACKEND_AUTH_DEV_FIXED_CODE: ${BACKEND_AUTH_DEV_FIXED_CODE:-}
|
||||
BACKEND_DEV_SANDBOX_EMAIL: ${BACKEND_DEV_SANDBOX_EMAIL:-}
|
||||
BACKEND_DEV_SANDBOX_ENGINE_IMAGE: ${BACKEND_DEV_SANDBOX_ENGINE_IMAGE:-}
|
||||
BACKEND_DEV_SANDBOX_ENGINE_VERSION: ${BACKEND_DEV_SANDBOX_ENGINE_VERSION:-}
|
||||
BACKEND_DEV_SANDBOX_PLAYER_COUNT: ${BACKEND_DEV_SANDBOX_PLAYER_COUNT:-}
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
# Per-game state directories live under the same absolute path
|
||||
|
||||
Reference in New Issue
Block a user