Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev
Sandbox" game and dummy users are removed so the dev contour starts
empty like prod; the separate legacy-report loader stays as the
test-data path.
- delete backend/internal/devsandbox (package + tests)
- drop the bootstrap call + DevSandboxConfig (struct, Config field,
BACKEND_DEV_SANDBOX_* env, defaults, loader, validation)
- strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and
.env.example; the generic engine-recycle / prune-broken-engines logic
stays (it serves real games)
- update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev
README + Makefile) and stale comments; DeleteGame and
InsertMembershipDirect remain (exercised by lobby integration tests)
No app behaviour change beyond not auto-creating the sandbox game.
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.
Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:
- `tools/dev-deploy/docker-compose.yml`: mount
`galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
bind. Declare the volume in the top-level `volumes:` block.
- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
(placed right after the existing UI-volume seed) copies the
fixture from `pkg/geoip/test-data/test-data/` into the named
volume via an ephemeral alpine container, the same pattern UI
seeding already uses.
- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
the same copy from the persistent checkout. `up` and `rebuild`
now depend on it, so a hand-run `make -C tools/dev-deploy up`
populates the volume without operator action.
- `tools/dev-deploy/README.md`: updated the make-targets table to
list `seed-geoip`.
- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
failure is downgraded to a "fixed" postmortem; the symptom,
cause, and where the fix lives are kept for future reference.
Verification on the dev host (this branch checked out):
$ make -C tools/dev-deploy up # populates the volume, brings stack healthy
$ docker restart galaxy-dev-backend # used to error "not a directory"
$ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
$ echo "ok" # backend up 6s, healthy
The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.
Fix:
- `integration/testenv/backend.go` now sets
`BACKEND_STACK_LABEL=integration` on every backend-under-test, so
the engines spawned by integration carry
`galaxy.stack=integration` in addition to `galaxy.backend=1`. The
backend support for this env was added in the previous CI tidy-up
PR (#13).
- `integration/scripts/preclean.sh` gains a multi-label AND filter
helper and uses it to scope engine cleanup to the combination
`galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
local-dev engines carry different `galaxy.stack` values, so the
AND match leaves them alone.
- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
the AND-scoping rule and the new integration backend stamp.
- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
gets an "Update" section recording the root cause and the fix; the
status is downgraded to "partially fixed" because the solo
`workflow_dispatch` reproduction (which does NOT trigger
integration) remains unexplained.
- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
`docker restart galaxy-dev-backend` failure caused by the
runner-workspace bind-mount that surfaced while diagnosing this
issue. Workaround: `make -C tools/dev-deploy up` from the
persistent checkout. Real fix is a follow-up (bake fixture into
image or copy to named volume).
Verification:
- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
`docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
returns nothing while the dev-deploy engine
`galaxy-game-80f3ce86-...` keeps running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.
Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:
- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
already running emitted `die` / `destroy` events only for
`galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
correctly matched the surviving engine in both cases
(`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
engine containers, so a graceful backend exit also leaves them
alone.
The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.
`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.
No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>