Root cause for the long-standing "Dev Sandbox flips to cancelled after dev-deploy" symptom in push-triggered cycles: when `integration.yaml` runs in parallel with `dev-deploy.yaml`, its `integration/scripts/preclean.sh` issues a `docker rm -f` over every container labelled `galaxy.backend=1`. That label is stamped by the backend's runtime adapter on every engine it spawns — including the engines living in the long-lived dev-deploy environment on the same Docker daemon. Each post-merge auto-deploy therefore had the integration preclean wipe the dev-sandbox engine, and the new backend's reconciler tick observed `container disappeared` and cascaded the sandbox into `cancelled`. Fix: - `integration/testenv/backend.go` now sets `BACKEND_STACK_LABEL=integration` on every backend-under-test, so the engines spawned by integration carry `galaxy.stack=integration` in addition to `galaxy.backend=1`. The backend support for this env was added in the previous CI tidy-up PR (#13). - `integration/scripts/preclean.sh` gains a multi-label AND filter helper and uses it to scope engine cleanup to the combination `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and local-dev engines carry different `galaxy.stack` values, so the AND match leaves them alone. - `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out the AND-scoping rule and the new integration backend stamp. - `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry gets an "Update" section recording the root cause and the fix; the status is downgraded to "partially fixed" because the solo `workflow_dispatch` reproduction (which does NOT trigger integration) remains unexplained. - `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the `docker restart galaxy-dev-backend` failure caused by the runner-workspace bind-mount that surfaced while diagnosing this issue. Workaround: `make -C tools/dev-deploy up` from the persistent checkout. Real fix is a follow-up (bake fixture into image or copy to named volume). Verification: - `go build ./backend/... ./integration/...` — clean. - `bash -n integration/scripts/preclean.sh` — syntax OK. - Live AND-filter check on the dev host: `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration` returns nothing while the dev-deploy engine `galaxy-game-80f3ce86-...` keeps running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tools/dev-deploy/ — long-lived Galaxy dev environment
A docker-compose stack that runs the Galaxy backend, gateway, supporting
services, and a small Caddy in front of them, reachable through the host
Caddy at https://www.galaxy.lan and https://api.galaxy.lan. Used by
the dev-deploy.yaml Gitea Actions workflow as the canonical dev target
on every merge into the development branch, and runnable by hand
through this Makefile for local debugging of the deploy plumbing
itself.
This stack is not the developer's primary playground for UI work —
that role still belongs to tools/local-dev/,
which is faster (Vite HMR, host-side dev server) and isolated to one
developer. The two stacks coexist on the same host because every name
is distinct:
tools/local-dev/ |
tools/dev-deploy/ |
|
|---|---|---|
| Compose project | local-dev |
galaxy-dev |
| Container prefix | galaxy-local-dev-* |
galaxy-dev-* |
| Network | galaxy-local-dev-net |
galaxy-dev-internal, edge |
| Volumes | galaxy-local-dev-* |
galaxy-dev-* |
| Host ports | 5433/6380/8025/8080/9090 | none (only edge network) |
| Game state | /tmp/galaxy-game-state |
/var/lib/galaxy-dev/game-state |
| Engine image | galaxy-engine:local-dev |
galaxy-engine:dev |
Prerequisites
The host must already provide:
-
Docker daemon reachable as the user running
make(member of thedockergroup, no sudo). -
An external bridge network named
edge(or whateverGALAXY_EDGE_NETWORKoverrides to):docker network create edge -
A host Caddy listening on
:80/:443, attached to theedgenetwork, and proxyingwww.galaxy.lanandapi.galaxy.lantogalaxy-caddy:80. Example fragment for the host Caddyfile:www.galaxy.lan, api.galaxy.lan { tls internal reverse_proxy galaxy-caddy:80 } -
Game-state directory writable by the user running
make. Default is${HOME}/.galaxy-dev/game-state;make upcreates it on demand. Override by exportingGALAXY_DEV_GAME_STATE_DIR(e.g. to/var/lib/galaxy-dev/game-stateonce the host is provisioned for it).
Bring it up
make -C tools/dev-deploy up
up (re)builds the local-dev backend and gateway images, makes sure the
engine image galaxy-engine:dev exists, and waits for healthchecks. It
does not seed the UI volume — that is normally done by CI. The first
time you run by hand:
make -C tools/dev-deploy seed-ui
make -C tools/dev-deploy up
make -C tools/dev-deploy health
seed-ui runs pnpm build in ui/frontend/, then copies the resulting
build/ tree into the galaxy-dev-ui-dist volume. Subsequent CI deploys
overwrite this volume automatically.
Daily flow
make -C tools/dev-deploy rebuild # rebuild backend/gateway images + up
make -C tools/dev-deploy logs # tail compose logs
make -C tools/dev-deploy health # probe https://*.galaxy.lan
make -C tools/dev-deploy down # stop, keep state
State persists in named volumes between up/down cycles. The
development branch keeps the dev environment continuously usable —
games created last week survive into this week unless somebody
calls make clean-data.
Logging in
The same dev-mode email-code override as tools/local-dev/ applies,
and the dev-deploy compose ships with it enabled by default:
- Enter
dev@galaxy.lan(or whateverBACKEND_DEV_SANDBOX_EMAILresolves to) in the login form. - Submit
123456as the code — the docker-compose default forBACKEND_AUTH_DEV_FIXED_CODEis123456, so the bcrypt-hashed email code stays a fallback. To force real Mailpit codes (e.g. for mail-flow QA), setBACKEND_AUTH_DEV_FIXED_CODE=(empty) in a local.envandmake rebuild.
The fixed-code override is rejected by production env loaders, so it cannot leak into the prod environment.
Networking
Browser
│ https://www.galaxy.lan, https://api.galaxy.lan
▼
host-Caddy (:80, :443, TLS, attached to `edge` network)
│ reverse_proxy *.galaxy.lan → galaxy-caddy:80
▼
galaxy-caddy (networks: edge + galaxy-dev-internal)
│ www.galaxy.lan → file_server /srv/galaxy-ui (volume galaxy-dev-ui-dist)
│ api.galaxy.lan → reverse_proxy galaxy-api:8080
▼
galaxy-dev-internal
├─ galaxy-api (gateway: :8080 REST, :9090 gRPC)
├─ galaxy-backend (backend: :8080 HTTP, :8081 gRPC push)
├─ galaxy-postgres (postgres: :5432)
├─ galaxy-redis (redis: :6379)
├─ galaxy-mailpit (mailpit: :8025 UI, :1025 SMTP)
└─ engine containers (spawned by backend on demand)
The compose project deliberately exposes no host ports. Diagnostics
that used to go through localhost:8025 etc. now go through the
container network: docker compose -f tools/dev-deploy/docker-compose.yml exec galaxy-mailpit wget -qO- localhost:8025/messages and similar.
Persistent state and schema changes
The dev Postgres volume galaxy-dev-postgres-data survives redeploys.
Schema deltas land as additive, sequence-numbered migration files
(backend/internal/postgres/migrations/0000N_*.sql) and pressly/goose
applies them on backend startup without operator action.
Use make -C tools/dev-deploy clean-data only when you deliberately
want a fresh database (debugging schema drift, exercising the
bootstrap path from scratch, etc.):
make -C tools/dev-deploy clean-data
make -C tools/dev-deploy up
The same volume-persistence model applies to tools/local-dev/.
Make targets
make up Build images, ensure engine image, bring stack up (waits for health)
make rebuild Rebuild backend / gateway images (ignores cache), then up
make seed-ui pnpm build + load build/ into galaxy-dev-ui-dist volume
make build-engine Build galaxy-engine:dev (no-op if image already present)
make down Stop containers, keep named volumes
make logs Tail compose logs
make status docker compose ps
make health curl https://www.galaxy.lan + https://api.galaxy.lan/healthz
make psql psql as galaxy@galaxy_backend
make clean-data Stop everything and wipe volumes + game-state dir
Files
docker-compose.yml— six services: postgres, redis, mailpit, galaxy-backend, galaxy-api, galaxy-caddy. Reuses the alpine-runtime Dockerfiles from../local-dev/so the backend healthcheck can runwget. Reuses the dev keypair from../local-dev/keys/.Caddyfile.dev— the application-routing Caddy config, mounted intogalaxy-caddyat/etc/caddy/Caddyfile.Caddyfile.prod— placeholder for a future prod deployment; not used by this compose.Makefile— wrapper overdocker composewith helpers for engine, UI seeding, health probes, and full wipe..env.example— non-secret defaults for the compose${VAR:-}expansions. Copy to.envif you want host-local overrides.
Known issues
See KNOWN-ISSUES.md for symptoms that surface
in the long-lived dev environment but are not yet fixed (currently:
the sandbox game flipping to cancelled after a redispatch).
Deployment cadence
This environment is single-tenant: one live deployment, redeployed by
the dev-deploy.yaml workflow on every merge into development. PR
branches do not auto-deploy here — pushes to feature/* only run the
test workflows (go-unit, ui-test, integration).
To put a feature branch on the shared dev environment before its PR merges (e.g. to validate a UI flow against the real Caddy edge), run the workflow manually:
- Push the branch (
git push gitea HEAD). - Gitea UI → Actions → Deploy · Dev → Run workflow, pick the feature ref.
The deploy is idempotent — when the PR later merges into
development, the regular push trigger fires the same packaging and
healthcheck steps, overwriting whatever the manual dispatch left
behind. There is no separate state to clean up between the two paths.
Relationship to other infrastructure
tools/local-dev/— single-developer playground, host-port mapped, Vite dev server on the side. Recommended for active UI work..gitea/workflows/dev-deploy.yaml— the CI side of this stack: builds images, seeds the UI volume, runsdocker compose up -don every merge intodevelopment. The Makefile in this directory is what that workflow ultimately calls into.