galaxy-game

Author	SHA1	Message	Date
Ilia Denisov	0cae89cba2	refactor(dev): remove the dev-sandbox bootstrap everywhere Tests · Go / test (push) Successful in 1m59s Details Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev Sandbox" game and dummy users are removed so the dev contour starts empty like prod; the separate legacy-report loader stays as the test-data path. - delete backend/internal/devsandbox (package + tests) - drop the bootstrap call + DevSandboxConfig (struct, Config field, BACKEND_DEV_SANDBOX_* env, defaults, loader, validation) - strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and .env.example; the generic engine-recycle / prune-broken-engines logic stays (it serves real games) - update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev README + Makefile) and stale comments; DeleteGame and InsertMembershipDirect remain (exercised by lobby integration tests) No app behaviour change beyond not auto-creating the sandbox game.	2026-05-31 22:28:03 +02:00
Ilia Denisov	f70258849f	fix(dev-deploy): seed geoip onto a named volume `docker restart galaxy-dev-backend` failed with "not a directory" after every dev-deploy workflow run. Root cause: the compose file bind-mounted the geoip database via a relative path (`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`). When the Gitea runner invoked `docker compose up`, the path resolved against the runner's ephemeral workspace under `/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source baked into the running container therefore pointed at that ephemeral path; the runner deleted the workspace once the workflow finished, and any later `docker restart` could not remount. Replace the bind with a named volume `galaxy-dev-geoip-data`, seeded at deploy time: - `tools/dev-deploy/docker-compose.yml`: mount `galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative bind. Declare the volume in the top-level `volumes:` block. - `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step (placed right after the existing UI-volume seed) copies the fixture from `pkg/geoip/test-data/test-data/` into the named volume via an ephemeral alpine container, the same pattern UI seeding already uses. - `tools/dev-deploy/Makefile`: new `seed-geoip` target performs the same copy from the persistent checkout. `up` and `rebuild` now depend on it, so a hand-run `make -C tools/dev-deploy up` populates the volume without operator action. - `tools/dev-deploy/README.md`: updated the make-targets table to list `seed-geoip`. - `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart failure is downgraded to a "fixed" postmortem; the symptom, cause, and where the fix lives are kept for future reference. Verification on the dev host (this branch checked out): $ make -C tools/dev-deploy up # populates the volume, brings stack healthy $ docker restart galaxy-dev-backend # used to error "not a directory" $ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done $ echo "ok" # backend up 6s, healthy The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived both `make up` and `docker restart` untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:59:38 +02:00
Ilia Denisov	a338ebf058	fix(integration): scope preclean to galaxy.stack=integration Tests · Integration / integration (pull_request) Successful in 1m37s Details Root cause for the long-standing "Dev Sandbox flips to cancelled after dev-deploy" symptom in push-triggered cycles: when `integration.yaml` runs in parallel with `dev-deploy.yaml`, its `integration/scripts/preclean.sh` issues a `docker rm -f` over every container labelled `galaxy.backend=1`. That label is stamped by the backend's runtime adapter on every engine it spawns — including the engines living in the long-lived dev-deploy environment on the same Docker daemon. Each post-merge auto-deploy therefore had the integration preclean wipe the dev-sandbox engine, and the new backend's reconciler tick observed `container disappeared` and cascaded the sandbox into `cancelled`. Fix: - `integration/testenv/backend.go` now sets `BACKEND_STACK_LABEL=integration` on every backend-under-test, so the engines spawned by integration carry `galaxy.stack=integration` in addition to `galaxy.backend=1`. The backend support for this env was added in the previous CI tidy-up PR (#13). - `integration/scripts/preclean.sh` gains a multi-label AND filter helper and uses it to scope engine cleanup to the combination `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and local-dev engines carry different `galaxy.stack` values, so the AND match leaves them alone. - `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out the AND-scoping rule and the new integration backend stamp. - `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry gets an "Update" section recording the root cause and the fix; the status is downgraded to "partially fixed" because the solo `workflow_dispatch` reproduction (which does NOT trigger integration) remains unexplained. - `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the `docker restart galaxy-dev-backend` failure caused by the runner-workspace bind-mount that surfaced while diagnosing this issue. Workaround: `make -C tools/dev-deploy up` from the persistent checkout. Real fix is a follow-up (bake fixture into image or copy to named volume). Verification: - `go build ./backend/... ./integration/...` — clean. - `bash -n integration/scripts/preclean.sh` — syntax OK. - Live AND-filter check on the dev host: `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration` returns nothing while the dev-deploy engine `galaxy-game-80f3ce86-...` keeps running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:37:55 +02:00
Ilia Denisov	49f614926a	KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:16:51 +02:00
Ilia Denisov	cadb72b412	KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap Tests · UI / test (push) Successful in 2m36s Details Tests · Go / test (push) Successful in 2m38s Details A live `docker inspect` of an engine container and two redispatch runs with `docker events` captured confirm: - Engine has no `com.docker.compose.*` labels and `AutoRemove=false`, so `--remove-orphans` cannot reap it. - Two consecutive `dev-deploy.yaml` redispatches with an engine already running emitted `die` / `destroy` events only for `galaxy-dev-{backend,api,caddy}` — never for the engine. - The reconciler tick that fires 60s after backend recreate correctly matched the surviving engine in both cases (`status=running` in both `games` and `runtime_records`). - `runtime.Service` has no `Shutdown` that proactively removes engine containers, so a graceful backend exit also leaves them alone. The repro window therefore needs a separate trigger that removed the engine container outside of compose. The new hypotheses point at host-side `docker prune` jobs, a `dockerd` restart that lost the container, or an early `Engine.Init` failure that exited the engine before `status=running` reached the runtime row. The investigation list now leads with `journalctl -u docker` and the host crontab — those are the cheapest checks to confirm or rule out next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:10:13 +02:00
Ilia Denisov	5177fef2ef	tools/dev-deploy: log the sandbox-cancellation TODO Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 22:56:25 +02:00

6 Commits