26f1e6292492bcbeb2424f4e4487b4e28af36827
19 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
27916bbe61 |
feat(admin-console): Stage 1 — pipe + skeleton behind the gateway
Tests · Go / test (push) Successful in 2m0s
Add the server-rendered operator console at /_gm, exposed publicly through the gateway behind the existing admin_accounts Basic Auth. Backend: - new internal/adminconsole package (html/template Renderer, stateless HMAC CSRF signer, embedded stylesheet) - /_gm route group reusing basicauth.Middleware(admin.Service) + a CSRF guard (per-operator token + same-origin check); dashboard landing page - BACKEND_ADMIN_CONSOLE_CSRF_KEY config (per-process random fallback) Gateway: - new "admin" public route class (per-IP rate limit, body + GET/HEAD/POST method limits) classifying /_gm traffic - reverse proxy to the backend /_gm surface, preserving Host and relaying the backend 401 Basic Auth challenge; 502 when the backend is unreachable - GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_ADMIN_* config dev-deploy: - Caddy routes /_gm/* to the gateway - bootstrap admin + stable CSRF key; enable Prometheus /metrics exporters on backend and gateway (forward-compat for a future Prometheus/Grafana stack) Docs: ARCHITECTURE 14.1/16, FUNCTIONAL 10.2.1 (+ru mirror), backend and gateway READMEs, new backend/docs/admin-console.md. Tests: renderer + CSRF unit tests; backend router auth/render/asset/CSRF; gateway classifier, proxy forwarding/Host/401/405/413/429/502. |
||
|
|
e038ea6154 |
fix(dev-deploy): recycle engine containers on galaxy-engine:dev SHA drift
`backend`'s reconciler adopts pre-existing `galaxy-game-*` containers without comparing their image SHA against the freshly-built `galaxy-engine:dev`, so a long-lived sandbox would otherwise keep serving the previous engine code after a redeploy. Issue #59 surfaced this: after the per-command-rejection fix was deployed via `workflow_dispatch`, the running sandbox container was still on the old image SHA and the browser kept seeing the 503/unavailable response. Adds a `Recycle engine containers on image drift` step right before `Reap stray dev-deploy containers`. The step compares the new `galaxy-engine:dev` SHA against every running `galaxy-game-*` container and, on drift, stops the backend, removes the container, wipes the bind-mounted per-game state directory (Engine.Init() writes turn-0 over any pre-existing `turn-N` files — silent state corruption otherwise), and cascade-deletes the lobby `games` row. The `dev-sandbox` bootstrap on the next backend boot finds no live sandbox and provisions a fresh one on the new engine image. When the engine sources are unchanged, the BuildKit cache hits and the SHA stays the same — the recycle step is a no-op and the running games keep their state across the deploy. Verified end-to-end against the live dev environment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8565942392 |
feat(deploy): single-origin path-based deployment + project site
Serve the whole stack behind one host: site at /, game UI at /game/, gateway REST at /api + /healthz, Connect at /rpc (prefix stripped by the edge Caddy). The built artifact is domain-agnostic — the UI talks to the gateway same-origin via relative URLs, so the same bundle runs under any host with no rebuild and with CORS disabled. - Rename the Connect proto service galaxy.gateway.v1.EdgeGateway -> edge.v1.Gateway; regenerate Go + TS; public path /rpc/edge.v1.Gateway. - Move the game UI under base path /game (env BASE_PATH); make the manifest, service-worker scope, WASM loader, and all navigation base-aware via a withBase helper. - Relative API + /rpc Connect prefix; Vite dev proxy mirrors the strip. - Rewrite the edge Caddy (dev + prod) for path-based routing; empty CORS allow-lists (same-origin); single host. - New VitePress project site (site/): i18n en/ru with switcher, LaTeX math, minimal monospace theme; built and served at /. - dev-deploy compose/Makefile + CI (dev-deploy, prod-build, new site-build) build and seed the site; probes hit /, /game/, /healthz. - Sync docs (ARCHITECTURE, gateway README/openapi, dev-deploy & local-dev READMEs, CLAUDE.md, ui/PLAN). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b85a9e1b9b |
fix(dev-deploy): explicit Cache-Control on the UI surface
Caddy's `file_server` did not set Cache-Control on the SvelteKit build, so browsers fell back to heuristic caching keyed off Last-Modified. On the long-lived dev environment the heuristic window leaves the previous deploy's `index.html` cached for minutes-to-hours, and Safari combined that with stale conditional requests into a visible multi-second freeze on every reload (the reproduction was "private window reloads instantly, normal window hangs; clearing Safari caches restores normal speed"). Push delivery itself works — heartbeat keeps the SubscribeEvents stream alive — but the bundle path stalls behind the browser revalidating a chain of stale chunks. Mirror the standard SvelteKit cache split inside both Caddyfiles: - `_app/immutable/*` — hash-named JS/CSS chunks Vite emits with content-addressed file names — `Cache-Control: public, max-age=31536000, immutable`. Safe to cache forever because the name changes whenever the content does, so the next deploy serves new files under new URLs. - Everything else (`index.html` fallback via `try_files`, `env.js`, `version.json`, `core.wasm`, `wasm_exec.js`, `favicon.svg`) — `Cache-Control: no-cache, must-revalidate`. The browser still uses the cached body when the ETag matches, but it always asks first; a fresh deploy reaches the user on the next reload without a manual cache clear. Smoke-tested locally: a docker-run Caddy with this config returns the immutable header only for `/_app/immutable/*` and the no-cache header for `/index.html`, `/env.js`, and the SPA-fallback path `/some/route`. The Caddyfile passes `caddy validate` in both `Caddyfile.dev` and `Caddyfile.prod`; the pre-existing formatting warning on line 7 is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f70258849f |
fix(dev-deploy): seed geoip onto a named volume
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.
Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:
- `tools/dev-deploy/docker-compose.yml`: mount
`galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
bind. Declare the volume in the top-level `volumes:` block.
- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
(placed right after the existing UI-volume seed) copies the
fixture from `pkg/geoip/test-data/test-data/` into the named
volume via an ephemeral alpine container, the same pattern UI
seeding already uses.
- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
the same copy from the persistent checkout. `up` and `rebuild`
now depend on it, so a hand-run `make -C tools/dev-deploy up`
populates the volume without operator action.
- `tools/dev-deploy/README.md`: updated the make-targets table to
list `seed-geoip`.
- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
failure is downgraded to a "fixed" postmortem; the symptom,
cause, and where the fix lives are kept for future reference.
Verification on the dev host (this branch checked out):
$ make -C tools/dev-deploy up # populates the volume, brings stack healthy
$ docker restart galaxy-dev-backend # used to error "not a directory"
$ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
$ echo "ok" # backend up 6s, healthy
The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a338ebf058 |
fix(integration): scope preclean to galaxy.stack=integration
Tests · Integration / integration (pull_request) Successful in 1m37s
Root cause for the long-standing "Dev Sandbox flips to cancelled after dev-deploy" symptom in push-triggered cycles: when `integration.yaml` runs in parallel with `dev-deploy.yaml`, its `integration/scripts/preclean.sh` issues a `docker rm -f` over every container labelled `galaxy.backend=1`. That label is stamped by the backend's runtime adapter on every engine it spawns — including the engines living in the long-lived dev-deploy environment on the same Docker daemon. Each post-merge auto-deploy therefore had the integration preclean wipe the dev-sandbox engine, and the new backend's reconciler tick observed `container disappeared` and cascaded the sandbox into `cancelled`. Fix: - `integration/testenv/backend.go` now sets `BACKEND_STACK_LABEL=integration` on every backend-under-test, so the engines spawned by integration carry `galaxy.stack=integration` in addition to `galaxy.backend=1`. The backend support for this env was added in the previous CI tidy-up PR (#13). - `integration/scripts/preclean.sh` gains a multi-label AND filter helper and uses it to scope engine cleanup to the combination `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and local-dev engines carry different `galaxy.stack` values, so the AND match leaves them alone. - `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out the AND-scoping rule and the new integration backend stamp. - `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry gets an "Update" section recording the root cause and the fix; the status is downgraded to "partially fixed" because the solo `workflow_dispatch` reproduction (which does NOT trigger integration) remains unexplained. - `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the `docker restart galaxy-dev-backend` failure caused by the runner-workspace bind-mount that surfaced while diagnosing this issue. Workaround: `make -C tools/dev-deploy up` from the persistent checkout. Real fix is a follow-up (bake fixture into image or copy to named volume). Verification: - `go build ./backend/... ./integration/...` — clean. - `bash -n integration/scripts/preclean.sh` — syntax OK. - Live AND-filter check on the dev host: `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration` returns nothing while the dev-deploy engine `galaxy-game-80f3ce86-...` keeps running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
daed2690c1 |
fix(compose): keep galaxy.stack label on containers only
The previous commit stamped `galaxy.stack=<value>` on services, volumes, and networks. Putting it on volumes/networks changes their compose config-hash on every label revision, so `docker compose up` tries to recreate them — which on the long-lived dev environment either destroys the postgres data volume or deadlocks while trying to remove `galaxy-dev-internal` with containers still bound to it. Observed live: run #184 hung in compose recreate after the three stateful services were stopped, with no recovery. Containers alone are sufficient for the cleanup contract (we filter containers, not volumes or networks). Roll back the label on volumes and networks in both compose files and capture the rule in docs/ARCHITECTURE.md so the next contributor does not reintroduce it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a9087691a3 |
chore(ci): tidy CI/dev infra — drop local-ci, lift migration rule, scope by galaxy.stack label
Five connected cleanups across the dev/CI infrastructure:
1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
the legacy "offline workflow validator"; the per-stage CI gate now
runs on gitea.lan and the directory was only retained as a
fallback. Removing it leaves no operational dependency: backend,
gateway, and game code have no references; documentation that
pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
in this same change. Historical "Verified on local-ci run N"
markers in ui/PLAN.md are preserved unchanged.
2. Lift the pre-production single-migration rule. The rule forced
every schema delta into 00001_init.sql and required a manual
make clean-data wipe on every backward-incompatible change in
tools/dev-deploy/. Future schema deltas now land as additive
sequence-numbered files (00002_*.sql, …) that goose applies
automatically on backend startup; 00001_init.sql becomes an
immutable baseline. Authoring conventions live in
backend/internal/postgres/migrations/README.md. The chain may be
squashed back into a fresh 00001 as a deliberate one-time
operation before the first production deployment.
3. Document the deployment cadence. The dev environment is
single-tenant: pushes to feature/* run the test workflows
(go-unit, ui-test, integration) only; dev-deploy.yaml fires on
push to development. A workflow_dispatch override on
dev-deploy.yaml lets a developer preview a feature branch on the
shared dev environment before merge; the next merge into
development overwrites the manual deploy idempotently.
4. Scope compose-managed resources by an explicit
galaxy.stack=<local-dev|dev-deploy> label. Both compose files
stamp the label on every service, network, and named volume.
Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
engine-cleanup operations by (stack-label AND engine OCI title)
so they never touch unrelated workloads on the same daemon.
dev-deploy.yaml gains a pre-`compose up` step that reaps stale
exited/dead containers under the dev-deploy stack label.
5. Backend now stamps the same galaxy.stack=<value> label on every
engine container it spawns, sourced from a new BACKEND_STACK_LABEL
env var (empty → label not applied; legacy-safe). Both compose
files set it to their stack name (local-dev / dev-deploy). The
contract is recorded in docs/ARCHITECTURE.md under
"Container labels". A package-level test in
backend/internal/runtime exercises both the label-present and
label-absent paths.
No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
49f614926a |
KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses
After the live investigation, the project owner confirms that none of the host-side cleanup paths apply: no docker prune cron, no manual `docker rm`, no `dockerd` restart in the window, and the engine binary does not crash while idling on API calls. Replace the host-side hypothesis list with a one-line note that they were considered and rejected, narrow the open suspicion to the `dev-deploy.yaml` job sequence (`docker build` + `docker compose build` + the alpine `docker run --rm` for UI seeding + `docker compose up -d --wait --remove-orphans`), and park the entry. Reopen if the symptom recurs with a fresh `docker events --since 0` capture armed before the deploy starts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
cadb72b412 |
KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:
- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
already running emitted `die` / `destroy` events only for
`galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
correctly matched the surviving engine in both cases
(`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
engine containers, so a graceful backend exit also leaves them
alone.
The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|
|
5177fef2ef |
tools/dev-deploy: log the sandbox-cancellation TODO
Capture the diagnostic notes for the issue we hit after every `dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox" game ends up `cancelled` ~15 minutes later, with the runtime reconciler reporting "container disappeared". The engine never shows up in `docker ps -a --filter label=galaxy-game-engine`, so either it never spawned or it was removed before any host-side snapshot. `KNOWN-ISSUES.md` records the symptom, the log excerpt, three working hypotheses (runtime spawn race, `--remove-orphans` interaction, engine `--rm` lifecycle), and the investigation checklist before opening an issue. The README gets a one-line pointer so future redeploys land on the doc immediately. No code change — this is the placeholder so the next person investigating the cancellation pattern does not have to rediscover the diagnostic from scratch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
57e6c1d253 |
gateway: CORS allow-list for the authenticated Connect-Web surface
The public REST listener already exposes `GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS`; the authenticated Connect-Web listener on the separate gRPC port had no equivalent. That worked in `tools/local-dev` (Vite proxy makes everything same-origin) and would work in production once UI and gateway share a single hostname, but the long-lived dev environment serves the UI from `https://www.galaxy.lan` and the gateway from `https://api.galaxy.lan` — every `/galaxy.gateway.v1.EdgeGateway/*` fetch failed in the browser with the WebKit "Load failed" generic message because the response carried no `Access-Control-Allow-Origin` header. Lobby rendered as "[unknown] Load failed" with no game. Mirror the public-REST CORS surface for the authenticated handler: - new env `GATEWAY_AUTHENTICATED_GRPC_CORS_ALLOWED_ORIGINS`; - new `AuthenticatedGRPCConfig.CORSAllowedOrigins` field; - new `grpcapi.withCORS` middleware wrapping the Connect mux; - dev-deploy stack sets the env to `https://www.galaxy.lan`. The middleware speaks plain net/http (the Connect handler is mounted on a ServeMux, not gin), handles preflight 204 immediately, and exposes the Connect-Web header set the browser needs to read the response (`Grpc-Status`, `Grpc-Message`, `Connect-Protocol-Version`). Empty allow-list disables the middleware — production stays at "single hostname" by default. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
4b2a949f12 |
dev-deploy Caddy: route Connect-Web traffic to gateway :9090
`api.galaxy.lan` was proxying every path to `galaxy-api:8080` (the public REST listener), so authenticated Connect-Web calls (`/galaxy.gateway.v1.EdgeGateway/ExecuteCommand`, `/galaxy.gateway.v1.EdgeGateway/SubscribeEvents`) collapsed to a 404 from the public route table — the lobby loaded the static bundle but every authenticated query failed silently. Split routing by path: `/galaxy.gateway.v1.EdgeGateway/*` goes to the authenticated listener on `:9090`, everything else stays on `:8080`. Mirrors the Vite dev-server proxy in `ui/frontend/vite.config.ts`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
81917acc3e |
dev-deploy: enable Dev Sandbox bootstrap and synthetic-report loader
Two long-standing dev-environment ergonomics had not survived the move from the bespoke local-dev stack to the CI-driven dev-deploy: 1. `BACKEND_DEV_SANDBOX_EMAIL` defaulted to an empty string in the dev-deploy compose, so the auto-provisioned "Dev Sandbox" game never appeared on `https://www.galaxy.lan`. Bake `dev@galaxy.lan` as the default — matches `.env.example` and lets a developer who logs in with that email find a ready-to-play game in the lobby. 2. The lobby's synthetic-report loader was gated on `import.meta.env.DEV`, which is true only for `vite dev` (the tools/local-dev path). The long-lived dev environment builds with `vite build` (production mode), so the section was always stripped from its bundle. Gate it on an explicit `VITE_GALAXY_DEV_AFFORDANCES` flag instead and set it both in `.env.development` (preserves `pnpm dev` behaviour) and in the `dev-deploy.yaml` build step. The `prod-build.yaml` build path leaves the flag unset, so production stays clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
8bc75fd71b |
dev-deploy: default BACKEND_AUTH_DEV_FIXED_CODE to 123456
The long-lived dev environment now opts into the bcrypt-bypass on a fresh `up`/`rebuild` so a returning developer can sign in with `123456` even after the matching browser session was cleared (the real emailed code is single-use). Set the variable to an empty string in `.env` to force real Mailpit codes (mail-flow QA). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
1855e43699 |
gateway: add CORS allow-list for the public REST surface
Adds a `GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS` env-driven allow-list on the public REST server so the dev UI on https://www.galaxy.lan can call https://api.galaxy.lan without the browser blocking the cross-origin response. Defaults to empty (no CORS) so the production posture stays closed. The middleware mounts before route classification and anti-abuse, so OPTIONS preflights never charge against per-class rate-limit buckets. `tools/dev-deploy/docker-compose.yml` opts the dev gateway into a single allowed origin (`https://www.galaxy.lan`); local-dev keeps the defaults because Vite proxies through the same origin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bb74e3336e |
dev-deploy: restore GeoIP bind-mount, drop image bake
With the runner in host-mode, compose bind-mount paths resolve to real host paths the Docker daemon can see, so the GeoIP file no longer needs to be baked into the backend image to survive CI. Bring back the bind-mount of `pkg/geoip/test-data/.../mmdb`, matching how local-dev sources it. Image now only carries the backend binary, symmetric with the production `backend/Dockerfile`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0da360a644 |
dev-deploy: fix backend startup in CI
Two bugs surfaced on the first real merge into development:
1. `${{ env.HOME }}` evaluates to empty string at the workflow stage,
so GALAXY_DEV_GAME_STATE_DIR became `/.galaxy-dev/game-state`.
Resolve in the shell instead of YAML.
2. The compose bind-mount of GeoIP2-Country-Test.mmdb referenced a
path inside the runner's workspace volume, which the host Docker
daemon cannot see — it created an empty directory and the backend
crashed with "geoip database: is a directory" in a restart loop.
Bake the file into the backend image so dev-deploy no longer needs
a bind-mount; local-dev compose still mounts it on top for swap-in
during development.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
00c79064fc |
tools/dev-deploy: long-lived dev environment behind host Caddy
A docker-compose stack that hosts postgres, redis, mailpit, backend, gateway, and an app-routing Caddy. Reachable through the host Caddy at https://www.galaxy.lan (static SPA) and https://api.galaxy.lan (REST + gRPC). Coexists with tools/local-dev/ and tools/local-ci/ by giving every name (compose project, container, network, volume) a distinct galaxy-dev-* prefix. State is persisted in named volumes; game-state lives under ${GALAXY_DEV_GAME_STATE_DIR:-$HOME/.galaxy-dev/game-state} so the default works for a non-root runner without sudo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |