Commit Graph

24 Commits

Author SHA1 Message Date
Ilia Denisov 45815c27d9 fix(dev-deploy): probe Mailpit /mailpit/livez under MP_WEBROOT
MP_WEBROOT=/mailpit prefixes every Mailpit HTTP route, including the
/livez health endpoint. The container healthcheck still probed
http://localhost:8025/livez, which now 404s, so Mailpit reported
unhealthy; the backend depends_on it with condition: service_healthy
and never started, cascading to the gateway and Caddy and failing
`docker compose up --wait`. Point the healthcheck at /mailpit/livez.
2026-06-01 06:11:25 +02:00
Ilia Denisov e11092234c feat(dev-deploy): expose Grafana + Mailpit UIs via Caddy; seed monitoring config
Deploy wiring for the observability stack (the services and collector
config landed in the previous commit):

- Caddyfile.dev: route /grafana/* to galaxy-grafana:3000 (Caddy
  sub-path mode, Grafana keeps its own login) and /mailpit/* to
  galaxy-mailpit:8025 behind dev basic-auth, so the captured-mail UI
  (every message, relayed or not) and Grafana are reachable through the
  single dev origin.
- dev-deploy.yaml: seed the monitoring config tree to a stable,
  reboot-surviving host path (GALAXY_DEV_MONITORING_DIR) before bringing
  the stack up, and inject the Grafana admin password from a Gitea
  secret (GALAXY_DEV_GRAFANA_ADMIN_PASSWORD; empty falls back to the
  compose default).
2026-06-01 05:46:19 +02:00
Ilia Denisov 84a0ccb23f feat(dev-deploy): full observability stack (Prometheus/Grafana/Loki/Tempo)
Stand up a production-mirror monitoring stack in the long-lived dev
contour, all on galaxy-dev-internal with no host ports (reached only via
the in-repo galaxy-dev-caddy):

- Prometheus scrapes backend:9100, gateway:9191, node-exporter and
  cadvisor (30s interval, 15d retention); Loki (7d) + promtail (Docker
  service discovery by the galaxy.stack=dev-deploy label) for logs;
  Tempo (3d) for traces.
- Backend and gateway now export OTLP traces to Tempo over plaintext
  gRPC on the internal network (OTEL_EXPORTER_OTLP_INSECURE).
- Grafana provisioned as code (Prometheus/Loki/Tempo datasources plus a
  starter dashboard), served under /grafana/ via Caddy sub-path mode;
  admin password from the GALAXY_DEV_GRAFANA_ADMIN_PASSWORD secret.
- Expose the Mailpit capture UI under /mailpit/ (Caddy basic-auth +
  MP_WEBROOT) so every captured message is readable regardless of relay.
- dev-deploy.yaml seeds the monitoring config to a stable, reboot-
  surviving host path and injects the Grafana admin secret.

Per-service memory limits keep the footprint within budget. All
collector config lives under tools/dev-deploy/monitoring/ for dev/prod
parity.
2026-05-31 23:39:06 +02:00
Ilia Denisov 7fb6a63c2b feat(dev-deploy): relay Mailpit to Gmail (Stage 3)
Keep Mailpit as the backend's SMTP submission point and turn on its
relay so OTP/notification mail addressed to the owner reaches a real
Gmail inbox, while everything else stays captured-only.

- mailpit gains --smtp-relay-config + --smtp-relay-matching (default
  non-routable, so an unconfigured stack only captures); relay.conf is
  mounted from a new galaxy-dev-mailpit-config volume
- tools/dev-deploy/mailpit/relay.conf.tmpl + a dev-deploy.yaml step that
  renders it from Gitea secrets (Gmail App Password, never committed)
  and seeds the volume; the GALAXY_DEV_MAIL_RELAY_MATCH var drives the
  relay-matching recipient
- backend SMTP config unchanged (still -> galaxy-mailpit:1025)
- dev-deploy README documents the relay + required secrets/vars

Verified locally: compose config valid; the rendered relay.conf is
accepted by mailpit v1.21.8 (relay + recipient-matching enabled).
Real Gmail delivery is verified at the dev-deploy preview once the
owner sets the secrets.
2026-05-31 22:44:32 +02:00
Ilia Denisov 0cae89cba2 refactor(dev): remove the dev-sandbox bootstrap everywhere
Tests · Go / test (push) Successful in 1m59s
Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev
Sandbox" game and dummy users are removed so the dev contour starts
empty like prod; the separate legacy-report loader stays as the
test-data path.

- delete backend/internal/devsandbox (package + tests)
- drop the bootstrap call + DevSandboxConfig (struct, Config field,
  BACKEND_DEV_SANDBOX_* env, defaults, loader, validation)
- strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and
  .env.example; the generic engine-recycle / prune-broken-engines logic
  stays (it serves real games)
- update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev
  README + Makefile) and stale comments; DeleteGame and
  InsertMembershipDirect remain (exercised by lobby integration tests)

No app behaviour change beyond not auto-creating the sandbox game.
2026-05-31 22:28:03 +02:00
Ilia Denisov 27916bbe61 feat(admin-console): Stage 1 — pipe + skeleton behind the gateway
Tests · Go / test (push) Successful in 2m0s
Add the server-rendered operator console at /_gm, exposed publicly through
the gateway behind the existing admin_accounts Basic Auth.

Backend:
- new internal/adminconsole package (html/template Renderer, stateless HMAC
  CSRF signer, embedded stylesheet)
- /_gm route group reusing basicauth.Middleware(admin.Service) + a CSRF guard
  (per-operator token + same-origin check); dashboard landing page
- BACKEND_ADMIN_CONSOLE_CSRF_KEY config (per-process random fallback)

Gateway:
- new "admin" public route class (per-IP rate limit, body + GET/HEAD/POST
  method limits) classifying /_gm traffic
- reverse proxy to the backend /_gm surface, preserving Host and relaying the
  backend 401 Basic Auth challenge; 502 when the backend is unreachable
- GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_ADMIN_* config

dev-deploy:
- Caddy routes /_gm/* to the gateway
- bootstrap admin + stable CSRF key; enable Prometheus /metrics exporters on
  backend and gateway (forward-compat for a future Prometheus/Grafana stack)

Docs: ARCHITECTURE 14.1/16, FUNCTIONAL 10.2.1 (+ru mirror), backend and
gateway READMEs, new backend/docs/admin-console.md.

Tests: renderer + CSRF unit tests; backend router auth/render/asset/CSRF;
gateway classifier, proxy forwarding/Host/401/405/413/429/502.
2026-05-31 19:50:15 +02:00
Ilia Denisov e038ea6154 fix(dev-deploy): recycle engine containers on galaxy-engine:dev SHA drift
Tests · Integration / integration (pull_request) Successful in 1m48s
Tests · Go / test (pull_request) Successful in 2m1s
`backend`'s reconciler adopts pre-existing `galaxy-game-*` containers
without comparing their image SHA against the freshly-built
`galaxy-engine:dev`, so a long-lived sandbox would otherwise keep
serving the previous engine code after a redeploy. Issue #59 surfaced
this: after the per-command-rejection fix was deployed via
`workflow_dispatch`, the running sandbox container was still on the
old image SHA and the browser kept seeing the 503/unavailable response.

Adds a `Recycle engine containers on image drift` step right before
`Reap stray dev-deploy containers`. The step compares the new
`galaxy-engine:dev` SHA against every running `galaxy-game-*`
container and, on drift, stops the backend, removes the container,
wipes the bind-mounted per-game state directory (Engine.Init() writes
turn-0 over any pre-existing `turn-N` files — silent state corruption
otherwise), and cascade-deletes the lobby `games` row. The
`dev-sandbox` bootstrap on the next backend boot finds no live
sandbox and provisions a fresh one on the new engine image.

When the engine sources are unchanged, the BuildKit cache hits and
the SHA stays the same — the recycle step is a no-op and the running
games keep their state across the deploy. Verified end-to-end against
the live dev environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 10:47:25 +02:00
Ilia Denisov 8565942392 feat(deploy): single-origin path-based deployment + project site
Build · Site / build (push) Successful in 8s
Tests · Go / test (push) Successful in 2m22s
Tests · UI / test (push) Failing after 2m42s
Serve the whole stack behind one host: site at /, game UI at /game/,
gateway REST at /api + /healthz, Connect at /rpc (prefix stripped by the
edge Caddy). The built artifact is domain-agnostic — the UI talks to the
gateway same-origin via relative URLs, so the same bundle runs under any
host with no rebuild and with CORS disabled.

- Rename the Connect proto service galaxy.gateway.v1.EdgeGateway ->
  edge.v1.Gateway; regenerate Go + TS; public path /rpc/edge.v1.Gateway.
- Move the game UI under base path /game (env BASE_PATH); make the
  manifest, service-worker scope, WASM loader, and all navigation
  base-aware via a withBase helper.
- Relative API + /rpc Connect prefix; Vite dev proxy mirrors the strip.
- Rewrite the edge Caddy (dev + prod) for path-based routing; empty CORS
  allow-lists (same-origin); single host.
- New VitePress project site (site/): i18n en/ru with switcher, LaTeX
  math, minimal monospace theme; built and served at /.
- dev-deploy compose/Makefile + CI (dev-deploy, prod-build, new
  site-build) build and seed the site; probes hit /, /game/, /healthz.
- Sync docs (ARCHITECTURE, gateway README/openapi, dev-deploy &
  local-dev READMEs, CLAUDE.md, ui/PLAN).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:19:07 +02:00
Ilia Denisov b85a9e1b9b fix(dev-deploy): explicit Cache-Control on the UI surface
Caddy's `file_server` did not set Cache-Control on the SvelteKit
build, so browsers fell back to heuristic caching keyed off
Last-Modified. On the long-lived dev environment the heuristic
window leaves the previous deploy's `index.html` cached for
minutes-to-hours, and Safari combined that with stale conditional
requests into a visible multi-second freeze on every reload (the
reproduction was "private window reloads instantly, normal window
hangs; clearing Safari caches restores normal speed"). Push
delivery itself works — heartbeat keeps the SubscribeEvents stream
alive — but the bundle path stalls behind the browser revalidating
a chain of stale chunks.

Mirror the standard SvelteKit cache split inside both Caddyfiles:

- `_app/immutable/*` — hash-named JS/CSS chunks Vite emits with
  content-addressed file names — `Cache-Control:
  public, max-age=31536000, immutable`. Safe to cache forever
  because the name changes whenever the content does, so the next
  deploy serves new files under new URLs.
- Everything else (`index.html` fallback via `try_files`,
  `env.js`, `version.json`, `core.wasm`, `wasm_exec.js`,
  `favicon.svg`) — `Cache-Control: no-cache, must-revalidate`.
  The browser still uses the cached body when the ETag matches,
  but it always asks first; a fresh deploy reaches the user on
  the next reload without a manual cache clear.

Smoke-tested locally: a docker-run Caddy with this config returns
the immutable header only for `/_app/immutable/*` and the
no-cache header for `/index.html`, `/env.js`, and the SPA-fallback
path `/some/route`. The Caddyfile passes `caddy validate` in
both `Caddyfile.dev` and `Caddyfile.prod`; the pre-existing
formatting warning on line 7 is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 10:11:09 +02:00
Ilia Denisov f70258849f fix(dev-deploy): seed geoip onto a named volume
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.

Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:

- `tools/dev-deploy/docker-compose.yml`: mount
  `galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
  bind. Declare the volume in the top-level `volumes:` block.

- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
  (placed right after the existing UI-volume seed) copies the
  fixture from `pkg/geoip/test-data/test-data/` into the named
  volume via an ephemeral alpine container, the same pattern UI
  seeding already uses.

- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
  the same copy from the persistent checkout. `up` and `rebuild`
  now depend on it, so a hand-run `make -C tools/dev-deploy up`
  populates the volume without operator action.

- `tools/dev-deploy/README.md`: updated the make-targets table to
  list `seed-geoip`.

- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
  failure is downgraded to a "fixed" postmortem; the symptom,
  cause, and where the fix lives are kept for future reference.

Verification on the dev host (this branch checked out):

  $ make -C tools/dev-deploy up                # populates the volume, brings stack healthy
  $ docker restart galaxy-dev-backend          # used to error "not a directory"
  $ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
  $ echo "ok"                                   # backend up 6s, healthy

The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:59:38 +02:00
Ilia Denisov a338ebf058 fix(integration): scope preclean to galaxy.stack=integration
Tests · Integration / integration (pull_request) Successful in 1m37s
Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.

Fix:

- `integration/testenv/backend.go` now sets
  `BACKEND_STACK_LABEL=integration` on every backend-under-test, so
  the engines spawned by integration carry
  `galaxy.stack=integration` in addition to `galaxy.backend=1`. The
  backend support for this env was added in the previous CI tidy-up
  PR (#13).

- `integration/scripts/preclean.sh` gains a multi-label AND filter
  helper and uses it to scope engine cleanup to the combination
  `galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
  local-dev engines carry different `galaxy.stack` values, so the
  AND match leaves them alone.

- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
  the AND-scoping rule and the new integration backend stamp.

- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
  gets an "Update" section recording the root cause and the fix; the
  status is downgraded to "partially fixed" because the solo
  `workflow_dispatch` reproduction (which does NOT trigger
  integration) remains unexplained.

- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
  `docker restart galaxy-dev-backend` failure caused by the
  runner-workspace bind-mount that surfaced while diagnosing this
  issue. Workaround: `make -C tools/dev-deploy up` from the
  persistent checkout. Real fix is a follow-up (bake fixture into
  image or copy to named volume).

Verification:

- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
  `docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
  returns nothing while the dev-deploy engine
  `galaxy-game-80f3ce86-...` keeps running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:37:55 +02:00
Ilia Denisov daed2690c1 fix(compose): keep galaxy.stack label on containers only
Tests · Integration / integration (pull_request) Successful in 1m41s
Tests · Go / test (pull_request) Successful in 2m0s
The previous commit stamped `galaxy.stack=<value>` on services,
volumes, and networks. Putting it on volumes/networks changes their
compose config-hash on every label revision, so `docker compose up`
tries to recreate them — which on the long-lived dev environment
either destroys the postgres data volume or deadlocks while trying
to remove `galaxy-dev-internal` with containers still bound to it.
Observed live: run #184 hung in compose recreate after the three
stateful services were stopped, with no recovery.

Containers alone are sufficient for the cleanup contract (we filter
containers, not volumes or networks). Roll back the label on volumes
and networks in both compose files and capture the rule in
docs/ARCHITECTURE.md so the next contributor does not reintroduce it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:00:21 +02:00
Ilia Denisov a9087691a3 chore(ci): tidy CI/dev infra — drop local-ci, lift migration rule, scope by galaxy.stack label
Tests · Go / test (push) Successful in 2m6s
Tests · Go / test (pull_request) Successful in 3m1s
Tests · Integration / integration (pull_request) Successful in 1m42s
Five connected cleanups across the dev/CI infrastructure:

1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
   the legacy "offline workflow validator"; the per-stage CI gate now
   runs on gitea.lan and the directory was only retained as a
   fallback. Removing it leaves no operational dependency: backend,
   gateway, and game code have no references; documentation that
   pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
   tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
   in this same change. Historical "Verified on local-ci run N"
   markers in ui/PLAN.md are preserved unchanged.

2. Lift the pre-production single-migration rule. The rule forced
   every schema delta into 00001_init.sql and required a manual
   make clean-data wipe on every backward-incompatible change in
   tools/dev-deploy/. Future schema deltas now land as additive
   sequence-numbered files (00002_*.sql, …) that goose applies
   automatically on backend startup; 00001_init.sql becomes an
   immutable baseline. Authoring conventions live in
   backend/internal/postgres/migrations/README.md. The chain may be
   squashed back into a fresh 00001 as a deliberate one-time
   operation before the first production deployment.

3. Document the deployment cadence. The dev environment is
   single-tenant: pushes to feature/* run the test workflows
   (go-unit, ui-test, integration) only; dev-deploy.yaml fires on
   push to development. A workflow_dispatch override on
   dev-deploy.yaml lets a developer preview a feature branch on the
   shared dev environment before merge; the next merge into
   development overwrites the manual deploy idempotently.

4. Scope compose-managed resources by an explicit
   galaxy.stack=<local-dev|dev-deploy> label. Both compose files
   stamp the label on every service, network, and named volume.
   Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
   engine-cleanup operations by (stack-label AND engine OCI title)
   so they never touch unrelated workloads on the same daemon.
   dev-deploy.yaml gains a pre-`compose up` step that reaps stale
   exited/dead containers under the dev-deploy stack label.

5. Backend now stamps the same galaxy.stack=<value> label on every
   engine container it spawns, sourced from a new BACKEND_STACK_LABEL
   env var (empty → label not applied; legacy-safe). Both compose
   files set it to their stack name (local-dev / dev-deploy). The
   contract is recorded in docs/ARCHITECTURE.md under
   "Container labels". A package-level test in
   backend/internal/runtime exercises both the label-present and
   label-absent paths.

No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:32:42 +02:00
Ilia Denisov 49f614926a KNOWN-ISSUES: park sandbox-cancel; owner rejected host-side hypotheses
After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.

Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:16:51 +02:00
Ilia Denisov cadb72b412 KNOWN-ISSUES: rule out compose orphan reap; narrow to host-side reap
Tests · UI / test (push) Successful in 2m36s
Tests · Go / test (push) Successful in 2m38s
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:

- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
  so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
  already running emitted `die` / `destroy` events only for
  `galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
  correctly matched the surviving engine in both cases
  (`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
  engine containers, so a graceful backend exit also leaves them
  alone.

The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:10:13 +02:00
Ilia Denisov 5177fef2ef tools/dev-deploy: log the sandbox-cancellation TODO
Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.

`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.

No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:56:25 +02:00
Ilia Denisov 57e6c1d253 gateway: CORS allow-list for the authenticated Connect-Web surface
Tests · Go / test (push) Successful in 2m9s
Tests · Go / test (pull_request) Successful in 2m9s
Tests · Integration / integration (pull_request) Successful in 1m47s
Tests · UI / test (pull_request) Successful in 2m52s
The public REST listener already exposes
`GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS`; the authenticated
Connect-Web listener on the separate gRPC port had no equivalent.
That worked in `tools/local-dev` (Vite proxy makes everything
same-origin) and would work in production once UI and gateway share
a single hostname, but the long-lived dev environment serves the
UI from `https://www.galaxy.lan` and the gateway from
`https://api.galaxy.lan` — every `/galaxy.gateway.v1.EdgeGateway/*`
fetch failed in the browser with the WebKit "Load failed" generic
message because the response carried no `Access-Control-Allow-Origin`
header. Lobby rendered as "[unknown] Load failed" with no game.

Mirror the public-REST CORS surface for the authenticated handler:

- new env `GATEWAY_AUTHENTICATED_GRPC_CORS_ALLOWED_ORIGINS`;
- new `AuthenticatedGRPCConfig.CORSAllowedOrigins` field;
- new `grpcapi.withCORS` middleware wrapping the Connect mux;
- dev-deploy stack sets the env to `https://www.galaxy.lan`.

The middleware speaks plain net/http (the Connect handler is mounted
on a ServeMux, not gin), handles preflight 204 immediately, and
exposes the Connect-Web header set the browser needs to read the
response (`Grpc-Status`, `Grpc-Message`, `Connect-Protocol-Version`).
Empty allow-list disables the middleware — production stays at
"single hostname" by default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:15:11 +02:00
Ilia Denisov 4b2a949f12 dev-deploy Caddy: route Connect-Web traffic to gateway :9090
Tests · Integration / integration (pull_request) Successful in 1m44s
Tests · Go / test (pull_request) Successful in 2m6s
Tests · UI / test (pull_request) Successful in 2m27s
`api.galaxy.lan` was proxying every path to `galaxy-api:8080` (the
public REST listener), so authenticated Connect-Web calls
(`/galaxy.gateway.v1.EdgeGateway/ExecuteCommand`,
`/galaxy.gateway.v1.EdgeGateway/SubscribeEvents`) collapsed to a 404
from the public route table — the lobby loaded the static bundle
but every authenticated query failed silently.

Split routing by path: `/galaxy.gateway.v1.EdgeGateway/*` goes to
the authenticated listener on `:9090`, everything else stays on
`:8080`. Mirrors the Vite dev-server proxy in
`ui/frontend/vite.config.ts`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:03:55 +02:00
Ilia Denisov 81917acc3e dev-deploy: enable Dev Sandbox bootstrap and synthetic-report loader
Tests · UI / test (push) Has been cancelled
Tests · Integration / integration (pull_request) Successful in 1m47s
Tests · Go / test (pull_request) Successful in 2m4s
Tests · UI / test (pull_request) Successful in 2m23s
Two long-standing dev-environment ergonomics had not survived the
move from the bespoke local-dev stack to the CI-driven dev-deploy:

1. `BACKEND_DEV_SANDBOX_EMAIL` defaulted to an empty string in the
   dev-deploy compose, so the auto-provisioned "Dev Sandbox" game
   never appeared on `https://www.galaxy.lan`. Bake `dev@galaxy.lan`
   as the default — matches `.env.example` and lets a developer who
   logs in with that email find a ready-to-play game in the lobby.

2. The lobby's synthetic-report loader was gated on
   `import.meta.env.DEV`, which is true only for `vite dev` (the
   tools/local-dev path). The long-lived dev environment builds
   with `vite build` (production mode), so the section was always
   stripped from its bundle. Gate it on an explicit
   `VITE_GALAXY_DEV_AFFORDANCES` flag instead and set it both in
   `.env.development` (preserves `pnpm dev` behaviour) and in the
   `dev-deploy.yaml` build step. The `prod-build.yaml` build path
   leaves the flag unset, so production stays clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 21:46:24 +02:00
Ilia Denisov 8bc75fd71b dev-deploy: default BACKEND_AUTH_DEV_FIXED_CODE to 123456
The long-lived dev environment now opts into the bcrypt-bypass on a
fresh `up`/`rebuild` so a returning developer can sign in with `123456`
even after the matching browser session was cleared (the real emailed
code is single-use). Set the variable to an empty string in `.env` to
force real Mailpit codes (mail-flow QA).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 12:41:32 +02:00
Ilia Denisov 1855e43699 gateway: add CORS allow-list for the public REST surface
Tests · Go / test (push) Successful in 1m42s
Tests · Go / test (pull_request) Successful in 1m45s
Tests · Integration / integration (pull_request) Successful in 1m36s
Adds a `GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS` env-driven allow-list
on the public REST server so the dev UI on https://www.galaxy.lan can
call https://api.galaxy.lan without the browser blocking the
cross-origin response. Defaults to empty (no CORS) so the production
posture stays closed.

The middleware mounts before route classification and anti-abuse, so
OPTIONS preflights never charge against per-class rate-limit buckets.

`tools/dev-deploy/docker-compose.yml` opts the dev gateway into a
single allowed origin (`https://www.galaxy.lan`); local-dev keeps the
defaults because Vite proxies through the same origin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:58:14 +02:00
Ilia Denisov bb74e3336e dev-deploy: restore GeoIP bind-mount, drop image bake
Tests · Integration / integration (pull_request) Successful in 2m14s
Tests · Go / test (pull_request) Successful in 2m19s
Tests · UI / test (pull_request) Failing after 51m17s
With the runner in host-mode, compose bind-mount paths resolve to
real host paths the Docker daemon can see, so the GeoIP file no
longer needs to be baked into the backend image to survive CI. Bring
back the bind-mount of `pkg/geoip/test-data/.../mmdb`, matching how
local-dev sources it. Image now only carries the backend binary,
symmetric with the production `backend/Dockerfile`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:04:11 +02:00
Ilia Denisov 0da360a644 dev-deploy: fix backend startup in CI
Two bugs surfaced on the first real merge into development:

1. `${{ env.HOME }}` evaluates to empty string at the workflow stage,
   so GALAXY_DEV_GAME_STATE_DIR became `/.galaxy-dev/game-state`.
   Resolve in the shell instead of YAML.

2. The compose bind-mount of GeoIP2-Country-Test.mmdb referenced a
   path inside the runner's workspace volume, which the host Docker
   daemon cannot see — it created an empty directory and the backend
   crashed with "geoip database: is a directory" in a restart loop.
   Bake the file into the backend image so dev-deploy no longer needs
   a bind-mount; local-dev compose still mounts it on top for swap-in
   during development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:22:16 +02:00
Ilia Denisov 00c79064fc tools/dev-deploy: long-lived dev environment behind host Caddy
A docker-compose stack that hosts postgres, redis, mailpit, backend,
gateway, and an app-routing Caddy. Reachable through the host Caddy at
https://www.galaxy.lan (static SPA) and https://api.galaxy.lan (REST +
gRPC). Coexists with tools/local-dev/ and tools/local-ci/ by giving
every name (compose project, container, network, volume) a distinct
galaxy-dev-* prefix.

State is persisted in named volumes; game-state lives under
${GALAXY_DEV_GAME_STATE_DIR:-$HOME/.galaxy-dev/game-state} so the
default works for a non-root runner without sudo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 23:26:35 +02:00