MP_WEBROOT=/mailpit prefixes every Mailpit HTTP route, including the
/livez health endpoint. The container healthcheck still probed
http://localhost:8025/livez, which now 404s, so Mailpit reported
unhealthy; the backend depends_on it with condition: service_healthy
and never started, cascading to the gateway and Caddy and failing
`docker compose up --wait`. Point the healthcheck at /mailpit/livez.
Stand up a production-mirror monitoring stack in the long-lived dev
contour, all on galaxy-dev-internal with no host ports (reached only via
the in-repo galaxy-dev-caddy):
- Prometheus scrapes backend:9100, gateway:9191, node-exporter and
cadvisor (30s interval, 15d retention); Loki (7d) + promtail (Docker
service discovery by the galaxy.stack=dev-deploy label) for logs;
Tempo (3d) for traces.
- Backend and gateway now export OTLP traces to Tempo over plaintext
gRPC on the internal network (OTEL_EXPORTER_OTLP_INSECURE).
- Grafana provisioned as code (Prometheus/Loki/Tempo datasources plus a
starter dashboard), served under /grafana/ via Caddy sub-path mode;
admin password from the GALAXY_DEV_GRAFANA_ADMIN_PASSWORD secret.
- Expose the Mailpit capture UI under /mailpit/ (Caddy basic-auth +
MP_WEBROOT) so every captured message is readable regardless of relay.
- dev-deploy.yaml seeds the monitoring config to a stable, reboot-
surviving host path and injects the Grafana admin secret.
Per-service memory limits keep the footprint within budget. All
collector config lives under tools/dev-deploy/monitoring/ for dev/prod
parity.
Keep Mailpit as the backend's SMTP submission point and turn on its
relay so OTP/notification mail addressed to the owner reaches a real
Gmail inbox, while everything else stays captured-only.
- mailpit gains --smtp-relay-config + --smtp-relay-matching (default
non-routable, so an unconfigured stack only captures); relay.conf is
mounted from a new galaxy-dev-mailpit-config volume
- tools/dev-deploy/mailpit/relay.conf.tmpl + a dev-deploy.yaml step that
renders it from Gitea secrets (Gmail App Password, never committed)
and seeds the volume; the GALAXY_DEV_MAIL_RELAY_MATCH var drives the
relay-matching recipient
- backend SMTP config unchanged (still -> galaxy-mailpit:1025)
- dev-deploy README documents the relay + required secrets/vars
Verified locally: compose config valid; the rendered relay.conf is
accepted by mailpit v1.21.8 (relay + recipient-matching enabled).
Real Gmail delivery is verified at the dev-deploy preview once the
owner sets the secrets.
Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev
Sandbox" game and dummy users are removed so the dev contour starts
empty like prod; the separate legacy-report loader stays as the
test-data path.
- delete backend/internal/devsandbox (package + tests)
- drop the bootstrap call + DevSandboxConfig (struct, Config field,
BACKEND_DEV_SANDBOX_* env, defaults, loader, validation)
- strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and
.env.example; the generic engine-recycle / prune-broken-engines logic
stays (it serves real games)
- update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev
README + Makefile) and stale comments; DeleteGame and
InsertMembershipDirect remain (exercised by lobby integration tests)
No app behaviour change beyond not auto-creating the sandbox game.
Serve the whole stack behind one host: site at /, game UI at /game/,
gateway REST at /api + /healthz, Connect at /rpc (prefix stripped by the
edge Caddy). The built artifact is domain-agnostic — the UI talks to the
gateway same-origin via relative URLs, so the same bundle runs under any
host with no rebuild and with CORS disabled.
- Rename the Connect proto service galaxy.gateway.v1.EdgeGateway ->
edge.v1.Gateway; regenerate Go + TS; public path /rpc/edge.v1.Gateway.
- Move the game UI under base path /game (env BASE_PATH); make the
manifest, service-worker scope, WASM loader, and all navigation
base-aware via a withBase helper.
- Relative API + /rpc Connect prefix; Vite dev proxy mirrors the strip.
- Rewrite the edge Caddy (dev + prod) for path-based routing; empty CORS
allow-lists (same-origin); single host.
- New VitePress project site (site/): i18n en/ru with switcher, LaTeX
math, minimal monospace theme; built and served at /.
- dev-deploy compose/Makefile + CI (dev-deploy, prod-build, new
site-build) build and seed the site; probes hit /, /game/, /healthz.
- Sync docs (ARCHITECTURE, gateway README/openapi, dev-deploy &
local-dev READMEs, CLAUDE.md, ui/PLAN).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.
Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:
- `tools/dev-deploy/docker-compose.yml`: mount
`galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
bind. Declare the volume in the top-level `volumes:` block.
- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
(placed right after the existing UI-volume seed) copies the
fixture from `pkg/geoip/test-data/test-data/` into the named
volume via an ephemeral alpine container, the same pattern UI
seeding already uses.
- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
the same copy from the persistent checkout. `up` and `rebuild`
now depend on it, so a hand-run `make -C tools/dev-deploy up`
populates the volume without operator action.
- `tools/dev-deploy/README.md`: updated the make-targets table to
list `seed-geoip`.
- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
failure is downgraded to a "fixed" postmortem; the symptom,
cause, and where the fix lives are kept for future reference.
Verification on the dev host (this branch checked out):
$ make -C tools/dev-deploy up # populates the volume, brings stack healthy
$ docker restart galaxy-dev-backend # used to error "not a directory"
$ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
$ echo "ok" # backend up 6s, healthy
The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit stamped `galaxy.stack=<value>` on services,
volumes, and networks. Putting it on volumes/networks changes their
compose config-hash on every label revision, so `docker compose up`
tries to recreate them — which on the long-lived dev environment
either destroys the postgres data volume or deadlocks while trying
to remove `galaxy-dev-internal` with containers still bound to it.
Observed live: run #184 hung in compose recreate after the three
stateful services were stopped, with no recovery.
Containers alone are sufficient for the cleanup contract (we filter
containers, not volumes or networks). Roll back the label on volumes
and networks in both compose files and capture the rule in
docs/ARCHITECTURE.md so the next contributor does not reintroduce it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five connected cleanups across the dev/CI infrastructure:
1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
the legacy "offline workflow validator"; the per-stage CI gate now
runs on gitea.lan and the directory was only retained as a
fallback. Removing it leaves no operational dependency: backend,
gateway, and game code have no references; documentation that
pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
in this same change. Historical "Verified on local-ci run N"
markers in ui/PLAN.md are preserved unchanged.
2. Lift the pre-production single-migration rule. The rule forced
every schema delta into 00001_init.sql and required a manual
make clean-data wipe on every backward-incompatible change in
tools/dev-deploy/. Future schema deltas now land as additive
sequence-numbered files (00002_*.sql, …) that goose applies
automatically on backend startup; 00001_init.sql becomes an
immutable baseline. Authoring conventions live in
backend/internal/postgres/migrations/README.md. The chain may be
squashed back into a fresh 00001 as a deliberate one-time
operation before the first production deployment.
3. Document the deployment cadence. The dev environment is
single-tenant: pushes to feature/* run the test workflows
(go-unit, ui-test, integration) only; dev-deploy.yaml fires on
push to development. A workflow_dispatch override on
dev-deploy.yaml lets a developer preview a feature branch on the
shared dev environment before merge; the next merge into
development overwrites the manual deploy idempotently.
4. Scope compose-managed resources by an explicit
galaxy.stack=<local-dev|dev-deploy> label. Both compose files
stamp the label on every service, network, and named volume.
Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
engine-cleanup operations by (stack-label AND engine OCI title)
so they never touch unrelated workloads on the same daemon.
dev-deploy.yaml gains a pre-`compose up` step that reaps stale
exited/dead containers under the dev-deploy stack label.
5. Backend now stamps the same galaxy.stack=<value> label on every
engine container it spawns, sourced from a new BACKEND_STACK_LABEL
env var (empty → label not applied; legacy-safe). Both compose
files set it to their stack name (local-dev / dev-deploy). The
contract is recorded in docs/ARCHITECTURE.md under
"Container labels". A package-level test in
backend/internal/runtime exercises both the label-present and
label-absent paths.
No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The public REST listener already exposes
`GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS`; the authenticated
Connect-Web listener on the separate gRPC port had no equivalent.
That worked in `tools/local-dev` (Vite proxy makes everything
same-origin) and would work in production once UI and gateway share
a single hostname, but the long-lived dev environment serves the
UI from `https://www.galaxy.lan` and the gateway from
`https://api.galaxy.lan` — every `/galaxy.gateway.v1.EdgeGateway/*`
fetch failed in the browser with the WebKit "Load failed" generic
message because the response carried no `Access-Control-Allow-Origin`
header. Lobby rendered as "[unknown] Load failed" with no game.
Mirror the public-REST CORS surface for the authenticated handler:
- new env `GATEWAY_AUTHENTICATED_GRPC_CORS_ALLOWED_ORIGINS`;
- new `AuthenticatedGRPCConfig.CORSAllowedOrigins` field;
- new `grpcapi.withCORS` middleware wrapping the Connect mux;
- dev-deploy stack sets the env to `https://www.galaxy.lan`.
The middleware speaks plain net/http (the Connect handler is mounted
on a ServeMux, not gin), handles preflight 204 immediately, and
exposes the Connect-Web header set the browser needs to read the
response (`Grpc-Status`, `Grpc-Message`, `Connect-Protocol-Version`).
Empty allow-list disables the middleware — production stays at
"single hostname" by default.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two long-standing dev-environment ergonomics had not survived the
move from the bespoke local-dev stack to the CI-driven dev-deploy:
1. `BACKEND_DEV_SANDBOX_EMAIL` defaulted to an empty string in the
dev-deploy compose, so the auto-provisioned "Dev Sandbox" game
never appeared on `https://www.galaxy.lan`. Bake `dev@galaxy.lan`
as the default — matches `.env.example` and lets a developer who
logs in with that email find a ready-to-play game in the lobby.
2. The lobby's synthetic-report loader was gated on
`import.meta.env.DEV`, which is true only for `vite dev` (the
tools/local-dev path). The long-lived dev environment builds
with `vite build` (production mode), so the section was always
stripped from its bundle. Gate it on an explicit
`VITE_GALAXY_DEV_AFFORDANCES` flag instead and set it both in
`.env.development` (preserves `pnpm dev` behaviour) and in the
`dev-deploy.yaml` build step. The `prod-build.yaml` build path
leaves the flag unset, so production stays clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The long-lived dev environment now opts into the bcrypt-bypass on a
fresh `up`/`rebuild` so a returning developer can sign in with `123456`
even after the matching browser session was cleared (the real emailed
code is single-use). Set the variable to an empty string in `.env` to
force real Mailpit codes (mail-flow QA).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a `GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS` env-driven allow-list
on the public REST server so the dev UI on https://www.galaxy.lan can
call https://api.galaxy.lan without the browser blocking the
cross-origin response. Defaults to empty (no CORS) so the production
posture stays closed.
The middleware mounts before route classification and anti-abuse, so
OPTIONS preflights never charge against per-class rate-limit buckets.
`tools/dev-deploy/docker-compose.yml` opts the dev gateway into a
single allowed origin (`https://www.galaxy.lan`); local-dev keeps the
defaults because Vite proxies through the same origin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the runner in host-mode, compose bind-mount paths resolve to
real host paths the Docker daemon can see, so the GeoIP file no
longer needs to be baked into the backend image to survive CI. Bring
back the bind-mount of `pkg/geoip/test-data/.../mmdb`, matching how
local-dev sources it. Image now only carries the backend binary,
symmetric with the production `backend/Dockerfile`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced on the first real merge into development:
1. `${{ env.HOME }}` evaluates to empty string at the workflow stage,
so GALAXY_DEV_GAME_STATE_DIR became `/.galaxy-dev/game-state`.
Resolve in the shell instead of YAML.
2. The compose bind-mount of GeoIP2-Country-Test.mmdb referenced a
path inside the runner's workspace volume, which the host Docker
daemon cannot see — it created an empty directory and the backend
crashed with "geoip database: is a directory" in a restart loop.
Bake the file into the backend image so dev-deploy no longer needs
a bind-mount; local-dev compose still mounts it on top for swap-in
during development.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A docker-compose stack that hosts postgres, redis, mailpit, backend,
gateway, and an app-routing Caddy. Reachable through the host Caddy at
https://www.galaxy.lan (static SPA) and https://api.galaxy.lan (REST +
gRPC). Coexists with tools/local-dev/ and tools/local-ci/ by giving
every name (compose project, container, network, volume) a distinct
galaxy-dev-* prefix.
State is persisted in named volumes; game-state lives under
${GALAXY_DEV_GAME_STATE_DIR:-$HOME/.galaxy-dev/game-state} so the
default works for a non-root runner without sudo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>