After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.
Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:
- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
already running emitted `die` / `destroy` events only for
`galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
correctly matched the surviving engine in both cases
(`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
engine containers, so a graceful backend exit also leaves them
alone.
The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.
`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.
No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The public REST listener already exposes
`GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS`; the authenticated
Connect-Web listener on the separate gRPC port had no equivalent.
That worked in `tools/local-dev` (Vite proxy makes everything
same-origin) and would work in production once UI and gateway share
a single hostname, but the long-lived dev environment serves the
UI from `https://www.galaxy.lan` and the gateway from
`https://api.galaxy.lan` — every `/galaxy.gateway.v1.EdgeGateway/*`
fetch failed in the browser with the WebKit "Load failed" generic
message because the response carried no `Access-Control-Allow-Origin`
header. Lobby rendered as "[unknown] Load failed" with no game.
Mirror the public-REST CORS surface for the authenticated handler:
- new env `GATEWAY_AUTHENTICATED_GRPC_CORS_ALLOWED_ORIGINS`;
- new `AuthenticatedGRPCConfig.CORSAllowedOrigins` field;
- new `grpcapi.withCORS` middleware wrapping the Connect mux;
- dev-deploy stack sets the env to `https://www.galaxy.lan`.
The middleware speaks plain net/http (the Connect handler is mounted
on a ServeMux, not gin), handles preflight 204 immediately, and
exposes the Connect-Web header set the browser needs to read the
response (`Grpc-Status`, `Grpc-Message`, `Connect-Protocol-Version`).
Empty allow-list disables the middleware — production stays at
"single hostname" by default.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`api.galaxy.lan` was proxying every path to `galaxy-api:8080` (the
public REST listener), so authenticated Connect-Web calls
(`/galaxy.gateway.v1.EdgeGateway/ExecuteCommand`,
`/galaxy.gateway.v1.EdgeGateway/SubscribeEvents`) collapsed to a 404
from the public route table — the lobby loaded the static bundle
but every authenticated query failed silently.
Split routing by path: `/galaxy.gateway.v1.EdgeGateway/*` goes to
the authenticated listener on `:9090`, everything else stays on
`:8080`. Mirrors the Vite dev-server proxy in
`ui/frontend/vite.config.ts`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two long-standing dev-environment ergonomics had not survived the
move from the bespoke local-dev stack to the CI-driven dev-deploy:
1. `BACKEND_DEV_SANDBOX_EMAIL` defaulted to an empty string in the
dev-deploy compose, so the auto-provisioned "Dev Sandbox" game
never appeared on `https://www.galaxy.lan`. Bake `dev@galaxy.lan`
as the default — matches `.env.example` and lets a developer who
logs in with that email find a ready-to-play game in the lobby.
2. The lobby's synthetic-report loader was gated on
`import.meta.env.DEV`, which is true only for `vite dev` (the
tools/local-dev path). The long-lived dev environment builds
with `vite build` (production mode), so the section was always
stripped from its bundle. Gate it on an explicit
`VITE_GALAXY_DEV_AFFORDANCES` flag instead and set it both in
`.env.development` (preserves `pnpm dev` behaviour) and in the
`dev-deploy.yaml` build step. The `prod-build.yaml` build path
leaves the flag unset, so production stays clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The long-lived dev environment now opts into the bcrypt-bypass on a
fresh `up`/`rebuild` so a returning developer can sign in with `123456`
even after the matching browser session was cleared (the real emailed
code is single-use). Set the variable to an empty string in `.env` to
force real Mailpit codes (mail-flow QA).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a `GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS` env-driven allow-list
on the public REST server so the dev UI on https://www.galaxy.lan can
call https://api.galaxy.lan without the browser blocking the
cross-origin response. Defaults to empty (no CORS) so the production
posture stays closed.
The middleware mounts before route classification and anti-abuse, so
OPTIONS preflights never charge against per-class rate-limit buckets.
`tools/dev-deploy/docker-compose.yml` opts the dev gateway into a
single allowed origin (`https://www.galaxy.lan`); local-dev keeps the
defaults because Vite proxies through the same origin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the runner in host-mode, compose bind-mount paths resolve to
real host paths the Docker daemon can see, so the GeoIP file no
longer needs to be baked into the backend image to survive CI. Bring
back the bind-mount of `pkg/geoip/test-data/.../mmdb`, matching how
local-dev sources it. Image now only carries the backend binary,
symmetric with the production `backend/Dockerfile`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced on the first real merge into development:
1. `${{ env.HOME }}` evaluates to empty string at the workflow stage,
so GALAXY_DEV_GAME_STATE_DIR became `/.galaxy-dev/game-state`.
Resolve in the shell instead of YAML.
2. The compose bind-mount of GeoIP2-Country-Test.mmdb referenced a
path inside the runner's workspace volume, which the host Docker
daemon cannot see — it created an empty directory and the backend
crashed with "geoip database: is a directory" in a restart loop.
Bake the file into the backend image so dev-deploy no longer needs
a bind-mount; local-dev compose still mounts it on top for swap-in
during development.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A docker-compose stack that hosts postgres, redis, mailpit, backend,
gateway, and an app-routing Caddy. Reachable through the host Caddy at
https://www.galaxy.lan (static SPA) and https://api.galaxy.lan (REST +
gRPC). Coexists with tools/local-dev/ and tools/local-ci/ by giving
every name (compose project, container, network, volume) a distinct
galaxy-dev-* prefix.
State is persisted in named volumes; game-state lives under
${GALAXY_DEV_GAME_STATE_DIR:-$HOME/.galaxy-dev/game-state} so the
default works for a non-root runner without sudo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>