Tokenize every remaining component <style> — calculator, order tab,
inspectors, tables, report sections, lobby, auth, mail, battle viewer,
toasts, map overlays. A scripted pass handled the unambiguous core
palette (text/bg/surface/border/accent/danger/muted), the rest were
mapped to the semantic/grey tokens by role.
Remaining colour literals are the documented exceptions only: the
battle-scene SVG data-visualisation palette (fixed dark, like the WebGL
map canvas), overlay scrims (modal / map-canvas), and directional or
deliberate drop shadows. The default theme stays dark until light
coherence is signed off across the views.
Updates ui/docs/design-system.md (migration status + exceptions).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce the shared design-token system under
ui/frontend/src/lib/theme/: tokens.css (dark default + light palette,
plus spacing/radii/typography scales), base.css global baseline
(document background, text, token focus ring, selection), and
theme.svelte.ts (system/light/dark choice, persisted to localStorage,
applied via data-theme on <html>). A pre-paint guard in app.html
resolves the theme before the app boots to avoid a flash, and the theme
picker is wired into the previously-disabled account-menu stub.
Migrate the always-visible in-game chrome to the tokens (header, account
menu, sidebar, tab-bar, bottom-tabs, shell background): dark renders as
before, light comes for free. The default stays dark during the
incremental migration; the remaining view bodies migrate in F1b.
Docs: ui/docs/design-system.md (+ index entry). Test: tests/theme.test.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MVP web client (Phases 1-30) is complete; reorganize planning + living docs around that.
- PLAN.md kept as the staged MVP record (1-30) with a status block + pointers; removed the 31-36 stages, regression scenarios, and deferred-TODO section (moved out); fixed a stale cross-machine plan path.
- ui/PLAN-finalize.md (new): active web-finalization plan in 8 stages (visual system, a11y, i18n, error UX, PWA, build hygiene, docs, owner manual-QA loop); absorbs former Phases 33 and 35.
- ui/ROADMAP.md (new): post-MVP (Wails, Capacitor, realistic projection, acceptance + regression scenarios) and triaged deferred follow-ups.
- ui/docs/README.md (new): grouped topic-doc index.
- De-archaeologized all 20 ui/docs topic docs + ui/README.md + ui/core/README.md: stripped Phase-N build history, rewritten as current-state; deferred work now points at ROADMAP.md / PLAN-finalize.md. Docs-only; no code change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- custom load capped at cargo capacity (error when exceeded); full load shows the cargo capacity; zero cargo pins load to empty and disables the toggle
- per-input red border + tooltip for every invalid value (blocks, techs, load, MAT, modernization target); no value may be negative; locking a speed is disabled when drive is zero
- display every computed number (results + goal-seek back-solved input) rounded up to 3 decimals via a shared pkg/calc Ceil3 bridged to wasm; engine keeps its own round-to-nearest util.Fixed*
- modernization total upgrade cost spans two columns (single line)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the calculator's inputs into a page-level calculatorState singleton so they survive the sidebar unmounting the tab on a tab switch (the inspector auto-opens on a planet click). ensureGame resets the design when the active game changes.
While on the calculator, a planet click no longer switches to the inspector — the calculator consumes the selection in its planet area / reach circles. Halve the reach-circle stroke width.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fuse the standalone ship-class designer (Phases 17/18) into a sidebar calculator: live mass/speed/attack/defence/bombing results, a planet build-rate readout, single-target goal-seek, a modernization-cost mode, and auto reach circles on the map for the selected planet.
pkg/calc becomes the single source for the new math (no mirroring): extract BombingPower from the engine model and the per-turn ship-production loop from controller.ProduceShip into pkg/calc (engine now delegates), and add inverse goal-seek solvers in pkg/calc/solve.go. Thin-bridge the combat, planet-build, and solver functions through ui/core/calc + ui/wasm and rebuild core.wasm.
Remove the standalone designer view/route; the ship-classes table and the view/bottom menus open the calculator via a shared request store.
Docs: rewrite ui/PLAN.md Phase 30, adjust Phase 34 (realistic forecast + CAP/COL ownership), add ui/docs/calculator-ux.md, extend calc-bridge.md, fix navigation.md; remove ui/CALCULATOR.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`pnpm/action-setup@v4` defaults to installing pnpm in the shared
`~/setup-pnpm`. On the single host-mode runner $HOME is shared across
concurrent jobs, so when two pnpm jobs overlap (e.g. a post-merge
`dev-deploy` and `ui-test`, which sit in different concurrency groups)
their self-installers race and one fails with
`ENOTEMPTY ... rmdir '~/setup-pnpm/node_modules/.bin/store/v11/files'`
before the tests even run.
Point each step's `dest` at `${{ runner.temp }}/setup-pnpm` (a per-job
isolated directory) so concurrent jobs never share the install location.
The action still adds `dest` to PATH, so setup-node's pnpm cache and
later `pnpm` calls are unaffected; the pnpm package store stays shared
(safe — pnpm locks it). Applied to the three workflows that set up pnpm:
ui-test, dev-deploy, prod-build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stage 1 (render-on-demand) removed the idle / whole-system freeze, but
panning a loaded map with "visible hyperspace" on stayed heavy in Safari:
the fog still cut its visibility holes by opaque overpaint — on KNNTS041
that is ~260 near-world-sized opaque circles blended over the fog every
rendered frame, a fill-rate cliff for Safari's WebGPU / Apple's tile-based
GPU.
Replace the overpaint with an INVERSE stencil mask: setVisibilityFog now
draws the FOG_COLOR rectangle(s) into fogLayer and collects the visibility
circles into one Graphics set as fogLayer.setMask({ mask, inverse: true }),
so the fog shows everywhere except the union of the circles. Per-frame cost
drops from dozens of blended opaque circle fills to one rect fill + a
stencil pass (no colour writes), which Apple's TBDR GPU handles cheaply,
and the fog stays fully vector — crisp at any zoom.
fogPaintOps and its unit tests are unchanged (the circle ops now feed the
mask instead of an overpaint). Verified with a high-contrast screenshot
during development (fog field with a correct circle-union hole) plus the
existing fog / render-on-demand e2e green on chromium + webkit.
Docs: renderer.md fog section + PLAN.md Phase 29 decision 9.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase 29 visibility fog ("visible hyperspace") froze the whole UI on
large reports in Safari while staying smooth in Firefox. Root cause: the
fog is a layered overpaint (torus mode = 9 world-sized rects + 9xN
near-world-sized opaque circles, ~270 fills for KNNTS041) and Pixi's
continuous auto-render loop re-rasterised all of it every frame, even
while idle. Safari's WebGPU backend cannot sustain that fillrate, so the
main thread/compositor starved and the entire UI froze.
Stage 1 (vector-preserving, no rasterisation):
- Stop Pixi's auto-render loop (app.stop()) and paint on demand via a
single Ticker.shared flush gated on viewport.dirty (camera) plus an
internal requestRender() from every content mutation (fog / hide-set /
extras / wrap mode / resize / pick overlay). An idle map now does zero
GPU work per frame; plain hover paints nothing.
- Remove the decelerate (drag-inertia) plugin: a released drag stops
instantly (owner request) and the viewport goes idle immediately.
- Expose RendererHandle.getRenderCount() / getMapRenderCount for
deterministic e2e assertions.
Tests: new map-toggles e2e specs (idle map does not repaint; released
drag does not coast) green on all four Playwright projects incl. WebKit.
Docs: renderer.md (render-on-demand section; fog section corrected to the
current single-fogLayer model; FPS note) and PLAN.md Phase 29 decision 8.
If Safari pan is still heavy after this, stage 2 will cut the overpaint
itself with an inverse stencil mask of the circle union (kept vector).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two regressions surfaced once visible-hyperspace toggled on a real
dev-deploy map:
1. On the zero-turn map the bg holes painted ON TOP of the planet
glyphs — every LOCAL planet looked like a hollow circle of
background colour instead of the planet pixel inside an
unfogged area.
2. On a legacy report with a drive tech that pushes the visibility
radius well past the world dimensions the bg circles overlapped
to cover the entire viewport. Combined with the wrong z-order
the result was a uniformly black canvas with every primitive
hidden.
The per-copy implementation added the fog container via
`copy.addChildAt(container, 0)` and trusted Pixi v8 to insert the
container at the start of the copy's children. Whether by a Pixi
quirk or by some interaction with how `populatePrimitives` orders
its `c.addChild(g)` calls, the fog ended up rendering after every
primitive in practice — the symptoms above are a perfect match for
that ordering.
Restructured the fog rendering so the z-order is structural
rather than relying on `addChildAt`:
- A single `fogLayer: Container` is added to the viewport BEFORE
the nine torus copies. Pixi renders viewport children in order,
so the layer is guaranteed to paint first; every copy renders
on top.
- `fogPaintOps` now emits world-space coordinates with wrap
offsets baked in (9 fog rects + 9 bg circles per visibility
entry in torus mode, 1 + N in no-wrap mode). The renderer
populates `fogLayer` with one `Graphics` per op — no per-copy
iteration on the fog side.
- The previous `fogGraphics: Container[]` closure state is gone.
Each `setVisibilityFog` flip drops every child of `fogLayer`
and rebuilds it. The dispose path drops the children
eagerly before `app.destroy({children: true})` walks the tree.
The fog-paint-ops test exercises the new contract: the no-wrap
path keeps one rect + N circles, the torus path expands to nine
rects + nine wrapped circles per entry (including the seam-fix
case at x = 950).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two visible regressions in the in-game map's fog overlay surfaced
on dev-deploy:
1. With three LOCAL planets close together, only the last planet
glyph stayed visible inside the bg holes — the other two were
obscured. The previous implementation stacked the fog rectangle
plus every bg circle onto a single `Graphics` via repeated
`g.rect(...).fill(...).circle(...).fill(...)...`. Pixi v8's
multi-shape Graphics is supported in theory, but in practice
only the last shape's fill seems to land, dropping the earlier
bg holes (and the planet glyphs on top look like they vanished
along with their hole). Splitting each op onto its own
`Graphics` inside a per-copy `Container` removes the ambiguity
— one shape, one fill, one render pass.
2. A planet near the right world edge produced a "sector" — the
bg circle painted into the area past the seam, but the
neighbouring tile's fog rectangle then overpainted that bleed,
leaving a quarter-circle hole. In torus mode each visibility
circle is now drawn at the nine wrapped positions
(`(dx, dy) ∈ {-1, 0, 1}²`); the wrapped copies in the
neighbour-tile-aligned positions keep the hole continuous
across the seam. No-wrap mode keeps a single emission per
circle, because wrapped circles would leak into the visible
world rectangle as unwanted holes.
The `fogPaintOps` helper now takes the wrap mode as a parameter;
`tests/fog-paint-ops.test.ts` covers the torus expansion
(nine-wrap product per circle, the seam-fix case at x = 950) and
re-asserts the no-wrap path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifted the Phase 29 fog draw sequence out of `setVisibilityFog`
into a pure `fogPaintOps` helper that returns an ordered list of
fill operations (one fog rect, then one background-coloured
circle per visibility entry). The renderer now dispatches each op
straight onto a Pixi `Graphics`; the indirection lets the layered-
overpaint contract be tested without booting Pixi.
`tests/fog-paint-ops.test.ts` covers: empty input → no ops; single
circle → fog rect + bg circle in that order; multiple circles → N
bg circles after the fog rect; overlapping circles emitted
independently (the rendering order unions them); zero / negative
world dimensions → no ops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase 29 fog overlay rendered as a handful of random arc
segments instead of a clean union of holes around LOCAL planets
— Pixi v8's `Graphics.cut()` does not reliably subtract multiple
overlapping circles from a base path.
Replaced the cut-based approach with a layered overpaint: a
fog-tinted rectangle fills the world, then opaque background-
coloured circles are painted on top for every visibility circle.
The natural rendering order unions overlapping circles for free —
no geometry, no `cut()` quirks, one extra fill per circle.
Renamed the toggle from `visibilityFog` to `visibleHyperspace`
across the store, i18n strings, popover, tests, and docs. The
overlay still implements the visual "fog" effect at the renderer
level (FOG_COLOR, setVisibilityFog, getMapFog); the toggle is
named after the player-facing concept it controls — the portion
of the map that is visible (intelligence/scan coverage) — rather
than the obscured part.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous logic re-mounted the renderer whenever
`store.wrapMode` flipped, because the `sameSnapshot` gate
included `handle.getMode() === mode`. Pixi 8 does not reliably
re-initialise an `Application` on the same canvas — the symptom
showed up as the chromium tab silently closing during the
Phase 29 wrap-mode e2e ("Target page, context or browser has
been closed").
The renderer already exposes an in-place `setMode` that swaps
the wrap-clamp / torus-copy visibility synchronously while
preserving the camera; the playground-map.spec.ts wrap toggle
has been driving it for several phases without issue. Drop
mode from the snapshot gate and route the change through
`handle.setMode(mode)` instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run #217 surfaced three independent bugs that survived the first
fixup pass:
1. `visibleHighBitCount` masked the id with `(prim.id >>> 0) & 0xf…`,
but JS bitwise AND always returns a signed int32 — the mask had
to be re-converted with `>>> 0` AFTER the AND, not before. Result
was always 0 on the previous run, masking the next two bugs by
making the persistence test's high-bit-count assertions a
tautology.
2. `applyVisibilityState` was wrapped in `untrack`, so the
`toggles.X` reads inside `computeHiddenIds` / `computeFogCircles`
never landed in the effect's dependency set — toggling fog or any
marker / group / kind flag did not re-run the effect, so the
renderer never received the new hide / fog input. Explicit
`void toggles.X` reads now live at the top of the effect so every
key is tracked synchronously.
3. The wrap-mode radios fired on `onchange`, which Svelte 5
suppresses on a re-activation of an already-checked input — the
Playwright `.click()` flake on the second wrap test reflected the
missed event. Switched to `onclick` and short-circuited when the
target mode is already active.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent bugs in `tests/e2e/map-toggles.spec.ts` made the
fresh-Phase-29 suite red on CI #216:
1. `visiblePlanets` filtered on `p.id < 1_000_000`, which JS interprets
in signed space — high-bit-prefix primitives (cargo route 0x80…,
battle 0xa0…, bombing 0xc0…) are stored as negative Numbers and
leaked into the planet list. Filter switched to a `0 < id < 1e7`
window that matches the engine planet-number range exactly.
2. The `visibleHighBitCount` helper now ToUint32-converts the id
before masking so the bitmask comparison works regardless of
whether the id is stored as positive or negative.
3. The fog and wrap-mode tests read the renderer state synchronously
after the click — the Svelte effect re-runs asynchronously, so the
tests saw stale state. Both now `waitForFunction` on the canonical
"settled" signal: empty fog circles for the fog flip, and a new
`getMapMode()` debug accessor for the wrap-mode remount.
Renderer side: registers a `MapModeProvider` next to the existing
camera / fog providers and exposes `getMapMode()` through the debug
surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the gear-icon popover on the map view with per-game persistence
of every category toggle plus the wrap-mode radio. Hide-by-id and
visibility-fog facilities land on the renderer so every flip applies
within one frame without a Pixi remount; the wrap-mode toggle keeps
its existing remount + camera-preserve path. A new server-side turn
force-resets every flag to defaults so a hidden category never makes
the player miss the next turn's news.
Also fixes the FligthDistance → FlightDistance typo in pkg/calc/race.go
(plus the single Go caller); the TS side keeps duplicating the formula
until a race-level WASM bridge lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`RandomName` builds the suffix as two independent `rand.Intn(1000)`
calls, so the two 4-digit halves collide on ~0.1% of runs. The
sub-test asserted `g[2] != g[3]`, which flakes whenever the same
value lands twice — once per ~1000 sub-runs per class, so across
the seven `PlanetClass` rows the integration suite hit it on
`#199 go-unit.yaml` against `feature/subscribe-events-heartbeat`
(`"0074"` collision).
Distinctness is not a property `RandomName` promises and is not
load-bearing for callers: `game/internal/controller/generate_game.go`
uses these names for planet labels and already tolerates duplicate
names across planets, so collisions inside one name are no worse
than collisions between names. Drop the assert; keep the format and
class-prefix checks, which are the actual contract.
Stress-tested with `-count=200`: 200 consecutive iterations × 7
classes = 1400 sub-runs without a single failure where the prior
version's flake probability would have surfaced ~once on average.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caddy's `file_server` did not set Cache-Control on the SvelteKit
build, so browsers fell back to heuristic caching keyed off
Last-Modified. On the long-lived dev environment the heuristic
window leaves the previous deploy's `index.html` cached for
minutes-to-hours, and Safari combined that with stale conditional
requests into a visible multi-second freeze on every reload (the
reproduction was "private window reloads instantly, normal window
hangs; clearing Safari caches restores normal speed"). Push
delivery itself works — heartbeat keeps the SubscribeEvents stream
alive — but the bundle path stalls behind the browser revalidating
a chain of stale chunks.
Mirror the standard SvelteKit cache split inside both Caddyfiles:
- `_app/immutable/*` — hash-named JS/CSS chunks Vite emits with
content-addressed file names — `Cache-Control:
public, max-age=31536000, immutable`. Safe to cache forever
because the name changes whenever the content does, so the next
deploy serves new files under new URLs.
- Everything else (`index.html` fallback via `try_files`,
`env.js`, `version.json`, `core.wasm`, `wasm_exec.js`,
`favicon.svg`) — `Cache-Control: no-cache, must-revalidate`.
The browser still uses the cached body when the ETag matches,
but it always asks first; a fresh deploy reaches the user on
the next reload without a manual cache clear.
Smoke-tested locally: a docker-run Caddy with this config returns
the immutable header only for `/_app/immutable/*` and the
no-cache header for `/index.html`, `/env.js`, and the SPA-fallback
path `/some/route`. The Caddyfile passes `caddy validate` in
both `Caddyfile.dev` and `Caddyfile.prod`; the pre-existing
formatting warning on line 7 is untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Browser fetch-streaming layers close response bodies they consider
idle after roughly 15-30 s without incoming bytes. Safari is the
most aggressive, but the symptom matters everywhere: a quiet
SubscribeEvents stream (lobby, between turns, mailbox empty) gets
torn down by the browser, the EventStream singleton reconnects with
backoff, and any push event that fires inside the reconnect window
is lost because `push.Hub` queues are not persisted across
subscription closes. The user-visible failure mode is the
intermittent "Fetch API cannot load … due to access control checks"
console error (a misleading WebKit symptom — CORS headers are
actually present) plus missed turn-ready / mail-received toasts.
Server-side fix: a silence-based heartbeat at the
`authenticatedPushStreamService` wrapper layer. After the signed
`gateway.server_time` bootstrap event, gateway wraps the bound
stream with `heartbeatingStream`. Every tail Send (fan-out, future
variants) resets the silence timer; when the timer elapses, a
goroutine emits `gateway.heartbeat` with only `EventType` set —
everything else stays at proto3 defaults, so the wire frame is
~45 bytes amortised. A `sendMu` serialises the heartbeat goroutine
with tail Sends because grpc.ServerStream.Send is not goroutine-safe.
The heartbeat is intentionally UNSIGNED: heartbeats carry no
payload, dispatch to no handler on the client, and an injected
heartbeat trivially causes no user-visible state change. TLS still
protects the wire and real events keep the signed envelope
unchanged. Documented in `docs/ARCHITECTURE.md` § 15 alongside the
per-scale bandwidth projection (100…100 000 clients × 15…60 s).
Config: new `GATEWAY_PUSH_HEARTBEAT_INTERVAL` (default `15s`,
`0s` disables). Telemetry: new
`gateway.push.heartbeats_sent{outcome}` counter so operators can
budget bandwidth and spot a sudden `outcome=error` bump as an
upstream-failing-before-flush signal.
Client (`ui/frontend/src/api/events.svelte.ts`): early `continue`
on `event.eventType === "gateway.heartbeat"` before `verifyEvent`,
`verifyPayloadHash`, or dispatch — empty signature would otherwise
trip SignatureError and reconnect. A leading heartbeat still flips
`connectionStatus` to `connected` and resets backoff, because
receiving one is proof the stream is healthy.
Tests:
- `push_heartbeat_test.go`: unit tests for the wrapper — zero
interval returns nil, heartbeat fires after silence, real Send
resets the timer, Stop / context-cancel halt the goroutine,
Send errors propagate.
- `server_test.go`: integration tests through the full gateway
pipeline — heartbeat fires after the configured silence window,
zero interval keeps the stream silent.
- `config_test.go`: default applied, env-override parsed,
negative value rejected.
- `events.test.ts`: heartbeat skipped before verification + not
dispatched to handlers; leading heartbeat still flips
`connectionStatus` to `connected`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase 28 ConnectRPC migration of the battle viewer added a
guard in `lib/active-view/battle.svelte` that waits for the
surrounding layout to publish a `GalaxyClient` before issuing the
fetch. The in-game shell layout deliberately skips
`galaxyClient.set(...)` on the synthetic branch (gateway is not
reachable in synthetic mode), so for any battle opened from a
synthetic-report game the viewer sat on "loading battle…"
forever — `fetchBattle` was never called, so the synthetic-fixture
short-circuit it carries was unreachable.
Let the guard skip synthetic ids: `fetchBattle` already resolves
those through `lookupSyntheticBattle` and never touches the
client, so its signature widens to `GalaxyClient | null` and the
synthetic path passes `null`. The live path still waits for the
handle as before; a `null` client on the live path now fails
fast with a transport-level `BattleFetchError` instead of silently
sitting on `loading`.
Tests:
- Existing "loading placeholder" smoke now uses a non-synthetic
game id so it keeps asserting the live-path wait.
- Two new cases pin the synthetic behaviour: missing fixture →
`battle-not-found`; registered fixture → `BattleViewer` mounts.
Docs:
- `docs/FUNCTIONAL.md` §6.5 still described the pre-Phase-28
raw REST path. Updated to the signed ConnectRPC command and
noted the synthetic short-circuit. Russian mirror updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.
Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:
- `tools/dev-deploy/docker-compose.yml`: mount
`galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
bind. Declare the volume in the top-level `volumes:` block.
- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
(placed right after the existing UI-volume seed) copies the
fixture from `pkg/geoip/test-data/test-data/` into the named
volume via an ephemeral alpine container, the same pattern UI
seeding already uses.
- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
the same copy from the persistent checkout. `up` and `rebuild`
now depend on it, so a hand-run `make -C tools/dev-deploy up`
populates the volume without operator action.
- `tools/dev-deploy/README.md`: updated the make-targets table to
list `seed-geoip`.
- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
failure is downgraded to a "fixed" postmortem; the symptom,
cause, and where the fix lives are kept for future reference.
Verification on the dev host (this branch checked out):
$ make -C tools/dev-deploy up # populates the volume, brings stack healthy
$ docker restart galaxy-dev-backend # used to error "not a directory"
$ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
$ echo "ok" # backend up 6s, healthy
The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.
Fix:
- `integration/testenv/backend.go` now sets
`BACKEND_STACK_LABEL=integration` on every backend-under-test, so
the engines spawned by integration carry
`galaxy.stack=integration` in addition to `galaxy.backend=1`. The
backend support for this env was added in the previous CI tidy-up
PR (#13).
- `integration/scripts/preclean.sh` gains a multi-label AND filter
helper and uses it to scope engine cleanup to the combination
`galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
local-dev engines carry different `galaxy.stack` values, so the
AND match leaves them alone.
- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
the AND-scoping rule and the new integration backend stamp.
- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
gets an "Update" section recording the root cause and the fix; the
status is downgraded to "partially fixed" because the solo
`workflow_dispatch` reproduction (which does NOT trigger
integration) remains unexplained.
- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
`docker restart galaxy-dev-backend` failure caused by the
runner-workspace bind-mount that surfaced while diagnosing this
issue. Workaround: `make -C tools/dev-deploy up` from the
persistent checkout. Real fix is a follow-up (bake fixture into
image or copy to named volume).
Verification:
- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
`docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
returns nothing while the dev-deploy engine
`galaxy-game-80f3ce86-...` keeps running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit stamped `galaxy.stack=<value>` on services,
volumes, and networks. Putting it on volumes/networks changes their
compose config-hash on every label revision, so `docker compose up`
tries to recreate them — which on the long-lived dev environment
either destroys the postgres data volume or deadlocks while trying
to remove `galaxy-dev-internal` with containers still bound to it.
Observed live: run #184 hung in compose recreate after the three
stateful services were stopped, with no recovery.
Containers alone are sufficient for the cleanup contract (we filter
containers, not volumes or networks). Roll back the label on volumes
and networks in both compose files and capture the rule in
docs/ARCHITECTURE.md so the next contributor does not reintroduce it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five connected cleanups across the dev/CI infrastructure:
1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
the legacy "offline workflow validator"; the per-stage CI gate now
runs on gitea.lan and the directory was only retained as a
fallback. Removing it leaves no operational dependency: backend,
gateway, and game code have no references; documentation that
pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
in this same change. Historical "Verified on local-ci run N"
markers in ui/PLAN.md are preserved unchanged.
2. Lift the pre-production single-migration rule. The rule forced
every schema delta into 00001_init.sql and required a manual
make clean-data wipe on every backward-incompatible change in
tools/dev-deploy/. Future schema deltas now land as additive
sequence-numbered files (00002_*.sql, …) that goose applies
automatically on backend startup; 00001_init.sql becomes an
immutable baseline. Authoring conventions live in
backend/internal/postgres/migrations/README.md. The chain may be
squashed back into a fresh 00001 as a deliberate one-time
operation before the first production deployment.
3. Document the deployment cadence. The dev environment is
single-tenant: pushes to feature/* run the test workflows
(go-unit, ui-test, integration) only; dev-deploy.yaml fires on
push to development. A workflow_dispatch override on
dev-deploy.yaml lets a developer preview a feature branch on the
shared dev environment before merge; the next merge into
development overwrites the manual deploy idempotently.
4. Scope compose-managed resources by an explicit
galaxy.stack=<local-dev|dev-deploy> label. Both compose files
stamp the label on every service, network, and named volume.
Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
engine-cleanup operations by (stack-label AND engine OCI title)
so they never touch unrelated workloads on the same daemon.
dev-deploy.yaml gains a pre-`compose up` step that reaps stale
exited/dead containers under the dev-deploy stack label.
5. Backend now stamps the same galaxy.stack=<value> label on every
engine container it spawns, sourced from a new BACKEND_STACK_LABEL
env var (empty → label not applied; legacy-safe). Both compose
files set it to their stack name (local-dev / dev-deploy). The
contract is recorded in docs/ARCHITECTURE.md under
"Container labels". A package-level test in
backend/internal/runtime exercises both the label-present and
label-absent paths.
No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the live investigation, the project owner confirms that none
of the host-side cleanup paths apply: no docker prune cron, no
manual `docker rm`, no `dockerd` restart in the window, and the
engine binary does not crash while idling on API calls.
Replace the host-side hypothesis list with a one-line note that
they were considered and rejected, narrow the open suspicion to
the `dev-deploy.yaml` job sequence (`docker build` + `docker
compose build` + the alpine `docker run --rm` for UI seeding +
`docker compose up -d --wait --remove-orphans`), and park the
entry. Reopen if the symptom recurs with a fresh
`docker events --since 0` capture armed before the deploy
starts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A live `docker inspect` of an engine container and two redispatch
runs with `docker events` captured confirm:
- Engine has no `com.docker.compose.*` labels and `AutoRemove=false`,
so `--remove-orphans` cannot reap it.
- Two consecutive `dev-deploy.yaml` redispatches with an engine
already running emitted `die` / `destroy` events only for
`galaxy-dev-{backend,api,caddy}` — never for the engine.
- The reconciler tick that fires 60s after backend recreate
correctly matched the surviving engine in both cases
(`status=running` in both `games` and `runtime_records`).
- `runtime.Service` has no `Shutdown` that proactively removes
engine containers, so a graceful backend exit also leaves them
alone.
The repro window therefore needs a separate trigger that removed
the engine container outside of compose. The new hypotheses point
at host-side `docker prune` jobs, a `dockerd` restart that lost the
container, or an early `Engine.Init` failure that exited the engine
before `status=running` reached the runtime row. The investigation
list now leads with `journalctl -u docker` and the host crontab —
those are the cheapest checks to confirm or rule out next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture the diagnostic notes for the issue we hit after every
`dev-deploy.yaml` redispatch: the freshly-bootstrapped "Dev Sandbox"
game ends up `cancelled` ~15 minutes later, with the runtime
reconciler reporting "container disappeared". The engine never
shows up in `docker ps -a --filter label=galaxy-game-engine`, so
either it never spawned or it was removed before any host-side
snapshot.
`KNOWN-ISSUES.md` records the symptom, the log excerpt, three
working hypotheses (runtime spawn race, `--remove-orphans`
interaction, engine `--rm` lifecycle), and the investigation
checklist before opening an issue. The README gets a one-line
pointer so future redeploys land on the doc immediately.
No code change — this is the placeholder so the next person
investigating the cancellation pattern does not have to
rediscover the diagnostic from scratch.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two issues surfaced once the long-lived dev environment finally
reached the diplomail view:
1. `/sent` returns one row per recipient for broadcast and admin
fan-outs (so the admin tooling can render the materialised
audience). The list pane fed all rows into the stand-alone
bucket, so the `{#each entries as e (entryKey(e))}` key in
`thread-list.svelte` collapsed to the same `standalone:${id}`
for every recipient and Svelte 5 aborted the render with
`each_key_duplicate`. Dedupe stand-alones by `message_id` in
`buildEntries`.
2. The compose dialog exposed an `admin` kind toggle gated on
"owner of game". That was a Phase 28 plan decision, but admin
compose is an operator tool (server admin), not an in-game
action — every game owner should not be able to broadcast
admin notifications. Drop the admin option, the audience
sub-toggles, and the admin path through `submit`. The
`MailStore.composeAdmin` wrapper and the backend RPC stay so
the future admin UI can call them.
Vitest covers the fan-out dedup with three rows sharing one
`message_id` collapsing to a single stand-alone entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The public REST listener already exposes
`GATEWAY_PUBLIC_HTTP_CORS_ALLOWED_ORIGINS`; the authenticated
Connect-Web listener on the separate gRPC port had no equivalent.
That worked in `tools/local-dev` (Vite proxy makes everything
same-origin) and would work in production once UI and gateway share
a single hostname, but the long-lived dev environment serves the
UI from `https://www.galaxy.lan` and the gateway from
`https://api.galaxy.lan` — every `/galaxy.gateway.v1.EdgeGateway/*`
fetch failed in the browser with the WebKit "Load failed" generic
message because the response carried no `Access-Control-Allow-Origin`
header. Lobby rendered as "[unknown] Load failed" with no game.
Mirror the public-REST CORS surface for the authenticated handler:
- new env `GATEWAY_AUTHENTICATED_GRPC_CORS_ALLOWED_ORIGINS`;
- new `AuthenticatedGRPCConfig.CORSAllowedOrigins` field;
- new `grpcapi.withCORS` middleware wrapping the Connect mux;
- dev-deploy stack sets the env to `https://www.galaxy.lan`.
The middleware speaks plain net/http (the Connect handler is mounted
on a ServeMux, not gin), handles preflight 204 immediately, and
exposes the Connect-Web header set the browser needs to read the
response (`Grpc-Status`, `Grpc-Message`, `Connect-Protocol-Version`).
Empty allow-list disables the middleware — production stays at
"single hostname" by default.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`api.galaxy.lan` was proxying every path to `galaxy-api:8080` (the
public REST listener), so authenticated Connect-Web calls
(`/galaxy.gateway.v1.EdgeGateway/ExecuteCommand`,
`/galaxy.gateway.v1.EdgeGateway/SubscribeEvents`) collapsed to a 404
from the public route table — the lobby loaded the static bundle
but every authenticated query failed silently.
Split routing by path: `/galaxy.gateway.v1.EdgeGateway/*` goes to
the authenticated listener on `:9090`, everything else stays on
`:8080`. Mirrors the Vite dev-server proxy in
`ui/frontend/vite.config.ts`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two long-standing dev-environment ergonomics had not survived the
move from the bespoke local-dev stack to the CI-driven dev-deploy:
1. `BACKEND_DEV_SANDBOX_EMAIL` defaulted to an empty string in the
dev-deploy compose, so the auto-provisioned "Dev Sandbox" game
never appeared on `https://www.galaxy.lan`. Bake `dev@galaxy.lan`
as the default — matches `.env.example` and lets a developer who
logs in with that email find a ready-to-play game in the lobby.
2. The lobby's synthetic-report loader was gated on
`import.meta.env.DEV`, which is true only for `vite dev` (the
tools/local-dev path). The long-lived dev environment builds
with `vite build` (production mode), so the section was always
stripped from its bundle. Gate it on an explicit
`VITE_GALAXY_DEV_AFFORDANCES` flag instead and set it both in
`.env.development` (preserves `pnpm dev` behaviour) and in the
`dev-deploy.yaml` build step. The `prod-build.yaml` build path
leaves the flag unset, so production stays clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two problems showed up while trying to log into the long-lived dev
environment with the dev-fixed code `123456`:
1. `ConfirmEmailCode` checked the per-challenge attempts ceiling
*before* the dev-fixed-code override. A developer who burned past
`ChallengeMaxAttempts` on an existing un-consumed challenge (easy
to trigger when the throttle reuses one challenge_id) hit
`ErrTooManyAttempts` and the UI rendered "code expired or already
used" even though the fixed code was correct. Reorder so the
dev-fixed-code branch runs first and bypasses both the bcrypt
verify and the attempts gate. Production stays unaffected
because production loaders refuse to set `DevFixedCode`.
2. `dev-deploy.yaml` only fires on push to `development`, so the
matching docker-compose default change for
`BACKEND_AUTH_DEV_FIXED_CODE` could not reach the running stack
before this PR merged. Add `workflow_dispatch: {}` so a developer
can deploy any branch — typically a feature branch under review —
from the Gitea Actions UI without waiting for the merge.
Covered by a new `TestConfirmEmailCodeDevFixedCodeBypassesAttemptsCeiling`
integration test that burns through the ceiling with wrong codes
then proves the dev-fixed code still produces a session.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>