Root cause for the long-standing "Dev Sandbox flips to cancelled
after dev-deploy" symptom in push-triggered cycles: when
`integration.yaml` runs in parallel with `dev-deploy.yaml`, its
`integration/scripts/preclean.sh` issues a `docker rm -f` over every
container labelled `galaxy.backend=1`. That label is stamped by the
backend's runtime adapter on every engine it spawns — including the
engines living in the long-lived dev-deploy environment on the same
Docker daemon. Each post-merge auto-deploy therefore had the
integration preclean wipe the dev-sandbox engine, and the new
backend's reconciler tick observed `container disappeared` and
cascaded the sandbox into `cancelled`.
Fix:
- `integration/testenv/backend.go` now sets
`BACKEND_STACK_LABEL=integration` on every backend-under-test, so
the engines spawned by integration carry
`galaxy.stack=integration` in addition to `galaxy.backend=1`. The
backend support for this env was added in the previous CI tidy-up
PR (#13).
- `integration/scripts/preclean.sh` gains a multi-label AND filter
helper and uses it to scope engine cleanup to the combination
`galaxy.backend=1 AND galaxy.stack=integration`. dev-deploy and
local-dev engines carry different `galaxy.stack` values, so the
AND match leaves them alone.
- `docs/ARCHITECTURE.md` "Container labels" — refreshed to call out
the AND-scoping rule and the new integration backend stamp.
- `tools/dev-deploy/KNOWN-ISSUES.md` — the sandbox-cancel entry
gets an "Update" section recording the root cause and the fix; the
status is downgraded to "partially fixed" because the solo
`workflow_dispatch` reproduction (which does NOT trigger
integration) remains unexplained.
- `tools/dev-deploy/KNOWN-ISSUES.md` — separately, document the
`docker restart galaxy-dev-backend` failure caused by the
runner-workspace bind-mount that surfaced while diagnosing this
issue. Workaround: `make -C tools/dev-deploy up` from the
persistent checkout. Real fix is a follow-up (bake fixture into
image or copy to named volume).
Verification:
- `go build ./backend/... ./integration/...` — clean.
- `bash -n integration/scripts/preclean.sh` — syntax OK.
- Live AND-filter check on the dev host:
`docker ps -aq --filter label=galaxy.backend=1 --filter label=galaxy.stack=integration`
returns nothing while the dev-deploy engine
`galaxy-game-80f3ce86-...` keeps running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit stamped `galaxy.stack=<value>` on services,
volumes, and networks. Putting it on volumes/networks changes their
compose config-hash on every label revision, so `docker compose up`
tries to recreate them — which on the long-lived dev environment
either destroys the postgres data volume or deadlocks while trying
to remove `galaxy-dev-internal` with containers still bound to it.
Observed live: run #184 hung in compose recreate after the three
stateful services were stopped, with no recovery.
Containers alone are sufficient for the cleanup contract (we filter
containers, not volumes or networks). Roll back the label on volumes
and networks in both compose files and capture the rule in
docs/ARCHITECTURE.md so the next contributor does not reintroduce it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five connected cleanups across the dev/CI infrastructure:
1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
the legacy "offline workflow validator"; the per-stage CI gate now
runs on gitea.lan and the directory was only retained as a
fallback. Removing it leaves no operational dependency: backend,
gateway, and game code have no references; documentation that
pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
in this same change. Historical "Verified on local-ci run N"
markers in ui/PLAN.md are preserved unchanged.
2. Lift the pre-production single-migration rule. The rule forced
every schema delta into 00001_init.sql and required a manual
make clean-data wipe on every backward-incompatible change in
tools/dev-deploy/. Future schema deltas now land as additive
sequence-numbered files (00002_*.sql, …) that goose applies
automatically on backend startup; 00001_init.sql becomes an
immutable baseline. Authoring conventions live in
backend/internal/postgres/migrations/README.md. The chain may be
squashed back into a fresh 00001 as a deliberate one-time
operation before the first production deployment.
3. Document the deployment cadence. The dev environment is
single-tenant: pushes to feature/* run the test workflows
(go-unit, ui-test, integration) only; dev-deploy.yaml fires on
push to development. A workflow_dispatch override on
dev-deploy.yaml lets a developer preview a feature branch on the
shared dev environment before merge; the next merge into
development overwrites the manual deploy idempotently.
4. Scope compose-managed resources by an explicit
galaxy.stack=<local-dev|dev-deploy> label. Both compose files
stamp the label on every service, network, and named volume.
Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
engine-cleanup operations by (stack-label AND engine OCI title)
so they never touch unrelated workloads on the same daemon.
dev-deploy.yaml gains a pre-`compose up` step that reaps stale
exited/dead containers under the dev-deploy stack label.
5. Backend now stamps the same galaxy.stack=<value> label on every
engine container it spawns, sourced from a new BACKEND_STACK_LABEL
env var (empty → label not applied; legacy-safe). Both compose
files set it to their stack name (local-dev / dev-deploy). The
contract is recorded in docs/ARCHITECTURE.md under
"Container labels". A package-level test in
backend/internal/runtime exercises both the label-present and
label-absent paths.
No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 28 of ui/PLAN.md needs a persistent player-to-player mail
channel; the existing `mail` package is a transactional email
outbox and the `notification` catalog is one-way platform events.
Stage A lands the schema (diplomail_messages / _recipients /
_translations), a single-recipient personal send/read/delete
service path, a `diplomail.message.received` push kind plumbed
through the notification pipeline, and an unread-counts endpoint
that drives the lobby badge. Admin / system mail, lifecycle hooks,
paid-tier broadcast, multi-game broadcast, bulk purge and language
detection / translation cache come in stages B–D.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the project guides with the branching/CI/environment changes
landed in the previous commits:
- CLAUDE.md: per-stage CI gate now closes against gitea.lan; describes
the main/development/feature/* flow and the workflow surface
- docs/ARCHITECTURE.md: new section 18 "CI and Environments" covering
branches, workflows, and the local-dev / dev-deploy / local-ci
triad; section numbering shifted accordingly
- tools/local-ci/README.md: marked as fallback (offline / runner
isolation only)
- tools/local-dev/README.md and ui/README.md: cross-link to
tools/dev-deploy/ for production-shaped testing
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backend now owns the turn-cutoff and pause guards the order tab
relies on: the scheduler flips runtime_status between
generation_in_progress and running around every engine tick, a
failed tick auto-pauses the game through OnRuntimeSnapshot, and a
new game.paused notification kind fans out alongside
game.turn.ready. The user-games handlers reject submits with
HTTP 409 turn_already_closed or game_paused depending on the
runtime state.
UI delegates auto-sync to a new OrderQueue: offline detection,
single retry on reconnect, conflict / paused classification.
OrderDraftStore surfaces conflictBanner / pausedBanner runes,
clears them on local mutation or on a game.turn.ready push via
resetForNewTurn. The order tab renders the matching banners and
the new conflict per-row badge; i18n bundles cover en + ru.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the first end-to-end command through the full pipeline:
inspector rename action → local order draft → user.games.order
submit → optimistic overlay on map / inspector → server hydration
on cache miss via the new user.games.order.get message type.
Backend: GET /api/v1/user/games/{id}/orders forwards to engine
GET /api/v1/order. Gateway parses the engine PUT response into the
extended UserGamesOrderResponse FBS envelope and adds
executeUserGamesOrderGet for the read-back path. Frontend ports
ValidateTypeName to TS, lands the inline rename editor + Submit
button, and exposes a renderedReport context so consumers see the
overlay-applied snapshot.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Extend pkg/model/lobby and pkg/schema/fbs/lobby.fbs with public-games
list, my-applications/invites lists, game-create, application-submit,
invite-redeem/decline. Mirror the matching transcoder pairs and Go
fixture round-trip tests.
- Wire the seven new lobby message types through
gateway/internal/backendclient/{routes,lobby_commands}.go with
per-command REST helpers, JSON-tolerant decoding of backend wire
shapes, and httptest-based unit coverage for success / 4xx / 5xx /
503 across each command.
- Introduce TS-side FlatBuffers via the `flatbuffers` runtime dep, a
`make fbs-ts` target driving flatc, and the generated bindings under
ui/frontend/src/proto/galaxy/fbs. Phase 7's `user.account.get` decode
now uses these bindings as well, closing the JSON.parse vs
FlatBuffers gap that would have failed against a real local stack.
- Replace the placeholder lobby with five sections (my games, pending
invitations, my applications, public games, create new game) and the
/lobby/create form. Submit-application uses an inline race-name
form on the public-game card; create-game keeps name / description /
turn_schedule / enrollment_ends_at always visible and the rest under
an Advanced toggle with TS-side defaults.
- Update lobby/+page.svelte to throw LobbyError on non-ok result codes;
GalaxyClient.executeCommand now returns { resultCode, payloadBytes }.
- Vitest binding round-trips, lobby.ts wrapper unit tests, lobby-page
+ lobby-create component tests, Playwright lobby-flow.spec covering
create / submit / accept across all four projects. Phase 7 e2e was
migrated to the FlatBuffers fixtures and to click+fill against the
Safari-autofill readonly inputs.
- Mark Phase 8 done in ui/PLAN.md, mirror the wire-format note into
Phase 7, append the new lobby commands to gateway/README.md and
docs/ARCHITECTURE.md, add ui/docs/lobby.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
KeyStore + Cache TS interfaces with WebCrypto non-extractable Ed25519
keys persisted via IndexedDB (idb), plus thin api/session.ts that
loads or creates the device session at app startup. Vitest unit
tests under fake-indexeddb cover both adapters; Playwright e2e
verifies the keypair survives reload and produces signatures still
verifiable under the persisted public key (gateway round-trip moves
to Phase 7's existing acceptance bullet).
Browser baseline: WebCrypto Ed25519 — Chrome >=137, Firefox >=130,
Safari >=17.4. No JS fallback; ui/docs/storage.md documents the
matrix and the WebKit non-determinism quirk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the native-gRPC server bootstrap with a single
`connectrpc.com/connect` HTTP/h2c listener. Connect-Go natively
serves Connect, gRPC, and gRPC-Web on the same port, so browsers can
now reach the authenticated surface without giving up the gRPC
framing native and desktop clients may use later. The decorator
stack (envelope → session → payload-hash → signature →
freshness/replay → rate-limit → routing/push) is reused unchanged
behind a small Connect → gRPC adapter and a `grpc.ServerStream`
shim around `*connect.ServerStream`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>