Commit Graph

29 Commits

Author SHA1 Message Date
Ilia Denisov e11092234c feat(dev-deploy): expose Grafana + Mailpit UIs via Caddy; seed monitoring config
Deploy wiring for the observability stack (the services and collector
config landed in the previous commit):

- Caddyfile.dev: route /grafana/* to galaxy-grafana:3000 (Caddy
  sub-path mode, Grafana keeps its own login) and /mailpit/* to
  galaxy-mailpit:8025 behind dev basic-auth, so the captured-mail UI
  (every message, relayed or not) and Grafana are reachable through the
  single dev origin.
- dev-deploy.yaml: seed the monitoring config tree to a stable,
  reboot-surviving host path (GALAXY_DEV_MONITORING_DIR) before bringing
  the stack up, and inject the Grafana admin password from a Gitea
  secret (GALAXY_DEV_GRAFANA_ADMIN_PASSWORD; empty falls back to the
  compose default).
2026-06-01 05:46:19 +02:00
Ilia Denisov 7fb6a63c2b feat(dev-deploy): relay Mailpit to Gmail (Stage 3)
Keep Mailpit as the backend's SMTP submission point and turn on its
relay so OTP/notification mail addressed to the owner reaches a real
Gmail inbox, while everything else stays captured-only.

- mailpit gains --smtp-relay-config + --smtp-relay-matching (default
  non-routable, so an unconfigured stack only captures); relay.conf is
  mounted from a new galaxy-dev-mailpit-config volume
- tools/dev-deploy/mailpit/relay.conf.tmpl + a dev-deploy.yaml step that
  renders it from Gitea secrets (Gmail App Password, never committed)
  and seeds the volume; the GALAXY_DEV_MAIL_RELAY_MATCH var drives the
  relay-matching recipient
- backend SMTP config unchanged (still -> galaxy-mailpit:1025)
- dev-deploy README documents the relay + required secrets/vars

Verified locally: compose config valid; the rendered relay.conf is
accepted by mailpit v1.21.8 (relay + recipient-matching enabled).
Real Gmail delivery is verified at the dev-deploy preview once the
owner sets the secrets.
2026-05-31 22:44:32 +02:00
Ilia Denisov 0cae89cba2 refactor(dev): remove the dev-sandbox bootstrap everywhere
Tests · Go / test (push) Successful in 1m59s
Stage 1 of the dev-as-prod-mirror rework. The auto-provisioned "Dev
Sandbox" game and dummy users are removed so the dev contour starts
empty like prod; the separate legacy-report loader stays as the
test-data path.

- delete backend/internal/devsandbox (package + tests)
- drop the bootstrap call + DevSandboxConfig (struct, Config field,
  BACKEND_DEV_SANDBOX_* env, defaults, loader, validation)
- strip BACKEND_DEV_SANDBOX_* from dev-deploy + local-dev compose and
  .env.example; the generic engine-recycle / prune-broken-engines logic
  stays (it serves real games)
- update tooling docs (dev-deploy README + KNOWN-ISSUES, local-dev
  README + Makefile) and stale comments; DeleteGame and
  InsertMembershipDirect remain (exercised by lobby integration tests)

No app behaviour change beyond not auto-creating the sandbox game.
2026-05-31 22:28:03 +02:00
Ilia Denisov eb549e6049 ci(ui-test): clean root-owned build artifacts so runner teardown succeeds
Tests · UI / test (push) Waiting to run
Tests · UI / test (pull_request) Successful in 3m24s
In host-mode the ui-test job runs as root, so vite (test:pwa),
svelte-kit and Playwright write build/, .svelte-kit/, test-results/ and
playwright-report/ root-owned into the shared host workspace. The
act_runner (non-root) then cannot remove them at teardown
("unlinkat ui/frontend/build: permission denied"), which spuriously
marks this or a sibling job that inherits the dirty workspace as failed
— it hit go-unit on the #83 merge even though every test passed.

Add an `if: always()` step that removes those generated dirs while the
job still has root, after the artifact uploads. Keeps the shared
workspace clean for the runner's own teardown and for later jobs.
2026-05-31 12:15:32 +02:00
Ilia Denisov 658ab7f6e7 chore(fbs): pin flatc toolchain to 25.9.23 and guard codegen drift
Tests · FBS codegen / codegen (push) Successful in 5s
Tests · Go / test (push) Successful in 2m29s
Tests · FBS codegen / codegen (pull_request) Successful in 6s
Tests · UI / test (push) Waiting to run
Tests · Integration / integration (pull_request) Successful in 1m46s
Tests · Go / test (pull_request) Successful in 3m20s
Tests · UI / test (pull_request) Successful in 3m19s
The committed FlatBuffers bindings were generated by flatc 25.x (the TS
runtime is flatbuffers@25.9.23), but nothing pinned the compiler, so a
regen on a box with an older flatc (Debian apt ships 23.5.26) silently
churns output and flips nullable-scalar builder defaults. PR #82 hit this
and shipped 5 report files from the wrong compiler.

Unify the whole toolchain on 25.9.23 (the only version available as an
npm package, a prebuilt flatc binary, and a Go tag) and make the bindings
reproducible:

- Downgrade the flatbuffers Go module 25.12.19 -> 25.9.23 (schema,
  transcoder, gateway, integration) so compiler and both runtimes match.
- Regenerate every schema with flatc 25.9.23. The only resulting change
  is order/command-item.ts: the lone straggler still on the old
  optional-scalar builder default (cmd_applied/cmd_error_code: 0 -> null).
  Inert in practice — the TS side never builds those response-only fields
  (the engine sets them in Go); the reader is unchanged.
- Pin the version in tooling: a flatc-check guard in ui/Makefile (fbs-ts)
  and a new pkg/schema/fbs/Makefile (fbs-go); both refuse a mismatched
  flatc and point at the release binary. Fix the stale apt install hint.
- Add a path-filtered CI guard (.gitea/workflows/fbs-codegen.yaml) that
  regenerates with the pinned flatc and fails on any diff.
- Document the pinned version and the regen commands in the schema README.

No wire-format change: Go build/vet, transcoder roundtrip + engine tests,
pnpm check and the full vitest suite (888) stay green.
2026-05-31 11:51:20 +02:00
Ilia Denisov e038ea6154 fix(dev-deploy): recycle engine containers on galaxy-engine:dev SHA drift
Tests · Integration / integration (pull_request) Successful in 1m48s
Tests · Go / test (pull_request) Successful in 2m1s
`backend`'s reconciler adopts pre-existing `galaxy-game-*` containers
without comparing their image SHA against the freshly-built
`galaxy-engine:dev`, so a long-lived sandbox would otherwise keep
serving the previous engine code after a redeploy. Issue #59 surfaced
this: after the per-command-rejection fix was deployed via
`workflow_dispatch`, the running sandbox container was still on the
old image SHA and the browser kept seeing the 503/unavailable response.

Adds a `Recycle engine containers on image drift` step right before
`Reap stray dev-deploy containers`. The step compares the new
`galaxy-engine:dev` SHA against every running `galaxy-game-*`
container and, on drift, stops the backend, removes the container,
wipes the bind-mounted per-game state directory (Engine.Init() writes
turn-0 over any pre-existing `turn-N` files — silent state corruption
otherwise), and cascade-deletes the lobby `games` row. The
`dev-sandbox` bootstrap on the next backend boot finds no live
sandbox and provisions a fresh one on the new engine image.

When the engine sources are unchanged, the BuildKit cache hits and
the SHA stays the same — the recycle step is a no-op and the running
games keep their state across the deploy. Verified end-to-end against
the live dev environment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 10:47:25 +02:00
Ilia Denisov 8565942392 feat(deploy): single-origin path-based deployment + project site
Build · Site / build (push) Successful in 8s
Tests · Go / test (push) Successful in 2m22s
Tests · UI / test (push) Failing after 2m42s
Serve the whole stack behind one host: site at /, game UI at /game/,
gateway REST at /api + /healthz, Connect at /rpc (prefix stripped by the
edge Caddy). The built artifact is domain-agnostic — the UI talks to the
gateway same-origin via relative URLs, so the same bundle runs under any
host with no rebuild and with CORS disabled.

- Rename the Connect proto service galaxy.gateway.v1.EdgeGateway ->
  edge.v1.Gateway; regenerate Go + TS; public path /rpc/edge.v1.Gateway.
- Move the game UI under base path /game (env BASE_PATH); make the
  manifest, service-worker scope, WASM loader, and all navigation
  base-aware via a withBase helper.
- Relative API + /rpc Connect prefix; Vite dev proxy mirrors the strip.
- Rewrite the edge Caddy (dev + prod) for path-based routing; empty CORS
  allow-lists (same-origin); single host.
- New VitePress project site (site/): i18n en/ru with switcher, LaTeX
  math, minimal monospace theme; built and served at /.
- dev-deploy compose/Makefile + CI (dev-deploy, prod-build, new
  site-build) build and seed the site; probes hit /, /game/, /healthz.
- Sync docs (ARCHITECTURE, gateway README/openapi, dev-deploy &
  local-dev READMEs, CLAUDE.md, ui/PLAN).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:19:07 +02:00
Ilia Denisov 04c7f6e68a feat(ui): installable offline PWA — service worker, manifest, icons (F5)
Tests · UI / test (push) Failing after 7m31s
Native SvelteKit service worker (src/service-worker.ts): a version-keyed
cache precaches the app shell + build artefacts (incl. core.wasm) +
static files; activate purges old caches; the gateway is never
intercepted; navigations fall back to the cached shell offline. Adds
static/manifest.webmanifest, a generated placeholder icon set
(scripts/gen-pwa-icons.mjs — dependency-free pure-Node PNG encoder), and
manifest / theme-color / apple-touch tags in app.html.

Gated by Playwright against a production preview (playwright.pwa.config.ts
+ tests/pwa/pwa.spec.ts via `pnpm test:pwa`, wired into ui-test):
manifest + installable icons, SW registration + a single version-keyed
cache, and offline shell load. Lighthouse is not used — its PWA category
was removed in v12.

Docs: ui/docs/pwa-strategy.md (+ index); F5 marked done.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:46:42 +02:00
Ilia Denisov b729036778 build(ui): build core.wasm in CI, stop committing the binary (F6)
Tests · UI / test (push) Successful in 3m48s
Tests · UI / test (pull_request) Successful in 2m35s
core.wasm and wasm_exec.js are no longer tracked (untracked + gitignored).
A reusable composite action .gitea/actions/build-wasm installs TinyGo
(actions/cache'd) and runs `make -C ui wasm`; it runs in all three
frontend-building workflows — ui-test (before Playwright; Vitest uses the
fake Core and needs no build), dev-deploy, and prod-build. ui-test gains a
Go setup (TinyGo shells out to Go); the deploy workflows already had one.

Docs: ui/docs/wasm-toolchain.md, ui/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 14:29:33 +02:00
Ilia Denisov b24d53b82f ci: install pnpm into a per-job dir to fix the host-runner setup race
Tests · UI / test (push) Waiting to run
Tests · UI / test (pull_request) Successful in 1m56s
`pnpm/action-setup@v4` defaults to installing pnpm in the shared
`~/setup-pnpm`. On the single host-mode runner $HOME is shared across
concurrent jobs, so when two pnpm jobs overlap (e.g. a post-merge
`dev-deploy` and `ui-test`, which sit in different concurrency groups)
their self-installers race and one fails with
`ENOTEMPTY ... rmdir '~/setup-pnpm/node_modules/.bin/store/v11/files'`
before the tests even run.

Point each step's `dest` at `${{ runner.temp }}/setup-pnpm` (a per-job
isolated directory) so concurrent jobs never share the install location.
The action still adds `dest` to PATH, so setup-node's pnpm cache and
later `pnpm` calls are unaffected; the pnpm package store stays shared
(safe — pnpm locks it). Applied to the three workflows that set up pnpm:
ui-test, dev-deploy, prod-build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 17:26:49 +02:00
Ilia Denisov f70258849f fix(dev-deploy): seed geoip onto a named volume
`docker restart galaxy-dev-backend` failed with "not a directory"
after every dev-deploy workflow run. Root cause: the compose file
bind-mounted the geoip database via a relative path
(`../../pkg/geoip/test-data/test-data/GeoIP2-Country-Test.mmdb`).
When the Gitea runner invoked `docker compose up`, the path
resolved against the runner's ephemeral workspace under
`/home/runner/.cache/act/<hash>/hostexecutor/...`. The bind source
baked into the running container therefore pointed at that
ephemeral path; the runner deleted the workspace once the workflow
finished, and any later `docker restart` could not remount.

Replace the bind with a named volume `galaxy-dev-geoip-data`,
seeded at deploy time:

- `tools/dev-deploy/docker-compose.yml`: mount
  `galaxy-dev-geoip-data:/var/lib/galaxy:ro` instead of a relative
  bind. Declare the volume in the top-level `volumes:` block.

- `.gitea/workflows/dev-deploy.yaml`: new `Seed geoip volume` step
  (placed right after the existing UI-volume seed) copies the
  fixture from `pkg/geoip/test-data/test-data/` into the named
  volume via an ephemeral alpine container, the same pattern UI
  seeding already uses.

- `tools/dev-deploy/Makefile`: new `seed-geoip` target performs
  the same copy from the persistent checkout. `up` and `rebuild`
  now depend on it, so a hand-run `make -C tools/dev-deploy up`
  populates the volume without operator action.

- `tools/dev-deploy/README.md`: updated the make-targets table to
  list `seed-geoip`.

- `tools/dev-deploy/KNOWN-ISSUES.md`: the entry for the restart
  failure is downgraded to a "fixed" postmortem; the symptom,
  cause, and where the fix lives are kept for future reference.

Verification on the dev host (this branch checked out):

  $ make -C tools/dev-deploy up                # populates the volume, brings stack healthy
  $ docker restart galaxy-dev-backend          # used to error "not a directory"
  $ until [ "$(docker inspect -f '{{.State.Health.Status}}' galaxy-dev-backend)" = "healthy" ]; do sleep 2; done
  $ echo "ok"                                   # backend up 6s, healthy

The pre-existing sandbox engine `galaxy-game-80f3ce86-...` survived
both `make up` and `docker restart` untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:59:38 +02:00
Ilia Denisov a9087691a3 chore(ci): tidy CI/dev infra — drop local-ci, lift migration rule, scope by galaxy.stack label
Tests · Go / test (push) Successful in 2m6s
Tests · Go / test (pull_request) Successful in 3m1s
Tests · Integration / integration (pull_request) Successful in 1m42s
Five connected cleanups across the dev/CI infrastructure:

1. Drop tools/local-ci/. The standalone Gitea + act_runner stack was
   the legacy "offline workflow validator"; the per-stage CI gate now
   runs on gitea.lan and the directory was only retained as a
   fallback. Removing it leaves no operational dependency: backend,
   gateway, and game code have no references; documentation that
   pointed at it (CLAUDE.md, docs/ARCHITECTURE.md, ui/docs/testing.md,
   tools/dev-deploy/README.md, tools/local-dev/README.md) is updated
   in this same change. Historical "Verified on local-ci run N"
   markers in ui/PLAN.md are preserved unchanged.

2. Lift the pre-production single-migration rule. The rule forced
   every schema delta into 00001_init.sql and required a manual
   make clean-data wipe on every backward-incompatible change in
   tools/dev-deploy/. Future schema deltas now land as additive
   sequence-numbered files (00002_*.sql, …) that goose applies
   automatically on backend startup; 00001_init.sql becomes an
   immutable baseline. Authoring conventions live in
   backend/internal/postgres/migrations/README.md. The chain may be
   squashed back into a fresh 00001 as a deliberate one-time
   operation before the first production deployment.

3. Document the deployment cadence. The dev environment is
   single-tenant: pushes to feature/* run the test workflows
   (go-unit, ui-test, integration) only; dev-deploy.yaml fires on
   push to development. A workflow_dispatch override on
   dev-deploy.yaml lets a developer preview a feature branch on the
   shared dev environment before merge; the next merge into
   development overwrites the manual deploy idempotently.

4. Scope compose-managed resources by an explicit
   galaxy.stack=<local-dev|dev-deploy> label. Both compose files
   stamp the label on every service, network, and named volume.
   Makefiles in tools/local-dev/ and tools/dev-deploy/ filter their
   engine-cleanup operations by (stack-label AND engine OCI title)
   so they never touch unrelated workloads on the same daemon.
   dev-deploy.yaml gains a pre-`compose up` step that reaps stale
   exited/dead containers under the dev-deploy stack label.

5. Backend now stamps the same galaxy.stack=<value> label on every
   engine container it spawns, sourced from a new BACKEND_STACK_LABEL
   env var (empty → label not applied; legacy-safe). Both compose
   files set it to their stack name (local-dev / dev-deploy). The
   contract is recorded in docs/ARCHITECTURE.md under
   "Container labels". A package-level test in
   backend/internal/runtime exercises both the label-present and
   label-absent paths.

No tests intentionally regressed: go test ./backend/internal/{config,
runtime,dockerclient} is green, both compose files validate cleanly,
and the backend, gateway, and game modules all build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:32:42 +02:00
Ilia Denisov 81917acc3e dev-deploy: enable Dev Sandbox bootstrap and synthetic-report loader
Tests · UI / test (push) Has been cancelled
Tests · Integration / integration (pull_request) Successful in 1m47s
Tests · Go / test (pull_request) Successful in 2m4s
Tests · UI / test (pull_request) Successful in 2m23s
Two long-standing dev-environment ergonomics had not survived the
move from the bespoke local-dev stack to the CI-driven dev-deploy:

1. `BACKEND_DEV_SANDBOX_EMAIL` defaulted to an empty string in the
   dev-deploy compose, so the auto-provisioned "Dev Sandbox" game
   never appeared on `https://www.galaxy.lan`. Bake `dev@galaxy.lan`
   as the default — matches `.env.example` and lets a developer who
   logs in with that email find a ready-to-play game in the lobby.

2. The lobby's synthetic-report loader was gated on
   `import.meta.env.DEV`, which is true only for `vite dev` (the
   tools/local-dev path). The long-lived dev environment builds
   with `vite build` (production mode), so the section was always
   stripped from its bundle. Gate it on an explicit
   `VITE_GALAXY_DEV_AFFORDANCES` flag instead and set it both in
   `.env.development` (preserves `pnpm dev` behaviour) and in the
   `dev-deploy.yaml` build step. The `prod-build.yaml` build path
   leaves the flag unset, so production stays clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 21:46:24 +02:00
Ilia Denisov 859b157a59 auth dev-fixed-code bypasses attempts cap; dev-deploy gains manual dispatch
Tests · Go / test (pull_request) Successful in 2m9s
Tests · Go / test (push) Successful in 2m9s
Tests · Integration / integration (pull_request) Successful in 1m49s
Tests · UI / test (pull_request) Successful in 2m51s
Two problems showed up while trying to log into the long-lived dev
environment with the dev-fixed code `123456`:

1. `ConfirmEmailCode` checked the per-challenge attempts ceiling
   *before* the dev-fixed-code override. A developer who burned past
   `ChallengeMaxAttempts` on an existing un-consumed challenge (easy
   to trigger when the throttle reuses one challenge_id) hit
   `ErrTooManyAttempts` and the UI rendered "code expired or already
   used" even though the fixed code was correct. Reorder so the
   dev-fixed-code branch runs first and bypasses both the bcrypt
   verify and the attempts gate. Production stays unaffected
   because production loaders refuse to set `DevFixedCode`.

2. `dev-deploy.yaml` only fires on push to `development`, so the
   matching docker-compose default change for
   `BACKEND_AUTH_DEV_FIXED_CODE` could not reach the running stack
   before this PR merged. Add `workflow_dispatch: {}` so a developer
   can deploy any branch — typically a feature branch under review —
   from the Gitea Actions UI without waiting for the merge.

Covered by a new `TestConfirmEmailCodeDevFixedCodeBypassesAttemptsCeiling`
integration test that burns through the ceiling with wrong codes
then proves the dev-fixed code still produces a session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 21:28:30 +02:00
Ilia Denisov 1a0e3e992f ci/ui-test: queue runs in one bucket instead of cancelling
Tests · UI / test (push) Waiting to run
Tests · UI / test (pull_request) Successful in 2m20s
`cancel-in-progress: true` killed run #73 even though it was the
only ui-test in its concurrency group — Gitea appears to cancel the
in-progress job on its own under that setting in some edge cases.

Switch to a singleton group with `cancel-in-progress: false`. The
new behaviour is simple queueing: only one ui-test workflow runs at
a time across the repository, the rest wait. Vite-on-:5173 cannot
collide because there is never a second ui-test alive. The wall-time
hit is bounded — ui-test is ~2 minutes — and bursts are rare enough
that queueing is cheap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:51:54 +02:00
Ilia Denisov 6e6186a571 ci/ui-test: key concurrency by head sha, not gitea.ref
Tests · UI / test (push) Has been cancelled
Tests · UI / test (pull_request) Successful in 2m17s
`gitea.ref` differs between push (`refs/heads/<branch>`) and
pull_request (`refs/pull/N/head`) events even for the same commit,
so the two parallel runs land in different concurrency groups and
the Vite-on-:5173 collision is not suppressed. Switching the key to
the head sha (`gitea.event.pull_request.head.sha || gitea.sha`)
collapses both events into one bucket, leaving exactly one ui-test
alive per commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:46:00 +02:00
Ilia Denisov e3bb30201d ci/ui-test: serialise per-ref + clear stale Vite before Playwright
Tests · UI / test (pull_request) Failing after 6s
Tests · UI / test (push) Successful in 2m21s
Two ui-test jobs cannot coexist on the same host: Playwright's
`webServer` spec spawns `pnpm dev` on :5173, and on a host-mode
runner the port lives in the host namespace shared by every job.
ui-test #67 hit "Error: http://localhost:5173 is already used"
because a parallel job's Vite still held the port.

Two changes:

1. `concurrency: ui-test-${{ gitea.ref }}` with `cancel-in-progress:
   true`. New push/PR runs against the same ref kill any earlier
   ui-test before starting, so we never have two `pnpm dev`s alive
   at once.
2. `pkill -f 'vite dev' || true` plus `fuser -k 5173/tcp` right
   before Playwright. Defence in depth in case the concurrency
   cancellation does not reap the spawned shell promptly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 08:42:08 +02:00
Ilia Denisov 2a95bf4a50 ci: re-enable actions cache now that the runner serves it
Tests · UI / test (push) Successful in 2m20s
Tests · Go / test (push) Failing after 2m21s
Tests · Go / test (pull_request) Successful in 1m40s
Tests · Integration / integration (pull_request) Successful in 1m46s
Tests · UI / test (pull_request) Successful in 2m2s
The Gitea Actions cache service now answers on 10.200.0.1:43513
(post nftables fix on the runner side). Turn `cache: true` and
`cache: pnpm` back on so setup-go/setup-node can use it for
cross-job tarball caching on top of the host-persistent caches we
already rely on.

The setup-* actions still tolerate the cache being unavailable, so
this is reversible to `cache: false` if the service goes away again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:39:39 +02:00
Ilia Denisov 8058f26397 ci: drop cache: setting in setup-go/setup-node
Tests · Go / test (push) Successful in 2m21s
Tests · UI / test (push) Successful in 2m22s
Tests · Go / test (pull_request) Successful in 3m14s
Tests · Integration / integration (pull_request) Successful in 1m37s
Tests · UI / test (pull_request) Successful in 2m7s
`cache: true` (setup-go) and `cache: pnpm` (setup-node) make the
actions push and pull tarballs through the Gitea Actions cache
service at 192.168.0.222:43513. That endpoint currently does not
answer, so every workflow burns minutes per run on reserveCache
retries before the action gives up.

In host-mode the real caches live under the runner user's $HOME
(~/go/pkg/mod, ~/.cache/go-build, ~/.local/share/pnpm,
~/.cache/ms-playwright) and persist between jobs without any
actions/cache plumbing. Switching cache: off avoids the zombie
retries and uses the local disk caches the runner already has warm.

Reviving the cache service is a separate TODO. Until then this is
the simpler and faster baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:39:22 +02:00
Ilia Denisov 9135991887 ci/ui-test: drop --with-deps now that runner is host-mode
Tests · Go / test (pull_request) Successful in 2m6s
Tests · UI / test (push) Failing after 2m32s
Tests · Integration / integration (pull_request) Successful in 1m52s
Tests · UI / test (pull_request) Successful in 2m3s
`playwright install --with-deps` shells out to `sudo apt-get install`
for the system libraries that headless browsers need. In a job
container that runs as root this is silent; on a host-mode runner the
non-interactive sudo prompts for a password, fails three times, and
the step exits 1.

Drop --with-deps. The system .so libraries are installed once on the
host via `pnpm exec playwright install-deps` (or the equivalent
apt-get incantation); workflow runs only need to fetch the browser
binaries themselves, which lives under the runner user's home and
needs no privilege.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:59:45 +02:00
Ilia Denisov 4a88b24f4b ci: drop GIT_SSL_NO_VERIFY now that runner is host-mode
The act_runner now executes jobs natively on the host (no per-job
container), so actions/checkout uses the host's system CA store,
which already trusts the host-Caddy root CA. The workaround that
disabled TLS verification for `git fetch` is no longer needed and
just hides legitimate cert issues if they ever appear.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:04:11 +02:00
Ilia Denisov 9ebb2e7f0f ci: rename workflows for Gitea UI readability
Tests · Go / test (push) Successful in 2m31s
Tests · Integration / integration (pull_request) Successful in 2m23s
Tests · Go / test (pull_request) Successful in 2m50s
Tests · UI / test (push) Successful in 13m2s
Tests · UI / test (pull_request) Successful in 13m22s
Switches the `name:` field on every workflow to the bulleted style:

  Tests · Go            (go-unit.yaml)
  Tests · UI            (ui-test.yaml)
  Tests · Integration   (integration.yaml)
  Deploy · Dev          (dev-deploy.yaml)
  Build · Prod          (prod-build.yaml)
  Deploy · Prod         (deploy-prod.yaml)

File names stay the same so existing path filters and any URL
references continue to work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:22:53 +02:00
Ilia Denisov 0da360a644 dev-deploy: fix backend startup in CI
Two bugs surfaced on the first real merge into development:

1. `${{ env.HOME }}` evaluates to empty string at the workflow stage,
   so GALAXY_DEV_GAME_STATE_DIR became `/.galaxy-dev/game-state`.
   Resolve in the shell instead of YAML.

2. The compose bind-mount of GeoIP2-Country-Test.mmdb referenced a
   path inside the runner's workspace volume, which the host Docker
   daemon cannot see — it created an empty directory and the backend
   crashed with "geoip database: is a directory" in a restart loop.
   Bake the file into the backend image so dev-deploy no longer needs
   a bind-mount; local-dev compose still mounts it on top for swap-in
   during development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 00:22:16 +02:00
Ilia Denisov c6c5f3c8dd ci: skip TLS verify for actions/checkout on LAN Gitea
go-unit / test (push) Successful in 2m28s
go-unit / test (pull_request) Successful in 2m30s
integration / integration (pull_request) Successful in 2m20s
ui-test / test (push) Successful in 13m5s
ui-test / test (pull_request) Successful in 14m31s
The Gitea host serves https://gitea.iliadenisov.ru with a cert signed
by host-Caddy's internal CA, which the runner-image's CA bundle does
not trust. actions/checkout@v4 fails on `git fetch` as a result, so
every workflow on gitea.lan has been failing — visible only now that
we made gitea.lan the primary CI target.

Sets GIT_SSL_NO_VERIFY=true on every workflow as a quick fix. Safe in
practice because both endpoints sit on the same LAN. The long-term
fix is to bake the Caddy root CA into the runner image and drop this
env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 23:43:51 +02:00
Ilia Denisov f316952c12 ci: split workflows for linear development flow
Reshapes .gitea/workflows/ around the new main ← development ←
feature/* branching model:

- go-unit.yaml — Go unit tests, runs on push/PR matching Go paths
- ui-test.yaml — narrowed to Vitest + Playwright only (Go tests now
  live in go-unit.yaml)
- integration.yaml — testcontainers suite, fires on PR to
  development/main and on push to development
- dev-deploy.yaml — builds the stack and (re)deploys tools/dev-deploy/
  on every merge into development
- prod-build.yaml — builds prod images on push to main and uploads
  docker save bundles as artifacts (30-day retention)
- deploy-prod.yaml — workflow_dispatch placeholder for the future
  SSH-based rollout

ui-release.yaml is removed; its v* tag trigger is superseded by
prod-build.yaml plus the manual deploy-prod entry point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 23:26:46 +02:00
Ilia Denisov 39b7b2ef29 ci: skip docs-only triggers; document per-stage local-ci gate
ui-test workflow gains a `!**/*.md` negation so commits touching only
markdown (READMEs, PLAN.md updates, topic docs) no longer kick off the
full Go + Vitest + Playwright pipeline. Mixed commits keep triggering
because at least one positive path (`ui/**`, `gateway/**`, …) still
matches.

Project CLAUDE.md adds a per-stage CI gate section so the local
Gitea Actions runner is exercised at the close of every stage from
any PLAN.md, with the push step pre-authorised.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 09:47:27 +02:00
Ilia Denisov dc1c9b109c phase 3 2026-05-07 09:40:37 +02:00
Ilia Denisov 1b5749bd31 fix: make ci green on a fresh runner
Two issues surfaced by the first end-to-end ui-test.yaml run on a
clean Linux runner that don't reproduce locally:

- pkg/geoip tests load fixtures from the pkg/geoip/test-data git
  submodule (MaxMind-DB). actions/checkout@v4 does not fetch
  submodules by default, so the fixture path is missing on the
  runner. Both ui-test and ui-release workflows now check out with
  submodules: recursive.

- pkg/util/TestWritable asserts that /usr/lib is not writable, which
  holds for unprivileged users but fails inside the catthehacker
  workflow container that runs as root. Skip that branch when
  os.Geteuid() == 0; the root-only "the writable dir is writable"
  branch still runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 08:35:34 +02:00
Ilia Denisov 7450006ed3 phase 2: ui testing infrastructure
Vitest + @testing-library/jest-dom matchers wired through tests/setup.ts.
Playwright with four projects: chromium-desktop, webkit-desktop,
chromium-mobile-iphone-13, chromium-mobile-pixel-5; traces and
screenshots retained on failure.

.gitea/workflows/ui-test.yaml runs Tier 1 on every push and pull
request: monorepo Go service tests (backend with -p 1 to dodge
testcontainer contention; gateway, game, every pkg/<name> module),
pnpm install --frozen-lockfile, playwright install --with-deps,
pnpm test, pnpm exec playwright test. Uploads playwright-report
and test-results on failure. Integration suite stays gated behind
make -C integration integration; deprecated client/ excluded.

.gitea/workflows/ui-release.yaml mirrors Tier 1 on v* tag push and
keeps commented placeholders for visual regression (Phase 33) and
macOS iOS smoke (Phase 32).

ui/docs/testing.md documents both tiers and the local invocations
that mirror what CI runs. ui/PLAN.md Phase 2 marked done; Phase 3
gains a bullet to extend the go test command with ./ui/core/...;
Phase 36 has the renamed release workflow path.

tools/local-ci/ ships a self-contained docker-compose for verifying
workflows against a local Gitea + arm64 act_runner before pushing
to a real instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 08:24:44 +02:00