Files
scrabble-game/PRERELEASE.md
T
Ilia Denisov 7e75c32d07
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m7s
R3: dashboards, docs and tracker bake-back
- Edge/UX dashboard: aggregate request-rate vs rejection-rate panel
  (gateway_rate_limited_total by class; no per-user labels).
- ARCHITECTURE §2/§11/§12/§13: body cap + explicit h2c sizing, the rate-limit
  observability pipeline and auto-flag policy, the admin-limiter note (and the
  caddy-path gap), the landing container topology; fixed the stale 120/min
  per-user figure.
- FUNCTIONAL (+_ru): the Throttled view and the reversible high-rate flag.
- gateway/backend/deploy READMEs, TESTING.md, root CLAUDE.md updated.
- PRERELEASE.md: R3 interview decisions + implementation refinements logged;
  tracker R3 -> done (this PR implements it; CI gates the merge).
2026-06-10 05:12:30 +02:00

20 KiB
Raw Blame History

Pre-release plan — hardening before Stage 18

Living tracker for the pre-release hardening pass that runs before Stage 18 (the prod cutover). Same discipline as PLAN.md: one phase per session, interview the owner on the open details at the start of each phase, bake every decision back into PLAN.md / docs/ / the affected READMEs / Go Doc comments in the same PR, get CI green, then mark the phase done. Phases run as feature/* → development PRs (the Stage 16 branch model); the owner approves+merges.

Why now: the system is feature-complete through Stage 17 and the test contour is green, but there is no prod data yet — schema, wire labels and the dictionary layout can still change for free. These phases spend that one-time freedom and harden the edge before prod. Each phase maps back to the owner's raw pre-release TODO list (numbers in the tracker).

Phase tracker

# Phase Raw TODOs Status
R1 Schema & naming reset 1 + 10 done
R2 Stress harness + contour observability + early run 9a done
R3 Edge hardening 2 + 8 + 3 done
R4 Push enrichment + kill the last poll 4 + 5 todo
R5 Bundle slimming 6 todo
R6 Refactor + docs reconciliation + de-staging 7 todo
R7 Final stress run + tuning 9b todo
Stage 18 — prod contour deploy see PLAN.md

Key findings (these reshaped the raw list — read before starting a phase)

  • R1 (TODO 1 + 10) is one cheap moment, now. Squashing the 12 goose migrations is safe precisely because there is no prod data and the contour DB is wiped. Folding the new variant labels (scrabble_ru/scrabble_en/erudit_ru) into that single baseline makes the rename need no data migration and no back-compat mapping. Today's labels (english/russian_scrabble/erudit) are persisted in games.variant, game_invitations.variant, in pkg/fbs and the UI — ~100 files, but a mechanical sweep on a clean DB.
  • R4 (TODO 4 + 5): the app is already push-first. Game state refreshes on your_turn/opponent_moved, the lobby on notify, chat on chat_message. The only genuine periodic server poll is lobby.poll (matchmaking, 2.5 s, ui/src/screens/NewGame.svelte). What remains is killing that one poll and enriching push events to carry payloads so the UI stops re-fetching after each signal.
  • R3 (TODO 2): identity forgery is already mitigated. Identity is always derived from the session (Authorization: BearerX-User-ID); the client cannot inject identity, the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real gaps are a missing request-body size limit (cheap DoS) and invisible rate-limit rejections (no log/metric/admin view — that is TODO 8). Static landing serving is not covered by the gateway token bucket (it only guards Execute).
  • R6 (TODO 7) scale: ~431 Stage N references across ~104 files (incl. the file name backend/internal/inttest/stage6_test.go). Code is the source of truth; docs/ describe current state; PLAN.md keeps the decision history.

Locked decisions (owner interview)

  • Stress test (TODO 9): early + final runs. Driver = edge protocol (Connect/FB through the gateway, moves generated by the solver) plus a separate gateway-hammer saturation test. Pacing = realistic (under limits) + saturation (ramp to the knee). Resource metrics = add cAdvisor + postgres_exporter to the contour (today only Go-runtime metrics exist). The harness stays in the repo for repeats.
  • Push (TODO 4 + 5): both — kill lobby.poll (use the existing match_found, keep poll as the ws-down fallback) and enrich push events with payloads.
  • Refactor (TODO 7): hygiene + structural changes by a reviewed list — behaviour-preserving, test-gated, contentious items surfaced to the owner before applying.
  • Landing (TODO 3): separate static container behind the project caddy (/ → landing, /app/ + /telegram/ → gateway); drop landing.html from the gateway go:embed.
  • Rate-abuse (TODO 8): metric + Grafana + admin view plus a conservative auto-flag — a soft, reversible "suspected high-rate" marker for operator review, tunable threshold, no auto-ban.

Phases

Each phase: read this tracker + the relevant docs/, interview the owner on the open details below, implement within scope, then update the tracker + docs/code and get CI green before marking it done.

R1 — Schema & naming reset (TODO 1 + 10) — first

Squash backend/internal/postgres/migrations/00001..00012 into one 00001_baseline.sql (method: pg_dump --schema-only from a fully-migrated DB → wrap as the goose baseline → prove a fresh migrate yields a schema identical to the 12-migration chain via the integration suite → delete the old files; keep goose). Bake the new variant labels into the baseline. Propagate scrabble_ru/scrabble_en/erudit_ru through the backend (engine.Variant/ParseVariant, registry.dictFiles, the CHECK values), the wire (pkg/fbs variant:string, regenerate FB) and the UI (lib/model.ts union, variants.ts, fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy ../scrabble-dictionary to a single source→dawg build point and align the dawg artifact names to the new labels (crosses into ../scrabble-solver's committed fixtures — keep them byte-identical). After merge, wipe the contour DB (drop the volume) so it re-provisions on the next deploy.

  • Critical files: backend/internal/postgres/migrations/, backend/internal/engine/{engine,registry}.go, pkg/fbs/scrabble.fbs, ui/src/lib/{model,variants}.ts, ../scrabble-dictionary/{Makefile,cmd/builddict,…}.
  • Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is one PR or split; how to script the contour DB wipe in the deploy.

R2 — Stress harness + contour observability + early run (TODO 9, part 1)

Build the reusable load harness as a new loadtest module in go.work (reuses pkg/fbs, connect-go, and scrabble-solver for legal-move generation): a seeder that inserts 1000 guest + 10000 durable accounts with pre-created sessions (token hashes) directly in the DB and hands the plaintext tokens to the client; a driver that runs N virtual users, each in 35 concurrent 24-player games, exercising submit-play / pass / exchange / nudge / chat / check-word / draft-move / profile-save through the edge protocol, in realistic (under rate limits) and saturation (ramp) modes; plus a separate gateway-hammer that deliberately exceeds limits to verify the limiter holds and measure its cost. Add cAdvisor + postgres_exporter to deploy/docker-compose.yml and a Grafana resource dashboard. Run the early pass against the freshly-wiped contour; produce a trip report (logic/concurrency bugs + a resource baseline) that feeds R3 and R6.

  • Critical files: new loadtest/, deploy/docker-compose.yml, deploy/observability/*, docs/TESTING.md.
  • Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move for realistic game progress); run duration; the pass/fail bar.

R3 — Edge hardening (TODO 2 + 8 + 3)

Add a request-body size cap at the gateway h2c mux / Execute (e.g. ~1 MB). Add rate-limit observability: a gateway_rate_limited_total{class} counter + a structured log per rejection; an aggregate Grafana panel (request rate + rejection rate — spikes visible without per-user label cardinality, honouring the Stage 12/17 discipline); an admin-console view of recently throttled users/IPs (in-memory ring buffer, single- instance, reset-on-restart, like the active_users gauge). Add the conservative auto-flag: when a user is sustained-throttled past a tunable threshold, set a soft, reversible account.flagged_high_rate_at marker (baked into the R1 baseline) surfaced in the admin user list/detail — no auto-ban; the operator clears it. Split the landing into its own static container (deploy/ + a Caddyfile route / → landing) and drop landing.html from the gateway go:embed.

  • Critical files: gateway/internal/connectsrv/server.go, gateway/internal/ratelimit/, gateway/internal/connectsrv/metrics.go, backend/internal/adminconsole/, deploy/caddy/Caddyfile, deploy/docker-compose.yml, gateway/internal/webui/.
  • Open details: the auto-flag threshold/window + whether the marker is persisted vs in-memory; the landing image base (caddy vs nginx).

R4 — Push enrichment + kill the last poll (TODO 4 + 5)

Replace lobby.poll with the existing match_found push (keep the poll as a ws-down fallback). Enrich your_turn/opponent_moved/notify to carry the state payload so the UI renders from the event without a follow-up game.state (removes the lobby↔game nav latency the owner noticed). Wire-contract change: pkg/fbs event payloads → backend notify emit → UI stream consumers (ui/src/lib/app.svelte.ts), with the per-game cache as the landing spot; regenerate FB.

  • Critical files: pkg/fbs/scrabble.fbs, backend/internal/notify/events.go, ui/src/lib/{app.svelte,transport}.ts, ui/src/screens/NewGame.svelte.
  • Open details: which events carry full vs delta payloads; the fallback-poll cadence when the stream is down.

R5 — Bundle slimming (TODO 6)

Lazy-load secondary screens (Friends/Stats/Settings/About/Profile) and i18n catalogs by language via dynamic imports; re-measure against the existing 100 KB-gzip budget (ui/scripts/bundle-size.mjs, ~82 KB today). If the win is marginal, stop — acceptable per the owner.

  • Critical files: ui/src/App.svelte, ui/vite.config.ts, ui/src/lib/i18n/.

R6 — Refactor + docs reconciliation + de-staging (TODO 7) — near last

Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical de-staging — remove Stage N/TODO-N references from code, comments and service READMEs (rename stage6_test.go); (b) docs↔code reconciliation — reconcile docs/ARCHITECTURE.md / docs/FUNCTIONAL.md(+_ru) against the code-as-truth, fixing drift and Go Doc comments; (c) structural changes by a reviewed list — surface a list of proposed optimizations / test-suite consolidations to the owner, apply only the approved, behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the regression gate. Incorporates the early-run (R2) bug fixes not already shipped.

  • Open details: the structural-changes list itself (owner-approved before applying); the test consolidation targets.

R7 — Final stress run + tuning (TODO 9, part 2) — before Stage 18

Re-run the R2 harness against the final, refactored system on a clean contour; analyse resource consumption across all components (gateway, backend, Postgres, the metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record the methodology + results in the repo.

Stage 18 (prod contour) then proceeds per PLAN.md.

Sequencing rationale

R1 first (cheapest now; everything builds on the final schema/naming and the stress test must run against it). R2 builds the harness and runs the early pass to surface bugs and a resource baseline that feed R3 and R6. R3/R4/R5 harden and improve the system. R6 (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and benefits from all accumulated bug knowledge. R7 validates the final system and tunes it. Then Stage 18.

Regression-safety discipline (cross-cutting)

  • Every phase is a feature/* → development PR; CI (unit + integration + ui behind the CI / gate check) must be green before the owner merges; watch the post-merge contour deploy with gitea-ci-watch.py.
  • R6 structural changes are behaviour-preserving, test-gated, and split from the mechanical sweeps; contentious items are owner-approved first.
  • The two stress runs (R2 early, R7 final) are the system-level regression gate.

Verification (per phase)

  • go build ./<module>/..., go vet, gofmt -l . clean, go test -count=1 ./<module>/...; UI: pnpm check && pnpm test:unit && pnpm build; the integration suite (-tags integration) for DB/schema changes; docker compose config for deploy changes; green CI on the PR + a healthy contour deploy.
  • R1: prove the squashed baseline yields a schema identical to the 12-migration chain (integration suite on a fresh DB) before deleting the old files.
  • R2/R7: the harness runs end-to-end against the contour; the trip report lists concrete defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels.

Refinements logged during implementation

  • R1 (interview + implementation):

    • Variant labels english/russian_scrabble/eruditscrabble_en/scrabble_ru/erudit_ru across the backend (engine.Variant.String/ParseVariant; the games/game_invitations variant CHECK in the baseline; GCG #lexicon and the variant metric attribute both flow from String), the wire (pkg/fbs variant is a string field — values change with no FlatBuffers regen) and the UI (model.ts union, variants.ts records, codec/premiums/mocks/tests, the admin dictionary.gohtml). Kept: the Go enum identifiers (VariantEnglish…, internal) and the i18n display keys (new.english/new.russian/new.erudit, display-only). complaints.variant stays free-text (no CHECK, as before).
    • dawg filenames kept descriptive (en_sowpods/ru_scrabble/ru_erudit) — only the registry's Variant key carries the rename, so registry.go, the published scrabble-solver fixtures and the dictionary release artifact are untouched (decouples the three repos).
    • Migrations squashed 12 → one hand-written 00001_baseline.sql. Verified by a pg_dump --schema-only diff (the chain vs the baseline are identical but for the two intended variant-CHECK values) plus the green integration suite. No data migration (no production data).
    • Done (cross-repo + contour): the scrabble-dictionary tidy merged (PR #2) and was re-cut as the byte-identical v1.0.1 release for clean provenance (the backend stays on v1.0.0 — same bytes, no rewire; the backend pulls a version-pinned release artifact, not master). Post-merge the contour backend schema was wiped (DROP SCHEMA backend CASCADE + restart, not a volume drop) and re-migrated to the baseline — verified the new variant CHECK (scrabble_en/scrabble_ru/erudit_ru), games=0 and a clean boot.
  • R2 (interview + implementation):

    • Locked decisions: game assembly via invitations (real path, no robots; not direct game-row inserts); moderate ramp 50 → 200 → 500 at 10 min/step; diagnostic pass bar (no SLO gate); run as a one-shot container on scrabble-internal in this PR.
    • Harness = new scrabble/loadtest module (use ./loadtest + a replace scrabble/gateway for the dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions directly in Postgres (token hash mirrors backend/internal/session), drives players over the edge protocol, generates mid-ranked legal moves locally with the embedded scrabble-solver by replaying game.history (the edge carries no board — mirrors engine.ReplayBoard via the public API), and a gateway-hammer. Compact CLI (run / cleanup), distroless Dockerfile (DAWGs baked), Go unit tests.
    • Adding the module broke the other images' builds — backend/gateway/telegram Dockerfiles reduce the workspace but still referenced ./loadtest (not in their context); each now also -dropuse=./loadtest (backend/telegram additionally -dropreplace the gateway replace). Caught by the first deploy run; verified by building all four images.
    • Harness payload fixes found by the smoke pass: the draft DTO's rack_order is a string (was sent as []bad_request); the display-name validator forbids digits/colons, so the cleanup marker became a letters-only Zzloadtest so profile.update resends the seeded name. chat_not_your_turn / nudge_own_turn are by-design turn gates, correctly exercised.
    • Observability: added cAdvisor + postgres_exporter + the Scrabble — Resources dashboard + two Prometheus jobs. Finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker breaks its layer-ID resolution — the existing galaxy deploy has the same limit), so per-container CPU/RSS for the early pass was captured via docker stats. R7: adopt the otelcol docker_stats receiver (already the contrib image) for per-container metrics in Grafana.
    • Early run (2026-06-09): ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000 accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's single shared http2.Transport; the harness itself peaked at 86 % of a core on the same host, so the figures are pessimistic. Full trip report in ../loadtest/REPORT-R2.md; it feeds R3 (h2c MaxConcurrentStreams/timeouts, body-size cap), R6 and R7 (per-player transports, separate hardware, pool/limit sizing).
    • CI: ./loadtest/... added to the path filter + vet/build/test; go.work.sum carries the new deps.
  • R3 (interview + implementation):

    • Locked decisions: the flag column lands by editing the R1 baseline (+ a contour schema wipe after merge — no migration chain accrues before prod); auto-flag defaults 1000 rejected / 10 min (BACKEND_HIGHRATE_FLAG_THRESHOLD/_WINDOW, rolling window, set-once, operator clears, no auto-ban); landing image = caddy:2-alpine; throttle data flows gateway → backend (a 30 s per-key summary POST to the new /api/v1/internal/ratelimit/report, the existing trusted direction) with the episode window + flag rule in the backend (internal/ratewatch); rejection logging = Warn summary per key per window + Debug per rejection — a deliberate deviation from the phase's "structured log per rejection" (the R2 hammer would have logged ~522k lines in minutes); all three R2-report tails included (explicit h2c sizing, the session-resolve failure cause at Warn, reviving the admin limiter).
    • Body cap: GATEWAY_MAX_BODY_BYTES (default 1 MiB) as both the Connect per-message read limit and an http.MaxBytesReader wrap of the public mux; an oversized Execute is resource_exhausted.
    • Dead config found: AdminPerMinute/AdminBurst were never wired — the gateway /_gm mount is now 429-guarded per IP ahead of its Basic-Auth. The caddy-fronted contour path stays unlimited (stock caddy has no limiter) — an accepted gap, recorded in docs/ARCHITECTURE.md §12.
    • Landing split: a landing target in gateway/Dockerfile (the UI build stage is shared; identical compose build args keep it one cached build); the gateway drops landing.html from the embed and 308-redirects //app/; the contour caddy routes /app/, /telegram/ and the Connect path to the gateway and the catch-all to the landing container; the CI deploy probe now checks both / (landing) and /app/ (gateway).
    • Observability: gateway_rate_limited_total{class} (user/public/email/admin, aggregate-only)
      • a rate-vs-rejections panel on the Edge/UX dashboard; the admin console gains the Throttled page (the in-memory episode window, reset-on-restart like active_users, plus the flagged-account queue) and the flag badge / clear action on the user list / card.
    • The jet regen also restored the previously missing game_drafts/game_hidden generated models (their tables were added after the last jetgen run; no behaviour change).