Files
scrabble-game/PRERELEASE.md
T
Ilia Denisov d4ef951db9
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Has been skipped
CI / integration (pull_request) Has been skipped
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m0s
R5: bundle slimming — retarget the budget to the app, no code slimming
Analysed the real dist (gzip + sourcemap attribution): the bundle is already minified + tree-shaken and dominated by the Connect/FlatBuffers transport runtime + generated bindings + the Svelte runtime (~2/3 of main), so no in-scope code slimming is warranted. Lazy-loading was rejected (bundle-size.mjs sums every chunk -> zero total-size win, plus +N gateway fetches of latency); i18n lazy-load and chunk-collapsing likewise (caching/HTTP2).

Instead bundle-size.mjs now measures per HTML entry with three independent gates (app entry <=100 KB, Svelte+i18n shared <=30 KB, landing-own <=5 KB): the app's real payload is its entry chunk + the shared chunk (~97 KB), never landing.js. Same CLI + exit-code contract, CI step unchanged. Fixed the stale ~82 KB figure in the script and ui/README.md. No app code change.
2026-06-10 15:11:45 +02:00

26 KiB
Raw Blame History

Pre-release plan — hardening before Stage 18

Living tracker for the pre-release hardening pass that runs before Stage 18 (the prod cutover). Same discipline as PLAN.md: one phase per session, interview the owner on the open details at the start of each phase, bake every decision back into PLAN.md / docs/ / the affected READMEs / Go Doc comments in the same PR, get CI green, then mark the phase done. Phases run as feature/* → development PRs (the Stage 16 branch model); the owner approves+merges.

Why now: the system is feature-complete through Stage 17 and the test contour is green, but there is no prod data yet — schema, wire labels and the dictionary layout can still change for free. These phases spend that one-time freedom and harden the edge before prod. Each phase maps back to the owner's raw pre-release TODO list (numbers in the tracker).

Phase tracker

# Phase Raw TODOs Status
R1 Schema & naming reset 1 + 10 done
R2 Stress harness + contour observability + early run 9a done
R3 Edge hardening 2 + 8 + 3 done
R4 Push enrichment + kill the last poll 4 + 5 done
R5 Bundle slimming 6 done
R6 Refactor + docs reconciliation + de-staging 7 todo
R7 Final stress run + tuning 9b todo
Stage 18 — prod contour deploy see PLAN.md

Key findings (these reshaped the raw list — read before starting a phase)

  • R1 (TODO 1 + 10) is one cheap moment, now. Squashing the 12 goose migrations is safe precisely because there is no prod data and the contour DB is wiped. Folding the new variant labels (scrabble_ru/scrabble_en/erudit_ru) into that single baseline makes the rename need no data migration and no back-compat mapping. Today's labels (english/russian_scrabble/erudit) are persisted in games.variant, game_invitations.variant, in pkg/fbs and the UI — ~100 files, but a mechanical sweep on a clean DB.
  • R4 (TODO 4 + 5): the app is already push-first. Game state refreshes on your_turn/opponent_moved, the lobby on notify, chat on chat_message. The only genuine periodic server poll is lobby.poll (matchmaking, 2.5 s, ui/src/screens/NewGame.svelte). What remains is killing that one poll and enriching push events to carry payloads so the UI stops re-fetching after each signal.
  • R3 (TODO 2): identity forgery is already mitigated. Identity is always derived from the session (Authorization: BearerX-User-ID); the client cannot inject identity, the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real gaps are a missing request-body size limit (cheap DoS) and invisible rate-limit rejections (no log/metric/admin view — that is TODO 8). Static landing serving is not covered by the gateway token bucket (it only guards Execute).
  • R6 (TODO 7) scale: ~431 Stage N references across ~104 files (incl. the file name backend/internal/inttest/stage6_test.go). Code is the source of truth; docs/ describe current state; PLAN.md keeps the decision history.

Locked decisions (owner interview)

  • Stress test (TODO 9): early + final runs. Driver = edge protocol (Connect/FB through the gateway, moves generated by the solver) plus a separate gateway-hammer saturation test. Pacing = realistic (under limits) + saturation (ramp to the knee). Resource metrics = add cAdvisor + postgres_exporter to the contour (today only Go-runtime metrics exist). The harness stays in the repo for repeats.
  • Push (TODO 4 + 5): both — kill lobby.poll (use the existing match_found, keep poll as the ws-down fallback) and enrich push events with payloads.
  • Refactor (TODO 7): hygiene + structural changes by a reviewed list — behaviour-preserving, test-gated, contentious items surfaced to the owner before applying.
  • Landing (TODO 3): separate static container behind the project caddy (/ → landing, /app/ + /telegram/ → gateway); drop landing.html from the gateway go:embed.
  • Rate-abuse (TODO 8): metric + Grafana + admin view plus a conservative auto-flag — a soft, reversible "suspected high-rate" marker for operator review, tunable threshold, no auto-ban.

Phases

Each phase: read this tracker + the relevant docs/, interview the owner on the open details below, implement within scope, then update the tracker + docs/code and get CI green before marking it done.

R1 — Schema & naming reset (TODO 1 + 10) — first

Squash backend/internal/postgres/migrations/00001..00012 into one 00001_baseline.sql (method: pg_dump --schema-only from a fully-migrated DB → wrap as the goose baseline → prove a fresh migrate yields a schema identical to the 12-migration chain via the integration suite → delete the old files; keep goose). Bake the new variant labels into the baseline. Propagate scrabble_ru/scrabble_en/erudit_ru through the backend (engine.Variant/ParseVariant, registry.dictFiles, the CHECK values), the wire (pkg/fbs variant:string, regenerate FB) and the UI (lib/model.ts union, variants.ts, fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy ../scrabble-dictionary to a single source→dawg build point and align the dawg artifact names to the new labels (crosses into ../scrabble-solver's committed fixtures — keep them byte-identical). After merge, wipe the contour DB (drop the volume) so it re-provisions on the next deploy.

  • Critical files: backend/internal/postgres/migrations/, backend/internal/engine/{engine,registry}.go, pkg/fbs/scrabble.fbs, ui/src/lib/{model,variants}.ts, ../scrabble-dictionary/{Makefile,cmd/builddict,…}.
  • Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is one PR or split; how to script the contour DB wipe in the deploy.

R2 — Stress harness + contour observability + early run (TODO 9, part 1)

Build the reusable load harness as a new loadtest module in go.work (reuses pkg/fbs, connect-go, and scrabble-solver for legal-move generation): a seeder that inserts 1000 guest + 10000 durable accounts with pre-created sessions (token hashes) directly in the DB and hands the plaintext tokens to the client; a driver that runs N virtual users, each in 35 concurrent 24-player games, exercising submit-play / pass / exchange / nudge / chat / check-word / draft-move / profile-save through the edge protocol, in realistic (under rate limits) and saturation (ramp) modes; plus a separate gateway-hammer that deliberately exceeds limits to verify the limiter holds and measure its cost. Add cAdvisor + postgres_exporter to deploy/docker-compose.yml and a Grafana resource dashboard. Run the early pass against the freshly-wiped contour; produce a trip report (logic/concurrency bugs + a resource baseline) that feeds R3 and R6.

  • Critical files: new loadtest/, deploy/docker-compose.yml, deploy/observability/*, docs/TESTING.md.
  • Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move for realistic game progress); run duration; the pass/fail bar.

R3 — Edge hardening (TODO 2 + 8 + 3)

Add a request-body size cap at the gateway h2c mux / Execute (e.g. ~1 MB). Add rate-limit observability: a gateway_rate_limited_total{class} counter + a structured log per rejection; an aggregate Grafana panel (request rate + rejection rate — spikes visible without per-user label cardinality, honouring the Stage 12/17 discipline); an admin-console view of recently throttled users/IPs (in-memory ring buffer, single- instance, reset-on-restart, like the active_users gauge). Add the conservative auto-flag: when a user is sustained-throttled past a tunable threshold, set a soft, reversible account.flagged_high_rate_at marker (baked into the R1 baseline) surfaced in the admin user list/detail — no auto-ban; the operator clears it. Split the landing into its own static container (deploy/ + a Caddyfile route / → landing) and drop landing.html from the gateway go:embed.

  • Critical files: gateway/internal/connectsrv/server.go, gateway/internal/ratelimit/, gateway/internal/connectsrv/metrics.go, backend/internal/adminconsole/, deploy/caddy/Caddyfile, deploy/docker-compose.yml, gateway/internal/webui/.
  • Open details: the auto-flag threshold/window + whether the marker is persisted vs in-memory; the landing image base (caddy vs nginx).

R4 — Push enrichment + kill the last poll (TODO 4 + 5)

Replace lobby.poll with the existing match_found push (keep the poll as a ws-down fallback). Enrich your_turn/opponent_moved/notify to carry the state payload so the UI renders from the event without a follow-up game.state (removes the lobby↔game nav latency the owner noticed). Wire-contract change: pkg/fbs event payloads → backend notify emit → UI stream consumers (ui/src/lib/app.svelte.ts), with the per-game cache as the landing spot; regenerate FB.

  • Critical files: pkg/fbs/scrabble.fbs, backend/internal/notify/events.go, ui/src/lib/{app.svelte,transport}.ts, ui/src/screens/NewGame.svelte.
  • Open details: which events carry full vs delta payloads; the fallback-poll cadence when the stream is down.

R5 — Bundle slimming (TODO 6) — done

Analysed the bundle against the 100 KB-gzip budget; no code slimming was warranted, and the budget metric was retargeted to measure the app correctly. The build already minifies + tree-shakes; the dominant cost is the Connect/FlatBuffers transport runtime + generated bindings

  • the Svelte runtime (≈⅔ of main's source is third-party/generated) — irreducible within scope. Lazy-loading was rejected: bundle-size.mjs sums every emitted chunk, so code-splitting yields no total-size win and adds request latency (+N gateway fetches on first navigation to a split screen). i18n lazy-load was skipped (the catalogs are a sliver of a Svelte-runtime-dominated shared chunk, and en must stay bundled as the MessageKey type source + fallback). Instead, bundle-size.mjs now measures per HTML entry, with three independent gates on the natural chunk boundaries — app entry ≤ 100 KB, the Svelte+i18n shared chunk ≤ 30 KB, the landing's own chunk ≤ 5 KB — since the app's real payload is its entry chunk plus the shared chunk (≈97 KB), while the landing (≈24 KB) is reported separately and kept minimal. Same CLI + exit-code contract, so the CI step is unchanged.
  • Critical files: ui/scripts/bundle-size.mjs; no app code changed.

R6 — Refactor + docs reconciliation + de-staging (TODO 7) — near last

Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical de-staging — remove Stage N/TODO-N references from code, comments and service READMEs (rename stage6_test.go); (b) docs↔code reconciliation — reconcile docs/ARCHITECTURE.md / docs/FUNCTIONAL.md(+_ru) against the code-as-truth, fixing drift and Go Doc comments; (c) structural changes by a reviewed list — surface a list of proposed optimizations / test-suite consolidations to the owner, apply only the approved, behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the regression gate. Incorporates the early-run (R2) bug fixes not already shipped.

  • Open details: the structural-changes list itself (owner-approved before applying); the test consolidation targets.

R7 — Final stress run + tuning (TODO 9, part 2) — before Stage 18

Re-run the R2 harness against the final, refactored system on a clean contour; analyse resource consumption across all components (gateway, backend, Postgres, the metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record the methodology + results in the repo.

Stage 18 (prod contour) then proceeds per PLAN.md.

Sequencing rationale

R1 first (cheapest now; everything builds on the final schema/naming and the stress test must run against it). R2 builds the harness and runs the early pass to surface bugs and a resource baseline that feed R3 and R6. R3/R4/R5 harden and improve the system. R6 (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and benefits from all accumulated bug knowledge. R7 validates the final system and tunes it. Then Stage 18.

Regression-safety discipline (cross-cutting)

  • Every phase is a feature/* → development PR; CI (unit + integration + ui behind the CI / gate check) must be green before the owner merges; watch the post-merge contour deploy with gitea-ci-watch.py.
  • R6 structural changes are behaviour-preserving, test-gated, and split from the mechanical sweeps; contentious items are owner-approved first.
  • The two stress runs (R2 early, R7 final) are the system-level regression gate.

Verification (per phase)

  • go build ./<module>/..., go vet, gofmt -l . clean, go test -count=1 ./<module>/...; UI: pnpm check && pnpm test:unit && pnpm build; the integration suite (-tags integration) for DB/schema changes; docker compose config for deploy changes; green CI on the PR + a healthy contour deploy.
  • R1: prove the squashed baseline yields a schema identical to the 12-migration chain (integration suite on a fresh DB) before deleting the old files.
  • R2/R7: the harness runs end-to-end against the contour; the trip report lists concrete defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels.

Refinements logged during implementation

  • R1 (interview + implementation):

    • Variant labels english/russian_scrabble/eruditscrabble_en/scrabble_ru/erudit_ru across the backend (engine.Variant.String/ParseVariant; the games/game_invitations variant CHECK in the baseline; GCG #lexicon and the variant metric attribute both flow from String), the wire (pkg/fbs variant is a string field — values change with no FlatBuffers regen) and the UI (model.ts union, variants.ts records, codec/premiums/mocks/tests, the admin dictionary.gohtml). Kept: the Go enum identifiers (VariantEnglish…, internal) and the i18n display keys (new.english/new.russian/new.erudit, display-only). complaints.variant stays free-text (no CHECK, as before).
    • dawg filenames kept descriptive (en_sowpods/ru_scrabble/ru_erudit) — only the registry's Variant key carries the rename, so registry.go, the published scrabble-solver fixtures and the dictionary release artifact are untouched (decouples the three repos).
    • Migrations squashed 12 → one hand-written 00001_baseline.sql. Verified by a pg_dump --schema-only diff (the chain vs the baseline are identical but for the two intended variant-CHECK values) plus the green integration suite. No data migration (no production data).
    • Done (cross-repo + contour): the scrabble-dictionary tidy merged (PR #2) and was re-cut as the byte-identical v1.0.1 release for clean provenance (the backend stays on v1.0.0 — same bytes, no rewire; the backend pulls a version-pinned release artifact, not master). Post-merge the contour backend schema was wiped (DROP SCHEMA backend CASCADE + restart, not a volume drop) and re-migrated to the baseline — verified the new variant CHECK (scrabble_en/scrabble_ru/erudit_ru), games=0 and a clean boot.
  • R2 (interview + implementation):

    • Locked decisions: game assembly via invitations (real path, no robots; not direct game-row inserts); moderate ramp 50 → 200 → 500 at 10 min/step; diagnostic pass bar (no SLO gate); run as a one-shot container on scrabble-internal in this PR.
    • Harness = new scrabble/loadtest module (use ./loadtest + a replace scrabble/gateway for the dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions directly in Postgres (token hash mirrors backend/internal/session), drives players over the edge protocol, generates mid-ranked legal moves locally with the embedded scrabble-solver by replaying game.history (the edge carries no board — mirrors engine.ReplayBoard via the public API), and a gateway-hammer. Compact CLI (run / cleanup), distroless Dockerfile (DAWGs baked), Go unit tests.
    • Adding the module broke the other images' builds — backend/gateway/telegram Dockerfiles reduce the workspace but still referenced ./loadtest (not in their context); each now also -dropuse=./loadtest (backend/telegram additionally -dropreplace the gateway replace). Caught by the first deploy run; verified by building all four images.
    • Harness payload fixes found by the smoke pass: the draft DTO's rack_order is a string (was sent as []bad_request); the display-name validator forbids digits/colons, so the cleanup marker became a letters-only Zzloadtest so profile.update resends the seeded name. chat_not_your_turn / nudge_own_turn are by-design turn gates, correctly exercised.
    • Observability: added cAdvisor + postgres_exporter + the Scrabble — Resources dashboard + two Prometheus jobs. Finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker breaks its layer-ID resolution — the existing galaxy deploy has the same limit), so per-container CPU/RSS for the early pass was captured via docker stats. R7: adopt the otelcol docker_stats receiver (already the contrib image) for per-container metrics in Grafana.
    • Early run (2026-06-09): ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000 accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's single shared http2.Transport; the harness itself peaked at 86 % of a core on the same host, so the figures are pessimistic. Full trip report in ../loadtest/REPORT-R2.md; it feeds R3 (h2c MaxConcurrentStreams/timeouts, body-size cap), R6 and R7 (per-player transports, separate hardware, pool/limit sizing).
    • CI: ./loadtest/... added to the path filter + vet/build/test; go.work.sum carries the new deps.
  • R3 (interview + implementation):

    • Locked decisions: the flag column lands by editing the R1 baseline (+ a contour schema wipe after merge — no migration chain accrues before prod); auto-flag defaults 1000 rejected / 10 min (BACKEND_HIGHRATE_FLAG_THRESHOLD/_WINDOW, rolling window, set-once, operator clears, no auto-ban); landing image = caddy:2-alpine; throttle data flows gateway → backend (a 30 s per-key summary POST to the new /api/v1/internal/ratelimit/report, the existing trusted direction) with the episode window + flag rule in the backend (internal/ratewatch); rejection logging = Warn summary per key per window + Debug per rejection — a deliberate deviation from the phase's "structured log per rejection" (the R2 hammer would have logged ~522k lines in minutes); all three R2-report tails included (explicit h2c sizing, the session-resolve failure cause at Warn, reviving the admin limiter).
    • Body cap: GATEWAY_MAX_BODY_BYTES (default 1 MiB) as both the Connect per-message read limit and an http.MaxBytesReader wrap of the public mux; an oversized Execute is resource_exhausted.
    • Dead config found: AdminPerMinute/AdminBurst were never wired — the gateway /_gm mount is now 429-guarded per IP ahead of its Basic-Auth. The caddy-fronted contour path stays unlimited (stock caddy has no limiter) — an accepted gap, recorded in docs/ARCHITECTURE.md §12.
    • Landing split: a landing target in gateway/Dockerfile (the UI build stage is shared; identical compose build args keep it one cached build); the gateway drops landing.html from the embed and 308-redirects //app/; the contour caddy routes /app/, /telegram/ and the Connect path to the gateway and the catch-all to the landing container; the CI deploy probe now checks both / (landing) and /app/ (gateway).
    • Observability: gateway_rate_limited_total{class} (user/public/email/admin, aggregate-only)
      • a rate-vs-rejections panel on the Edge/UX dashboard; the admin console gains the Throttled page (the in-memory episode window, reset-on-restart like active_users, plus the flagged-account queue) and the flag badge / clear action on the user list / card.
    • The jet regen also restored the previously missing game_drafts/game_hidden generated models (their tables were added after the last jetgen run; no behaviour change).
  • R4 (interview + implementation):

    • Locked decisions: delta-first, not full snapshots — an event carries only the new move and the UI applies it to its per-game cache, keyed on move_count (idempotent + gap-safe: a gap or the actor's own move falls back to a game.state + game.history refetch). match_found / game_started carry the recipient's initial StateView (instant lobby→game); the fallback refetch stays the existing two calls (no merged endpoint); the matchmaking poll runs only while the stream is down (2.5 s); all UI-state-changing events carry their payload (incl. lobby notify).
    • Enriched events (pkg/fbs trailing fields — backward-compatible, no FB regen of values, only the schema): opponent_moved (+move/game/bag_len), your_turn (+move_count), match_found (+state), game_over (+game), notify (+account/invitation/state). The pre-R4 opponent_moved scalars (seat/action/score/total) stay for wire back-compat, now redundant with move/game — slated for the R6 de-stage.
    • Encoding placement: the notify package keeps ownership of the FlatBuffers encoding (a new encode.go mirrors the gateway transcode but reads wire-agnostic notify.* input structs + engine.MoveRecord); the game/lobby/social services map their domain types to those structs, so the wire schema stays out of the domain. Flagged for R6: this partly duplicates the gateway encoders (different source types) — a candidate consolidation.
    • Actor self-fetch killed too (beyond literal "push"): the submit_play/pass/exchange/resign response (MoveResult) now returns the actor's refilled rack + bag size, so the mover renders the next turn from the response — Game.svelte's commit/pass/exchange/resign drop their await load().
    • match_found enrichment needs a per-seat initial state: lobby.GameCreator gained InitialState, and game.Service.InitialState builds the notify.PlayerState (rack re-encoded to wire indices, the variant alphabet embedded for a first-seen variant).
    • UI: a pure lib/gamedelta.ts reducer (applyMoveDelta / applyGameOver / seedInitialState, unit-tested) advances the cache; app.svelte seeds it on match_found / game_started; Game.svelte applies the delta (falling back to load() while composing, on a gap, or on its own move's new rack); NewGame.svelte polls only when app.streamAlive is false and guards its teardown so a push-delivered match is not cancelled.
    • notify (friends/invitations) scope: the backend carries the full account / invitation payload on the wire (per "all events → push"); the UI seeds the game cache from game_started but keeps its lightweight authoritative badge refresh (refreshNotifications, on the rare notify event + on foreground) rather than adding client-side friend/invitation caches — the per-move hot path is fully de-fetched, which was the goal. Deeper lobby-cache consumption is an easy follow-up.
    • No schema change (no migration); the contour needs no DB wipe. Tests: notify FB round-trips + emitMove delta + the gamedelta reducer; the e2e mock now emits the enriched delta.
  • R5 (interview + implementation):

    • No code slimming — by analysis. A gzip measure + sourcemap attribution of the real dist showed the app bundle is already minified + tree-shaken and dominated by the Connect/FlatBuffers transport runtime + generated FB/PB bindings (≈⅔ of main's source) and the Svelte runtime — all third-party/generated, irreducible within R5's scope. App-authored code carries no hand-trimmable fat.
    • Lazy-load rejected (screens and i18n): bundle-size.mjs sums every emitted chunk, so code-splitting moves bytes between chunks for zero total-size win while adding request latency (+N gateway fetches on first navigation to a split screen). i18n lazy-load additionally buys ≤3 KB (en-only users) at the cost of an async t(), and en must stay bundled (it is the MessageKey type source + fallback). Chunk-collapsing rejected too — keeping the near-static Svelte runtime in its own cacheable chunk is the recommended practice (an app deploy then re-busts only main, not the runtime), and HTTP/2 makes the extra preload request negligible.
    • Metric retargeted to the app. The two-entry build (index.html app + landing.html) makes Rollup hoist the code shared by both (Svelte runtime + i18n + aboutContent) into one preloaded chunk, so the app actually loads its entry chunk + the shared chunk (≈74 + ≈23 = ≈97 KB), never landing.js (≈1.6 KB). The old script summed all three chunks (98.8 KB), over-counting the app by landing.js. bundle-size.mjs now parses each built HTML for the JS it eagerly loads and gates three parts independently — app entry ≤ 100 KB, shared (Svelte+i18n) ≤ 30 KB, landing-own ≤ 5 KB — reporting the app total (≈97) and landing total (≈24.5). Same CLI + exit-code contract, so the CI step is unchanged.
    • No app/source/build change (App.svelte, lib/i18n/, vite.config.ts untouched); no schema change, no contour wipe. The stale "~82 KB" figure was corrected in bundle-size.mjs and ui/README.md.