# Pre-release plan — hardening before Stage 18 Living tracker for the pre-release hardening pass that runs **before Stage 18** (the prod cutover). Same discipline as [`PLAN.md`](PLAN.md): one phase per session, **interview the owner on the open details** at the start of each phase, bake every decision back into `PLAN.md` / `docs/` / the affected `README`s / Go Doc comments in the **same** PR, get CI green, then mark the phase done. Phases run as `feature/* → development` PRs (the Stage 16 branch model); the owner approves+merges. **Why now:** the system is feature-complete through Stage 17 and the test contour is green, but there is **no prod data yet** — schema, wire labels and the dictionary layout can still change for free. These phases spend that one-time freedom and harden the edge before prod. Each phase maps back to the owner's raw pre-release TODO list (numbers in the tracker). ## Phase tracker | # | Phase | Raw TODOs | Status | |---|-------|-----------|--------| | R1 | Schema & naming reset | 1 + 10 | **done** | | R2 | Stress harness + contour observability + early run | 9a | **done** | | R3 | Edge hardening | 2 + 8 + 3 | **done** | | R4 | Push enrichment + kill the last poll | 4 + 5 | **done** | | R5 | Bundle slimming | 6 | todo | | R6 | Refactor + docs reconciliation + de-staging | 7 | todo | | R7 | Final stress run + tuning | 9b | todo | | → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) | ## Key findings (these reshaped the raw list — read before starting a phase) - **R1 (TODO 1 + 10) is one cheap moment, now.** Squashing the 12 goose migrations is safe precisely because there is no prod data and the contour DB is wiped. Folding the new variant labels (`scrabble_ru`/`scrabble_en`/`erudit_ru`) into that single baseline makes the rename need **no data migration and no back-compat mapping**. Today's labels (`english`/`russian_scrabble`/`erudit`) are persisted in `games.variant`, `game_invitations.variant`, in `pkg/fbs` and the UI — ~100 files, but a mechanical sweep on a clean DB. - **R4 (TODO 4 + 5): the app is already push-first.** Game state refreshes on `your_turn`/`opponent_moved`, the lobby on `notify`, chat on `chat_message`. The **only** genuine periodic server poll is `lobby.poll` (matchmaking, 2.5 s, `ui/src/screens/NewGame.svelte`). What remains is killing that one poll **and** enriching push events to carry payloads so the UI stops re-fetching after each signal. - **R3 (TODO 2): identity forgery is already mitigated.** Identity is always derived from the session (`Authorization: Bearer` → `X-User-ID`); the client cannot inject identity, the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real gaps are a missing **request-body size limit** (cheap DoS) and **invisible rate-limit rejections** (no log/metric/admin view — that is TODO 8). Static landing serving is **not** covered by the gateway token bucket (it only guards `Execute`). - **R6 (TODO 7) scale:** ~431 `Stage N` references across ~104 files (incl. the file name `backend/internal/inttest/stage6_test.go`). Code is the source of truth; `docs/` describe current state; `PLAN.md` keeps the decision history. ## Locked decisions (owner interview) - **Stress test (TODO 9):** **early + final** runs. Driver = **edge protocol** (Connect/FB through the gateway, moves generated by the solver) **plus a separate gateway-hammer** saturation test. Pacing = **realistic (under limits) + saturation (ramp to the knee)**. Resource metrics = **add cAdvisor + postgres_exporter to the contour** (today only Go-runtime metrics exist). The harness stays in the repo for repeats. - **Push (TODO 4 + 5):** **both** — kill `lobby.poll` (use the existing `match_found`, keep poll as the ws-down fallback) **and** enrich push events with payloads. - **Refactor (TODO 7):** **hygiene + structural changes by a reviewed list** — behaviour-preserving, test-gated, contentious items surfaced to the owner before applying. - **Landing (TODO 3):** **separate static container** behind the project caddy (`/` → landing, `/app/` + `/telegram/` → gateway); drop `landing.html` from the gateway `go:embed`. - **Rate-abuse (TODO 8):** metric + Grafana + admin view **plus a conservative auto-flag** — a *soft, reversible* "suspected high-rate" marker for operator review, tunable threshold, **no auto-ban**. ## Phases Each phase: read this tracker + the relevant `docs/`, **interview the owner on the open details below**, implement within scope, then update the tracker + docs/code and get CI green before marking it done. ### R1 — Schema & naming reset *(TODO 1 + 10)* — first Squash `backend/internal/postgres/migrations/00001..00012` into one `00001_baseline.sql` (method: `pg_dump --schema-only` from a fully-migrated DB → wrap as the goose baseline → prove a fresh migrate yields a schema identical to the 12-migration chain via the integration suite → delete the old files; keep goose). Bake the new variant labels into the baseline. Propagate `scrabble_ru`/`scrabble_en`/`erudit_ru` through the backend (`engine.Variant`/`ParseVariant`, `registry.dictFiles`, the CHECK values), the wire (`pkg/fbs` `variant:string`, regenerate FB) and the UI (`lib/model.ts` union, `variants.ts`, fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy `../scrabble-dictionary` to a single source→dawg build point and align the dawg artifact names to the new labels (crosses into `../scrabble-solver`'s committed fixtures — keep them byte-identical). After merge, **wipe the contour DB** (drop the volume) so it re-provisions on the next deploy. - Critical files: `backend/internal/postgres/migrations/`, `backend/internal/engine/{engine,registry}.go`, `pkg/fbs/scrabble.fbs`, `ui/src/lib/{model,variants}.ts`, `../scrabble-dictionary/{Makefile,cmd/builddict,…}`. - Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is one PR or split; how to script the contour DB wipe in the deploy. ### R2 — Stress harness + contour observability + early run *(TODO 9, part 1)* Build the reusable load harness as a new `loadtest` module in `go.work` (reuses `pkg/fbs`, `connect-go`, and `scrabble-solver` for legal-move generation): a seeder that inserts **1000 guest + 10000 durable** accounts with pre-created sessions (token hashes) directly in the DB and hands the plaintext tokens to the client; a driver that runs N virtual users, each in 3–5 concurrent 2–4-player games, exercising submit-play / pass / exchange / nudge / chat / check-word / draft-move / profile-save through the **edge protocol**, in **realistic** (under rate limits) and **saturation** (ramp) modes; plus a separate **gateway-hammer** that deliberately exceeds limits to verify the limiter holds and measure its cost. Add **cAdvisor + postgres_exporter** to `deploy/docker-compose.yml` and a Grafana resource dashboard. Run the **early pass** against the freshly-wiped contour; produce a **trip report** (logic/concurrency bugs + a resource baseline) that feeds R3 and R6. - Critical files: new `loadtest/`, `deploy/docker-compose.yml`, `deploy/observability/*`, `docs/TESTING.md`. - Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move for realistic game progress); run duration; the pass/fail bar. ### R3 — Edge hardening *(TODO 2 + 8 + 3)* Add a **request-body size cap** at the gateway h2c mux / `Execute` (e.g. ~1 MB). Add **rate-limit observability**: a `gateway_rate_limited_total{class}` counter + a structured log per rejection; an **aggregate** Grafana panel (request rate + rejection rate — spikes visible without per-user label cardinality, honouring the Stage 12/17 discipline); an **admin-console view** of recently throttled users/IPs (in-memory ring buffer, single- instance, reset-on-restart, like the `active_users` gauge). Add the **conservative auto-flag**: when a user is *sustained*-throttled past a tunable threshold, set a soft, reversible `account.flagged_high_rate_at` marker (baked into the R1 baseline) surfaced in the admin user list/detail — **no auto-ban**; the operator clears it. Split the **landing** into its own static container (`deploy/` + a Caddyfile route `/` → landing) and drop `landing.html` from the gateway `go:embed`. - Critical files: `gateway/internal/connectsrv/server.go`, `gateway/internal/ratelimit/`, `gateway/internal/connectsrv/metrics.go`, `backend/internal/adminconsole/`, `deploy/caddy/Caddyfile`, `deploy/docker-compose.yml`, `gateway/internal/webui/`. - Open details: the auto-flag threshold/window + whether the marker is persisted vs in-memory; the landing image base (caddy vs nginx). ### R4 — Push enrichment + kill the last poll *(TODO 4 + 5)* Replace `lobby.poll` with the existing `match_found` push (keep the poll as a ws-down fallback). Enrich `your_turn`/`opponent_moved`/`notify` to carry the state payload so the UI renders from the event without a follow-up `game.state` (removes the lobby↔game nav latency the owner noticed). Wire-contract change: `pkg/fbs` event payloads → backend `notify` emit → UI stream consumers (`ui/src/lib/app.svelte.ts`), with the per-game cache as the landing spot; regenerate FB. - Critical files: `pkg/fbs/scrabble.fbs`, `backend/internal/notify/events.go`, `ui/src/lib/{app.svelte,transport}.ts`, `ui/src/screens/NewGame.svelte`. - Open details: which events carry full vs delta payloads; the fallback-poll cadence when the stream is down. ### R5 — Bundle slimming *(TODO 6)* Lazy-load secondary screens (Friends/Stats/Settings/About/Profile) and i18n catalogs by language via dynamic imports; re-measure against the existing 100 KB-gzip budget (`ui/scripts/bundle-size.mjs`, ~82 KB today). If the win is marginal, stop — acceptable per the owner. - Critical files: `ui/src/App.svelte`, `ui/vite.config.ts`, `ui/src/lib/i18n/`. ### R6 — Refactor + docs reconciliation + de-staging *(TODO 7)* — near last Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical **de-staging** — remove `Stage N`/`TODO-N` references from code, comments and service READMEs (rename `stage6_test.go`); (b) **docs↔code reconciliation** — reconcile `docs/ARCHITECTURE.md` / `docs/FUNCTIONAL.md`(+`_ru`) against the code-as-truth, fixing drift and Go Doc comments; (c) **structural changes by a reviewed list** — surface a list of proposed optimizations / test-suite consolidations to the owner, apply only the approved, behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the regression gate. Incorporates the early-run (R2) bug fixes not already shipped. - Open details: the structural-changes list itself (owner-approved before applying); the test consolidation targets. ### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18 Re-run the R2 harness against the final, refactored system on a clean contour; analyse resource consumption across **all** components (gateway, backend, Postgres, the metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record the methodology + results in the repo. → **Stage 18** (prod contour) then proceeds per [`PLAN.md`](PLAN.md). ## Sequencing rationale `R1` first (cheapest now; everything builds on the final schema/naming and the stress test must run against it). `R2` builds the harness and runs the **early** pass to surface bugs and a resource baseline that feed `R3` and `R6`. `R3`/`R4`/`R5` harden and improve the system. `R6` (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and benefits from all accumulated bug knowledge. `R7` validates the final system and tunes it. Then Stage 18. ## Regression-safety discipline (cross-cutting) - Every phase is a `feature/* → development` PR; CI (`unit` + `integration` + `ui` behind the `CI / gate` check) must be green before the owner merges; watch the post-merge contour deploy with `gitea-ci-watch.py`. - `R6` structural changes are behaviour-preserving, test-gated, and split from the mechanical sweeps; contentious items are owner-approved first. - The two stress runs (`R2` early, `R7` final) are the system-level regression gate. ## Verification (per phase) - `go build .//...`, `go vet`, `gofmt -l .` clean, `go test -count=1 .//...`; UI: `pnpm check && pnpm test:unit && pnpm build`; the integration suite (`-tags integration`) for DB/schema changes; `docker compose config` for deploy changes; green CI on the PR + a healthy contour deploy. - `R1`: prove the squashed baseline yields a schema identical to the 12-migration chain (integration suite on a fresh DB) **before** deleting the old files. - `R2`/`R7`: the harness runs end-to-end against the contour; the trip report lists concrete defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels. ## Refinements logged during implementation - **R1** (interview + implementation): - **Variant labels** `english`/`russian_scrabble`/`erudit` → **`scrabble_en`/`scrabble_ru`/`erudit_ru`** across the backend (`engine.Variant.String`/`ParseVariant`; the `games`/`game_invitations` `variant` CHECK in the baseline; GCG `#lexicon` and the `variant` metric attribute both flow from `String`), the wire (`pkg/fbs` `variant` is a `string` field — values change with **no FlatBuffers regen**) and the UI (`model.ts` union, `variants.ts` records, `codec`/`premiums`/mocks/tests, the admin `dictionary.gohtml`). **Kept:** the Go enum identifiers (`VariantEnglish`…, internal) and the i18n display keys (`new.english`/`new.russian`/`new.erudit`, display-only). `complaints.variant` stays free-text (no CHECK, as before). - **dawg filenames kept descriptive** (`en_sowpods`/`ru_scrabble`/`ru_erudit`) — only the registry's `Variant` key carries the rename, so `registry.go`, the published `scrabble-solver` fixtures and the dictionary release artifact are untouched (decouples the three repos). - **Migrations squashed** 12 → one hand-written `00001_baseline.sql`. Verified by a `pg_dump --schema-only` diff (the chain vs the baseline are **identical** but for the two intended variant-CHECK values) plus the green integration suite. **No data migration** (no production data). - **Done (cross-repo + contour):** the **`scrabble-dictionary` tidy** merged (PR #2) and was re-cut as the **byte-identical `v1.0.1`** release for clean provenance (the backend stays on `v1.0.0` — same bytes, no rewire; the backend pulls a version-pinned release artifact, not master). Post-merge the contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`), `games`=0 and a clean boot. - **R2** (interview + implementation): - **Locked decisions:** game assembly via **invitations** (real path, no robots; not direct game-row inserts); **moderate** ramp **50 → 200 → 500** at 10 min/step; **diagnostic** pass bar (no SLO gate); run as a **one-shot container on `scrabble-internal`** in this PR. - **Harness** = new `scrabble/loadtest` module (`use ./loadtest` + a `replace scrabble/gateway` for the dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions **directly in Postgres** (token hash mirrors `backend/internal/session`), drives players over the **edge protocol**, generates **mid-ranked legal moves locally** with the embedded `scrabble-solver` by replaying `game.history` (the edge carries no board — mirrors `engine.ReplayBoard` via the public API), and a **gateway-hammer**. Compact CLI (`run` / `cleanup`), distroless Dockerfile (DAWGs baked), Go unit tests. - **Adding the module broke the other images' builds** — backend/gateway/telegram Dockerfiles reduce the workspace but still referenced `./loadtest` (not in their context); each now also `-dropuse=./loadtest` (backend/telegram additionally `-dropreplace` the gateway replace). Caught by the first deploy run; verified by building all four images. - **Harness payload fixes found by the smoke pass:** the draft DTO's `rack_order` is a string (was sent as `[]` → `bad_request`); the display-name validator forbids digits/colons, so the cleanup marker became a letters-only `Zzloadtest` so `profile.update` resends the seeded name. `chat_not_your_turn` / `nudge_own_turn` are **by-design** turn gates, correctly exercised. - **Observability:** added **cAdvisor + postgres_exporter** + the **Scrabble — Resources** dashboard + two Prometheus jobs. **Finding:** cAdvisor yields only the root cgroup on the contour host (separate XFS `/var/lib/docker` breaks its layer-ID resolution — the existing galaxy deploy has the same limit), so per-container CPU/RSS for the early pass was captured via `docker stats`. **R7:** adopt the otelcol `docker_stats` receiver (already the contrib image) for per-container metrics in Grafana. - **Early run (2026-06-09):** ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000 accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the hammer (99.97 % rejected, p99 2 ms). **Top finding:** ~14 % `transport_error` on `game.state` at 500 players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's single shared `http2.Transport`; the harness itself peaked at 86 % of a core on the same host, so the figures are pessimistic. Full trip report in [`../loadtest/REPORT-R2.md`](../loadtest/REPORT-R2.md); it feeds R3 (h2c `MaxConcurrentStreams`/timeouts, body-size cap), R6 and R7 (per-player transports, separate hardware, pool/limit sizing). - **CI:** `./loadtest/...` added to the path filter + vet/build/test; `go.work.sum` carries the new deps. - **R3** (interview + implementation): - **Locked decisions:** the flag column lands by **editing the R1 baseline** (+ a contour schema wipe after merge — no migration chain accrues before prod); auto-flag defaults **1000 rejected / 10 min** (`BACKEND_HIGHRATE_FLAG_THRESHOLD`/`_WINDOW`, rolling window, set-once, operator clears, no auto-ban); landing image = **caddy:2-alpine**; throttle data flows **gateway → backend** (a 30 s per-key summary POST to the new `/api/v1/internal/ratelimit/report`, the existing trusted direction) with the episode window + flag rule in the backend (`internal/ratewatch`); rejection logging = **Warn summary per key per window + Debug per rejection** — a deliberate deviation from the phase's "structured log per rejection" (the R2 hammer would have logged ~522k lines in minutes); all three R2-report tails included (explicit h2c sizing, the session-resolve failure cause at Warn, reviving the admin limiter). - **Body cap:** `GATEWAY_MAX_BODY_BYTES` (default 1 MiB) as both the Connect per-message read limit and an `http.MaxBytesReader` wrap of the public mux; an oversized Execute is `resource_exhausted`. - **Dead config found:** `AdminPerMinute`/`AdminBurst` were never wired — the gateway `/_gm` mount is now 429-guarded per IP ahead of its Basic-Auth. The caddy-fronted contour path stays unlimited (stock caddy has no limiter) — an accepted gap, recorded in `docs/ARCHITECTURE.md` §12. - **Landing split:** a `landing` target in `gateway/Dockerfile` (the UI build stage is shared; identical compose build args keep it one cached build); the gateway drops `landing.html` from the embed and 308-redirects `/` → `/app/`; the contour caddy routes `/app/`, `/telegram/` and the Connect path to the gateway and the catch-all to the landing container; the CI deploy probe now checks both `/` (landing) and `/app/` (gateway). - **Observability:** `gateway_rate_limited_total{class}` (user/public/email/admin, aggregate-only) + a rate-vs-rejections panel on the Edge/UX dashboard; the admin console gains the **Throttled** page (the in-memory episode window, reset-on-restart like `active_users`, plus the flagged-account queue) and the flag badge / clear action on the user list / card. - The jet regen also restored the previously missing `game_drafts`/`game_hidden` generated models (their tables were added after the last jetgen run; no behaviour change). - **R4** (interview + implementation): - **Locked decisions:** **delta-first**, not full snapshots — an event carries only the new move and the UI applies it to its per-game cache, keyed on `move_count` (idempotent + gap-safe: a gap or the actor's own move falls back to a `game.state` + `game.history` refetch). `match_found` / `game_started` carry the recipient's **initial `StateView`** (instant lobby→game); the fallback refetch stays the existing two calls (no merged endpoint); the matchmaking poll runs **only while the stream is down** (2.5 s); **all** UI-state-changing events carry their payload (incl. lobby `notify`). - **Enriched events** (`pkg/fbs` trailing fields — backward-compatible, no FB regen of *values*, only the schema): `opponent_moved` (+`move`/`game`/`bag_len`), `your_turn` (+`move_count`), `match_found` (+`state`), `game_over` (+`game`), `notify` (+`account`/`invitation`/`state`). The pre-R4 `opponent_moved` scalars (`seat`/`action`/`score`/`total`) stay for wire back-compat, now redundant with `move`/`game` — slated for the R6 de-stage. - **Encoding placement:** the `notify` package keeps ownership of the FlatBuffers encoding (a new `encode.go` mirrors the gateway transcode but reads wire-agnostic `notify.*` input structs + `engine.MoveRecord`); the game/lobby/social services map their domain types to those structs, so the wire schema stays out of the domain. **Flagged for R6:** this partly duplicates the gateway encoders (different source types) — a candidate consolidation. - **Actor self-fetch killed too** (beyond literal "push"): the `submit_play`/`pass`/`exchange`/`resign` **response** (`MoveResult`) now returns the actor's refilled rack + bag size, so the mover renders the next turn from the response — `Game.svelte`'s `commit`/`pass`/`exchange`/`resign` drop their `await load()`. - **`match_found` enrichment** needs a per-seat initial state: `lobby.GameCreator` gained `InitialState`, and `game.Service.InitialState` builds the `notify.PlayerState` (rack re-encoded to wire indices, the variant alphabet embedded for a first-seen variant). - **UI:** a pure `lib/gamedelta.ts` reducer (`applyMoveDelta` / `applyGameOver` / `seedInitialState`, unit-tested) advances the cache; `app.svelte` seeds it on `match_found` / `game_started`; `Game.svelte` applies the delta (falling back to `load()` while composing, on a gap, or on its own move's new rack); `NewGame.svelte` polls only when `app.streamAlive` is false and guards its teardown so a push-delivered match is not cancelled. - **notify (friends/invitations) scope:** the backend carries the full account / invitation payload on the wire (per "all events → push"); the UI seeds the game cache from `game_started` but keeps its lightweight **authoritative** badge refresh (`refreshNotifications`, on the rare `notify` event + on foreground) rather than adding client-side friend/invitation caches — the per-move hot path is fully de-fetched, which was the goal. Deeper lobby-cache consumption is an easy follow-up. - **No schema change** (no migration); the contour needs no DB wipe. Tests: `notify` FB round-trips + `emitMove` delta + the `gamedelta` reducer; the e2e mock now emits the enriched delta.