- Edge/UX dashboard: aggregate request-rate vs rejection-rate panel (gateway_rate_limited_total by class; no per-user labels). - ARCHITECTURE §2/§11/§12/§13: body cap + explicit h2c sizing, the rate-limit observability pipeline and auto-flag policy, the admin-limiter note (and the caddy-path gap), the landing container topology; fixed the stale 120/min per-user figure. - FUNCTIONAL (+_ru): the Throttled view and the reversible high-rate flag. - gateway/backend/deploy READMEs, TESTING.md, root CLAUDE.md updated. - PRERELEASE.md: R3 interview decisions + implementation refinements logged; tracker R3 -> done (this PR implements it; CI gates the merge).
20 KiB
Pre-release plan — hardening before Stage 18
Living tracker for the pre-release hardening pass that runs before Stage 18 (the
prod cutover). Same discipline as PLAN.md: one phase per session,
interview the owner on the open details at the start of each phase, bake every
decision back into PLAN.md / docs/ / the affected READMEs / Go Doc comments in
the same PR, get CI green, then mark the phase done. Phases run as
feature/* → development PRs (the Stage 16 branch model); the owner approves+merges.
Why now: the system is feature-complete through Stage 17 and the test contour is green, but there is no prod data yet — schema, wire labels and the dictionary layout can still change for free. These phases spend that one-time freedom and harden the edge before prod. Each phase maps back to the owner's raw pre-release TODO list (numbers in the tracker).
Phase tracker
| # | Phase | Raw TODOs | Status |
|---|---|---|---|
| R1 | Schema & naming reset | 1 + 10 | done |
| R2 | Stress harness + contour observability + early run | 9a | done |
| R3 | Edge hardening | 2 + 8 + 3 | done |
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
| R5 | Bundle slimming | 6 | todo |
| R6 | Refactor + docs reconciliation + de-staging | 7 | todo |
| R7 | Final stress run + tuning | 9b | todo |
| → | Stage 18 — prod contour deploy | — | see PLAN.md |
Key findings (these reshaped the raw list — read before starting a phase)
- R1 (TODO 1 + 10) is one cheap moment, now. Squashing the 12 goose migrations is
safe precisely because there is no prod data and the contour DB is wiped. Folding the
new variant labels (
scrabble_ru/scrabble_en/erudit_ru) into that single baseline makes the rename need no data migration and no back-compat mapping. Today's labels (english/russian_scrabble/erudit) are persisted ingames.variant,game_invitations.variant, inpkg/fbsand the UI — ~100 files, but a mechanical sweep on a clean DB. - R4 (TODO 4 + 5): the app is already push-first. Game state refreshes on
your_turn/opponent_moved, the lobby onnotify, chat onchat_message. The only genuine periodic server poll islobby.poll(matchmaking, 2.5 s,ui/src/screens/NewGame.svelte). What remains is killing that one poll and enriching push events to carry payloads so the UI stops re-fetching after each signal. - R3 (TODO 2): identity forgery is already mitigated. Identity is always derived from
the session (
Authorization: Bearer→X-User-ID); the client cannot inject identity, the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real gaps are a missing request-body size limit (cheap DoS) and invisible rate-limit rejections (no log/metric/admin view — that is TODO 8). Static landing serving is not covered by the gateway token bucket (it only guardsExecute). - R6 (TODO 7) scale: ~431
Stage Nreferences across ~104 files (incl. the file namebackend/internal/inttest/stage6_test.go). Code is the source of truth;docs/describe current state;PLAN.mdkeeps the decision history.
Locked decisions (owner interview)
- Stress test (TODO 9): early + final runs. Driver = edge protocol (Connect/FB through the gateway, moves generated by the solver) plus a separate gateway-hammer saturation test. Pacing = realistic (under limits) + saturation (ramp to the knee). Resource metrics = add cAdvisor + postgres_exporter to the contour (today only Go-runtime metrics exist). The harness stays in the repo for repeats.
- Push (TODO 4 + 5): both — kill
lobby.poll(use the existingmatch_found, keep poll as the ws-down fallback) and enrich push events with payloads. - Refactor (TODO 7): hygiene + structural changes by a reviewed list — behaviour-preserving, test-gated, contentious items surfaced to the owner before applying.
- Landing (TODO 3): separate static container behind the project caddy
(
/→ landing,/app/+/telegram/→ gateway); droplanding.htmlfrom the gatewaygo:embed. - Rate-abuse (TODO 8): metric + Grafana + admin view plus a conservative auto-flag — a soft, reversible "suspected high-rate" marker for operator review, tunable threshold, no auto-ban.
Phases
Each phase: read this tracker + the relevant docs/, interview the owner on the open
details below, implement within scope, then update the tracker + docs/code and get CI
green before marking it done.
R1 — Schema & naming reset (TODO 1 + 10) — first
Squash backend/internal/postgres/migrations/00001..00012 into one 00001_baseline.sql
(method: pg_dump --schema-only from a fully-migrated DB → wrap as the goose baseline →
prove a fresh migrate yields a schema identical to the 12-migration chain via the
integration suite → delete the old files; keep goose). Bake the new variant labels into the
baseline. Propagate scrabble_ru/scrabble_en/erudit_ru through the backend
(engine.Variant/ParseVariant, registry.dictFiles, the CHECK values), the wire
(pkg/fbs variant:string, regenerate FB) and the UI (lib/model.ts union, variants.ts,
fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy
../scrabble-dictionary to a single source→dawg build point and align the dawg artifact
names to the new labels (crosses into ../scrabble-solver's committed fixtures — keep them
byte-identical). After merge, wipe the contour DB (drop the volume) so it re-provisions
on the next deploy.
- Critical files:
backend/internal/postgres/migrations/,backend/internal/engine/{engine,registry}.go,pkg/fbs/scrabble.fbs,ui/src/lib/{model,variants}.ts,../scrabble-dictionary/{Makefile,cmd/builddict,…}. - Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is one PR or split; how to script the contour DB wipe in the deploy.
R2 — Stress harness + contour observability + early run (TODO 9, part 1)
Build the reusable load harness as a new loadtest module in go.work (reuses pkg/fbs,
connect-go, and scrabble-solver for legal-move generation): a seeder that inserts
1000 guest + 10000 durable accounts with pre-created sessions (token hashes) directly in
the DB and hands the plaintext tokens to the client; a driver that runs N virtual users,
each in 3–5 concurrent 2–4-player games, exercising submit-play / pass / exchange / nudge /
chat / check-word / draft-move / profile-save through the edge protocol, in
realistic (under rate limits) and saturation (ramp) modes; plus a separate
gateway-hammer that deliberately exceeds limits to verify the limiter holds and measure
its cost. Add cAdvisor + postgres_exporter to deploy/docker-compose.yml and a Grafana
resource dashboard. Run the early pass against the freshly-wiped contour; produce a
trip report (logic/concurrency bugs + a resource baseline) that feeds R3 and R6.
- Critical files: new
loadtest/,deploy/docker-compose.yml,deploy/observability/*,docs/TESTING.md. - Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move for realistic game progress); run duration; the pass/fail bar.
R3 — Edge hardening (TODO 2 + 8 + 3)
Add a request-body size cap at the gateway h2c mux / Execute (e.g. ~1 MB). Add
rate-limit observability: a gateway_rate_limited_total{class} counter + a structured
log per rejection; an aggregate Grafana panel (request rate + rejection rate — spikes
visible without per-user label cardinality, honouring the Stage 12/17 discipline); an
admin-console view of recently throttled users/IPs (in-memory ring buffer, single-
instance, reset-on-restart, like the active_users gauge). Add the conservative
auto-flag: when a user is sustained-throttled past a tunable threshold, set a soft,
reversible account.flagged_high_rate_at marker (baked into the R1 baseline) surfaced in the
admin user list/detail — no auto-ban; the operator clears it. Split the landing into
its own static container (deploy/ + a Caddyfile route / → landing) and drop
landing.html from the gateway go:embed.
- Critical files:
gateway/internal/connectsrv/server.go,gateway/internal/ratelimit/,gateway/internal/connectsrv/metrics.go,backend/internal/adminconsole/,deploy/caddy/Caddyfile,deploy/docker-compose.yml,gateway/internal/webui/. - Open details: the auto-flag threshold/window + whether the marker is persisted vs in-memory; the landing image base (caddy vs nginx).
R4 — Push enrichment + kill the last poll (TODO 4 + 5)
Replace lobby.poll with the existing match_found push (keep the poll as a ws-down
fallback). Enrich your_turn/opponent_moved/notify to carry the state payload so the UI
renders from the event without a follow-up game.state (removes the lobby↔game nav latency
the owner noticed). Wire-contract change: pkg/fbs event payloads → backend notify emit →
UI stream consumers (ui/src/lib/app.svelte.ts), with the per-game cache as the landing
spot; regenerate FB.
- Critical files:
pkg/fbs/scrabble.fbs,backend/internal/notify/events.go,ui/src/lib/{app.svelte,transport}.ts,ui/src/screens/NewGame.svelte. - Open details: which events carry full vs delta payloads; the fallback-poll cadence when the stream is down.
R5 — Bundle slimming (TODO 6)
Lazy-load secondary screens (Friends/Stats/Settings/About/Profile) and i18n catalogs by
language via dynamic imports; re-measure against the existing 100 KB-gzip budget
(ui/scripts/bundle-size.mjs, ~82 KB today). If the win is marginal, stop — acceptable per
the owner.
- Critical files:
ui/src/App.svelte,ui/vite.config.ts,ui/src/lib/i18n/.
R6 — Refactor + docs reconciliation + de-staging (TODO 7) — near last
Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical
de-staging — remove Stage N/TODO-N references from code, comments and service
READMEs (rename stage6_test.go); (b) docs↔code reconciliation — reconcile
docs/ARCHITECTURE.md / docs/FUNCTIONAL.md(+_ru) against the code-as-truth, fixing drift
and Go Doc comments; (c) structural changes by a reviewed list — surface a list of
proposed optimizations / test-suite consolidations to the owner, apply only the approved,
behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the
regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
- Open details: the structural-changes list itself (owner-approved before applying); the test consolidation targets.
R7 — Final stress run + tuning (TODO 9, part 2) — before Stage 18
Re-run the R2 harness against the final, refactored system on a clean contour; analyse resource consumption across all components (gateway, backend, Postgres, the metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record the methodology + results in the repo.
→ Stage 18 (prod contour) then proceeds per PLAN.md.
Sequencing rationale
R1 first (cheapest now; everything builds on the final schema/naming and the stress test
must run against it). R2 builds the harness and runs the early pass to surface bugs and
a resource baseline that feed R3 and R6. R3/R4/R5 harden and improve the system.
R6 (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and
benefits from all accumulated bug knowledge. R7 validates the final system and tunes it.
Then Stage 18.
Regression-safety discipline (cross-cutting)
- Every phase is a
feature/* → developmentPR; CI (unit+integration+uibehind theCI / gatecheck) must be green before the owner merges; watch the post-merge contour deploy withgitea-ci-watch.py. R6structural changes are behaviour-preserving, test-gated, and split from the mechanical sweeps; contentious items are owner-approved first.- The two stress runs (
R2early,R7final) are the system-level regression gate.
Verification (per phase)
go build ./<module>/...,go vet,gofmt -l .clean,go test -count=1 ./<module>/...; UI:pnpm check && pnpm test:unit && pnpm build; the integration suite (-tags integration) for DB/schema changes;docker compose configfor deploy changes; green CI on the PR + a healthy contour deploy.R1: prove the squashed baseline yields a schema identical to the 12-migration chain (integration suite on a fresh DB) before deleting the old files.R2/R7: the harness runs end-to-end against the contour; the trip report lists concrete defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels.
Refinements logged during implementation
-
R1 (interview + implementation):
- Variant labels
english/russian_scrabble/erudit→scrabble_en/scrabble_ru/erudit_ruacross the backend (engine.Variant.String/ParseVariant; thegames/game_invitationsvariantCHECK in the baseline; GCG#lexiconand thevariantmetric attribute both flow fromString), the wire (pkg/fbsvariantis astringfield — values change with no FlatBuffers regen) and the UI (model.tsunion,variants.tsrecords,codec/premiums/mocks/tests, the admindictionary.gohtml). Kept: the Go enum identifiers (VariantEnglish…, internal) and the i18n display keys (new.english/new.russian/new.erudit, display-only).complaints.variantstays free-text (no CHECK, as before). - dawg filenames kept descriptive (
en_sowpods/ru_scrabble/ru_erudit) — only the registry'sVariantkey carries the rename, soregistry.go, the publishedscrabble-solverfixtures and the dictionary release artifact are untouched (decouples the three repos). - Migrations squashed 12 → one hand-written
00001_baseline.sql. Verified by apg_dump --schema-onlydiff (the chain vs the baseline are identical but for the two intended variant-CHECK values) plus the green integration suite. No data migration (no production data). - Done (cross-repo + contour): the
scrabble-dictionarytidy merged (PR #2) and was re-cut as the byte-identicalv1.0.1release for clean provenance (the backend stays onv1.0.0— same bytes, no rewire; the backend pulls a version-pinned release artifact, not master). Post-merge the contourbackendschema was wiped (DROP SCHEMA backend CASCADE+ restart, not a volume drop) and re-migrated to the baseline — verified the new variant CHECK (scrabble_en/scrabble_ru/erudit_ru),games=0 and a clean boot.
- Variant labels
-
R2 (interview + implementation):
- Locked decisions: game assembly via invitations (real path, no robots; not direct game-row
inserts); moderate ramp 50 → 200 → 500 at 10 min/step; diagnostic pass bar (no SLO gate);
run as a one-shot container on
scrabble-internalin this PR. - Harness = new
scrabble/loadtestmodule (use ./loadtest+ areplace scrabble/gatewayfor the dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions directly in Postgres (token hash mirrorsbackend/internal/session), drives players over the edge protocol, generates mid-ranked legal moves locally with the embeddedscrabble-solverby replayinggame.history(the edge carries no board — mirrorsengine.ReplayBoardvia the public API), and a gateway-hammer. Compact CLI (run/cleanup), distroless Dockerfile (DAWGs baked), Go unit tests. - Adding the module broke the other images' builds — backend/gateway/telegram Dockerfiles reduce the
workspace but still referenced
./loadtest(not in their context); each now also-dropuse=./loadtest(backend/telegram additionally-dropreplacethe gateway replace). Caught by the first deploy run; verified by building all four images. - Harness payload fixes found by the smoke pass: the draft DTO's
rack_orderis a string (was sent as[]→bad_request); the display-name validator forbids digits/colons, so the cleanup marker became a letters-onlyZzloadtestsoprofile.updateresends the seeded name.chat_not_your_turn/nudge_own_turnare by-design turn gates, correctly exercised. - Observability: added cAdvisor + postgres_exporter + the Scrabble — Resources dashboard +
two Prometheus jobs. Finding: cAdvisor yields only the root cgroup on the contour host (separate
XFS
/var/lib/dockerbreaks its layer-ID resolution — the existing galaxy deploy has the same limit), so per-container CPU/RSS for the early pass was captured viadocker stats. R7: adopt the otelcoldocker_statsreceiver (already the contrib image) for per-container metrics in Grafana. - Early run (2026-06-09): ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000
accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the
hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 %
transport_errorongame.stateat 500 players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's single sharedhttp2.Transport; the harness itself peaked at 86 % of a core on the same host, so the figures are pessimistic. Full trip report in../loadtest/REPORT-R2.md; it feeds R3 (h2cMaxConcurrentStreams/timeouts, body-size cap), R6 and R7 (per-player transports, separate hardware, pool/limit sizing). - CI:
./loadtest/...added to the path filter + vet/build/test;go.work.sumcarries the new deps.
- Locked decisions: game assembly via invitations (real path, no robots; not direct game-row
inserts); moderate ramp 50 → 200 → 500 at 10 min/step; diagnostic pass bar (no SLO gate);
run as a one-shot container on
-
R3 (interview + implementation):
- Locked decisions: the flag column lands by editing the R1 baseline (+ a contour schema
wipe after merge — no migration chain accrues before prod); auto-flag defaults 1000 rejected /
10 min (
BACKEND_HIGHRATE_FLAG_THRESHOLD/_WINDOW, rolling window, set-once, operator clears, no auto-ban); landing image = caddy:2-alpine; throttle data flows gateway → backend (a 30 s per-key summary POST to the new/api/v1/internal/ratelimit/report, the existing trusted direction) with the episode window + flag rule in the backend (internal/ratewatch); rejection logging = Warn summary per key per window + Debug per rejection — a deliberate deviation from the phase's "structured log per rejection" (the R2 hammer would have logged ~522k lines in minutes); all three R2-report tails included (explicit h2c sizing, the session-resolve failure cause at Warn, reviving the admin limiter). - Body cap:
GATEWAY_MAX_BODY_BYTES(default 1 MiB) as both the Connect per-message read limit and anhttp.MaxBytesReaderwrap of the public mux; an oversized Execute isresource_exhausted. - Dead config found:
AdminPerMinute/AdminBurstwere never wired — the gateway/_gmmount is now 429-guarded per IP ahead of its Basic-Auth. The caddy-fronted contour path stays unlimited (stock caddy has no limiter) — an accepted gap, recorded indocs/ARCHITECTURE.md§12. - Landing split: a
landingtarget ingateway/Dockerfile(the UI build stage is shared; identical compose build args keep it one cached build); the gateway dropslanding.htmlfrom the embed and 308-redirects/→/app/; the contour caddy routes/app/,/telegram/and the Connect path to the gateway and the catch-all to the landing container; the CI deploy probe now checks both/(landing) and/app/(gateway). - Observability:
gateway_rate_limited_total{class}(user/public/email/admin, aggregate-only)- a rate-vs-rejections panel on the Edge/UX dashboard; the admin console gains the Throttled
page (the in-memory episode window, reset-on-restart like
active_users, plus the flagged-account queue) and the flag badge / clear action on the user list / card.
- a rate-vs-rejections panel on the Edge/UX dashboard; the admin console gains the Throttled
page (the in-memory episode window, reset-on-restart like
- The jet regen also restored the previously missing
game_drafts/game_hiddengenerated models (their tables were added after the last jetgen run; no behaviour change).
- Locked decisions: the flag column lands by editing the R1 baseline (+ a contour schema
wipe after merge — no migration chain accrues before prod); auto-flag defaults 1000 rejected /
10 min (