R1: schema & naming reset — squash migrations, rename variants
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 11s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m8s

Squash the 12 goose migrations into one 00001_baseline.sql (there is no prod
data; verified schema-identical to the chain via a pg_dump diff + the green
integration suite) and rename the game-variant labels
english/russian_scrabble/erudit -> scrabble_en/scrabble_ru/erudit_ru across the
backend, the FlatBuffers wire values and the UI.

dawg filenames and the Go enum identifiers are unchanged; the i18n display keys
are kept. Adds PRERELEASE.md (the R1-R7 pre-release tracker), linked from
CLAUDE.md. Contour DB wipe and the scrabble-dictionary tidy are follow-ups.
This commit is contained in:
Ilia Denisov
2026-06-09 12:09:50 +02:00
parent 70e3fab512
commit 26aa154547
54 changed files with 688 additions and 675 deletions
+219
View File
@@ -0,0 +1,219 @@
# Pre-release plan — hardening before Stage 18
Living tracker for the pre-release hardening pass that runs **before Stage 18** (the
prod cutover). Same discipline as [`PLAN.md`](PLAN.md): one phase per session,
**interview the owner on the open details** at the start of each phase, bake every
decision back into `PLAN.md` / `docs/` / the affected `README`s / Go Doc comments in
the **same** PR, get CI green, then mark the phase done. Phases run as
`feature/* → development` PRs (the Stage 16 branch model); the owner approves+merges.
**Why now:** the system is feature-complete through Stage 17 and the test contour is
green, but there is **no prod data yet** — schema, wire labels and the dictionary
layout can still change for free. These phases spend that one-time freedom and harden
the edge before prod. Each phase maps back to the owner's raw pre-release TODO list
(numbers in the tracker).
## Phase tracker
| # | Phase | Raw TODOs | Status |
|---|-------|-----------|--------|
| R1 | Schema & naming reset | 1 + 10 | **in review** |
| R2 | Stress harness + contour observability + early run | 9a | todo |
| R3 | Edge hardening | 2 + 8 + 3 | todo |
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
| R5 | Bundle slimming | 6 | todo |
| R6 | Refactor + docs reconciliation + de-staging | 7 | todo |
| R7 | Final stress run + tuning | 9b | todo |
| → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) |
## Key findings (these reshaped the raw list — read before starting a phase)
- **R1 (TODO 1 + 10) is one cheap moment, now.** Squashing the 12 goose migrations is
safe precisely because there is no prod data and the contour DB is wiped. Folding the
new variant labels (`scrabble_ru`/`scrabble_en`/`erudit_ru`) into that single baseline
makes the rename need **no data migration and no back-compat mapping**. Today's labels
(`english`/`russian_scrabble`/`erudit`) are persisted in `games.variant`,
`game_invitations.variant`, in `pkg/fbs` and the UI — ~100 files, but a mechanical sweep
on a clean DB.
- **R4 (TODO 4 + 5): the app is already push-first.** Game state refreshes on
`your_turn`/`opponent_moved`, the lobby on `notify`, chat on `chat_message`. The **only**
genuine periodic server poll is `lobby.poll` (matchmaking, 2.5 s,
`ui/src/screens/NewGame.svelte`). What remains is killing that one poll **and** enriching
push events to carry payloads so the UI stops re-fetching after each signal.
- **R3 (TODO 2): identity forgery is already mitigated.** Identity is always derived from
the session (`Authorization: Bearer``X-User-ID`); the client cannot inject identity,
the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real
gaps are a missing **request-body size limit** (cheap DoS) and **invisible rate-limit
rejections** (no log/metric/admin view — that is TODO 8). Static landing serving is **not**
covered by the gateway token bucket (it only guards `Execute`).
- **R6 (TODO 7) scale:** ~431 `Stage N` references across ~104 files (incl. the file name
`backend/internal/inttest/stage6_test.go`). Code is the source of truth; `docs/` describe
current state; `PLAN.md` keeps the decision history.
## Locked decisions (owner interview)
- **Stress test (TODO 9):** **early + final** runs. Driver = **edge protocol** (Connect/FB
through the gateway, moves generated by the solver) **plus a separate gateway-hammer**
saturation test. Pacing = **realistic (under limits) + saturation (ramp to the knee)**.
Resource metrics = **add cAdvisor + postgres_exporter to the contour** (today only
Go-runtime metrics exist). The harness stays in the repo for repeats.
- **Push (TODO 4 + 5):** **both** — kill `lobby.poll` (use the existing `match_found`, keep
poll as the ws-down fallback) **and** enrich push events with payloads.
- **Refactor (TODO 7):** **hygiene + structural changes by a reviewed list**
behaviour-preserving, test-gated, contentious items surfaced to the owner before applying.
- **Landing (TODO 3):** **separate static container** behind the project caddy
(`/` → landing, `/app/` + `/telegram/` → gateway); drop `landing.html` from the gateway
`go:embed`.
- **Rate-abuse (TODO 8):** metric + Grafana + admin view **plus a conservative auto-flag**
a *soft, reversible* "suspected high-rate" marker for operator review, tunable threshold,
**no auto-ban**.
## Phases
Each phase: read this tracker + the relevant `docs/`, **interview the owner on the open
details below**, implement within scope, then update the tracker + docs/code and get CI
green before marking it done.
### R1 — Schema & naming reset *(TODO 1 + 10)* — first
Squash `backend/internal/postgres/migrations/00001..00012` into one `00001_baseline.sql`
(method: `pg_dump --schema-only` from a fully-migrated DB → wrap as the goose baseline →
prove a fresh migrate yields a schema identical to the 12-migration chain via the
integration suite → delete the old files; keep goose). Bake the new variant labels into the
baseline. Propagate `scrabble_ru`/`scrabble_en`/`erudit_ru` through the backend
(`engine.Variant`/`ParseVariant`, `registry.dictFiles`, the CHECK values), the wire
(`pkg/fbs` `variant:string`, regenerate FB) and the UI (`lib/model.ts` union, `variants.ts`,
fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy
`../scrabble-dictionary` to a single source→dawg build point and align the dawg artifact
names to the new labels (crosses into `../scrabble-solver`'s committed fixtures — keep them
byte-identical). After merge, **wipe the contour DB** (drop the volume) so it re-provisions
on the next deploy.
- Critical files: `backend/internal/postgres/migrations/`,
`backend/internal/engine/{engine,registry}.go`, `pkg/fbs/scrabble.fbs`,
`ui/src/lib/{model,variants}.ts`, `../scrabble-dictionary/{Makefile,cmd/builddict,…}`.
- Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is
one PR or split; how to script the contour DB wipe in the deploy.
### R2 — Stress harness + contour observability + early run *(TODO 9, part 1)*
Build the reusable load harness as a new `loadtest` module in `go.work` (reuses `pkg/fbs`,
`connect-go`, and `scrabble-solver` for legal-move generation): a seeder that inserts
**1000 guest + 10000 durable** accounts with pre-created sessions (token hashes) directly in
the DB and hands the plaintext tokens to the client; a driver that runs N virtual users,
each in 35 concurrent 24-player games, exercising submit-play / pass / exchange / nudge /
chat / check-word / draft-move / profile-save through the **edge protocol**, in
**realistic** (under rate limits) and **saturation** (ramp) modes; plus a separate
**gateway-hammer** that deliberately exceeds limits to verify the limiter holds and measure
its cost. Add **cAdvisor + postgres_exporter** to `deploy/docker-compose.yml` and a Grafana
resource dashboard. Run the **early pass** against the freshly-wiped contour; produce a
**trip report** (logic/concurrency bugs + a resource baseline) that feeds R3 and R6.
- Critical files: new `loadtest/`, `deploy/docker-compose.yml`, `deploy/observability/*`,
`docs/TESTING.md`.
- Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move
for realistic game progress); run duration; the pass/fail bar.
### R3 — Edge hardening *(TODO 2 + 8 + 3)*
Add a **request-body size cap** at the gateway h2c mux / `Execute` (e.g. ~1 MB). Add
**rate-limit observability**: a `gateway_rate_limited_total{class}` counter + a structured
log per rejection; an **aggregate** Grafana panel (request rate + rejection rate — spikes
visible without per-user label cardinality, honouring the Stage 12/17 discipline); an
**admin-console view** of recently throttled users/IPs (in-memory ring buffer, single-
instance, reset-on-restart, like the `active_users` gauge). Add the **conservative
auto-flag**: when a user is *sustained*-throttled past a tunable threshold, set a soft,
reversible `account.flagged_high_rate_at` marker (baked into the R1 baseline) surfaced in the
admin user list/detail — **no auto-ban**; the operator clears it. Split the **landing** into
its own static container (`deploy/` + a Caddyfile route `/` → landing) and drop
`landing.html` from the gateway `go:embed`.
- Critical files: `gateway/internal/connectsrv/server.go`, `gateway/internal/ratelimit/`,
`gateway/internal/connectsrv/metrics.go`, `backend/internal/adminconsole/`,
`deploy/caddy/Caddyfile`, `deploy/docker-compose.yml`, `gateway/internal/webui/`.
- Open details: the auto-flag threshold/window + whether the marker is persisted vs
in-memory; the landing image base (caddy vs nginx).
### R4 — Push enrichment + kill the last poll *(TODO 4 + 5)*
Replace `lobby.poll` with the existing `match_found` push (keep the poll as a ws-down
fallback). Enrich `your_turn`/`opponent_moved`/`notify` to carry the state payload so the UI
renders from the event without a follow-up `game.state` (removes the lobby↔game nav latency
the owner noticed). Wire-contract change: `pkg/fbs` event payloads → backend `notify` emit →
UI stream consumers (`ui/src/lib/app.svelte.ts`), with the per-game cache as the landing
spot; regenerate FB.
- Critical files: `pkg/fbs/scrabble.fbs`, `backend/internal/notify/events.go`,
`ui/src/lib/{app.svelte,transport}.ts`, `ui/src/screens/NewGame.svelte`.
- Open details: which events carry full vs delta payloads; the fallback-poll cadence when the
stream is down.
### R5 — Bundle slimming *(TODO 6)*
Lazy-load secondary screens (Friends/Stats/Settings/About/Profile) and i18n catalogs by
language via dynamic imports; re-measure against the existing 100 KB-gzip budget
(`ui/scripts/bundle-size.mjs`, ~82 KB today). If the win is marginal, stop — acceptable per
the owner.
- Critical files: `ui/src/App.svelte`, `ui/vite.config.ts`, `ui/src/lib/i18n/`.
### R6 — Refactor + docs reconciliation + de-staging *(TODO 7)* — near last
Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical
**de-staging** — remove `Stage N`/`TODO-N` references from code, comments and service
READMEs (rename `stage6_test.go`); (b) **docs↔code reconciliation** — reconcile
`docs/ARCHITECTURE.md` / `docs/FUNCTIONAL.md`(+`_ru`) against the code-as-truth, fixing drift
and Go Doc comments; (c) **structural changes by a reviewed list** — surface a list of
proposed optimizations / test-suite consolidations to the owner, apply only the approved,
behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the
regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
- Open details: the structural-changes list itself (owner-approved before applying); the test
consolidation targets.
### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18
Re-run the R2 harness against the final, refactored system on a clean contour; analyse
resource consumption across **all** components (gateway, backend, Postgres, the
metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate
limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record
the methodology + results in the repo.
**Stage 18** (prod contour) then proceeds per [`PLAN.md`](PLAN.md).
## Sequencing rationale
`R1` first (cheapest now; everything builds on the final schema/naming and the stress test
must run against it). `R2` builds the harness and runs the **early** pass to surface bugs and
a resource baseline that feed `R3` and `R6`. `R3`/`R4`/`R5` harden and improve the system.
`R6` (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and
benefits from all accumulated bug knowledge. `R7` validates the final system and tunes it.
Then Stage 18.
## Regression-safety discipline (cross-cutting)
- Every phase is a `feature/* → development` PR; CI (`unit` + `integration` + `ui` behind the
`CI / gate` check) must be green before the owner merges; watch the post-merge contour
deploy with `gitea-ci-watch.py`.
- `R6` structural changes are behaviour-preserving, test-gated, and split from the mechanical
sweeps; contentious items are owner-approved first.
- The two stress runs (`R2` early, `R7` final) are the system-level regression gate.
## Verification (per phase)
- `go build ./<module>/...`, `go vet`, `gofmt -l .` clean, `go test -count=1 ./<module>/...`;
UI: `pnpm check && pnpm test:unit && pnpm build`; the integration suite
(`-tags integration`) for DB/schema changes; `docker compose config` for deploy changes;
green CI on the PR + a healthy contour deploy.
- `R1`: prove the squashed baseline yields a schema identical to the 12-migration chain
(integration suite on a fresh DB) **before** deleting the old files.
- `R2`/`R7`: the harness runs end-to-end against the contour; the trip report lists concrete
defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels.
## Refinements logged during implementation
- **R1** (interview + implementation):
- **Variant labels** `english`/`russian_scrabble`/`erudit`**`scrabble_en`/`scrabble_ru`/`erudit_ru`**
across the backend (`engine.Variant.String`/`ParseVariant`; the `games`/`game_invitations` `variant`
CHECK in the baseline; GCG `#lexicon` and the `variant` metric attribute both flow from `String`),
the wire (`pkg/fbs` `variant` is a `string` field — values change with **no FlatBuffers regen**) and
the UI (`model.ts` union, `variants.ts` records, `codec`/`premiums`/mocks/tests, the admin
`dictionary.gohtml`). **Kept:** the Go enum identifiers (`VariantEnglish`…, internal) and the i18n
display keys (`new.english`/`new.russian`/`new.erudit`, display-only). `complaints.variant` stays
free-text (no CHECK, as before).
- **dawg filenames kept descriptive** (`en_sowpods`/`ru_scrabble`/`ru_erudit`) — only the registry's
`Variant` key carries the rename, so `registry.go`, the published `scrabble-solver` fixtures and the
dictionary release artifact are untouched (decouples the three repos).
- **Migrations squashed** 12 → one hand-written `00001_baseline.sql`. Verified by a
`pg_dump --schema-only` diff (the chain vs the baseline are **identical** but for the two intended
variant-CHECK values) plus the green integration suite. **No data migration** (no production data).
- **Follow-ups:** post-merge, drop the contour `postgres-data` volume on the runner host so the next
deploy provisions a clean DB on the new baseline; the **`scrabble-dictionary` tidy** ships as its own
PR in that repo (sources consolidated to a single build point; byte-identical artifact).