Files
scrabble-game/PRERELEASE.md
T
Ilia Denisov 8bfc44aad0
CI / gate (pull_request) Successful in 0s
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Has been skipped
CI / integration (pull_request) Has been skipped
CI / ui (pull_request) Has been skipped
CI / deploy (pull_request) Successful in 55s
R1: mark done in PRERELEASE.md (post-merge close-out)
scrabble-game #31 + scrabble-dictionary #2 merged, v1.0.1 cut, contour DB wiped
and re-migrated to the baseline (verified).
2026-06-09 16:11:03 +02:00

223 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Pre-release plan — hardening before Stage 18
Living tracker for the pre-release hardening pass that runs **before Stage 18** (the
prod cutover). Same discipline as [`PLAN.md`](PLAN.md): one phase per session,
**interview the owner on the open details** at the start of each phase, bake every
decision back into `PLAN.md` / `docs/` / the affected `README`s / Go Doc comments in
the **same** PR, get CI green, then mark the phase done. Phases run as
`feature/* → development` PRs (the Stage 16 branch model); the owner approves+merges.
**Why now:** the system is feature-complete through Stage 17 and the test contour is
green, but there is **no prod data yet** — schema, wire labels and the dictionary
layout can still change for free. These phases spend that one-time freedom and harden
the edge before prod. Each phase maps back to the owner's raw pre-release TODO list
(numbers in the tracker).
## Phase tracker
| # | Phase | Raw TODOs | Status |
|---|-------|-----------|--------|
| R1 | Schema & naming reset | 1 + 10 | **done** |
| R2 | Stress harness + contour observability + early run | 9a | todo |
| R3 | Edge hardening | 2 + 8 + 3 | todo |
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
| R5 | Bundle slimming | 6 | todo |
| R6 | Refactor + docs reconciliation + de-staging | 7 | todo |
| R7 | Final stress run + tuning | 9b | todo |
| → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) |
## Key findings (these reshaped the raw list — read before starting a phase)
- **R1 (TODO 1 + 10) is one cheap moment, now.** Squashing the 12 goose migrations is
safe precisely because there is no prod data and the contour DB is wiped. Folding the
new variant labels (`scrabble_ru`/`scrabble_en`/`erudit_ru`) into that single baseline
makes the rename need **no data migration and no back-compat mapping**. Today's labels
(`english`/`russian_scrabble`/`erudit`) are persisted in `games.variant`,
`game_invitations.variant`, in `pkg/fbs` and the UI — ~100 files, but a mechanical sweep
on a clean DB.
- **R4 (TODO 4 + 5): the app is already push-first.** Game state refreshes on
`your_turn`/`opponent_moved`, the lobby on `notify`, chat on `chat_message`. The **only**
genuine periodic server poll is `lobby.poll` (matchmaking, 2.5 s,
`ui/src/screens/NewGame.svelte`). What remains is killing that one poll **and** enriching
push events to carry payloads so the UI stops re-fetching after each signal.
- **R3 (TODO 2): identity forgery is already mitigated.** Identity is always derived from
the session (`Authorization: Bearer``X-User-ID`); the client cannot inject identity,
the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real
gaps are a missing **request-body size limit** (cheap DoS) and **invisible rate-limit
rejections** (no log/metric/admin view — that is TODO 8). Static landing serving is **not**
covered by the gateway token bucket (it only guards `Execute`).
- **R6 (TODO 7) scale:** ~431 `Stage N` references across ~104 files (incl. the file name
`backend/internal/inttest/stage6_test.go`). Code is the source of truth; `docs/` describe
current state; `PLAN.md` keeps the decision history.
## Locked decisions (owner interview)
- **Stress test (TODO 9):** **early + final** runs. Driver = **edge protocol** (Connect/FB
through the gateway, moves generated by the solver) **plus a separate gateway-hammer**
saturation test. Pacing = **realistic (under limits) + saturation (ramp to the knee)**.
Resource metrics = **add cAdvisor + postgres_exporter to the contour** (today only
Go-runtime metrics exist). The harness stays in the repo for repeats.
- **Push (TODO 4 + 5):** **both** — kill `lobby.poll` (use the existing `match_found`, keep
poll as the ws-down fallback) **and** enrich push events with payloads.
- **Refactor (TODO 7):** **hygiene + structural changes by a reviewed list**
behaviour-preserving, test-gated, contentious items surfaced to the owner before applying.
- **Landing (TODO 3):** **separate static container** behind the project caddy
(`/` → landing, `/app/` + `/telegram/` → gateway); drop `landing.html` from the gateway
`go:embed`.
- **Rate-abuse (TODO 8):** metric + Grafana + admin view **plus a conservative auto-flag**
a *soft, reversible* "suspected high-rate" marker for operator review, tunable threshold,
**no auto-ban**.
## Phases
Each phase: read this tracker + the relevant `docs/`, **interview the owner on the open
details below**, implement within scope, then update the tracker + docs/code and get CI
green before marking it done.
### R1 — Schema & naming reset *(TODO 1 + 10)* — first
Squash `backend/internal/postgres/migrations/00001..00012` into one `00001_baseline.sql`
(method: `pg_dump --schema-only` from a fully-migrated DB → wrap as the goose baseline →
prove a fresh migrate yields a schema identical to the 12-migration chain via the
integration suite → delete the old files; keep goose). Bake the new variant labels into the
baseline. Propagate `scrabble_ru`/`scrabble_en`/`erudit_ru` through the backend
(`engine.Variant`/`ParseVariant`, `registry.dictFiles`, the CHECK values), the wire
(`pkg/fbs` `variant:string`, regenerate FB) and the UI (`lib/model.ts` union, `variants.ts`,
fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy
`../scrabble-dictionary` to a single source→dawg build point and align the dawg artifact
names to the new labels (crosses into `../scrabble-solver`'s committed fixtures — keep them
byte-identical). After merge, **wipe the contour DB** (drop the volume) so it re-provisions
on the next deploy.
- Critical files: `backend/internal/postgres/migrations/`,
`backend/internal/engine/{engine,registry}.go`, `pkg/fbs/scrabble.fbs`,
`ui/src/lib/{model,variants}.ts`, `../scrabble-dictionary/{Makefile,cmd/builddict,…}`.
- Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is
one PR or split; how to script the contour DB wipe in the deploy.
### R2 — Stress harness + contour observability + early run *(TODO 9, part 1)*
Build the reusable load harness as a new `loadtest` module in `go.work` (reuses `pkg/fbs`,
`connect-go`, and `scrabble-solver` for legal-move generation): a seeder that inserts
**1000 guest + 10000 durable** accounts with pre-created sessions (token hashes) directly in
the DB and hands the plaintext tokens to the client; a driver that runs N virtual users,
each in 35 concurrent 24-player games, exercising submit-play / pass / exchange / nudge /
chat / check-word / draft-move / profile-save through the **edge protocol**, in
**realistic** (under rate limits) and **saturation** (ramp) modes; plus a separate
**gateway-hammer** that deliberately exceeds limits to verify the limiter holds and measure
its cost. Add **cAdvisor + postgres_exporter** to `deploy/docker-compose.yml` and a Grafana
resource dashboard. Run the **early pass** against the freshly-wiped contour; produce a
**trip report** (logic/concurrency bugs + a resource baseline) that feeds R3 and R6.
- Critical files: new `loadtest/`, `deploy/docker-compose.yml`, `deploy/observability/*`,
`docs/TESTING.md`.
- Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move
for realistic game progress); run duration; the pass/fail bar.
### R3 — Edge hardening *(TODO 2 + 8 + 3)*
Add a **request-body size cap** at the gateway h2c mux / `Execute` (e.g. ~1 MB). Add
**rate-limit observability**: a `gateway_rate_limited_total{class}` counter + a structured
log per rejection; an **aggregate** Grafana panel (request rate + rejection rate — spikes
visible without per-user label cardinality, honouring the Stage 12/17 discipline); an
**admin-console view** of recently throttled users/IPs (in-memory ring buffer, single-
instance, reset-on-restart, like the `active_users` gauge). Add the **conservative
auto-flag**: when a user is *sustained*-throttled past a tunable threshold, set a soft,
reversible `account.flagged_high_rate_at` marker (baked into the R1 baseline) surfaced in the
admin user list/detail — **no auto-ban**; the operator clears it. Split the **landing** into
its own static container (`deploy/` + a Caddyfile route `/` → landing) and drop
`landing.html` from the gateway `go:embed`.
- Critical files: `gateway/internal/connectsrv/server.go`, `gateway/internal/ratelimit/`,
`gateway/internal/connectsrv/metrics.go`, `backend/internal/adminconsole/`,
`deploy/caddy/Caddyfile`, `deploy/docker-compose.yml`, `gateway/internal/webui/`.
- Open details: the auto-flag threshold/window + whether the marker is persisted vs
in-memory; the landing image base (caddy vs nginx).
### R4 — Push enrichment + kill the last poll *(TODO 4 + 5)*
Replace `lobby.poll` with the existing `match_found` push (keep the poll as a ws-down
fallback). Enrich `your_turn`/`opponent_moved`/`notify` to carry the state payload so the UI
renders from the event without a follow-up `game.state` (removes the lobby↔game nav latency
the owner noticed). Wire-contract change: `pkg/fbs` event payloads → backend `notify` emit →
UI stream consumers (`ui/src/lib/app.svelte.ts`), with the per-game cache as the landing
spot; regenerate FB.
- Critical files: `pkg/fbs/scrabble.fbs`, `backend/internal/notify/events.go`,
`ui/src/lib/{app.svelte,transport}.ts`, `ui/src/screens/NewGame.svelte`.
- Open details: which events carry full vs delta payloads; the fallback-poll cadence when the
stream is down.
### R5 — Bundle slimming *(TODO 6)*
Lazy-load secondary screens (Friends/Stats/Settings/About/Profile) and i18n catalogs by
language via dynamic imports; re-measure against the existing 100 KB-gzip budget
(`ui/scripts/bundle-size.mjs`, ~82 KB today). If the win is marginal, stop — acceptable per
the owner.
- Critical files: `ui/src/App.svelte`, `ui/vite.config.ts`, `ui/src/lib/i18n/`.
### R6 — Refactor + docs reconciliation + de-staging *(TODO 7)* — near last
Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical
**de-staging** — remove `Stage N`/`TODO-N` references from code, comments and service
READMEs (rename `stage6_test.go`); (b) **docs↔code reconciliation** — reconcile
`docs/ARCHITECTURE.md` / `docs/FUNCTIONAL.md`(+`_ru`) against the code-as-truth, fixing drift
and Go Doc comments; (c) **structural changes by a reviewed list** — surface a list of
proposed optimizations / test-suite consolidations to the owner, apply only the approved,
behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the
regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
- Open details: the structural-changes list itself (owner-approved before applying); the test
consolidation targets.
### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18
Re-run the R2 harness against the final, refactored system on a clean contour; analyse
resource consumption across **all** components (gateway, backend, Postgres, the
metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate
limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record
the methodology + results in the repo.
**Stage 18** (prod contour) then proceeds per [`PLAN.md`](PLAN.md).
## Sequencing rationale
`R1` first (cheapest now; everything builds on the final schema/naming and the stress test
must run against it). `R2` builds the harness and runs the **early** pass to surface bugs and
a resource baseline that feed `R3` and `R6`. `R3`/`R4`/`R5` harden and improve the system.
`R6` (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and
benefits from all accumulated bug knowledge. `R7` validates the final system and tunes it.
Then Stage 18.
## Regression-safety discipline (cross-cutting)
- Every phase is a `feature/* → development` PR; CI (`unit` + `integration` + `ui` behind the
`CI / gate` check) must be green before the owner merges; watch the post-merge contour
deploy with `gitea-ci-watch.py`.
- `R6` structural changes are behaviour-preserving, test-gated, and split from the mechanical
sweeps; contentious items are owner-approved first.
- The two stress runs (`R2` early, `R7` final) are the system-level regression gate.
## Verification (per phase)
- `go build ./<module>/...`, `go vet`, `gofmt -l .` clean, `go test -count=1 ./<module>/...`;
UI: `pnpm check && pnpm test:unit && pnpm build`; the integration suite
(`-tags integration`) for DB/schema changes; `docker compose config` for deploy changes;
green CI on the PR + a healthy contour deploy.
- `R1`: prove the squashed baseline yields a schema identical to the 12-migration chain
(integration suite on a fresh DB) **before** deleting the old files.
- `R2`/`R7`: the harness runs end-to-end against the contour; the trip report lists concrete
defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels.
## Refinements logged during implementation
- **R1** (interview + implementation):
- **Variant labels** `english`/`russian_scrabble`/`erudit`**`scrabble_en`/`scrabble_ru`/`erudit_ru`**
across the backend (`engine.Variant.String`/`ParseVariant`; the `games`/`game_invitations` `variant`
CHECK in the baseline; GCG `#lexicon` and the `variant` metric attribute both flow from `String`),
the wire (`pkg/fbs` `variant` is a `string` field — values change with **no FlatBuffers regen**) and
the UI (`model.ts` union, `variants.ts` records, `codec`/`premiums`/mocks/tests, the admin
`dictionary.gohtml`). **Kept:** the Go enum identifiers (`VariantEnglish`…, internal) and the i18n
display keys (`new.english`/`new.russian`/`new.erudit`, display-only). `complaints.variant` stays
free-text (no CHECK, as before).
- **dawg filenames kept descriptive** (`en_sowpods`/`ru_scrabble`/`ru_erudit`) — only the registry's
`Variant` key carries the rename, so `registry.go`, the published `scrabble-solver` fixtures and the
dictionary release artifact are untouched (decouples the three repos).
- **Migrations squashed** 12 → one hand-written `00001_baseline.sql`. Verified by a
`pg_dump --schema-only` diff (the chain vs the baseline are **identical** but for the two intended
variant-CHECK values) plus the green integration suite. **No data migration** (no production data).
- **Done (cross-repo + contour):** the **`scrabble-dictionary` tidy** merged (PR #2) and was re-cut as
the **byte-identical `v1.0.1`** release for clean provenance (the backend stays on `v1.0.0` — same
bytes, no rewire; the backend pulls a version-pinned release artifact, not master). Post-merge the
contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and
re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`),
`games`=0 and a clean boot.