8bfc44aad0
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Has been skipped
CI / integration (pull_request) Has been skipped
CI / ui (pull_request) Has been skipped
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 55s
scrabble-game #31 + scrabble-dictionary #2 merged, v1.0.1 cut, contour DB wiped and re-migrated to the baseline (verified).
223 lines
15 KiB
Markdown
223 lines
15 KiB
Markdown
# Pre-release plan — hardening before Stage 18
|
||
|
||
Living tracker for the pre-release hardening pass that runs **before Stage 18** (the
|
||
prod cutover). Same discipline as [`PLAN.md`](PLAN.md): one phase per session,
|
||
**interview the owner on the open details** at the start of each phase, bake every
|
||
decision back into `PLAN.md` / `docs/` / the affected `README`s / Go Doc comments in
|
||
the **same** PR, get CI green, then mark the phase done. Phases run as
|
||
`feature/* → development` PRs (the Stage 16 branch model); the owner approves+merges.
|
||
|
||
**Why now:** the system is feature-complete through Stage 17 and the test contour is
|
||
green, but there is **no prod data yet** — schema, wire labels and the dictionary
|
||
layout can still change for free. These phases spend that one-time freedom and harden
|
||
the edge before prod. Each phase maps back to the owner's raw pre-release TODO list
|
||
(numbers in the tracker).
|
||
|
||
## Phase tracker
|
||
|
||
| # | Phase | Raw TODOs | Status |
|
||
|---|-------|-----------|--------|
|
||
| R1 | Schema & naming reset | 1 + 10 | **done** |
|
||
| R2 | Stress harness + contour observability + early run | 9a | todo |
|
||
| R3 | Edge hardening | 2 + 8 + 3 | todo |
|
||
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
|
||
| R5 | Bundle slimming | 6 | todo |
|
||
| R6 | Refactor + docs reconciliation + de-staging | 7 | todo |
|
||
| R7 | Final stress run + tuning | 9b | todo |
|
||
| → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) |
|
||
|
||
## Key findings (these reshaped the raw list — read before starting a phase)
|
||
|
||
- **R1 (TODO 1 + 10) is one cheap moment, now.** Squashing the 12 goose migrations is
|
||
safe precisely because there is no prod data and the contour DB is wiped. Folding the
|
||
new variant labels (`scrabble_ru`/`scrabble_en`/`erudit_ru`) into that single baseline
|
||
makes the rename need **no data migration and no back-compat mapping**. Today's labels
|
||
(`english`/`russian_scrabble`/`erudit`) are persisted in `games.variant`,
|
||
`game_invitations.variant`, in `pkg/fbs` and the UI — ~100 files, but a mechanical sweep
|
||
on a clean DB.
|
||
- **R4 (TODO 4 + 5): the app is already push-first.** Game state refreshes on
|
||
`your_turn`/`opponent_moved`, the lobby on `notify`, chat on `chat_message`. The **only**
|
||
genuine periodic server poll is `lobby.poll` (matchmaking, 2.5 s,
|
||
`ui/src/screens/NewGame.svelte`). What remains is killing that one poll **and** enriching
|
||
push events to carry payloads so the UI stops re-fetching after each signal.
|
||
- **R3 (TODO 2): identity forgery is already mitigated.** Identity is always derived from
|
||
the session (`Authorization: Bearer` → `X-User-ID`); the client cannot inject identity,
|
||
the backend re-validates resource ownership, Telegram initData is HMAC-checked. The real
|
||
gaps are a missing **request-body size limit** (cheap DoS) and **invisible rate-limit
|
||
rejections** (no log/metric/admin view — that is TODO 8). Static landing serving is **not**
|
||
covered by the gateway token bucket (it only guards `Execute`).
|
||
- **R6 (TODO 7) scale:** ~431 `Stage N` references across ~104 files (incl. the file name
|
||
`backend/internal/inttest/stage6_test.go`). Code is the source of truth; `docs/` describe
|
||
current state; `PLAN.md` keeps the decision history.
|
||
|
||
## Locked decisions (owner interview)
|
||
|
||
- **Stress test (TODO 9):** **early + final** runs. Driver = **edge protocol** (Connect/FB
|
||
through the gateway, moves generated by the solver) **plus a separate gateway-hammer**
|
||
saturation test. Pacing = **realistic (under limits) + saturation (ramp to the knee)**.
|
||
Resource metrics = **add cAdvisor + postgres_exporter to the contour** (today only
|
||
Go-runtime metrics exist). The harness stays in the repo for repeats.
|
||
- **Push (TODO 4 + 5):** **both** — kill `lobby.poll` (use the existing `match_found`, keep
|
||
poll as the ws-down fallback) **and** enrich push events with payloads.
|
||
- **Refactor (TODO 7):** **hygiene + structural changes by a reviewed list** —
|
||
behaviour-preserving, test-gated, contentious items surfaced to the owner before applying.
|
||
- **Landing (TODO 3):** **separate static container** behind the project caddy
|
||
(`/` → landing, `/app/` + `/telegram/` → gateway); drop `landing.html` from the gateway
|
||
`go:embed`.
|
||
- **Rate-abuse (TODO 8):** metric + Grafana + admin view **plus a conservative auto-flag** —
|
||
a *soft, reversible* "suspected high-rate" marker for operator review, tunable threshold,
|
||
**no auto-ban**.
|
||
|
||
## Phases
|
||
|
||
Each phase: read this tracker + the relevant `docs/`, **interview the owner on the open
|
||
details below**, implement within scope, then update the tracker + docs/code and get CI
|
||
green before marking it done.
|
||
|
||
### R1 — Schema & naming reset *(TODO 1 + 10)* — first
|
||
Squash `backend/internal/postgres/migrations/00001..00012` into one `00001_baseline.sql`
|
||
(method: `pg_dump --schema-only` from a fully-migrated DB → wrap as the goose baseline →
|
||
prove a fresh migrate yields a schema identical to the 12-migration chain via the
|
||
integration suite → delete the old files; keep goose). Bake the new variant labels into the
|
||
baseline. Propagate `scrabble_ru`/`scrabble_en`/`erudit_ru` through the backend
|
||
(`engine.Variant`/`ParseVariant`, `registry.dictFiles`, the CHECK values), the wire
|
||
(`pkg/fbs` `variant:string`, regenerate FB) and the UI (`lib/model.ts` union, `variants.ts`,
|
||
fixtures, premium/alphabet keys, tests); i18n display keys stay display-only. Tidy
|
||
`../scrabble-dictionary` to a single source→dawg build point and align the dawg artifact
|
||
names to the new labels (crosses into `../scrabble-solver`'s committed fixtures — keep them
|
||
byte-identical). After merge, **wipe the contour DB** (drop the volume) so it re-provisions
|
||
on the next deploy.
|
||
- Critical files: `backend/internal/postgres/migrations/`,
|
||
`backend/internal/engine/{engine,registry}.go`, `pkg/fbs/scrabble.fbs`,
|
||
`ui/src/lib/{model,variants}.ts`, `../scrabble-dictionary/{Makefile,cmd/builddict,…}`.
|
||
- Open details to interview: the exact dawg filename scheme; whether the dict-repo tidy is
|
||
one PR or split; how to script the contour DB wipe in the deploy.
|
||
|
||
### R2 — Stress harness + contour observability + early run *(TODO 9, part 1)*
|
||
Build the reusable load harness as a new `loadtest` module in `go.work` (reuses `pkg/fbs`,
|
||
`connect-go`, and `scrabble-solver` for legal-move generation): a seeder that inserts
|
||
**1000 guest + 10000 durable** accounts with pre-created sessions (token hashes) directly in
|
||
the DB and hands the plaintext tokens to the client; a driver that runs N virtual users,
|
||
each in 3–5 concurrent 2–4-player games, exercising submit-play / pass / exchange / nudge /
|
||
chat / check-word / draft-move / profile-save through the **edge protocol**, in
|
||
**realistic** (under rate limits) and **saturation** (ramp) modes; plus a separate
|
||
**gateway-hammer** that deliberately exceeds limits to verify the limiter holds and measure
|
||
its cost. Add **cAdvisor + postgres_exporter** to `deploy/docker-compose.yml` and a Grafana
|
||
resource dashboard. Run the **early pass** against the freshly-wiped contour; produce a
|
||
**trip report** (logic/concurrency bugs + a resource baseline) that feeds R3 and R6.
|
||
- Critical files: new `loadtest/`, `deploy/docker-compose.yml`, `deploy/observability/*`,
|
||
`docs/TESTING.md`.
|
||
- Open details: the scale ramp steps; the move-selection policy (a mid-ranked solver move
|
||
for realistic game progress); run duration; the pass/fail bar.
|
||
|
||
### R3 — Edge hardening *(TODO 2 + 8 + 3)*
|
||
Add a **request-body size cap** at the gateway h2c mux / `Execute` (e.g. ~1 MB). Add
|
||
**rate-limit observability**: a `gateway_rate_limited_total{class}` counter + a structured
|
||
log per rejection; an **aggregate** Grafana panel (request rate + rejection rate — spikes
|
||
visible without per-user label cardinality, honouring the Stage 12/17 discipline); an
|
||
**admin-console view** of recently throttled users/IPs (in-memory ring buffer, single-
|
||
instance, reset-on-restart, like the `active_users` gauge). Add the **conservative
|
||
auto-flag**: when a user is *sustained*-throttled past a tunable threshold, set a soft,
|
||
reversible `account.flagged_high_rate_at` marker (baked into the R1 baseline) surfaced in the
|
||
admin user list/detail — **no auto-ban**; the operator clears it. Split the **landing** into
|
||
its own static container (`deploy/` + a Caddyfile route `/` → landing) and drop
|
||
`landing.html` from the gateway `go:embed`.
|
||
- Critical files: `gateway/internal/connectsrv/server.go`, `gateway/internal/ratelimit/`,
|
||
`gateway/internal/connectsrv/metrics.go`, `backend/internal/adminconsole/`,
|
||
`deploy/caddy/Caddyfile`, `deploy/docker-compose.yml`, `gateway/internal/webui/`.
|
||
- Open details: the auto-flag threshold/window + whether the marker is persisted vs
|
||
in-memory; the landing image base (caddy vs nginx).
|
||
|
||
### R4 — Push enrichment + kill the last poll *(TODO 4 + 5)*
|
||
Replace `lobby.poll` with the existing `match_found` push (keep the poll as a ws-down
|
||
fallback). Enrich `your_turn`/`opponent_moved`/`notify` to carry the state payload so the UI
|
||
renders from the event without a follow-up `game.state` (removes the lobby↔game nav latency
|
||
the owner noticed). Wire-contract change: `pkg/fbs` event payloads → backend `notify` emit →
|
||
UI stream consumers (`ui/src/lib/app.svelte.ts`), with the per-game cache as the landing
|
||
spot; regenerate FB.
|
||
- Critical files: `pkg/fbs/scrabble.fbs`, `backend/internal/notify/events.go`,
|
||
`ui/src/lib/{app.svelte,transport}.ts`, `ui/src/screens/NewGame.svelte`.
|
||
- Open details: which events carry full vs delta payloads; the fallback-poll cadence when the
|
||
stream is down.
|
||
|
||
### R5 — Bundle slimming *(TODO 6)*
|
||
Lazy-load secondary screens (Friends/Stats/Settings/About/Profile) and i18n catalogs by
|
||
language via dynamic imports; re-measure against the existing 100 KB-gzip budget
|
||
(`ui/scripts/bundle-size.mjs`, ~82 KB today). If the win is marginal, stop — acceptable per
|
||
the owner.
|
||
- Critical files: `ui/src/App.svelte`, `ui/vite.config.ts`, `ui/src/lib/i18n/`.
|
||
|
||
### R6 — Refactor + docs reconciliation + de-staging *(TODO 7)* — near last
|
||
Behaviour-preserving only. Three separable, separately-committed passes: (a) mechanical
|
||
**de-staging** — remove `Stage N`/`TODO-N` references from code, comments and service
|
||
READMEs (rename `stage6_test.go`); (b) **docs↔code reconciliation** — reconcile
|
||
`docs/ARCHITECTURE.md` / `docs/FUNCTIONAL.md`(+`_ru`) against the code-as-truth, fixing drift
|
||
and Go Doc comments; (c) **structural changes by a reviewed list** — surface a list of
|
||
proposed optimizations / test-suite consolidations to the owner, apply only the approved,
|
||
behaviour-preserving, test-gated ones. The full suite + the final stress run (R7) are the
|
||
regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
|
||
- Open details: the structural-changes list itself (owner-approved before applying); the test
|
||
consolidation targets.
|
||
|
||
### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18
|
||
Re-run the R2 harness against the final, refactored system on a clean contour; analyse
|
||
resource consumption across **all** components (gateway, backend, Postgres, the
|
||
metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate
|
||
limits, cache TTLs, container limits, GOMAXPROCS, log levels). Apply the agreed tuning; record
|
||
the methodology + results in the repo.
|
||
|
||
→ **Stage 18** (prod contour) then proceeds per [`PLAN.md`](PLAN.md).
|
||
|
||
## Sequencing rationale
|
||
|
||
`R1` first (cheapest now; everything builds on the final schema/naming and the stress test
|
||
must run against it). `R2` builds the harness and runs the **early** pass to surface bugs and
|
||
a resource baseline that feed `R3` and `R6`. `R3`/`R4`/`R5` harden and improve the system.
|
||
`R6` (de-stage + reconcile + structural) runs near the end so it sweeps settled code once and
|
||
benefits from all accumulated bug knowledge. `R7` validates the final system and tunes it.
|
||
Then Stage 18.
|
||
|
||
## Regression-safety discipline (cross-cutting)
|
||
|
||
- Every phase is a `feature/* → development` PR; CI (`unit` + `integration` + `ui` behind the
|
||
`CI / gate` check) must be green before the owner merges; watch the post-merge contour
|
||
deploy with `gitea-ci-watch.py`.
|
||
- `R6` structural changes are behaviour-preserving, test-gated, and split from the mechanical
|
||
sweeps; contentious items are owner-approved first.
|
||
- The two stress runs (`R2` early, `R7` final) are the system-level regression gate.
|
||
|
||
## Verification (per phase)
|
||
|
||
- `go build ./<module>/...`, `go vet`, `gofmt -l .` clean, `go test -count=1 ./<module>/...`;
|
||
UI: `pnpm check && pnpm test:unit && pnpm build`; the integration suite
|
||
(`-tags integration`) for DB/schema changes; `docker compose config` for deploy changes;
|
||
green CI on the PR + a healthy contour deploy.
|
||
- `R1`: prove the squashed baseline yields a schema identical to the 12-migration chain
|
||
(integration suite on a fresh DB) **before** deleting the old files.
|
||
- `R2`/`R7`: the harness runs end-to-end against the contour; the trip report lists concrete
|
||
defects + a resource profile from the Grafana cAdvisor/postgres_exporter panels.
|
||
|
||
## Refinements logged during implementation
|
||
|
||
- **R1** (interview + implementation):
|
||
- **Variant labels** `english`/`russian_scrabble`/`erudit` → **`scrabble_en`/`scrabble_ru`/`erudit_ru`**
|
||
across the backend (`engine.Variant.String`/`ParseVariant`; the `games`/`game_invitations` `variant`
|
||
CHECK in the baseline; GCG `#lexicon` and the `variant` metric attribute both flow from `String`),
|
||
the wire (`pkg/fbs` `variant` is a `string` field — values change with **no FlatBuffers regen**) and
|
||
the UI (`model.ts` union, `variants.ts` records, `codec`/`premiums`/mocks/tests, the admin
|
||
`dictionary.gohtml`). **Kept:** the Go enum identifiers (`VariantEnglish`…, internal) and the i18n
|
||
display keys (`new.english`/`new.russian`/`new.erudit`, display-only). `complaints.variant` stays
|
||
free-text (no CHECK, as before).
|
||
- **dawg filenames kept descriptive** (`en_sowpods`/`ru_scrabble`/`ru_erudit`) — only the registry's
|
||
`Variant` key carries the rename, so `registry.go`, the published `scrabble-solver` fixtures and the
|
||
dictionary release artifact are untouched (decouples the three repos).
|
||
- **Migrations squashed** 12 → one hand-written `00001_baseline.sql`. Verified by a
|
||
`pg_dump --schema-only` diff (the chain vs the baseline are **identical** but for the two intended
|
||
variant-CHECK values) plus the green integration suite. **No data migration** (no production data).
|
||
- **Done (cross-repo + contour):** the **`scrabble-dictionary` tidy** merged (PR #2) and was re-cut as
|
||
the **byte-identical `v1.0.1`** release for clean provenance (the backend stays on `v1.0.0` — same
|
||
bytes, no rewire; the backend pulls a version-pinned release artifact, not master). Post-merge the
|
||
contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and
|
||
re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`),
|
||
`games`=0 and a clean boot.
|