From a2265a122e46f9e9df795722d0aec3a91df335d4 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Wed, 10 Jun 2026 00:47:16 +0200 Subject: [PATCH] R2: early-pass trip report + mark R2 done Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held under the gateway-hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation (backend/gateway/Postgres each ~1 core), amplified by the harness's single shared http2.Transport (the harness itself peaked at 86 % of a core on the same host). Observability finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7 should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md; PRERELEASE refinements logged; R2 marked done. --- PRERELEASE.md | 35 ++++++++- loadtest/REPORT-R2.md | 162 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 loadtest/REPORT-R2.md diff --git a/PRERELEASE.md b/PRERELEASE.md index e723e2f..46cf842 100644 --- a/PRERELEASE.md +++ b/PRERELEASE.md @@ -18,7 +18,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l | # | Phase | Raw TODOs | Status | |---|-------|-----------|--------| | R1 | Schema & naming reset | 1 + 10 | **done** | -| R2 | Stress harness + contour observability + early run | 9a | todo | +| R2 | Stress harness + contour observability + early run | 9a | **done** | | R3 | Edge hardening | 2 + 8 + 3 | todo | | R4 | Push enrichment + kill the last poll | 4 + 5 | todo | | R5 | Bundle slimming | 6 | todo | @@ -220,3 +220,36 @@ Then Stage 18. contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`), `games`=0 and a clean boot. + +- **R2** (interview + implementation): + - **Locked decisions:** game assembly via **invitations** (real path, no robots; not direct game-row + inserts); **moderate** ramp **50 → 200 → 500** at 10 min/step; **diagnostic** pass bar (no SLO gate); + run as a **one-shot container on `scrabble-internal`** in this PR. + - **Harness** = new `scrabble/loadtest` module (`use ./loadtest` + a `replace scrabble/gateway` for the + dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions **directly in + Postgres** (token hash mirrors `backend/internal/session`), drives players over the **edge protocol**, + generates **mid-ranked legal moves locally** with the embedded `scrabble-solver` by replaying + `game.history` (the edge carries no board — mirrors `engine.ReplayBoard` via the public API), and a + **gateway-hammer**. Compact CLI (`run` / `cleanup`), distroless Dockerfile (DAWGs baked), Go unit tests. + - **Adding the module broke the other images' builds** — backend/gateway/telegram Dockerfiles reduce the + workspace but still referenced `./loadtest` (not in their context); each now also + `-dropuse=./loadtest` (backend/telegram additionally `-dropreplace` the gateway replace). Caught by the + first deploy run; verified by building all four images. + - **Harness payload fixes found by the smoke pass:** the draft DTO's `rack_order` is a string (was sent + as `[]` → `bad_request`); the display-name validator forbids digits/colons, so the cleanup marker + became a letters-only `Zzloadtest` so `profile.update` resends the seeded name. `chat_not_your_turn` / + `nudge_own_turn` are **by-design** turn gates, correctly exercised. + - **Observability:** added **cAdvisor + postgres_exporter** + the **Scrabble — Resources** dashboard + + two Prometheus jobs. **Finding:** cAdvisor yields only the root cgroup on the contour host (separate + XFS `/var/lib/docker` breaks its layer-ID resolution — the existing galaxy deploy has the same limit), + so per-container CPU/RSS for the early pass was captured via `docker stats`. **R7:** adopt the otelcol + `docker_stats` receiver (already the contrib image) for per-container metrics in Grafana. + - **Early run (2026-06-09):** ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000 + accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the + hammer (99.97 % rejected, p99 2 ms). **Top finding:** ~14 % `transport_error` on `game.state` at 500 + players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's + single shared `http2.Transport`; the harness itself peaked at 86 % of a core on the same host, so the + figures are pessimistic. Full trip report in [`../loadtest/REPORT-R2.md`](../loadtest/REPORT-R2.md); + it feeds R3 (h2c `MaxConcurrentStreams`/timeouts, body-size cap), R6 and R7 (per-player transports, + separate hardware, pool/limit sizing). + - **CI:** `./loadtest/...` added to the path filter + vet/build/test; `go.work.sum` carries the new deps. diff --git a/loadtest/REPORT-R2.md b/loadtest/REPORT-R2.md new file mode 100644 index 0000000..a119b34 --- /dev/null +++ b/loadtest/REPORT-R2.md @@ -0,0 +1,162 @@ +# R2 — early stress-run trip report + +The early stress pass for `PRERELEASE.md` R2. It exercises the system through the +**edge protocol** with the `scrabble/loadtest` harness, to surface logic/concurrency +bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and +R7 (final tuning). Pass bar: **diagnostic** — the run "passes" by completing without the +harness crashing; findings are recorded below, not gated. + +## Method + +- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the + `scrabble-internal` docker network (reaching `postgres:5432` and `gateway:8081` + directly, bypassing the host→gateway hairpin). +- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written + directly to Postgres (token hash matches `backend/internal/session`), so the driver + authenticates without the per-IP-limited auth ops. +- **Games:** assembled through the real **invitation** flow (`invitation.create` → + `invitation.accept`), 2–4 players each, no robots; variants spread over + scrabble_en / scrabble_ru / erudit_ru. +- **Play:** each virtual player holds a live `Subscribe` stream and, per tick, polls + `game.state`, replays `game.history` and submits a **mid-ranked** legal move generated + locally by the embedded `scrabble-solver` (the edge carries no board), or + passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile / + stats. A separate **gateway-hammer** floods `games.list` from one account. +- **Scale:** moderate ramp **50 → 200 → 500** concurrent players, 10 min/step (the + agreed moderate profile; harness and contour share this host's CPU). +- **Resource capture:** `docker stats` (docker API) sampled every 28 s for per-container + CPU/memory; Prometheus for edge latency/throughput, `postgres_exporter` internals and + per-service Go runtime metrics. + +## Run configuration + +``` +loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \ + --tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup +``` + +Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2 +exporters. Seeded population removed by `--cleanup` afterwards. + +## Findings + +### Validated (fixed within R2) +- **Harness draft payload.** `draft.save` first returned `bad_request`: the backend + draft DTO's `rack_order` is a string (the harness sent `[]`). Fixed → `ok`. +- **Harness profile marker.** `profile.update` first returned `invalid_profile`: the + editable-display-name validator (`backend/internal/account/profile.go`) forbids digits + and colons, but the seed marker was `lt:…`. Switched the marker to a distinctive + letters-only string → `ok`. Cleanup still matches it. + +### By-design behaviour (correctly exercised, not bugs) +- **`chat_not_your_turn`** — chat is gated to the sender's turn + (`backend/internal/social/chat.go`); off-turn posts are correctly rejected. +- **`nudge_own_turn`** — you nudge the player whose turn it is, so a nudge on your own + turn is correctly rejected. The harness nudges/chats at random ticks, so a share of + these codes is expected. + +### Observability gap (key R7 input) +- **cAdvisor yields only the root cgroup on the contour host.** Its docker factory + registers, but per-container init fails — `failed to identify the read-write layer ID + … /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory` — because this + host's `/var/lib/docker` is a **separate XFS mount** not visible under cAdvisor's + `/rootfs` bind (the existing galaxy deployment on the same host has the same + limitation). So the **Scrabble — Resources** dashboard's per-container panels are empty + here, and per-container CPU/RSS for this run was captured via `docker stats` instead. + Postgres internals (`postgres_exporter`) and per-service Go runtime metrics + (`go_*` by `service_name`) work. **Recommendation for R7:** adopt the otelcol + **`docker_stats`** receiver (already the contrib image) — it reads per-container stats + via the docker API with no cgroup dependency — and/or run the final pass on hardware + where cAdvisor resolves containers. (Decision to confirm with the owner.) + +### Run results + +The ramp ran clean to 500 players with no harness crash, no deadlock and +`stream errors: 0`; cleanup removed all 11 000 seeded accounts (and their ~941 games). + +- **Ramp:** step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569. +- **Volume (30 min):** 1.20 M total edge calls, 659 req/s average. Real gameplay at + scale: **48 870 committed plays**, 52 772 `your_turn` + 159 631 `opponent_moved` + events, **2 798 games finished**. +- **Latency under load (peak, step 3):** `game.state` p50 ≈ 100 ms, p90/p99 in the + 200–500 ms buckets, max 849 ms; `game.submit_play` similar (p99 ≤ 500 ms, max 490 ms). + Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms). +- **Rate limiter holds.** The gateway-hammer sent 522 667 `games.list` from one account; + **522 486 (99.97 %) were `rate_limited`**, only 135 `ok` (the burst). Rejections are + cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the + flood. The per-user limiter behaves as designed (R3 input: the cost is negligible). + +**Top finding — `transport_error` under saturation.** At 500 players ~14 % of +`game.state` calls (72 429 / 519 067) and a few % of the other ops returned a Connect +`transport_error` (not a domain code). It correlates with the CPU saturation below: the +backend/gateway are pinned near one core each while the host also runs the 86 %-core +harness, so the edge sheds load (resets/timeouts) at the knee. It is **amplified by a +harness artifact** — all 500 virtual players multiplex over a *single* shared +`http2.Transport`, so 500 persistent `Subscribe` streams plus Execute calls press on one +HTTP/2 connection's concurrent-stream limit; real clients each use their own connection. +**Actions:** R7 harness — give each player (or a pool) its own transport, and run on +hardware not shared with the contour; R3 — confirm the gateway's h2c +`MaxConcurrentStreams` and edge timeouts are sized for many persistent streams. + +**Minor findings:** +- `unauthenticated` on a tiny share (188 / 519 067 `game.state`, ~0.04 %) — transient + session-resolve failures under load; worth a glance in R3 but not material. +- one `internal` on `game.pass` (1 / 4 788). +- `game_finished` dominates `chat.nudge`/`chat.post` (≈ 3 900 each): the harness keeps + secondary ops on games that already ended. Harness refinement — drop finished games + from the rotation (R7). +- `nudge_own_turn` / `chat_not_your_turn` / `nudge_too_soon` are the expected turn/rate + gates, correctly exercised. + +## Resource baseline + +Per-container peak during step 3 (500 players), from `docker stats`: + +| container | peak CPU | memory | +|-----------|---------:|-------:| +| scrabble-backend | **99 %** (~1 core) | 91 MiB | +| scrabble-gateway | **93 %** | 76 MiB | +| scrabble-postgres | **90 %** | 69 MiB | +| scrabble-loadtest (harness) | **86 %** | 42 MiB | +| scrabble-otelcol | 10 % | 110 MiB | +| scrabble-tempo | 9 % | 446 MiB | +| prometheus / postgres-exporter | ~0 % | 46 / 16 MiB | + +- **The contour is CPU-bound at 500 concurrent players:** backend, gateway and Postgres + each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at + this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for + R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover. +- **Caveat:** the harness itself peaked at **86 % of a core** on the *same host*, so the + step-3 latency and `transport_error` figures are pessimistic — the contour competed + with the generator for CPU. A clean ceiling needs separate hardware (R7). +- **Postgres:** peak 28 backend connections, ~5 581 commits/s at the peak, **100 % cache + hit ratio** (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here. +- **Goroutines:** backend 638, gateway **1 698** (it holds the 500 `Subscribe` streams + + per-request goroutines), telegram 49 — all stable, no leak across the ramp. + +## Recommendations feeding later phases +- **R3 (edge hardening):** the per-user limiter holds (99.97 % rejected, p99 2 ms) — add + the per-IP body-size cap on top. Investigate the **~14 % `transport_error` on + `game.state` at 500 players**: confirm the gateway h2c `MaxConcurrentStreams` and edge + read/write timeouts are sized for many persistent `Subscribe` streams, and glance at the + ~0.04 % transient `unauthenticated` resolves under load. +- **R6 (refactor):** no logic bug forced a code change beyond the two harness-payload + fixes; the run surfaced no deadlock or goroutine leak across the ramp. +- **R7 (final tuning + stress):** (1) fix the per-container observability gap — adopt the + otelcol `docker_stats` receiver so Grafana shows per-container CPU/RSS on the contour; + (2) refine the harness — per-player/pooled transports and dropping finished games from + the rotation — and run on hardware **not** shared with the contour; (3) size pools / + GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway, + Postgres at 500 players). + +## Re-running + +See [`README.md`](README.md). Briefly, from the repo root: + +```sh +docker build -f loadtest/Dockerfile -t scrabble-loadtest . +docker run --rm --name scrabble-loadtest --network scrabble-internal \ + -e POSTGRES_PASSWORD=… scrabble-loadtest run # add --reset on a re-run +``` + +The harness stays in the repo for the R7 repeat.