# R2 — early stress-run trip report The early stress pass for `PRERELEASE.md` R2. It exercises the system through the **edge protocol** with the `scrabble/loadtest` harness, to surface logic/concurrency bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and R7 (final tuning). Pass bar: **diagnostic** — the run "passes" by completing without the harness crashing; findings are recorded below, not gated. ## Method - **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the `scrabble-internal` docker network (reaching `postgres:5432` and `gateway:8081` directly, bypassing the host→gateway hairpin). - **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written directly to Postgres (token hash matches `backend/internal/session`), so the driver authenticates without the per-IP-limited auth ops. - **Games:** assembled through the real **invitation** flow (`invitation.create` → `invitation.accept`), 2–4 players each, no robots; variants spread over scrabble_en / scrabble_ru / erudit_ru. - **Play:** each virtual player holds a live `Subscribe` stream and, per tick, polls `game.state`, replays `game.history` and submits a **mid-ranked** legal move generated locally by the embedded `scrabble-solver` (the edge carries no board), or passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile / stats. A separate **gateway-hammer** floods `games.list` from one account. - **Scale:** moderate ramp **50 → 200 → 500** concurrent players, 10 min/step (the agreed moderate profile; harness and contour share this host's CPU). - **Resource capture:** `docker stats` (docker API) sampled every 28 s for per-container CPU/memory; Prometheus for edge latency/throughput, `postgres_exporter` internals and per-service Go runtime metrics. ## Run configuration ``` loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \ --tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup ``` Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2 exporters. Seeded population removed by `--cleanup` afterwards. ## Findings ### Validated (fixed within R2) - **Harness draft payload.** `draft.save` first returned `bad_request`: the backend draft DTO's `rack_order` is a string (the harness sent `[]`). Fixed → `ok`. - **Harness profile marker.** `profile.update` first returned `invalid_profile`: the editable-display-name validator (`backend/internal/account/profile.go`) forbids digits and colons, but the seed marker was `lt:…`. Switched the marker to a distinctive letters-only string → `ok`. Cleanup still matches it. ### By-design behaviour (correctly exercised, not bugs) - **`chat_not_your_turn`** — chat is gated to the sender's turn (`backend/internal/social/chat.go`); off-turn posts are correctly rejected. - **`nudge_own_turn`** — you nudge the player whose turn it is, so a nudge on your own turn is correctly rejected. The harness nudges/chats at random ticks, so a share of these codes is expected. ### Observability gap (key R7 input) - **cAdvisor yields only the root cgroup on the contour host.** Its docker factory registers, but per-container init fails — `failed to identify the read-write layer ID … /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory` — because this host's `/var/lib/docker` is a **separate XFS mount** not visible under cAdvisor's `/rootfs` bind (the existing galaxy deployment on the same host has the same limitation). So the **Scrabble — Resources** dashboard's per-container panels are empty here, and per-container CPU/RSS for this run was captured via `docker stats` instead. Postgres internals (`postgres_exporter`) and per-service Go runtime metrics (`go_*` by `service_name`) work. **Recommendation for R7:** adopt the otelcol **`docker_stats`** receiver (already the contrib image) — it reads per-container stats via the docker API with no cgroup dependency — and/or run the final pass on hardware where cAdvisor resolves containers. (Decision to confirm with the owner.) ### Run results The ramp ran clean to 500 players with no harness crash, no deadlock and `stream errors: 0`; cleanup removed all 11 000 seeded accounts (and their ~941 games). - **Ramp:** step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569. - **Volume (30 min):** 1.20 M total edge calls, 659 req/s average. Real gameplay at scale: **48 870 committed plays**, 52 772 `your_turn` + 159 631 `opponent_moved` events, **2 798 games finished**. - **Latency under load (peak, step 3):** `game.state` p50 ≈ 100 ms, p90/p99 in the 200–500 ms buckets, max 849 ms; `game.submit_play` similar (p99 ≤ 500 ms, max 490 ms). Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms). - **Rate limiter holds.** The gateway-hammer sent 522 667 `games.list` from one account; **522 486 (99.97 %) were `rate_limited`**, only 135 `ok` (the burst). Rejections are cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the flood. The per-user limiter behaves as designed (R3 input: the cost is negligible). **Top finding — `transport_error` under saturation.** At 500 players ~14 % of `game.state` calls (72 429 / 519 067) and a few % of the other ops returned a Connect `transport_error` (not a domain code). It correlates with the CPU saturation below: the backend/gateway are pinned near one core each while the host also runs the 86 %-core harness, so the edge sheds load (resets/timeouts) at the knee. It is **amplified by a harness artifact** — all 500 virtual players multiplex over a *single* shared `http2.Transport`, so 500 persistent `Subscribe` streams plus Execute calls press on one HTTP/2 connection's concurrent-stream limit; real clients each use their own connection. **Actions:** R7 harness — give each player (or a pool) its own transport, and run on hardware not shared with the contour; R3 — confirm the gateway's h2c `MaxConcurrentStreams` and edge timeouts are sized for many persistent streams. **Minor findings:** - `unauthenticated` on a tiny share (188 / 519 067 `game.state`, ~0.04 %) — transient session-resolve failures under load; worth a glance in R3 but not material. - one `internal` on `game.pass` (1 / 4 788). - `game_finished` dominates `chat.nudge`/`chat.post` (≈ 3 900 each): the harness keeps secondary ops on games that already ended. Harness refinement — drop finished games from the rotation (R7). - `nudge_own_turn` / `chat_not_your_turn` / `nudge_too_soon` are the expected turn/rate gates, correctly exercised. ## Resource baseline Per-container peak during step 3 (500 players), from `docker stats`: | container | peak CPU | memory | |-----------|---------:|-------:| | scrabble-backend | **99 %** (~1 core) | 91 MiB | | scrabble-gateway | **93 %** | 76 MiB | | scrabble-postgres | **90 %** | 69 MiB | | scrabble-loadtest (harness) | **86 %** | 42 MiB | | scrabble-otelcol | 10 % | 110 MiB | | scrabble-tempo | 9 % | 446 MiB | | prometheus / postgres-exporter | ~0 % | 46 / 16 MiB | - **The contour is CPU-bound at 500 concurrent players:** backend, gateway and Postgres each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover. - **Caveat:** the harness itself peaked at **86 % of a core** on the *same host*, so the step-3 latency and `transport_error` figures are pessimistic — the contour competed with the generator for CPU. A clean ceiling needs separate hardware (R7). - **Postgres:** peak 28 backend connections, ~5 581 commits/s at the peak, **100 % cache hit ratio** (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here. - **Goroutines:** backend 638, gateway **1 698** (it holds the 500 `Subscribe` streams + per-request goroutines), telegram 49 — all stable, no leak across the ramp. ## Recommendations feeding later phases - **R3 (edge hardening):** the per-user limiter holds (99.97 % rejected, p99 2 ms) — add the per-IP body-size cap on top. Investigate the **~14 % `transport_error` on `game.state` at 500 players**: confirm the gateway h2c `MaxConcurrentStreams` and edge read/write timeouts are sized for many persistent `Subscribe` streams, and glance at the ~0.04 % transient `unauthenticated` resolves under load. - **R6 (refactor):** no logic bug forced a code change beyond the two harness-payload fixes; the run surfaced no deadlock or goroutine leak across the ramp. - **R7 (final tuning + stress):** (1) fix the per-container observability gap — adopt the otelcol `docker_stats` receiver so Grafana shows per-container CPU/RSS on the contour; (2) refine the harness — per-player/pooled transports and dropping finished games from the rotation — and run on hardware **not** shared with the contour; (3) size pools / GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway, Postgres at 500 players). ## Re-running See [`README.md`](README.md). Briefly, from the repo root: ```sh docker build -f loadtest/Dockerfile -t scrabble-loadtest . docker run --rm --name scrabble-loadtest --network scrabble-internal \ -e POSTGRES_PASSWORD=… scrabble-loadtest run # add --reset on a re-run ``` The harness stays in the repo for the R7 repeat.