R7: trip report + docs/tracker bake-back; mark R7 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s
- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway cores; the burst run showing connection-bound behavior), and the prod-sizing recommendation for Stage 18. - loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor), the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links. - docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats receiver (cAdvisor removed); links to both trip reports. - CLAUDE.md: repo-layout line reflects docker_stats + per-service limits. - PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording the decisions, findings, applied tuning and validation. This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
This commit is contained in:
+25
-11
@@ -36,17 +36,21 @@ container on the contour's docker network (this bypasses the host→gateway hair
|
||||
# from the repo root
|
||||
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
|
||||
|
||||
docker run --rm --name scrabble-loadtest --network scrabble-internal \
|
||||
docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
|
||||
-e POSTGRES_PASSWORD="$TEST_POSTGRES_PASSWORD" \
|
||||
scrabble-loadtest run
|
||||
```
|
||||
|
||||
Defaults assume the contour service names: `postgres:5432` and `gateway:8081`. The
|
||||
DAWGs are baked into the image (`/opt/dawg`, pinned to the dictionary release). Run with
|
||||
Each virtual player gets its own `edge.Client` (its own h2c connection), mirroring real
|
||||
clients rather than multiplexing every player over one transport. Defaults assume the
|
||||
contour service names: `postgres:5432` and `gateway:8081`. The DAWGs are baked into the
|
||||
image (`/opt/dawg`, pinned to the dictionary release). On a host shared with the contour,
|
||||
cap the harness (`--cpus=3`) so the contour keeps the spare cores. Run with
|
||||
`--name scrabble-loadtest` so the harness's own CPU/memory show up as a `scrabble-*`
|
||||
series in cAdvisor (keeping it separable from the system under test). Capture the
|
||||
resource baseline from the Grafana **Scrabble — Resources** dashboard
|
||||
(cAdvisor + postgres_exporter) while the run is in progress.
|
||||
series in the metrics (keeping it separable from the system under test). Capture the
|
||||
resource baseline from the Grafana **Scrabble — Resources** dashboard (the otelcol
|
||||
`docker_stats` receiver + postgres_exporter), or from `docker stats` directly, while the
|
||||
run is in progress.
|
||||
|
||||
## Commands & flags
|
||||
|
||||
@@ -80,15 +84,25 @@ DB wipe (`DROP SCHEMA backend CASCADE` + backend restart).
|
||||
```sh
|
||||
go build ./loadtest/...
|
||||
go vet ./loadtest/...
|
||||
BACKEND_DICT_DIR=../scrabble-solver/dawg go test -count=1 ./loadtest/...
|
||||
BACKEND_DICT_DIR="$PWD/../scrabble-solver/dawg" go test -count=1 ./loadtest/...
|
||||
```
|
||||
|
||||
The DAWG-backed `moves` test runs only when `BACKEND_DICT_DIR` is set (as the engine
|
||||
tests use); the pure logic (hashing, board replay, rack build, move selection, report)
|
||||
runs unconditionally.
|
||||
runs unconditionally. Use an **absolute** path (here via `$PWD`): `go test ./loadtest/...`
|
||||
runs each package from its own directory, so a relative `BACKEND_DICT_DIR` would not
|
||||
resolve.
|
||||
|
||||
## Trip reports
|
||||
|
||||
The two stress passes are written up in the repo: the early pass in
|
||||
[`REPORT-R2.md`](REPORT-R2.md) and the final, tuned pass in
|
||||
[`REPORT-R7.md`](REPORT-R7.md).
|
||||
|
||||
## Caveat
|
||||
|
||||
The harness shares the host CPU with the contour, so the early-pass resource baseline
|
||||
is read with the harness's own container series in mind; a cleaner number on separate
|
||||
hardware is future work. The moderate ramp keeps the generator from being the bottleneck.
|
||||
The harness shares the host CPU with the contour, so its own `scrabble-loadtest`
|
||||
container series is read alongside the system under test; capping it with `--cpus`
|
||||
keeps the contour's quota. Per-player transports (R7) removed the shared-transport
|
||||
artifact that inflated R2's `transport_error`, so the figures reflect the system. A
|
||||
fully isolated ceiling on separate hardware remains future work.
|
||||
|
||||
@@ -0,0 +1,185 @@
|
||||
# R7 — final stress-run trip report
|
||||
|
||||
The final pre-release stress pass for [`PRERELEASE.md`](../PRERELEASE.md) R7. It re-runs
|
||||
the R2 harness (`scrabble/loadtest`) against the **final, refactored system** on a
|
||||
freshly redeployed contour, to confirm the system holds at scale and to settle the
|
||||
resource sizing (container limits, `GOMAXPROCS`, pools, rate limits, log levels) before
|
||||
the Stage 18 prod cutover. Pass bar: **diagnostic + a tuning decision** — the run
|
||||
"passes" by completing cleanly; the per-container resource profile drives the tuning
|
||||
recorded below. Companion to the early pass, [`REPORT-R2.md`](REPORT-R2.md).
|
||||
|
||||
## What changed since the R2 pass
|
||||
|
||||
- **Harness — per-player transports.** Each virtual player now owns its `edge.Client`
|
||||
(its own `http2.Transport` / h2c connection carrying both its `Subscribe` stream and
|
||||
its `Execute` calls), instead of all players multiplexing over one shared transport.
|
||||
R2 traced the ~14 % `transport_error` on `game.state` at 500 players to that single
|
||||
shared connection's stream limit; per-player connections mirror real clients and
|
||||
remove the artifact, so this pass measures the system, not the harness.
|
||||
- **Harness — drop finished games.** `playTurn` reports a finished game and the player
|
||||
drops it from its rotation, so secondary ops stop hitting `game_finished` on ended
|
||||
games (the other R2 harness finding).
|
||||
- **Observability — otelcol `docker_stats`.** cAdvisor (which resolves only the root
|
||||
cgroup on this host — separate-XFS `/var/lib/docker`) is replaced by the otelcol
|
||||
`docker_stats` receiver, reading per-container CPU/memory/network from the Docker API.
|
||||
Per-container panels now populate on the contour host. (`api_version` pinned to 1.44;
|
||||
the daemon's minimum is 1.40.)
|
||||
- **Contour — container limits + `GOMAXPROCS`.** `deploy.resources.limits` now bound
|
||||
every service; the Go services pin `GOMAXPROCS` to their CPU limit so the runtime
|
||||
matches the cgroup quota. Starting values were generous over the R2 peak; this pass
|
||||
validates them and settles the agreed sizing (below).
|
||||
|
||||
## Method
|
||||
|
||||
Unchanged from R2 except for the per-player transports and the dropped-finished-games
|
||||
refinement above:
|
||||
|
||||
- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the
|
||||
`scrabble-internal` docker network (reaching `postgres:5432` / `gateway:8081`
|
||||
directly), capped at `--cpus 3` so the contour keeps the host's spare cores.
|
||||
- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written
|
||||
straight to Postgres (token hash matches `backend/internal/session`).
|
||||
- **Games:** assembled through the real **invitation** flow, 2–4 players each, no
|
||||
robots; variants over scrabble_en / scrabble_ru / erudit_ru.
|
||||
- **Play:** each player holds a live `Subscribe` stream and, per tick, polls
|
||||
`game.state`, replays `game.history` and submits a **mid-ranked** legal move generated
|
||||
locally by the embedded `scrabble-solver`, or passes / exchanges; a fraction exercise
|
||||
nudge / chat / check-word / draft / profile / stats. A separate **gateway-hammer**
|
||||
floods `games.list` from one account.
|
||||
- **Scale:** the same moderate ramp **50 → 200 → 500** concurrent players, 10 min/step.
|
||||
- **Resource capture:** `docker stats` (docker API) sampled every ~20 s for per-container
|
||||
CPU/memory; the otelcol **`docker_stats`** receiver → Prometheus → the Grafana
|
||||
**Scrabble — Resources** dashboard for the same per-container series; `postgres_exporter`
|
||||
internals and per-service Go runtime metrics.
|
||||
|
||||
## Run configuration
|
||||
|
||||
```
|
||||
docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
|
||||
-e POSTGRES_PASSWORD=… scrabble-loadtest \
|
||||
run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
|
||||
--tick 800ms --hammer-workers 20 --hammer-dur 15s --reset --cleanup
|
||||
```
|
||||
|
||||
Date: 2026-06-10. Contour: the R1-baseline schema, freshly redeployed with the R7
|
||||
container limits / `GOMAXPROCS` (backend/gateway/postgres capped at 2 cores + 512 MiB,
|
||||
`GOMAXPROCS=2`) and the `docker_stats` observability. Seeded population removed by
|
||||
`--cleanup` afterwards.
|
||||
|
||||
## Findings
|
||||
|
||||
The ramp ran clean to 500 players — no harness crash, no deadlock, `stream errors: 0` —
|
||||
and cleanup removed all 11 000 seeded accounts.
|
||||
|
||||
- **Volume (1827 s):** 821 680 edge calls (449.7 req/s incl. the hammer). Real gameplay
|
||||
at scale: **50 916 committed plays**, 4 817 passes, 2 931 games finished; 165 755
|
||||
`opponent_moved` + 54 864 `your_turn` events.
|
||||
- **The per-player transport fix worked.** `game.state` returned `transport_error` on
|
||||
**3 173 / 127 403 = 2.49 %** of calls — down from R2's ~14 % on the same step. Other
|
||||
ops were lower still (`game.history` 0.43 %, `game.submit_play` 0.28 %). The residual
|
||||
is the gateway bursting into its 2-core cap (see the profile below), not the harness.
|
||||
- **Dropping finished games worked.** `game_finished` on `chat.nudge` / `chat.post` fell
|
||||
to **35 / 36** (R2: ≈ 3 900 each) — secondary ops no longer hammer ended games.
|
||||
- **The limiter holds.** The gateway-hammer sent 565 152 `games.list`; **564 979
|
||||
(99.97 %) were `rate_limited`** (154 ok burst, 19 deadline), p99 = 2 ms, ~309 req/s of
|
||||
rejections sustained — unchanged from R2.
|
||||
- **Latency (peak):** `game.state` p50 ≈ 100 ms, p99 in the 2000 ms bucket (max 2549 ms);
|
||||
`game.submit_play` p50 100 / p99 1000 ms bucket. Lobby ops stayed fast
|
||||
(invitation / games.list p99 ≤ 10 ms). The p99 tail correlates with the gateway
|
||||
burst-throttling, not the backend (which stayed at ~0.85 core).
|
||||
|
||||
## Resource profile
|
||||
|
||||
Per-container peak during step 3 (500 players), with the R7 starting limits in force
|
||||
(backend/gateway/postgres capped at 2 cores / 512 MiB). Two CPU columns: `docker stats`
|
||||
samples a ~1 s window (catches bursts); the otelcol `docker_stats` receiver averages over
|
||||
its 30 s collection interval (smooths them) — they agree within sampling error, which
|
||||
validates the new observability path.
|
||||
|
||||
| container | CPU burst (1 s) | CPU sustained (30 s) | CPU cap | mem peak | mem cap |
|
||||
|-----------|----------------:|---------------------:|--------:|---------:|--------:|
|
||||
| scrabble-gateway | **217 %** (at cap) | ~145 % | 200 % | 167 MiB | 512 MiB |
|
||||
| scrabble-postgres | 138 % | ~153 % | 200 % | 117 MiB | 512 MiB |
|
||||
| scrabble-backend | 85 % | ~89 % | 200 % | 116 MiB | 512 MiB |
|
||||
| scrabble-tempo | 33 % | — | (none) | **1024 MiB** (at cap) | 1024 MiB |
|
||||
| scrabble-otelcol | 11 % | — | (none) | 131 MiB | 512 MiB |
|
||||
| scrabble-loadtest (harness) | 157 % | — | 300 % | 369 MiB | — |
|
||||
|
||||
- **The gateway is the binding constraint.** With one h2c connection per player it draws
|
||||
~1.45 cores sustained and **bursts to its 2-core cap** at 500 players, throttling
|
||||
briefly — the source of the 2.49 % `transport_error`. R2 saw only ~0.93 core because
|
||||
all 500 players shared one connection; the +~0.5 core is the realistic per-connection
|
||||
overhead (500 separate HTTP/2 connections). This is a sizing fact, not a regression.
|
||||
- **backend is over-provisioned** (~0.85 core vs a 2-core cap); **postgres** (~1.4 cores)
|
||||
has headroom; both stayed ≤ 120 MiB.
|
||||
- **tempo reached its 1 GiB memory cap** (R2: 446 MiB) — an OOM risk under sustained
|
||||
tracing.
|
||||
- **Postgres backends peaked at 28**, with the backend pool at its `MaxOpenConns=25` cap.
|
||||
Cache hit stayed ~100 % (no disk reads); CPU, not I/O, is the limit.
|
||||
- **docker log volume (30 min):** backend 14.2 MiB, gateway 4.6 MiB, postgres 0.04 MiB —
|
||||
the backend's per-request latency line at info dominates, and json-file logs had no
|
||||
rotation.
|
||||
|
||||
## Tuning applied
|
||||
|
||||
Agreed from the profile (all in `deploy/docker-compose.yml`; no code change — the pool
|
||||
is already env-driven):
|
||||
|
||||
| knob | from | to | why |
|
||||
|------|------|----|-----|
|
||||
| gateway CPU + `GOMAXPROCS` | 2 cores / 2 | **3 cores / 3** | it bursts into the 2-core cap at 500 players (the 2.49 % `transport_error`); 3 absorbs the bursts |
|
||||
| tempo memory | 1 GiB | **2 GiB** | it reached the 1 GiB cap (OOM risk) |
|
||||
| backend `MAX_OPEN_CONNS` | 25 | **40** | the pool sat at its 25-conn cap at peak; headroom trims the p99 tail |
|
||||
| docker logs | unbounded | **json-file 10m × 3** | bound the ~14 MiB / 30 min backend log; level stays `info` |
|
||||
|
||||
Left as-is: backend / postgres at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom
|
||||
is cheap on the shared host); the per-user rate limiter and `h2cMaxConcurrentStreams=250`
|
||||
(per-connection now, ~1 stream each — ample) and cache TTLs (no pressure observed).
|
||||
|
||||
### Validation re-run
|
||||
|
||||
Re-running the **same gradual ramp** (50 → 200 → 500) on the tuned contour confirms the
|
||||
fix:
|
||||
|
||||
- **`game.state` `transport_error` fell to 0.72 %** (853 / 119 051), down from 2.49 % at
|
||||
2 cores. The latency tail also improved — p99 in the 1000 ms bucket, max 1220 ms (was
|
||||
the 2000 ms bucket, max 2549 ms).
|
||||
- The **gateway peaked at ~2 cores** (≈196 % on the 30 s gauge) — now comfortably **under
|
||||
the 3-core cap**, so it no longer throttles. backend ~1 core, postgres ~1.3 cores.
|
||||
- **tempo peaked at ~1.27 GiB** — under the new 2 GiB cap (it would have OOM-ed at 1 GiB).
|
||||
- Drop-finished still holds (`game_finished` on chat 41/42); the limiter still rejects
|
||||
99.97 % of the hammer at p99 2 ms; `stream errors: 0`.
|
||||
|
||||
A separate **burst stress** (a single 100 → 500 jump — 400 players connecting at once)
|
||||
**pegged the gateway at 3 cores** (≈296 % sustained) and pushed `game.state`
|
||||
`transport_error` to 9.27 %. The gateway is **connection-CPU-bound and bursty**: average
|
||||
load is ~1 core, but a mass-simultaneous connection storm saturates whatever single-node
|
||||
cap it is given. Real arrivals are gradual (the canonical run), where 3 cores has
|
||||
headroom; the lever for a true arrival spike is **horizontal scaling**, not more cores per
|
||||
node — carried into the prod recommendation below.
|
||||
|
||||
## Prod-sizing recommendation (Stage 18)
|
||||
|
||||
The contour is **CPU-bound and gateway-led** at 500 concurrent players. Carry these to the
|
||||
prod contour env (the same compose, `PROD_*` values):
|
||||
|
||||
- **gateway: ≥ 3 cores** per ~500 concurrent players, `GOMAXPROCS` pinned to the limit —
|
||||
it scales with the **connection count**, not just the request rate; beyond one node's
|
||||
worth, scale the gateway **horizontally** rather than vertically.
|
||||
- **backend: ~1–2 cores**, pool 40 — comfortable; the work is light per request.
|
||||
- **postgres: ~2 cores / ≥ 512 MiB** — ~1.4 cores at 500 players, 100 % cache hit.
|
||||
- **tempo: ≥ 2 GiB**; the Go services run under ~170 MiB (256 MiB would suffice, 512 is
|
||||
safe); pin `GOMAXPROCS` to each CPU limit; keep json-file rotation.
|
||||
- Memory is not the constraint anywhere; CPU is.
|
||||
|
||||
## Re-running
|
||||
|
||||
See [`README.md`](README.md). Briefly, from the repo root:
|
||||
|
||||
```sh
|
||||
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
|
||||
docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
|
||||
-e POSTGRES_PASSWORD=… scrabble-loadtest run --reset --cleanup
|
||||
```
|
||||
|
||||
The harness stays in the repo for future repeats.
|
||||
Reference in New Issue
Block a user