From 2a48df9b83afeba81ee24c60e0f6f1c0d73f3ea7 Mon Sep 17 00:00:00 2001 From: Ilia Denisov Date: Thu, 11 Jun 2026 11:18:57 +0200 Subject: [PATCH] R7: trip report + docs/tracker bake-back; mark R7 done MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway cores; the burst run showing connection-bound behavior), and the prod-sizing recommendation for Stage 18. - loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor), the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links. - docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats receiver (cAdvisor removed); links to both trip reports. - CLAUDE.md: repo-layout line reflects docker_stats + per-service limits. - PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording the decisions, findings, applied tuning and validation. This is the final pre-release hardening phase; Stage 18 (prod cutover) is next. --- CLAUDE.md | 2 +- PRERELEASE.md | 37 ++++++++- docs/ARCHITECTURE.md | 12 +-- docs/TESTING.md | 6 +- loadtest/README.md | 36 +++++--- loadtest/REPORT-R7.md | 185 ++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 257 insertions(+), 21 deletions(-) create mode 100644 loadtest/REPORT-R7.md diff --git a/CLAUDE.md b/CLAUDE.md index 28a1502..3d7e611 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -128,7 +128,7 @@ gateway/ ui/ pkg/ # added by their stages platform/telegram/ # Telegram connector side-service (Stage 9): bot + gRPC API loadtest/ # module scrabble/loadtest: the pre-release stress harness (R2) backend/Dockerfile gateway/Dockerfile platform/telegram/Dockerfile loadtest/Dockerfile # multi-stage distroless (Stage 16; loadtest R2); gateway/Dockerfile also has the `landing` target (R3) -deploy/ # docker-compose + caddy + landing + otelcol/prometheus/tempo/grafana (+ cAdvisor/postgres_exporter, R2) +deploy/ # docker-compose (per-service limits, R7) + caddy + landing + otelcol (OTLP + docker_stats per-container metrics) + prometheus/tempo/grafana + postgres_exporter ``` ## Build & test diff --git a/PRERELEASE.md b/PRERELEASE.md index 8f3fe12..c05acd4 100644 --- a/PRERELEASE.md +++ b/PRERELEASE.md @@ -23,7 +23,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l | R4 | Push enrichment + kill the last poll | 4 + 5 | **done** | | R5 | Bundle slimming | 6 | **done** | | R6 | Refactor + docs reconciliation + de-staging | 7 | **done** | -| R7 | Final stress run + tuning | 9b | todo | +| R7 | Final stress run + tuning | 9b | **done** | | → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) | ## Key findings (these reshaped the raw list — read before starting a phase) @@ -168,7 +168,7 @@ regression gate. Incorporates the early-run (R2) bug fixes not already shipped. - Open details: the structural-changes list itself (owner-approved before applying); the test consolidation targets. -### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18 +### R7 — Final stress run + tuning *(TODO 9, part 2)* — done Re-run the R2 harness against the final, refactored system on a clean contour; analyse resource consumption across **all** components (gateway, backend, Postgres, the metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate @@ -380,3 +380,36 @@ Then Stage 18. 10 files) into `backend/internal/inttest/helpers.go`; single-file helpers stay local. Pure relocation. - **No schema change → no contour DB wipe.** Regression gate: the full unit + integration + UI suites plus the R7 stress run. + +- **R7** (interview + implementation): + - **Locked decisions:** run the harness **same-host** (one-shot container on `scrabble-internal`, capped + `--cpus=3` so the contour keeps spare cores); **apply container limits + `GOMAXPROCS` now** (not just a + prod recommendation); **replace cAdvisor with the otelcol `docker_stats` receiver** (it resolved only the + root cgroup on this host); keep rate-limit / h2c knobs **compiled-in** (change values only if the data + demands — it did not). + - **Harness refinements (pre-run):** each virtual player builds its **own `edge.Client`** (its own h2c + connection for its Subscribe stream + Execute calls) instead of all players sharing one `http2.Transport` — + the R2 `transport_error` artifact; and `playTurn` now reports a **finished** game so the player drops it + from rotation. Effect, measured: `game.state` `transport_error` 14 % (R2) → **2.49 %**; `game_finished` on + chat ≈ 3 900 → **35**. + - **Observability:** added the `docker_stats` receiver to `otelcol` (`api_version: "1.44"` — the daemon's + minimum is 1.40; the receiver defaults to 1.25 and crash-looped until pinned), mounted the docker socket + read-only with `group_add` (the contrib image runs as UID 10001), dropped the cAdvisor service + its + Prometheus job, and retargeted the **Scrabble — Resources** dashboard to the docker_stats metric names + (`container_cpu_utilization`/100 == cores). Cross-checked against `docker stats` within sampling error. + - **Profile (final run, 500 players, limits in force):** the **gateway is the binding constraint** — with + one connection per player it bursts into its 2-core cap (the residual 2.49 % `transport_error`); backend + ~0.85 core and postgres ~1.4 cores had headroom; **tempo reached its 1 GiB cap**; the backend pool sat at + its `MaxOpenConns=25` cap (28 backends); docker logs were unbounded (~14 MiB / 30 min on the backend at + info). Full write-up in [`../loadtest/REPORT-R7.md`](../loadtest/REPORT-R7.md). + - **Round-2 tuning (owner-agreed, all in `deploy/docker-compose.yml`, no code change):** gateway **2 → 3 + cores + `GOMAXPROCS=3`**; tempo memory **1 → 2 GiB**; backend `MAX_OPEN_CONNS` **25 → 40**; a json-file + **log-rotation** default (10m × 3) applied contour-wide via a YAML anchor (level stays info). + backend/postgres kept at 2 cores / 512 MiB (headroom is cheap on the shared host). + - **Validation:** the same gradual ramp on the tuned contour cut `game.state` `transport_error` to **0.72 %** + (gateway ~2 cores, now under the 3-core cap, no throttle; tempo ~1.27 GiB, under 2 GiB). A separate + **burst** run (a single 100 → 500 jump) pegged the gateway at 3 cores (≈296 % sustained, 9.27 % error), + confirming it is **connection-CPU-bound** — a true arrival spike is a **horizontal-scaling** lever, not + more cores per node (recorded in the prod-sizing recommendation). + - **No schema change → no contour DB wipe.** Bake-back: `loadtest/REPORT-R7.md` (new), `loadtest/README.md`, + `docs/TESTING.md`, the telemetry/observability section of `docs/ARCHITECTURE.md`, the repo-layout line in `CLAUDE.md`. diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 0e4e361..e3a481e 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -561,11 +561,13 @@ promotions) is future work and would deliver short markdown messages (text + lin metrics + Tempo traces), **Prometheus** (15d), **Tempo** (72h) and **Grafana** (provisioned datasources + dashboards, behind the caddy `/_gm/grafana` Basic-Auth) are stood up with the deploy (`deploy/`); the default exporter stays - `none`, so CI needs no collector. The contour also runs **cAdvisor** (per-container - CPU/memory/network) and **postgres_exporter** (connections, cache-hit ratio, - transactions, db size), scraped by Prometheus and surfaced on the **Scrabble — - Resources** Grafana dashboard, which captures a resource - baseline; these export directly in Prometheus format (not through the collector). + `none`, so CI needs no collector. The collector also runs a **`docker_stats`** + receiver (per-container CPU/memory/network read from the Docker API and exported + through its Prometheus endpoint), and the contour runs **postgres_exporter** + (connections, cache-hit ratio, transactions, db size, scraped directly by Prometheus); + both are surfaced on the **Scrabble — Resources** Grafana dashboard, which captures the + stress-run resource profile. (`docker_stats` replaced cAdvisor, which on the contour + host resolved only the root cgroup — a separate-XFS `/var/lib/docker`.) - Per-request server-side timing via gin middleware from day one (the access log carries method, route, status, latency and the active trace id). A client-measured RTT piggybacked on the next request is a later enhancement. diff --git a/docs/TESTING.md b/docs/TESTING.md index 051d42e..d3012b8 100644 --- a/docs/TESTING.md +++ b/docs/TESTING.md @@ -127,8 +127,10 @@ tests or touching CI. selection, the report); the DAWG-backed move test runs under `BACKEND_DICT_DIR` (as the engine tests do). It is **not** part of the per-PR suite's behavioural assertions: it runs ad hoc as a one-shot container against the contour, producing a trip report (bugs - + a resource baseline) read off the **cAdvisor + postgres_exporter** Grafana dashboard - on the contour. See [`../loadtest/README.md`](../loadtest/README.md). + + a per-container resource profile) read off the **otelcol `docker_stats` + + postgres_exporter** Grafana dashboard on the contour. Two passes are recorded — the + early [`REPORT-R2.md`](../loadtest/REPORT-R2.md) and the final, tuned + [`REPORT-R7.md`](../loadtest/REPORT-R7.md). See [`../loadtest/README.md`](../loadtest/README.md). ## Principles diff --git a/loadtest/README.md b/loadtest/README.md index d12845f..5ffb762 100644 --- a/loadtest/README.md +++ b/loadtest/README.md @@ -36,17 +36,21 @@ container on the contour's docker network (this bypasses the host→gateway hair # from the repo root docker build -f loadtest/Dockerfile -t scrabble-loadtest . -docker run --rm --name scrabble-loadtest --network scrabble-internal \ +docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \ -e POSTGRES_PASSWORD="$TEST_POSTGRES_PASSWORD" \ scrabble-loadtest run ``` -Defaults assume the contour service names: `postgres:5432` and `gateway:8081`. The -DAWGs are baked into the image (`/opt/dawg`, pinned to the dictionary release). Run with +Each virtual player gets its own `edge.Client` (its own h2c connection), mirroring real +clients rather than multiplexing every player over one transport. Defaults assume the +contour service names: `postgres:5432` and `gateway:8081`. The DAWGs are baked into the +image (`/opt/dawg`, pinned to the dictionary release). On a host shared with the contour, +cap the harness (`--cpus=3`) so the contour keeps the spare cores. Run with `--name scrabble-loadtest` so the harness's own CPU/memory show up as a `scrabble-*` -series in cAdvisor (keeping it separable from the system under test). Capture the -resource baseline from the Grafana **Scrabble — Resources** dashboard -(cAdvisor + postgres_exporter) while the run is in progress. +series in the metrics (keeping it separable from the system under test). Capture the +resource baseline from the Grafana **Scrabble — Resources** dashboard (the otelcol +`docker_stats` receiver + postgres_exporter), or from `docker stats` directly, while the +run is in progress. ## Commands & flags @@ -80,15 +84,25 @@ DB wipe (`DROP SCHEMA backend CASCADE` + backend restart). ```sh go build ./loadtest/... go vet ./loadtest/... -BACKEND_DICT_DIR=../scrabble-solver/dawg go test -count=1 ./loadtest/... +BACKEND_DICT_DIR="$PWD/../scrabble-solver/dawg" go test -count=1 ./loadtest/... ``` The DAWG-backed `moves` test runs only when `BACKEND_DICT_DIR` is set (as the engine tests use); the pure logic (hashing, board replay, rack build, move selection, report) -runs unconditionally. +runs unconditionally. Use an **absolute** path (here via `$PWD`): `go test ./loadtest/...` +runs each package from its own directory, so a relative `BACKEND_DICT_DIR` would not +resolve. + +## Trip reports + +The two stress passes are written up in the repo: the early pass in +[`REPORT-R2.md`](REPORT-R2.md) and the final, tuned pass in +[`REPORT-R7.md`](REPORT-R7.md). ## Caveat -The harness shares the host CPU with the contour, so the early-pass resource baseline -is read with the harness's own container series in mind; a cleaner number on separate -hardware is future work. The moderate ramp keeps the generator from being the bottleneck. +The harness shares the host CPU with the contour, so its own `scrabble-loadtest` +container series is read alongside the system under test; capping it with `--cpus` +keeps the contour's quota. Per-player transports (R7) removed the shared-transport +artifact that inflated R2's `transport_error`, so the figures reflect the system. A +fully isolated ceiling on separate hardware remains future work. diff --git a/loadtest/REPORT-R7.md b/loadtest/REPORT-R7.md new file mode 100644 index 0000000..1c9baf8 --- /dev/null +++ b/loadtest/REPORT-R7.md @@ -0,0 +1,185 @@ +# R7 — final stress-run trip report + +The final pre-release stress pass for [`PRERELEASE.md`](../PRERELEASE.md) R7. It re-runs +the R2 harness (`scrabble/loadtest`) against the **final, refactored system** on a +freshly redeployed contour, to confirm the system holds at scale and to settle the +resource sizing (container limits, `GOMAXPROCS`, pools, rate limits, log levels) before +the Stage 18 prod cutover. Pass bar: **diagnostic + a tuning decision** — the run +"passes" by completing cleanly; the per-container resource profile drives the tuning +recorded below. Companion to the early pass, [`REPORT-R2.md`](REPORT-R2.md). + +## What changed since the R2 pass + +- **Harness — per-player transports.** Each virtual player now owns its `edge.Client` + (its own `http2.Transport` / h2c connection carrying both its `Subscribe` stream and + its `Execute` calls), instead of all players multiplexing over one shared transport. + R2 traced the ~14 % `transport_error` on `game.state` at 500 players to that single + shared connection's stream limit; per-player connections mirror real clients and + remove the artifact, so this pass measures the system, not the harness. +- **Harness — drop finished games.** `playTurn` reports a finished game and the player + drops it from its rotation, so secondary ops stop hitting `game_finished` on ended + games (the other R2 harness finding). +- **Observability — otelcol `docker_stats`.** cAdvisor (which resolves only the root + cgroup on this host — separate-XFS `/var/lib/docker`) is replaced by the otelcol + `docker_stats` receiver, reading per-container CPU/memory/network from the Docker API. + Per-container panels now populate on the contour host. (`api_version` pinned to 1.44; + the daemon's minimum is 1.40.) +- **Contour — container limits + `GOMAXPROCS`.** `deploy.resources.limits` now bound + every service; the Go services pin `GOMAXPROCS` to their CPU limit so the runtime + matches the cgroup quota. Starting values were generous over the R2 peak; this pass + validates them and settles the agreed sizing (below). + +## Method + +Unchanged from R2 except for the per-player transports and the dropped-finished-games +refinement above: + +- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the + `scrabble-internal` docker network (reaching `postgres:5432` / `gateway:8081` + directly), capped at `--cpus 3` so the contour keeps the host's spare cores. +- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written + straight to Postgres (token hash matches `backend/internal/session`). +- **Games:** assembled through the real **invitation** flow, 2–4 players each, no + robots; variants over scrabble_en / scrabble_ru / erudit_ru. +- **Play:** each player holds a live `Subscribe` stream and, per tick, polls + `game.state`, replays `game.history` and submits a **mid-ranked** legal move generated + locally by the embedded `scrabble-solver`, or passes / exchanges; a fraction exercise + nudge / chat / check-word / draft / profile / stats. A separate **gateway-hammer** + floods `games.list` from one account. +- **Scale:** the same moderate ramp **50 → 200 → 500** concurrent players, 10 min/step. +- **Resource capture:** `docker stats` (docker API) sampled every ~20 s for per-container + CPU/memory; the otelcol **`docker_stats`** receiver → Prometheus → the Grafana + **Scrabble — Resources** dashboard for the same per-container series; `postgres_exporter` + internals and per-service Go runtime metrics. + +## Run configuration + +``` +docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \ + -e POSTGRES_PASSWORD=… scrabble-loadtest \ + run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \ + --tick 800ms --hammer-workers 20 --hammer-dur 15s --reset --cleanup +``` + +Date: 2026-06-10. Contour: the R1-baseline schema, freshly redeployed with the R7 +container limits / `GOMAXPROCS` (backend/gateway/postgres capped at 2 cores + 512 MiB, +`GOMAXPROCS=2`) and the `docker_stats` observability. Seeded population removed by +`--cleanup` afterwards. + +## Findings + +The ramp ran clean to 500 players — no harness crash, no deadlock, `stream errors: 0` — +and cleanup removed all 11 000 seeded accounts. + +- **Volume (1827 s):** 821 680 edge calls (449.7 req/s incl. the hammer). Real gameplay + at scale: **50 916 committed plays**, 4 817 passes, 2 931 games finished; 165 755 + `opponent_moved` + 54 864 `your_turn` events. +- **The per-player transport fix worked.** `game.state` returned `transport_error` on + **3 173 / 127 403 = 2.49 %** of calls — down from R2's ~14 % on the same step. Other + ops were lower still (`game.history` 0.43 %, `game.submit_play` 0.28 %). The residual + is the gateway bursting into its 2-core cap (see the profile below), not the harness. +- **Dropping finished games worked.** `game_finished` on `chat.nudge` / `chat.post` fell + to **35 / 36** (R2: ≈ 3 900 each) — secondary ops no longer hammer ended games. +- **The limiter holds.** The gateway-hammer sent 565 152 `games.list`; **564 979 + (99.97 %) were `rate_limited`** (154 ok burst, 19 deadline), p99 = 2 ms, ~309 req/s of + rejections sustained — unchanged from R2. +- **Latency (peak):** `game.state` p50 ≈ 100 ms, p99 in the 2000 ms bucket (max 2549 ms); + `game.submit_play` p50 100 / p99 1000 ms bucket. Lobby ops stayed fast + (invitation / games.list p99 ≤ 10 ms). The p99 tail correlates with the gateway + burst-throttling, not the backend (which stayed at ~0.85 core). + +## Resource profile + +Per-container peak during step 3 (500 players), with the R7 starting limits in force +(backend/gateway/postgres capped at 2 cores / 512 MiB). Two CPU columns: `docker stats` +samples a ~1 s window (catches bursts); the otelcol `docker_stats` receiver averages over +its 30 s collection interval (smooths them) — they agree within sampling error, which +validates the new observability path. + +| container | CPU burst (1 s) | CPU sustained (30 s) | CPU cap | mem peak | mem cap | +|-----------|----------------:|---------------------:|--------:|---------:|--------:| +| scrabble-gateway | **217 %** (at cap) | ~145 % | 200 % | 167 MiB | 512 MiB | +| scrabble-postgres | 138 % | ~153 % | 200 % | 117 MiB | 512 MiB | +| scrabble-backend | 85 % | ~89 % | 200 % | 116 MiB | 512 MiB | +| scrabble-tempo | 33 % | — | (none) | **1024 MiB** (at cap) | 1024 MiB | +| scrabble-otelcol | 11 % | — | (none) | 131 MiB | 512 MiB | +| scrabble-loadtest (harness) | 157 % | — | 300 % | 369 MiB | — | + +- **The gateway is the binding constraint.** With one h2c connection per player it draws + ~1.45 cores sustained and **bursts to its 2-core cap** at 500 players, throttling + briefly — the source of the 2.49 % `transport_error`. R2 saw only ~0.93 core because + all 500 players shared one connection; the +~0.5 core is the realistic per-connection + overhead (500 separate HTTP/2 connections). This is a sizing fact, not a regression. +- **backend is over-provisioned** (~0.85 core vs a 2-core cap); **postgres** (~1.4 cores) + has headroom; both stayed ≤ 120 MiB. +- **tempo reached its 1 GiB memory cap** (R2: 446 MiB) — an OOM risk under sustained + tracing. +- **Postgres backends peaked at 28**, with the backend pool at its `MaxOpenConns=25` cap. + Cache hit stayed ~100 % (no disk reads); CPU, not I/O, is the limit. +- **docker log volume (30 min):** backend 14.2 MiB, gateway 4.6 MiB, postgres 0.04 MiB — + the backend's per-request latency line at info dominates, and json-file logs had no + rotation. + +## Tuning applied + +Agreed from the profile (all in `deploy/docker-compose.yml`; no code change — the pool +is already env-driven): + +| knob | from | to | why | +|------|------|----|-----| +| gateway CPU + `GOMAXPROCS` | 2 cores / 2 | **3 cores / 3** | it bursts into the 2-core cap at 500 players (the 2.49 % `transport_error`); 3 absorbs the bursts | +| tempo memory | 1 GiB | **2 GiB** | it reached the 1 GiB cap (OOM risk) | +| backend `MAX_OPEN_CONNS` | 25 | **40** | the pool sat at its 25-conn cap at peak; headroom trims the p99 tail | +| docker logs | unbounded | **json-file 10m × 3** | bound the ~14 MiB / 30 min backend log; level stays `info` | + +Left as-is: backend / postgres at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom +is cheap on the shared host); the per-user rate limiter and `h2cMaxConcurrentStreams=250` +(per-connection now, ~1 stream each — ample) and cache TTLs (no pressure observed). + +### Validation re-run + +Re-running the **same gradual ramp** (50 → 200 → 500) on the tuned contour confirms the +fix: + +- **`game.state` `transport_error` fell to 0.72 %** (853 / 119 051), down from 2.49 % at + 2 cores. The latency tail also improved — p99 in the 1000 ms bucket, max 1220 ms (was + the 2000 ms bucket, max 2549 ms). +- The **gateway peaked at ~2 cores** (≈196 % on the 30 s gauge) — now comfortably **under + the 3-core cap**, so it no longer throttles. backend ~1 core, postgres ~1.3 cores. +- **tempo peaked at ~1.27 GiB** — under the new 2 GiB cap (it would have OOM-ed at 1 GiB). +- Drop-finished still holds (`game_finished` on chat 41/42); the limiter still rejects + 99.97 % of the hammer at p99 2 ms; `stream errors: 0`. + +A separate **burst stress** (a single 100 → 500 jump — 400 players connecting at once) +**pegged the gateway at 3 cores** (≈296 % sustained) and pushed `game.state` +`transport_error` to 9.27 %. The gateway is **connection-CPU-bound and bursty**: average +load is ~1 core, but a mass-simultaneous connection storm saturates whatever single-node +cap it is given. Real arrivals are gradual (the canonical run), where 3 cores has +headroom; the lever for a true arrival spike is **horizontal scaling**, not more cores per +node — carried into the prod recommendation below. + +## Prod-sizing recommendation (Stage 18) + +The contour is **CPU-bound and gateway-led** at 500 concurrent players. Carry these to the +prod contour env (the same compose, `PROD_*` values): + +- **gateway: ≥ 3 cores** per ~500 concurrent players, `GOMAXPROCS` pinned to the limit — + it scales with the **connection count**, not just the request rate; beyond one node's + worth, scale the gateway **horizontally** rather than vertically. +- **backend: ~1–2 cores**, pool 40 — comfortable; the work is light per request. +- **postgres: ~2 cores / ≥ 512 MiB** — ~1.4 cores at 500 players, 100 % cache hit. +- **tempo: ≥ 2 GiB**; the Go services run under ~170 MiB (256 MiB would suffice, 512 is + safe); pin `GOMAXPROCS` to each CPU limit; keep json-file rotation. +- Memory is not the constraint anywhere; CPU is. + +## Re-running + +See [`README.md`](README.md). Briefly, from the repo root: + +```sh +docker build -f loadtest/Dockerfile -t scrabble-loadtest . +docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \ + -e POSTGRES_PASSWORD=… scrabble-loadtest run --reset --cleanup +``` + +The harness stays in the repo for future repeats.