R7: trip report + docs/tracker bake-back; mark R7 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s

- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource
  profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway
  cores; the burst run showing connection-bound behavior), and the prod-sizing
  recommendation for Stage 18.
- loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor),
  the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links.
- docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats
  receiver (cAdvisor removed); links to both trip reports.
- CLAUDE.md: repo-layout line reflects docker_stats + per-service limits.
- PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording
  the decisions, findings, applied tuning and validation.

This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
This commit is contained in:
Ilia Denisov
2026-06-11 11:18:57 +02:00
parent f23da88028
commit 2a48df9b83
6 changed files with 257 additions and 21 deletions
+35 -2
View File
@@ -23,7 +23,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l
| R4 | Push enrichment + kill the last poll | 4 + 5 | **done** |
| R5 | Bundle slimming | 6 | **done** |
| R6 | Refactor + docs reconciliation + de-staging | 7 | **done** |
| R7 | Final stress run + tuning | 9b | todo |
| R7 | Final stress run + tuning | 9b | **done** |
| → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) |
## Key findings (these reshaped the raw list — read before starting a phase)
@@ -168,7 +168,7 @@ regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
- Open details: the structural-changes list itself (owner-approved before applying); the test
consolidation targets.
### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18
### R7 — Final stress run + tuning *(TODO 9, part 2)* — done
Re-run the R2 harness against the final, refactored system on a clean contour; analyse
resource consumption across **all** components (gateway, backend, Postgres, the
metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate
@@ -380,3 +380,36 @@ Then Stage 18.
10 files) into `backend/internal/inttest/helpers.go`; single-file helpers stay local. Pure relocation.
- **No schema change → no contour DB wipe.** Regression gate: the full unit + integration + UI suites plus
the R7 stress run.
- **R7** (interview + implementation):
- **Locked decisions:** run the harness **same-host** (one-shot container on `scrabble-internal`, capped
`--cpus=3` so the contour keeps spare cores); **apply container limits + `GOMAXPROCS` now** (not just a
prod recommendation); **replace cAdvisor with the otelcol `docker_stats` receiver** (it resolved only the
root cgroup on this host); keep rate-limit / h2c knobs **compiled-in** (change values only if the data
demands — it did not).
- **Harness refinements (pre-run):** each virtual player builds its **own `edge.Client`** (its own h2c
connection for its Subscribe stream + Execute calls) instead of all players sharing one `http2.Transport`
the R2 `transport_error` artifact; and `playTurn` now reports a **finished** game so the player drops it
from rotation. Effect, measured: `game.state` `transport_error` 14 % (R2) → **2.49 %**; `game_finished` on
chat ≈ 3 900 → **35**.
- **Observability:** added the `docker_stats` receiver to `otelcol` (`api_version: "1.44"` — the daemon's
minimum is 1.40; the receiver defaults to 1.25 and crash-looped until pinned), mounted the docker socket
read-only with `group_add` (the contrib image runs as UID 10001), dropped the cAdvisor service + its
Prometheus job, and retargeted the **Scrabble — Resources** dashboard to the docker_stats metric names
(`container_cpu_utilization`/100 == cores). Cross-checked against `docker stats` within sampling error.
- **Profile (final run, 500 players, limits in force):** the **gateway is the binding constraint** — with
one connection per player it bursts into its 2-core cap (the residual 2.49 % `transport_error`); backend
~0.85 core and postgres ~1.4 cores had headroom; **tempo reached its 1 GiB cap**; the backend pool sat at
its `MaxOpenConns=25` cap (28 backends); docker logs were unbounded (~14 MiB / 30 min on the backend at
info). Full write-up in [`../loadtest/REPORT-R7.md`](../loadtest/REPORT-R7.md).
- **Round-2 tuning (owner-agreed, all in `deploy/docker-compose.yml`, no code change):** gateway **2 → 3
cores + `GOMAXPROCS=3`**; tempo memory **1 → 2 GiB**; backend `MAX_OPEN_CONNS` **25 → 40**; a json-file
**log-rotation** default (10m × 3) applied contour-wide via a YAML anchor (level stays info).
backend/postgres kept at 2 cores / 512 MiB (headroom is cheap on the shared host).
- **Validation:** the same gradual ramp on the tuned contour cut `game.state` `transport_error` to **0.72 %**
(gateway ~2 cores, now under the 3-core cap, no throttle; tempo ~1.27 GiB, under 2 GiB). A separate
**burst** run (a single 100 → 500 jump) pegged the gateway at 3 cores (≈296 % sustained, 9.27 % error),
confirming it is **connection-CPU-bound** — a true arrival spike is a **horizontal-scaling** lever, not
more cores per node (recorded in the prod-sizing recommendation).
- **No schema change → no contour DB wipe.** Bake-back: `loadtest/REPORT-R7.md` (new), `loadtest/README.md`,
`docs/TESTING.md`, the telemetry/observability section of `docs/ARCHITECTURE.md`, the repo-layout line in `CLAUDE.md`.