R7: trip report + docs/tracker bake-back; mark R7 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s
- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway cores; the burst run showing connection-bound behavior), and the prod-sizing recommendation for Stage 18. - loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor), the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links. - docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats receiver (cAdvisor removed); links to both trip reports. - CLAUDE.md: repo-layout line reflects docker_stats + per-service limits. - PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording the decisions, findings, applied tuning and validation. This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
This commit is contained in:
+35
-2
@@ -23,7 +23,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l
|
||||
| R4 | Push enrichment + kill the last poll | 4 + 5 | **done** |
|
||||
| R5 | Bundle slimming | 6 | **done** |
|
||||
| R6 | Refactor + docs reconciliation + de-staging | 7 | **done** |
|
||||
| R7 | Final stress run + tuning | 9b | todo |
|
||||
| R7 | Final stress run + tuning | 9b | **done** |
|
||||
| → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) |
|
||||
|
||||
## Key findings (these reshaped the raw list — read before starting a phase)
|
||||
@@ -168,7 +168,7 @@ regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
|
||||
- Open details: the structural-changes list itself (owner-approved before applying); the test
|
||||
consolidation targets.
|
||||
|
||||
### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18
|
||||
### R7 — Final stress run + tuning *(TODO 9, part 2)* — done
|
||||
Re-run the R2 harness against the final, refactored system on a clean contour; analyse
|
||||
resource consumption across **all** components (gateway, backend, Postgres, the
|
||||
metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate
|
||||
@@ -380,3 +380,36 @@ Then Stage 18.
|
||||
10 files) into `backend/internal/inttest/helpers.go`; single-file helpers stay local. Pure relocation.
|
||||
- **No schema change → no contour DB wipe.** Regression gate: the full unit + integration + UI suites plus
|
||||
the R7 stress run.
|
||||
|
||||
- **R7** (interview + implementation):
|
||||
- **Locked decisions:** run the harness **same-host** (one-shot container on `scrabble-internal`, capped
|
||||
`--cpus=3` so the contour keeps spare cores); **apply container limits + `GOMAXPROCS` now** (not just a
|
||||
prod recommendation); **replace cAdvisor with the otelcol `docker_stats` receiver** (it resolved only the
|
||||
root cgroup on this host); keep rate-limit / h2c knobs **compiled-in** (change values only if the data
|
||||
demands — it did not).
|
||||
- **Harness refinements (pre-run):** each virtual player builds its **own `edge.Client`** (its own h2c
|
||||
connection for its Subscribe stream + Execute calls) instead of all players sharing one `http2.Transport` —
|
||||
the R2 `transport_error` artifact; and `playTurn` now reports a **finished** game so the player drops it
|
||||
from rotation. Effect, measured: `game.state` `transport_error` 14 % (R2) → **2.49 %**; `game_finished` on
|
||||
chat ≈ 3 900 → **35**.
|
||||
- **Observability:** added the `docker_stats` receiver to `otelcol` (`api_version: "1.44"` — the daemon's
|
||||
minimum is 1.40; the receiver defaults to 1.25 and crash-looped until pinned), mounted the docker socket
|
||||
read-only with `group_add` (the contrib image runs as UID 10001), dropped the cAdvisor service + its
|
||||
Prometheus job, and retargeted the **Scrabble — Resources** dashboard to the docker_stats metric names
|
||||
(`container_cpu_utilization`/100 == cores). Cross-checked against `docker stats` within sampling error.
|
||||
- **Profile (final run, 500 players, limits in force):** the **gateway is the binding constraint** — with
|
||||
one connection per player it bursts into its 2-core cap (the residual 2.49 % `transport_error`); backend
|
||||
~0.85 core and postgres ~1.4 cores had headroom; **tempo reached its 1 GiB cap**; the backend pool sat at
|
||||
its `MaxOpenConns=25` cap (28 backends); docker logs were unbounded (~14 MiB / 30 min on the backend at
|
||||
info). Full write-up in [`../loadtest/REPORT-R7.md`](../loadtest/REPORT-R7.md).
|
||||
- **Round-2 tuning (owner-agreed, all in `deploy/docker-compose.yml`, no code change):** gateway **2 → 3
|
||||
cores + `GOMAXPROCS=3`**; tempo memory **1 → 2 GiB**; backend `MAX_OPEN_CONNS` **25 → 40**; a json-file
|
||||
**log-rotation** default (10m × 3) applied contour-wide via a YAML anchor (level stays info).
|
||||
backend/postgres kept at 2 cores / 512 MiB (headroom is cheap on the shared host).
|
||||
- **Validation:** the same gradual ramp on the tuned contour cut `game.state` `transport_error` to **0.72 %**
|
||||
(gateway ~2 cores, now under the 3-core cap, no throttle; tempo ~1.27 GiB, under 2 GiB). A separate
|
||||
**burst** run (a single 100 → 500 jump) pegged the gateway at 3 cores (≈296 % sustained, 9.27 % error),
|
||||
confirming it is **connection-CPU-bound** — a true arrival spike is a **horizontal-scaling** lever, not
|
||||
more cores per node (recorded in the prod-sizing recommendation).
|
||||
- **No schema change → no contour DB wipe.** Bake-back: `loadtest/REPORT-R7.md` (new), `loadtest/README.md`,
|
||||
`docs/TESTING.md`, the telemetry/observability section of `docs/ARCHITECTURE.md`, the repo-layout line in `CLAUDE.md`.
|
||||
|
||||
Reference in New Issue
Block a user