R7: final stress run + tuning #38

Merged
developer merged 6 commits from feature/r7-final-stress-tuning into development 2026-06-11 09:35:15 +00:00
Owner

PRERELEASE.md R7 — Final stress run + tuning (staged; in progress).

This PR lands the pre-run changes; the agreed tuning + REPORT-R7.md + the doc/tracker bake-back are pushed to this same branch after the final stress run (R7 is a measure → tune → re-verify phase, so the tuning numbers follow the run).

So far:

  1. Harness — per-player transports (loadtest/): each virtual player owns its h2c connection instead of multiplexing over one shared http2.Transport (the R2 transport_error artifact); finished games are dropped from the rotation.
  2. Observability — replace cAdvisor (root-cgroup-only on this host) with the otelcol docker_stats receiver; Resources dashboard retargeted.
  3. Container limits + GOMAXPROCS on the contour (generous R2-derived starting values; tightened to prod sizing in Round 2).

Still to come on this branch: the final run on the freshly-deployed contour, the agreed tuning values, loadtest/REPORT-R7.md, and marking R7 done in PRERELEASE.md (+ docs).

**PRERELEASE.md R7 — Final stress run + tuning** (staged; in progress). This PR lands the **pre-run** changes; the **agreed tuning + `REPORT-R7.md` + the doc/tracker bake-back** are pushed to this same branch after the final stress run (R7 is a measure → tune → re-verify phase, so the tuning numbers follow the run). So far: 1. **Harness — per-player transports** (`loadtest/`): each virtual player owns its h2c connection instead of multiplexing over one shared `http2.Transport` (the R2 `transport_error` artifact); finished games are dropped from the rotation. 2. **Observability** — replace cAdvisor (root-cgroup-only on this host) with the otelcol **`docker_stats`** receiver; Resources dashboard retargeted. 3. **Container limits + `GOMAXPROCS`** on the contour (generous R2-derived starting values; tightened to prod sizing in Round 2). Still to come on this branch: the final run on the freshly-deployed contour, the agreed tuning values, `loadtest/REPORT-R7.md`, and marking R7 done in `PRERELEASE.md` (+ docs).
developer added 2 commits 2026-06-10 16:53:46 +00:00
Each virtual player now builds its own edge.Client (its own h2c connection
carrying both the Subscribe stream and the Execute calls), instead of every
player multiplexing over a single shared http2.Transport. The R2 trip report
traced the ~14% transport_error on game.state at 500 players to that single
shared transport; per-player connections mirror real clients and isolate the
artifact. The assembly burst and the gateway-hammer each get their own client.

playTurn now reports when a game has finished so playerLoop drops it from the
rotation (slices.DeleteFunc); once no active game remains the player idles while
still holding its stream. This stops secondary ops from hammering game_finished
on already-ended games (the other R2 harness finding).
R7: contour docker_stats observability + container limits/GOMAXPROCS
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m21s
c16f27475f
Observability: replace cAdvisor (which resolves only the root cgroup on the
contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats
receiver, which reads per-container CPU/memory/network straight from the Docker
API and works the same in prod. The collector joins the host docker group
(DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out
through the existing prometheus exporter, so the cAdvisor scrape job and the
privileged cAdvisor service are removed. The Resources dashboard panels are
retargeted to the docker_stats metric names (container_name label;
container.cpu.utilization/100 == cores).

Container limits: apply deploy.resources.limits (honoured by Compose v2) across
the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime
matches the cgroup quota. Starting values are generous over the R2 peak (~1 core /
<=100 MiB per app service) to avoid skewing or OOM-killing the measurement run;
they are tightened to the agreed prod sizing after the final stress run (R7
Round 2). The privileged VPN sidecar is left unconstrained.
developer added 1 commit 2026-06-10 16:58:57 +00:00
R7: pin docker_stats api_version to 1.44
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m3s
8eee018728
The receiver defaults to Docker API 1.25, but the contour daemon's minimum is
1.40 (it speaks up to 1.54), so otelcol crash-looped on start with "client
version 1.25 is too old". Pinning api_version to 1.44 (accepted by both the
receiver's bundled client and the daemon) starts the receiver cleanly —
verified by running the image against the host socket ("Everything is ready",
no start error).
developer added 1 commit 2026-06-11 08:33:59 +00:00
R7: apply the agreed tuning from the final stress run
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m23s
f23da88028
Round-2 tuning, decided from the 500-player resource profile:
- gateway: 2 -> 3 cores + GOMAXPROCS=3. It holds one h2c connection per player, so
  at 500 players it burst into the 2-core cap (~2.49% transport_error on game.state);
  3 cores absorbs the bursts. The per-connection cost is the realistic prod load.
- tempo: memory 1G -> 2G. It reached the 1 GiB cap during the run (OOM risk).
- backend Postgres pool: MAX_OPEN_CONNS 25 -> 40. The pool sat at its 25-conn cap
  (28 backends) at peak; headroom trims the p99 tail. Postgres (2c/512M) handles it.
- docker log volume: a json-file rotation default (10m x 3 = 30 MiB/container) applied
  contour-wide via a YAML anchor; the backend logs ~14 MiB / 30 min at info under load
  and was previously unbounded. Log level stays info.

backend/postgres stay at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom is cheap
on the shared host). A validation re-run confirms the gateway fix before merge.
developer added 1 commit 2026-06-11 09:18:59 +00:00
R7: trip report + docs/tracker bake-back; mark R7 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s
2a48df9b83
- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource
  profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway
  cores; the burst run showing connection-bound behavior), and the prod-sizing
  recommendation for Stage 18.
- loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor),
  the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links.
- docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats
  receiver (cAdvisor removed); links to both trip reports.
- CLAUDE.md: repo-layout line reflects docker_stats + per-service limits.
- PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording
  the decisions, findings, applied tuning and validation.

This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
developer added 1 commit 2026-06-11 09:32:11 +00:00
R7: add a VPS/VDS sizing table (min/avg/max) to the trip report
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
225188e4b5
A practical single-host ordering guide — CPU cores, RAM, disk at three tiers —
grounded in the R7 profile (~5.5 cores / ~2.5 GiB peak at 500 players) and the
measured on-disk footprint (images ~2.4 GB; Tempo 3.1 GB at 72 h; the game DB
23 MiB and growing). Notes which knobs move disk (Tempo/Prometheus retention,
Postgres growth) and that the gateway scales horizontally past one host.
owner approved these changes 2026-06-11 09:34:24 +00:00
developer merged commit f8b6b7f2e3 into development 2026-06-11 09:35:15 +00:00
developer deleted branch feature/r7-final-stress-tuning 2026-06-11 09:35:15 +00:00
Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: developer/scrabble-game#38