R7: final stress run + tuning #38

Open
developer wants to merge 3 commits from feature/r7-final-stress-tuning into development
Owner

PRERELEASE.md R7 — Final stress run + tuning (staged; in progress).

This PR lands the pre-run changes; the agreed tuning + REPORT-R7.md + the doc/tracker bake-back are pushed to this same branch after the final stress run (R7 is a measure → tune → re-verify phase, so the tuning numbers follow the run).

So far:

  1. Harness — per-player transports (loadtest/): each virtual player owns its h2c connection instead of multiplexing over one shared http2.Transport (the R2 transport_error artifact); finished games are dropped from the rotation.
  2. Observability — replace cAdvisor (root-cgroup-only on this host) with the otelcol docker_stats receiver; Resources dashboard retargeted.
  3. Container limits + GOMAXPROCS on the contour (generous R2-derived starting values; tightened to prod sizing in Round 2).

Still to come on this branch: the final run on the freshly-deployed contour, the agreed tuning values, loadtest/REPORT-R7.md, and marking R7 done in PRERELEASE.md (+ docs).

**PRERELEASE.md R7 — Final stress run + tuning** (staged; in progress). This PR lands the **pre-run** changes; the **agreed tuning + `REPORT-R7.md` + the doc/tracker bake-back** are pushed to this same branch after the final stress run (R7 is a measure → tune → re-verify phase, so the tuning numbers follow the run). So far: 1. **Harness — per-player transports** (`loadtest/`): each virtual player owns its h2c connection instead of multiplexing over one shared `http2.Transport` (the R2 `transport_error` artifact); finished games are dropped from the rotation. 2. **Observability** — replace cAdvisor (root-cgroup-only on this host) with the otelcol **`docker_stats`** receiver; Resources dashboard retargeted. 3. **Container limits + `GOMAXPROCS`** on the contour (generous R2-derived starting values; tightened to prod sizing in Round 2). Still to come on this branch: the final run on the freshly-deployed contour, the agreed tuning values, `loadtest/REPORT-R7.md`, and marking R7 done in `PRERELEASE.md` (+ docs).
developer added 2 commits 2026-06-10 16:53:46 +00:00
Each virtual player now builds its own edge.Client (its own h2c connection
carrying both the Subscribe stream and the Execute calls), instead of every
player multiplexing over a single shared http2.Transport. The R2 trip report
traced the ~14% transport_error on game.state at 500 players to that single
shared transport; per-player connections mirror real clients and isolate the
artifact. The assembly burst and the gateway-hammer each get their own client.

playTurn now reports when a game has finished so playerLoop drops it from the
rotation (slices.DeleteFunc); once no active game remains the player idles while
still holding its stream. This stops secondary ops from hammering game_finished
on already-ended games (the other R2 harness finding).
R7: contour docker_stats observability + container limits/GOMAXPROCS
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m21s
c16f27475f
Observability: replace cAdvisor (which resolves only the root cgroup on the
contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats
receiver, which reads per-container CPU/memory/network straight from the Docker
API and works the same in prod. The collector joins the host docker group
(DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out
through the existing prometheus exporter, so the cAdvisor scrape job and the
privileged cAdvisor service are removed. The Resources dashboard panels are
retargeted to the docker_stats metric names (container_name label;
container.cpu.utilization/100 == cores).

Container limits: apply deploy.resources.limits (honoured by Compose v2) across
the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime
matches the cgroup quota. Starting values are generous over the R2 peak (~1 core /
<=100 MiB per app service) to avoid skewing or OOM-killing the measurement run;
they are tightened to the agreed prod sizing after the final stress run (R7
Round 2). The privileged VPN sidecar is left unconstrained.
developer added 1 commit 2026-06-10 16:58:57 +00:00
R7: pin docker_stats api_version to 1.44
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m3s
8eee018728
The receiver defaults to Docker API 1.25, but the contour daemon's minimum is
1.40 (it speaks up to 1.54), so otelcol crash-looped on start with "client
version 1.25 is too old". Pinning api_version to 1.44 (accepted by both the
receiver's bundled client and the daemon) starts the receiver cleanly —
verified by running the image against the host socket ("Everything is ready",
no start error).
Some checks are pending
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m3s
This pull request doesn't have enough required approvals yet. 0 of 1 approvals granted from users or teams on the allowlist.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feature/r7-final-stress-tuning:feature/r7-final-stress-tuning
git checkout feature/r7-final-stress-tuning
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: developer/scrabble-game#38