developer/scrabble-game

R7: final stress run + tuning #38

Merged

developer merged 6 commits from feature/r7-final-stress-tuning into development

2026-06-11 09:35:15 +00:00

Author	SHA1	Message	Date
Ilia Denisov	225188e4b5	R7: add a VPS/VDS sizing table (min/avg/max) to the trip report CI / changes (pull_request) Successful in 2s Details CI / unit (pull_request) Successful in 9s Details CI / integration (pull_request) Successful in 13s Details CI / ui (pull_request) Successful in 36s Details CI / gate (pull_request) Successful in 0s Details CI / deploy (pull_request) Successful in 57s Details A practical single-host ordering guide — CPU cores, RAM, disk at three tiers — grounded in the R7 profile (~5.5 cores / ~2.5 GiB peak at 500 players) and the measured on-disk footprint (images ~2.4 GB; Tempo 3.1 GB at 72 h; the game DB 23 MiB and growing). Notes which knobs move disk (Tempo/Prometheus retention, Postgres growth) and that the gateway scales horizontally past one host.	2026-06-11 11:32:09 +02:00
Ilia Denisov	2a48df9b83	R7: trip report + docs/tracker bake-back; mark R7 done CI / changes (pull_request) Successful in 1s Details CI / unit (pull_request) Successful in 9s Details CI / integration (pull_request) Successful in 12s Details CI / ui (pull_request) Successful in 37s Details CI / gate (pull_request) Successful in 0s Details CI / deploy (pull_request) Successful in 58s Details - loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway cores; the burst run showing connection-bound behavior), and the prod-sizing recommendation for Stage 18. - loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor), the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links. - docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats receiver (cAdvisor removed); links to both trip reports. - CLAUDE.md: repo-layout line reflects docker_stats + per-service limits. - PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording the decisions, findings, applied tuning and validation. This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.	2026-06-11 11:18:57 +02:00
Ilia Denisov	f23da88028	R7: apply the agreed tuning from the final stress run CI / changes (pull_request) Successful in 2s Details CI / unit (pull_request) Successful in 8s Details CI / integration (pull_request) Successful in 12s Details CI / ui (pull_request) Successful in 36s Details CI / gate (pull_request) Successful in 0s Details CI / deploy (pull_request) Successful in 1m23s Details Round-2 tuning, decided from the 500-player resource profile: - gateway: 2 -> 3 cores + GOMAXPROCS=3. It holds one h2c connection per player, so at 500 players it burst into the 2-core cap (~2.49% transport_error on game.state); 3 cores absorbs the bursts. The per-connection cost is the realistic prod load. - tempo: memory 1G -> 2G. It reached the 1 GiB cap during the run (OOM risk). - backend Postgres pool: MAX_OPEN_CONNS 25 -> 40. The pool sat at its 25-conn cap (28 backends) at peak; headroom trims the p99 tail. Postgres (2c/512M) handles it. - docker log volume: a json-file rotation default (10m x 3 = 30 MiB/container) applied contour-wide via a YAML anchor; the backend logs ~14 MiB / 30 min at info under load and was previously unbounded. Log level stays info. backend/postgres stay at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom is cheap on the shared host). A validation re-run confirms the gateway fix before merge.	2026-06-11 10:33:58 +02:00
Ilia Denisov	8eee018728	R7: pin docker_stats api_version to 1.44 CI / changes (pull_request) Successful in 1s Details CI / unit (pull_request) Successful in 9s Details CI / integration (pull_request) Successful in 12s Details CI / ui (pull_request) Successful in 36s Details CI / gate (pull_request) Successful in 0s Details CI / deploy (pull_request) Successful in 1m3s Details The receiver defaults to Docker API 1.25, but the contour daemon's minimum is 1.40 (it speaks up to 1.54), so otelcol crash-looped on start with "client version 1.25 is too old". Pinning api_version to 1.44 (accepted by both the receiver's bundled client and the daemon) starts the receiver cleanly — verified by running the image against the host socket ("Everything is ready", no start error).	2026-06-10 18:58:55 +02:00
Ilia Denisov	c16f27475f	R7: contour docker_stats observability + container limits/GOMAXPROCS CI / changes (pull_request) Successful in 1s Details CI / unit (pull_request) Successful in 9s Details CI / integration (pull_request) Successful in 13s Details CI / ui (pull_request) Successful in 37s Details CI / gate (pull_request) Successful in 0s Details CI / deploy (pull_request) Successful in 1m21s Details Observability: replace cAdvisor (which resolves only the root cgroup on the contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats receiver, which reads per-container CPU/memory/network straight from the Docker API and works the same in prod. The collector joins the host docker group (DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out through the existing prometheus exporter, so the cAdvisor scrape job and the privileged cAdvisor service are removed. The Resources dashboard panels are retargeted to the docker_stats metric names (container_name label; container.cpu.utilization/100 == cores). Container limits: apply deploy.resources.limits (honoured by Compose v2) across the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime matches the cgroup quota. Starting values are generous over the R2 peak (~1 core / <=100 MiB per app service) to avoid skewing or OOM-killing the measurement run; they are tightened to the agreed prod sizing after the final stress run (R7 Round 2). The privileged VPN sidecar is left unconstrained.	2026-06-10 18:53:19 +02:00
Ilia Denisov	04263a17ca	R7: per-player transports + drop finished games in the load harness Each virtual player now builds its own edge.Client (its own h2c connection carrying both the Subscribe stream and the Execute calls), instead of every player multiplexing over a single shared http2.Transport. The R2 trip report traced the ~14% transport_error on game.state at 500 players to that single shared transport; per-player connections mirror real clients and isolate the artifact. The assembly burst and the gateway-hammer each get their own client. playTurn now reports when a game has finished so playerLoop drops it from the rotation (slices.DeleteFunc); once no active game remains the player idles while still holding its stream. This stops secondary ops from hammering game_finished on already-ended games (the other R2 harness finding).	2026-06-10 18:53:07 +02:00