A practical single-host ordering guide — CPU cores, RAM, disk at three tiers —
grounded in the R7 profile (~5.5 cores / ~2.5 GiB peak at 500 players) and the
measured on-disk footprint (images ~2.4 GB; Tempo 3.1 GB at 72 h; the game DB
23 MiB and growing). Notes which knobs move disk (Tempo/Prometheus retention,
Postgres growth) and that the gateway scales horizontally past one host.
- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource
profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway
cores; the burst run showing connection-bound behavior), and the prod-sizing
recommendation for Stage 18.
- loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor),
the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links.
- docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats
receiver (cAdvisor removed); links to both trip reports.
- CLAUDE.md: repo-layout line reflects docker_stats + per-service limits.
- PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording
the decisions, findings, applied tuning and validation.
This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
Each virtual player now builds its own edge.Client (its own h2c connection
carrying both the Subscribe stream and the Execute calls), instead of every
player multiplexing over a single shared http2.Transport. The R2 trip report
traced the ~14% transport_error on game.state at 500 players to that single
shared transport; per-player connections mirror real clients and isolate the
artifact. The assembly burst and the gateway-hammer each get their own client.
playTurn now reports when a game has finished so playerLoop drops it from the
rotation (slices.DeleteFunc); once no active game remains the player idles while
still holding its stream. This stops secondary ops from hammering game_finished
on already-ended games (the other R2 harness finding).
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped
clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no
crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held
under the gateway-hammer (99.97 % rejected, p99 2 ms).
Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation
(backend/gateway/Postgres each ~1 core), amplified by the harness's single shared
http2.Transport (the harness itself peaked at 86 % of a core on the same host).
Observability finding: cAdvisor yields only the root cgroup on the contour host
(separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7
should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md;
PRERELEASE refinements logged; R2 marked done.
- display-name marker: letters-only 'Zzloadtest' (the editable-name validator
forbids digits/colons), so profile.update resends the seeded name successfully.
- draft.save: rack_order is a string in the backend draft DTO (was sent as []),
fixing the bad_request.
Both confirmed ok against the contour. chat_not_your_turn / nudge_own_turn are
by-design turn gates (backend/internal/social/chat.go), correctly exercised.
New scrabble/loadtest module (the pre-release stress harness): seeds 1000 guest +
10000 durable accounts with pre-created sessions directly in Postgres (token hash
matches backend/internal/session), drives virtual players through the edge protocol
(real 2-4p games assembled via invitations, mid-ranked legal moves generated locally
by the embedded scrabble-solver — the edge carries no board, so the client replays
history), plus nudge/chat/check-word/draft/profile/stats and a gateway-hammer that
verifies the rate limiter. Prints a trip-report summary (per-op latency percentiles,
result codes, live-event tally). Go unit tests cover the pure pieces; the DAWG-backed
move test runs under BACKEND_DICT_DIR.
Contour: add cAdvisor + postgres_exporter + a 'Scrabble - Resources' Grafana
dashboard and the two Prometheus scrape jobs, for the R2/R7 stress-run resource
baseline.
CI: gate ./loadtest/... (path filter + vet/build/test). Docs: TESTING, ARCHITECTURE,
project CLAUDE repo layout.