Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held under the gateway-hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation (backend/gateway/Postgres each ~1 core), amplified by the harness's single shared http2.Transport (the harness itself peaked at 86 % of a core on the same host). Observability finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7 should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md; PRERELEASE refinements logged; R2 marked done.
9.4 KiB
R2 — early stress-run trip report
The early stress pass for PRERELEASE.md R2. It exercises the system through the
edge protocol with the scrabble/loadtest harness, to surface logic/concurrency
bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and
R7 (final tuning). Pass bar: diagnostic — the run "passes" by completing without the
harness crashing; findings are recorded below, not gated.
Method
- Driver: the
scrabble/loadtestmodule, run as a one-shot container on thescrabble-internaldocker network (reachingpostgres:5432andgateway:8081directly, bypassing the host→gateway hairpin). - Seed: 10 000 durable + 1 000 guest accounts with pre-created sessions written
directly to Postgres (token hash matches
backend/internal/session), so the driver authenticates without the per-IP-limited auth ops. - Games: assembled through the real invitation flow (
invitation.create→invitation.accept), 2–4 players each, no robots; variants spread over scrabble_en / scrabble_ru / erudit_ru. - Play: each virtual player holds a live
Subscribestream and, per tick, pollsgame.state, replaysgame.historyand submits a mid-ranked legal move generated locally by the embeddedscrabble-solver(the edge carries no board), or passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile / stats. A separate gateway-hammer floodsgames.listfrom one account. - Scale: moderate ramp 50 → 200 → 500 concurrent players, 10 min/step (the agreed moderate profile; harness and contour share this host's CPU).
- Resource capture:
docker stats(docker API) sampled every 28 s for per-container CPU/memory; Prometheus for edge latency/throughput,postgres_exporterinternals and per-service Go runtime metrics.
Run configuration
loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
--tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup
Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2
exporters. Seeded population removed by --cleanup afterwards.
Findings
Validated (fixed within R2)
- Harness draft payload.
draft.savefirst returnedbad_request: the backend draft DTO'srack_orderis a string (the harness sent[]). Fixed →ok. - Harness profile marker.
profile.updatefirst returnedinvalid_profile: the editable-display-name validator (backend/internal/account/profile.go) forbids digits and colons, but the seed marker waslt:…. Switched the marker to a distinctive letters-only string →ok. Cleanup still matches it.
By-design behaviour (correctly exercised, not bugs)
chat_not_your_turn— chat is gated to the sender's turn (backend/internal/social/chat.go); off-turn posts are correctly rejected.nudge_own_turn— you nudge the player whose turn it is, so a nudge on your own turn is correctly rejected. The harness nudges/chats at random ticks, so a share of these codes is expected.
Observability gap (key R7 input)
- cAdvisor yields only the root cgroup on the contour host. Its docker factory
registers, but per-container init fails —
failed to identify the read-write layer ID … /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory— because this host's/var/lib/dockeris a separate XFS mount not visible under cAdvisor's/rootfsbind (the existing galaxy deployment on the same host has the same limitation). So the Scrabble — Resources dashboard's per-container panels are empty here, and per-container CPU/RSS for this run was captured viadocker statsinstead. Postgres internals (postgres_exporter) and per-service Go runtime metrics (go_*byservice_name) work. Recommendation for R7: adopt the otelcoldocker_statsreceiver (already the contrib image) — it reads per-container stats via the docker API with no cgroup dependency — and/or run the final pass on hardware where cAdvisor resolves containers. (Decision to confirm with the owner.)
Run results
The ramp ran clean to 500 players with no harness crash, no deadlock and
stream errors: 0; cleanup removed all 11 000 seeded accounts (and their ~941 games).
- Ramp: step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569.
- Volume (30 min): 1.20 M total edge calls, 659 req/s average. Real gameplay at
scale: 48 870 committed plays, 52 772
your_turn+ 159 631opponent_movedevents, 2 798 games finished. - Latency under load (peak, step 3):
game.statep50 ≈ 100 ms, p90/p99 in the 200–500 ms buckets, max 849 ms;game.submit_playsimilar (p99 ≤ 500 ms, max 490 ms). Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms). - Rate limiter holds. The gateway-hammer sent 522 667
games.listfrom one account; 522 486 (99.97 %) wererate_limited, only 135ok(the burst). Rejections are cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the flood. The per-user limiter behaves as designed (R3 input: the cost is negligible).
Top finding — transport_error under saturation. At 500 players ~14 % of
game.state calls (72 429 / 519 067) and a few % of the other ops returned a Connect
transport_error (not a domain code). It correlates with the CPU saturation below: the
backend/gateway are pinned near one core each while the host also runs the 86 %-core
harness, so the edge sheds load (resets/timeouts) at the knee. It is amplified by a
harness artifact — all 500 virtual players multiplex over a single shared
http2.Transport, so 500 persistent Subscribe streams plus Execute calls press on one
HTTP/2 connection's concurrent-stream limit; real clients each use their own connection.
Actions: R7 harness — give each player (or a pool) its own transport, and run on
hardware not shared with the contour; R3 — confirm the gateway's h2c
MaxConcurrentStreams and edge timeouts are sized for many persistent streams.
Minor findings:
unauthenticatedon a tiny share (188 / 519 067game.state, ~0.04 %) — transient session-resolve failures under load; worth a glance in R3 but not material.- one
internalongame.pass(1 / 4 788). game_finisheddominateschat.nudge/chat.post(≈ 3 900 each): the harness keeps secondary ops on games that already ended. Harness refinement — drop finished games from the rotation (R7).nudge_own_turn/chat_not_your_turn/nudge_too_soonare the expected turn/rate gates, correctly exercised.
Resource baseline
Per-container peak during step 3 (500 players), from docker stats:
| container | peak CPU | memory |
|---|---|---|
| scrabble-backend | 99 % (~1 core) | 91 MiB |
| scrabble-gateway | 93 % | 76 MiB |
| scrabble-postgres | 90 % | 69 MiB |
| scrabble-loadtest (harness) | 86 % | 42 MiB |
| scrabble-otelcol | 10 % | 110 MiB |
| scrabble-tempo | 9 % | 446 MiB |
| prometheus / postgres-exporter | ~0 % | 46 / 16 MiB |
- The contour is CPU-bound at 500 concurrent players: backend, gateway and Postgres each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover.
- Caveat: the harness itself peaked at 86 % of a core on the same host, so the
step-3 latency and
transport_errorfigures are pessimistic — the contour competed with the generator for CPU. A clean ceiling needs separate hardware (R7). - Postgres: peak 28 backend connections, ~5 581 commits/s at the peak, 100 % cache hit ratio (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here.
- Goroutines: backend 638, gateway 1 698 (it holds the 500
Subscribestreams + per-request goroutines), telegram 49 — all stable, no leak across the ramp.
Recommendations feeding later phases
- R3 (edge hardening): the per-user limiter holds (99.97 % rejected, p99 2 ms) — add
the per-IP body-size cap on top. Investigate the ~14 %
transport_errorongame.stateat 500 players: confirm the gateway h2cMaxConcurrentStreamsand edge read/write timeouts are sized for many persistentSubscribestreams, and glance at the ~0.04 % transientunauthenticatedresolves under load. - R6 (refactor): no logic bug forced a code change beyond the two harness-payload fixes; the run surfaced no deadlock or goroutine leak across the ramp.
- R7 (final tuning + stress): (1) fix the per-container observability gap — adopt the
otelcol
docker_statsreceiver so Grafana shows per-container CPU/RSS on the contour; (2) refine the harness — per-player/pooled transports and dropping finished games from the rotation — and run on hardware not shared with the contour; (3) size pools / GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway, Postgres at 500 players).
Re-running
See README.md. Briefly, from the repo root:
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --name scrabble-loadtest --network scrabble-internal \
-e POSTGRES_PASSWORD=… scrabble-loadtest run # add --reset on a re-run
The harness stays in the repo for the R7 repeat.