Files
scrabble-game/loadtest/REPORT-R2.md
T
Ilia Denisov a2265a122e
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
R2: early-pass trip report + mark R2 done
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped
clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no
crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held
under the gateway-hammer (99.97 % rejected, p99 2 ms).

Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation
(backend/gateway/Postgres each ~1 core), amplified by the harness's single shared
http2.Transport (the harness itself peaked at 86 % of a core on the same host).
Observability finding: cAdvisor yields only the root cgroup on the contour host
(separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7
should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md;
PRERELEASE refinements logged; R2 marked done.
2026-06-10 00:47:16 +02:00

9.4 KiB
Raw Blame History

R2 — early stress-run trip report

The early stress pass for PRERELEASE.md R2. It exercises the system through the edge protocol with the scrabble/loadtest harness, to surface logic/concurrency bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and R7 (final tuning). Pass bar: diagnostic — the run "passes" by completing without the harness crashing; findings are recorded below, not gated.

Method

  • Driver: the scrabble/loadtest module, run as a one-shot container on the scrabble-internal docker network (reaching postgres:5432 and gateway:8081 directly, bypassing the host→gateway hairpin).
  • Seed: 10 000 durable + 1 000 guest accounts with pre-created sessions written directly to Postgres (token hash matches backend/internal/session), so the driver authenticates without the per-IP-limited auth ops.
  • Games: assembled through the real invitation flow (invitation.createinvitation.accept), 24 players each, no robots; variants spread over scrabble_en / scrabble_ru / erudit_ru.
  • Play: each virtual player holds a live Subscribe stream and, per tick, polls game.state, replays game.history and submits a mid-ranked legal move generated locally by the embedded scrabble-solver (the edge carries no board), or passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile / stats. A separate gateway-hammer floods games.list from one account.
  • Scale: moderate ramp 50 → 200 → 500 concurrent players, 10 min/step (the agreed moderate profile; harness and contour share this host's CPU).
  • Resource capture: docker stats (docker API) sampled every 28 s for per-container CPU/memory; Prometheus for edge latency/throughput, postgres_exporter internals and per-service Go runtime metrics.

Run configuration

loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
             --tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup

Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2 exporters. Seeded population removed by --cleanup afterwards.

Findings

Validated (fixed within R2)

  • Harness draft payload. draft.save first returned bad_request: the backend draft DTO's rack_order is a string (the harness sent []). Fixed → ok.
  • Harness profile marker. profile.update first returned invalid_profile: the editable-display-name validator (backend/internal/account/profile.go) forbids digits and colons, but the seed marker was lt:…. Switched the marker to a distinctive letters-only string → ok. Cleanup still matches it.

By-design behaviour (correctly exercised, not bugs)

  • chat_not_your_turn — chat is gated to the sender's turn (backend/internal/social/chat.go); off-turn posts are correctly rejected.
  • nudge_own_turn — you nudge the player whose turn it is, so a nudge on your own turn is correctly rejected. The harness nudges/chats at random ticks, so a share of these codes is expected.

Observability gap (key R7 input)

  • cAdvisor yields only the root cgroup on the contour host. Its docker factory registers, but per-container init fails — failed to identify the read-write layer ID … /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory — because this host's /var/lib/docker is a separate XFS mount not visible under cAdvisor's /rootfs bind (the existing galaxy deployment on the same host has the same limitation). So the Scrabble — Resources dashboard's per-container panels are empty here, and per-container CPU/RSS for this run was captured via docker stats instead. Postgres internals (postgres_exporter) and per-service Go runtime metrics (go_* by service_name) work. Recommendation for R7: adopt the otelcol docker_stats receiver (already the contrib image) — it reads per-container stats via the docker API with no cgroup dependency — and/or run the final pass on hardware where cAdvisor resolves containers. (Decision to confirm with the owner.)

Run results

The ramp ran clean to 500 players with no harness crash, no deadlock and stream errors: 0; cleanup removed all 11 000 seeded accounts (and their ~941 games).

  • Ramp: step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569.
  • Volume (30 min): 1.20 M total edge calls, 659 req/s average. Real gameplay at scale: 48 870 committed plays, 52 772 your_turn + 159 631 opponent_moved events, 2 798 games finished.
  • Latency under load (peak, step 3): game.state p50 ≈ 100 ms, p90/p99 in the 200500 ms buckets, max 849 ms; game.submit_play similar (p99 ≤ 500 ms, max 490 ms). Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms).
  • Rate limiter holds. The gateway-hammer sent 522 667 games.list from one account; 522 486 (99.97 %) were rate_limited, only 135 ok (the burst). Rejections are cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the flood. The per-user limiter behaves as designed (R3 input: the cost is negligible).

Top finding — transport_error under saturation. At 500 players ~14 % of game.state calls (72 429 / 519 067) and a few % of the other ops returned a Connect transport_error (not a domain code). It correlates with the CPU saturation below: the backend/gateway are pinned near one core each while the host also runs the 86 %-core harness, so the edge sheds load (resets/timeouts) at the knee. It is amplified by a harness artifact — all 500 virtual players multiplex over a single shared http2.Transport, so 500 persistent Subscribe streams plus Execute calls press on one HTTP/2 connection's concurrent-stream limit; real clients each use their own connection. Actions: R7 harness — give each player (or a pool) its own transport, and run on hardware not shared with the contour; R3 — confirm the gateway's h2c MaxConcurrentStreams and edge timeouts are sized for many persistent streams.

Minor findings:

  • unauthenticated on a tiny share (188 / 519 067 game.state, ~0.04 %) — transient session-resolve failures under load; worth a glance in R3 but not material.
  • one internal on game.pass (1 / 4 788).
  • game_finished dominates chat.nudge/chat.post (≈ 3 900 each): the harness keeps secondary ops on games that already ended. Harness refinement — drop finished games from the rotation (R7).
  • nudge_own_turn / chat_not_your_turn / nudge_too_soon are the expected turn/rate gates, correctly exercised.

Resource baseline

Per-container peak during step 3 (500 players), from docker stats:

container peak CPU memory
scrabble-backend 99 % (~1 core) 91 MiB
scrabble-gateway 93 % 76 MiB
scrabble-postgres 90 % 69 MiB
scrabble-loadtest (harness) 86 % 42 MiB
scrabble-otelcol 10 % 110 MiB
scrabble-tempo 9 % 446 MiB
prometheus / postgres-exporter ~0 % 46 / 16 MiB
  • The contour is CPU-bound at 500 concurrent players: backend, gateway and Postgres each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover.
  • Caveat: the harness itself peaked at 86 % of a core on the same host, so the step-3 latency and transport_error figures are pessimistic — the contour competed with the generator for CPU. A clean ceiling needs separate hardware (R7).
  • Postgres: peak 28 backend connections, ~5 581 commits/s at the peak, 100 % cache hit ratio (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here.
  • Goroutines: backend 638, gateway 1 698 (it holds the 500 Subscribe streams + per-request goroutines), telegram 49 — all stable, no leak across the ramp.

Recommendations feeding later phases

  • R3 (edge hardening): the per-user limiter holds (99.97 % rejected, p99 2 ms) — add the per-IP body-size cap on top. Investigate the ~14 % transport_error on game.state at 500 players: confirm the gateway h2c MaxConcurrentStreams and edge read/write timeouts are sized for many persistent Subscribe streams, and glance at the ~0.04 % transient unauthenticated resolves under load.
  • R6 (refactor): no logic bug forced a code change beyond the two harness-payload fixes; the run surfaced no deadlock or goroutine leak across the ramp.
  • R7 (final tuning + stress): (1) fix the per-container observability gap — adopt the otelcol docker_stats receiver so Grafana shows per-container CPU/RSS on the contour; (2) refine the harness — per-player/pooled transports and dropping finished games from the rotation — and run on hardware not shared with the contour; (3) size pools / GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway, Postgres at 500 players).

Re-running

See README.md. Briefly, from the repo root:

docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --name scrabble-loadtest --network scrabble-internal \
  -e POSTGRES_PASSWORD=… scrabble-loadtest run    # add --reset on a re-run

The harness stays in the repo for the R7 repeat.