Files
scrabble-game/loadtest/REPORT-R7.md
T
Ilia Denisov 225188e4b5
CI / changes (pull_request) Successful in 2s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
R7: add a VPS/VDS sizing table (min/avg/max) to the trip report
A practical single-host ordering guide — CPU cores, RAM, disk at three tiers —
grounded in the R7 profile (~5.5 cores / ~2.5 GiB peak at 500 players) and the
measured on-disk footprint (images ~2.4 GB; Tempo 3.1 GB at 72 h; the game DB
23 MiB and growing). Notes which knobs move disk (Tempo/Prometheus retention,
Postgres growth) and that the gateway scales horizontally past one host.
2026-06-11 11:32:09 +02:00

13 KiB
Raw Blame History

R7 — final stress-run trip report

The final pre-release stress pass for PRERELEASE.md R7. It re-runs the R2 harness (scrabble/loadtest) against the final, refactored system on a freshly redeployed contour, to confirm the system holds at scale and to settle the resource sizing (container limits, GOMAXPROCS, pools, rate limits, log levels) before the Stage 18 prod cutover. Pass bar: diagnostic + a tuning decision — the run "passes" by completing cleanly; the per-container resource profile drives the tuning recorded below. Companion to the early pass, REPORT-R2.md.

What changed since the R2 pass

  • Harness — per-player transports. Each virtual player now owns its edge.Client (its own http2.Transport / h2c connection carrying both its Subscribe stream and its Execute calls), instead of all players multiplexing over one shared transport. R2 traced the ~14 % transport_error on game.state at 500 players to that single shared connection's stream limit; per-player connections mirror real clients and remove the artifact, so this pass measures the system, not the harness.
  • Harness — drop finished games. playTurn reports a finished game and the player drops it from its rotation, so secondary ops stop hitting game_finished on ended games (the other R2 harness finding).
  • Observability — otelcol docker_stats. cAdvisor (which resolves only the root cgroup on this host — separate-XFS /var/lib/docker) is replaced by the otelcol docker_stats receiver, reading per-container CPU/memory/network from the Docker API. Per-container panels now populate on the contour host. (api_version pinned to 1.44; the daemon's minimum is 1.40.)
  • Contour — container limits + GOMAXPROCS. deploy.resources.limits now bound every service; the Go services pin GOMAXPROCS to their CPU limit so the runtime matches the cgroup quota. Starting values were generous over the R2 peak; this pass validates them and settles the agreed sizing (below).

Method

Unchanged from R2 except for the per-player transports and the dropped-finished-games refinement above:

  • Driver: the scrabble/loadtest module, run as a one-shot container on the scrabble-internal docker network (reaching postgres:5432 / gateway:8081 directly), capped at --cpus 3 so the contour keeps the host's spare cores.
  • Seed: 10 000 durable + 1 000 guest accounts with pre-created sessions written straight to Postgres (token hash matches backend/internal/session).
  • Games: assembled through the real invitation flow, 24 players each, no robots; variants over scrabble_en / scrabble_ru / erudit_ru.
  • Play: each player holds a live Subscribe stream and, per tick, polls game.state, replays game.history and submits a mid-ranked legal move generated locally by the embedded scrabble-solver, or passes / exchanges; a fraction exercise nudge / chat / check-word / draft / profile / stats. A separate gateway-hammer floods games.list from one account.
  • Scale: the same moderate ramp 50 → 200 → 500 concurrent players, 10 min/step.
  • Resource capture: docker stats (docker API) sampled every ~20 s for per-container CPU/memory; the otelcol docker_stats receiver → Prometheus → the Grafana Scrabble — Resources dashboard for the same per-container series; postgres_exporter internals and per-service Go runtime metrics.

Run configuration

docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
  -e POSTGRES_PASSWORD=… scrabble-loadtest \
  run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
      --tick 800ms --hammer-workers 20 --hammer-dur 15s --reset --cleanup

Date: 2026-06-10. Contour: the R1-baseline schema, freshly redeployed with the R7 container limits / GOMAXPROCS (backend/gateway/postgres capped at 2 cores + 512 MiB, GOMAXPROCS=2) and the docker_stats observability. Seeded population removed by --cleanup afterwards.

Findings

The ramp ran clean to 500 players — no harness crash, no deadlock, stream errors: 0 — and cleanup removed all 11 000 seeded accounts.

  • Volume (1827 s): 821 680 edge calls (449.7 req/s incl. the hammer). Real gameplay at scale: 50 916 committed plays, 4 817 passes, 2 931 games finished; 165 755 opponent_moved + 54 864 your_turn events.
  • The per-player transport fix worked. game.state returned transport_error on 3 173 / 127 403 = 2.49 % of calls — down from R2's ~14 % on the same step. Other ops were lower still (game.history 0.43 %, game.submit_play 0.28 %). The residual is the gateway bursting into its 2-core cap (see the profile below), not the harness.
  • Dropping finished games worked. game_finished on chat.nudge / chat.post fell to 35 / 36 (R2: ≈ 3 900 each) — secondary ops no longer hammer ended games.
  • The limiter holds. The gateway-hammer sent 565 152 games.list; 564 979 (99.97 %) were rate_limited (154 ok burst, 19 deadline), p99 = 2 ms, ~309 req/s of rejections sustained — unchanged from R2.
  • Latency (peak): game.state p50 ≈ 100 ms, p99 in the 2000 ms bucket (max 2549 ms); game.submit_play p50 100 / p99 1000 ms bucket. Lobby ops stayed fast (invitation / games.list p99 ≤ 10 ms). The p99 tail correlates with the gateway burst-throttling, not the backend (which stayed at ~0.85 core).

Resource profile

Per-container peak during step 3 (500 players), with the R7 starting limits in force (backend/gateway/postgres capped at 2 cores / 512 MiB). Two CPU columns: docker stats samples a ~1 s window (catches bursts); the otelcol docker_stats receiver averages over its 30 s collection interval (smooths them) — they agree within sampling error, which validates the new observability path.

container CPU burst (1 s) CPU sustained (30 s) CPU cap mem peak mem cap
scrabble-gateway 217 % (at cap) ~145 % 200 % 167 MiB 512 MiB
scrabble-postgres 138 % ~153 % 200 % 117 MiB 512 MiB
scrabble-backend 85 % ~89 % 200 % 116 MiB 512 MiB
scrabble-tempo 33 % (none) 1024 MiB (at cap) 1024 MiB
scrabble-otelcol 11 % (none) 131 MiB 512 MiB
scrabble-loadtest (harness) 157 % 300 % 369 MiB
  • The gateway is the binding constraint. With one h2c connection per player it draws ~1.45 cores sustained and bursts to its 2-core cap at 500 players, throttling briefly — the source of the 2.49 % transport_error. R2 saw only ~0.93 core because all 500 players shared one connection; the +~0.5 core is the realistic per-connection overhead (500 separate HTTP/2 connections). This is a sizing fact, not a regression.
  • backend is over-provisioned (~0.85 core vs a 2-core cap); postgres (~1.4 cores) has headroom; both stayed ≤ 120 MiB.
  • tempo reached its 1 GiB memory cap (R2: 446 MiB) — an OOM risk under sustained tracing.
  • Postgres backends peaked at 28, with the backend pool at its MaxOpenConns=25 cap. Cache hit stayed ~100 % (no disk reads); CPU, not I/O, is the limit.
  • docker log volume (30 min): backend 14.2 MiB, gateway 4.6 MiB, postgres 0.04 MiB — the backend's per-request latency line at info dominates, and json-file logs had no rotation.

Tuning applied

Agreed from the profile (all in deploy/docker-compose.yml; no code change — the pool is already env-driven):

knob from to why
gateway CPU + GOMAXPROCS 2 cores / 2 3 cores / 3 it bursts into the 2-core cap at 500 players (the 2.49 % transport_error); 3 absorbs the bursts
tempo memory 1 GiB 2 GiB it reached the 1 GiB cap (OOM risk)
backend MAX_OPEN_CONNS 25 40 the pool sat at its 25-conn cap at peak; headroom trims the p99 tail
docker logs unbounded json-file 10m × 3 bound the ~14 MiB / 30 min backend log; level stays info

Left as-is: backend / postgres at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom is cheap on the shared host); the per-user rate limiter and h2cMaxConcurrentStreams=250 (per-connection now, ~1 stream each — ample) and cache TTLs (no pressure observed).

Validation re-run

Re-running the same gradual ramp (50 → 200 → 500) on the tuned contour confirms the fix:

  • game.state transport_error fell to 0.72 % (853 / 119 051), down from 2.49 % at 2 cores. The latency tail also improved — p99 in the 1000 ms bucket, max 1220 ms (was the 2000 ms bucket, max 2549 ms).
  • The gateway peaked at ~2 cores (≈196 % on the 30 s gauge) — now comfortably under the 3-core cap, so it no longer throttles. backend ~1 core, postgres ~1.3 cores.
  • tempo peaked at ~1.27 GiB — under the new 2 GiB cap (it would have OOM-ed at 1 GiB).
  • Drop-finished still holds (game_finished on chat 41/42); the limiter still rejects 99.97 % of the hammer at p99 2 ms; stream errors: 0.

A separate burst stress (a single 100 → 500 jump — 400 players connecting at once) pegged the gateway at 3 cores (≈296 % sustained) and pushed game.state transport_error to 9.27 %. The gateway is connection-CPU-bound and bursty: average load is ~1 core, but a mass-simultaneous connection storm saturates whatever single-node cap it is given. Real arrivals are gradual (the canonical run), where 3 cores has headroom; the lever for a true arrival spike is horizontal scaling, not more cores per node — carried into the prod recommendation below.

Prod-sizing recommendation (Stage 18)

The contour is CPU-bound and gateway-led at 500 concurrent players. Carry these to the prod contour env (the same compose, PROD_* values):

  • gateway: ≥ 3 cores per ~500 concurrent players, GOMAXPROCS pinned to the limit — it scales with the connection count, not just the request rate; beyond one node's worth, scale the gateway horizontally rather than vertically.
  • backend: ~12 cores, pool 40 — comfortable; the work is light per request.
  • postgres: ~2 cores / ≥ 512 MiB — ~1.4 cores at 500 players, 100 % cache hit.
  • tempo: ≥ 2 GiB; the Go services run under ~170 MiB (256 MiB would suffice, 512 is safe); pin GOMAXPROCS to each CPU limit; keep json-file rotation.
  • Memory is not the constraint anywhere; CPU is.

VPS / VDS sizing (single-host contour)

The whole contour (the app + the observability stack) runs on one host via docker-compose. The tiers below are grounded in the R7 profile (≈5.5 cores / ≈2.5 GiB RAM peak at 500 concurrent players; ≈0.5 GiB idle) and the measured on-disk footprint: prod images ≈2.4 GB; the Tempo volume 3.1 GB at 72 h retention; Prometheus ≈12 GB at 15 d; the game DB 23 MiB and growing with history. CPU and disk grow; RAM has the most slack.

tier CPU RAM disk handles
Minimum 2 cores 2 GiB 20 GiB ~up to ~150 concurrent; lower the compose limits (gateway 1.5 / backend·postgres 1 / tempo 1 GiB) to fit the box
Average (reasonable load) 4 cores 4 GiB 40 GiB ~300400 concurrent comfortably; the tested 500 with occasional gateway burst-throttling
Maximum (worry-free) 8 cores 8 GiB 80 GiB 500+ concurrent with full gateway burst headroom (its 3-core cap) + room to grow; the compose limits fit as-is
  • The per-service limits in docker-compose.yml are tuned for the Average/Maximum target (the gateway alone caps at 3 cores). On the Minimum tier, scale them down to match the host or the caps over-subscribe it.
  • Disk is dominated by observability retention + DB growth. Tempo (72 h traces) and Prometheus (15 d metrics) are the main levers — shorten the windows (or move Tempo to object storage) to cut disk; Postgres grows with game history, so budget for months of it; container logs are already capped (json-file 10m × 3 ≈ 30 MiB each).
  • RAM rarely binds: the contour peaks ≈2.5 GiB at 500 players and the sum of all configured limits is ≈5.6 GiB, so 8 GiB never strains.
  • Beyond one host's worth of players, scale the gateway horizontally (it is connection-CPU-bound) rather than ordering an ever-bigger box.

Re-running

See README.md. Briefly, from the repo root:

docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
  -e POSTGRES_PASSWORD=… scrabble-loadtest run --reset --cleanup

The harness stays in the repo for future repeats.