Files
scrabble-game/loadtest/REPORT-R2.md
T
Ilia Denisov a2265a122e
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
R2: early-pass trip report + mark R2 done
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped
clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no
crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held
under the gateway-hammer (99.97 % rejected, p99 2 ms).

Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation
(backend/gateway/Postgres each ~1 core), amplified by the harness's single shared
http2.Transport (the harness itself peaked at 86 % of a core on the same host).
Observability finding: cAdvisor yields only the root cgroup on the contour host
(separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7
should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md;
PRERELEASE refinements logged; R2 marked done.
2026-06-10 00:47:16 +02:00

163 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# R2 — early stress-run trip report
The early stress pass for `PRERELEASE.md` R2. It exercises the system through the
**edge protocol** with the `scrabble/loadtest` harness, to surface logic/concurrency
bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and
R7 (final tuning). Pass bar: **diagnostic** — the run "passes" by completing without the
harness crashing; findings are recorded below, not gated.
## Method
- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the
`scrabble-internal` docker network (reaching `postgres:5432` and `gateway:8081`
directly, bypassing the host→gateway hairpin).
- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written
directly to Postgres (token hash matches `backend/internal/session`), so the driver
authenticates without the per-IP-limited auth ops.
- **Games:** assembled through the real **invitation** flow (`invitation.create`
`invitation.accept`), 24 players each, no robots; variants spread over
scrabble_en / scrabble_ru / erudit_ru.
- **Play:** each virtual player holds a live `Subscribe` stream and, per tick, polls
`game.state`, replays `game.history` and submits a **mid-ranked** legal move generated
locally by the embedded `scrabble-solver` (the edge carries no board), or
passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile /
stats. A separate **gateway-hammer** floods `games.list` from one account.
- **Scale:** moderate ramp **50 → 200 → 500** concurrent players, 10 min/step (the
agreed moderate profile; harness and contour share this host's CPU).
- **Resource capture:** `docker stats` (docker API) sampled every 28 s for per-container
CPU/memory; Prometheus for edge latency/throughput, `postgres_exporter` internals and
per-service Go runtime metrics.
## Run configuration
```
loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
--tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup
```
Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2
exporters. Seeded population removed by `--cleanup` afterwards.
## Findings
### Validated (fixed within R2)
- **Harness draft payload.** `draft.save` first returned `bad_request`: the backend
draft DTO's `rack_order` is a string (the harness sent `[]`). Fixed → `ok`.
- **Harness profile marker.** `profile.update` first returned `invalid_profile`: the
editable-display-name validator (`backend/internal/account/profile.go`) forbids digits
and colons, but the seed marker was `lt:…`. Switched the marker to a distinctive
letters-only string → `ok`. Cleanup still matches it.
### By-design behaviour (correctly exercised, not bugs)
- **`chat_not_your_turn`** — chat is gated to the sender's turn
(`backend/internal/social/chat.go`); off-turn posts are correctly rejected.
- **`nudge_own_turn`** — you nudge the player whose turn it is, so a nudge on your own
turn is correctly rejected. The harness nudges/chats at random ticks, so a share of
these codes is expected.
### Observability gap (key R7 input)
- **cAdvisor yields only the root cgroup on the contour host.** Its docker factory
registers, but per-container init fails — `failed to identify the read-write layer ID
… /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory` — because this
host's `/var/lib/docker` is a **separate XFS mount** not visible under cAdvisor's
`/rootfs` bind (the existing galaxy deployment on the same host has the same
limitation). So the **Scrabble — Resources** dashboard's per-container panels are empty
here, and per-container CPU/RSS for this run was captured via `docker stats` instead.
Postgres internals (`postgres_exporter`) and per-service Go runtime metrics
(`go_*` by `service_name`) work. **Recommendation for R7:** adopt the otelcol
**`docker_stats`** receiver (already the contrib image) — it reads per-container stats
via the docker API with no cgroup dependency — and/or run the final pass on hardware
where cAdvisor resolves containers. (Decision to confirm with the owner.)
### Run results
The ramp ran clean to 500 players with no harness crash, no deadlock and
`stream errors: 0`; cleanup removed all 11 000 seeded accounts (and their ~941 games).
- **Ramp:** step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569.
- **Volume (30 min):** 1.20 M total edge calls, 659 req/s average. Real gameplay at
scale: **48 870 committed plays**, 52 772 `your_turn` + 159 631 `opponent_moved`
events, **2 798 games finished**.
- **Latency under load (peak, step 3):** `game.state` p50 ≈ 100 ms, p90/p99 in the
200500 ms buckets, max 849 ms; `game.submit_play` similar (p99 ≤ 500 ms, max 490 ms).
Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms).
- **Rate limiter holds.** The gateway-hammer sent 522 667 `games.list` from one account;
**522 486 (99.97 %) were `rate_limited`**, only 135 `ok` (the burst). Rejections are
cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the
flood. The per-user limiter behaves as designed (R3 input: the cost is negligible).
**Top finding — `transport_error` under saturation.** At 500 players ~14 % of
`game.state` calls (72 429 / 519 067) and a few % of the other ops returned a Connect
`transport_error` (not a domain code). It correlates with the CPU saturation below: the
backend/gateway are pinned near one core each while the host also runs the 86 %-core
harness, so the edge sheds load (resets/timeouts) at the knee. It is **amplified by a
harness artifact** — all 500 virtual players multiplex over a *single* shared
`http2.Transport`, so 500 persistent `Subscribe` streams plus Execute calls press on one
HTTP/2 connection's concurrent-stream limit; real clients each use their own connection.
**Actions:** R7 harness — give each player (or a pool) its own transport, and run on
hardware not shared with the contour; R3 — confirm the gateway's h2c
`MaxConcurrentStreams` and edge timeouts are sized for many persistent streams.
**Minor findings:**
- `unauthenticated` on a tiny share (188 / 519 067 `game.state`, ~0.04 %) — transient
session-resolve failures under load; worth a glance in R3 but not material.
- one `internal` on `game.pass` (1 / 4 788).
- `game_finished` dominates `chat.nudge`/`chat.post` (≈ 3 900 each): the harness keeps
secondary ops on games that already ended. Harness refinement — drop finished games
from the rotation (R7).
- `nudge_own_turn` / `chat_not_your_turn` / `nudge_too_soon` are the expected turn/rate
gates, correctly exercised.
## Resource baseline
Per-container peak during step 3 (500 players), from `docker stats`:
| container | peak CPU | memory |
|-----------|---------:|-------:|
| scrabble-backend | **99 %** (~1 core) | 91 MiB |
| scrabble-gateway | **93 %** | 76 MiB |
| scrabble-postgres | **90 %** | 69 MiB |
| scrabble-loadtest (harness) | **86 %** | 42 MiB |
| scrabble-otelcol | 10 % | 110 MiB |
| scrabble-tempo | 9 % | 446 MiB |
| prometheus / postgres-exporter | ~0 % | 46 / 16 MiB |
- **The contour is CPU-bound at 500 concurrent players:** backend, gateway and Postgres
each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at
this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for
R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover.
- **Caveat:** the harness itself peaked at **86 % of a core** on the *same host*, so the
step-3 latency and `transport_error` figures are pessimistic — the contour competed
with the generator for CPU. A clean ceiling needs separate hardware (R7).
- **Postgres:** peak 28 backend connections, ~5 581 commits/s at the peak, **100 % cache
hit ratio** (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here.
- **Goroutines:** backend 638, gateway **1 698** (it holds the 500 `Subscribe` streams +
per-request goroutines), telegram 49 — all stable, no leak across the ramp.
## Recommendations feeding later phases
- **R3 (edge hardening):** the per-user limiter holds (99.97 % rejected, p99 2 ms) — add
the per-IP body-size cap on top. Investigate the **~14 % `transport_error` on
`game.state` at 500 players**: confirm the gateway h2c `MaxConcurrentStreams` and edge
read/write timeouts are sized for many persistent `Subscribe` streams, and glance at the
~0.04 % transient `unauthenticated` resolves under load.
- **R6 (refactor):** no logic bug forced a code change beyond the two harness-payload
fixes; the run surfaced no deadlock or goroutine leak across the ramp.
- **R7 (final tuning + stress):** (1) fix the per-container observability gap — adopt the
otelcol `docker_stats` receiver so Grafana shows per-container CPU/RSS on the contour;
(2) refine the harness — per-player/pooled transports and dropping finished games from
the rotation — and run on hardware **not** shared with the contour; (3) size pools /
GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway,
Postgres at 500 players).
## Re-running
See [`README.md`](README.md). Briefly, from the repo root:
```sh
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --name scrabble-loadtest --network scrabble-internal \
-e POSTGRES_PASSWORD=… scrabble-loadtest run # add --reset on a re-run
```
The harness stays in the repo for the R7 repeat.