scrabble-game/loadtest/REPORT-R2.md

# R2 — early stress-run trip report

The early stress pass for `PRERELEASE.md` R2. It exercises the system through the
**edge protocol** with the `scrabble/loadtest` harness, to surface logic/concurrency
bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and
R7 (final tuning). Pass bar: **diagnostic** — the run "passes" by completing without the
harness crashing; findings are recorded below, not gated.

## Method

- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the
  `scrabble-internal` docker network (reaching `postgres:5432` and `gateway:8081`
  directly, bypassing the host→gateway hairpin).
- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written
  directly to Postgres (token hash matches `backend/internal/session`), so the driver
  authenticates without the per-IP-limited auth ops.
- **Games:** assembled through the real **invitation** flow (`invitation.create` →
  `invitation.accept`), 2–4 players each, no robots; variants spread over
  scrabble_en / scrabble_ru / erudit_ru.
- **Play:** each virtual player holds a live `Subscribe` stream and, per tick, polls
  `game.state`, replays `game.history` and submits a **mid-ranked** legal move generated
  locally by the embedded `scrabble-solver` (the edge carries no board), or
  passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile /
  stats. A separate **gateway-hammer** floods `games.list` from one account.
- **Scale:** moderate ramp **50 → 200 → 500** concurrent players, 10 min/step (the
  agreed moderate profile; harness and contour share this host's CPU).
- **Resource capture:** `docker stats` (docker API) sampled every 28 s for per-container
  CPU/memory; Prometheus for edge latency/throughput, `postgres_exporter` internals and
  per-service Go runtime metrics.

## Run configuration

```
loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
             --tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup
```

Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2
exporters. Seeded population removed by `--cleanup` afterwards.

## Findings

### Validated (fixed within R2)
- **Harness draft payload.** `draft.save` first returned `bad_request`: the backend
  draft DTO's `rack_order` is a string (the harness sent `[]`). Fixed → `ok`.
- **Harness profile marker.** `profile.update` first returned `invalid_profile`: the
  editable-display-name validator (`backend/internal/account/profile.go`) forbids digits
  and colons, but the seed marker was `lt:…`. Switched the marker to a distinctive
  letters-only string → `ok`. Cleanup still matches it.

### By-design behaviour (correctly exercised, not bugs)
- **`chat_not_your_turn`** — chat is gated to the sender's turn
  (`backend/internal/social/chat.go`); off-turn posts are correctly rejected.
- **`nudge_own_turn`** — you nudge the player whose turn it is, so a nudge on your own
  turn is correctly rejected. The harness nudges/chats at random ticks, so a share of
  these codes is expected.

### Observability gap (key R7 input)
- **cAdvisor yields only the root cgroup on the contour host.** Its docker factory
  registers, but per-container init fails — `failed to identify the read-write layer ID
  … /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory` — because this
  host's `/var/lib/docker` is a **separate XFS mount** not visible under cAdvisor's
  `/rootfs` bind (the existing galaxy deployment on the same host has the same
  limitation). So the **Scrabble — Resources** dashboard's per-container panels are empty
  here, and per-container CPU/RSS for this run was captured via `docker stats` instead.
  Postgres internals (`postgres_exporter`) and per-service Go runtime metrics
  (`go_*` by `service_name`) work. **Recommendation for R7:** adopt the otelcol
  **`docker_stats`** receiver (already the contrib image) — it reads per-container stats
  via the docker API with no cgroup dependency — and/or run the final pass on hardware
  where cAdvisor resolves containers. (Decision to confirm with the owner.)

### Run results

The ramp ran clean to 500 players with no harness crash, no deadlock and
`stream errors: 0`; cleanup removed all 11 000 seeded accounts (and their ~941 games).

- **Ramp:** step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569.
- **Volume (30 min):** 1.20 M total edge calls, 659 req/s average. Real gameplay at
  scale: **48 870 committed plays**, 52 772 `your_turn` + 159 631 `opponent_moved`
  events, **2 798 games finished**.
- **Latency under load (peak, step 3):** `game.state` p50 ≈ 100 ms, p90/p99 in the
  200–500 ms buckets, max 849 ms; `game.submit_play` similar (p99 ≤ 500 ms, max 490 ms).
  Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms).
- **Rate limiter holds.** The gateway-hammer sent 522 667 `games.list` from one account;
  **522 486 (99.97 %) were `rate_limited`**, only 135 `ok` (the burst). Rejections are
  cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the
  flood. The per-user limiter behaves as designed (R3 input: the cost is negligible).

**Top finding — `transport_error` under saturation.** At 500 players ~14 % of
`game.state` calls (72 429 / 519 067) and a few % of the other ops returned a Connect
`transport_error` (not a domain code). It correlates with the CPU saturation below: the
backend/gateway are pinned near one core each while the host also runs the 86 %-core
harness, so the edge sheds load (resets/timeouts) at the knee. It is **amplified by a
harness artifact** — all 500 virtual players multiplex over a *single* shared
`http2.Transport`, so 500 persistent `Subscribe` streams plus Execute calls press on one
HTTP/2 connection's concurrent-stream limit; real clients each use their own connection.
**Actions:** R7 harness — give each player (or a pool) its own transport, and run on
hardware not shared with the contour; R3 — confirm the gateway's h2c
`MaxConcurrentStreams` and edge timeouts are sized for many persistent streams.

**Minor findings:**
- `unauthenticated` on a tiny share (188 / 519 067 `game.state`, ~0.04 %) — transient
  session-resolve failures under load; worth a glance in R3 but not material.
- one `internal` on `game.pass` (1 / 4 788).
- `game_finished` dominates `chat.nudge`/`chat.post` (≈ 3 900 each): the harness keeps
  secondary ops on games that already ended. Harness refinement — drop finished games
  from the rotation (R7).
- `nudge_own_turn` / `chat_not_your_turn` / `nudge_too_soon` are the expected turn/rate
  gates, correctly exercised.

## Resource baseline

Per-container peak during step 3 (500 players), from `docker stats`:

| container | peak CPU | memory |
|-----------|---------:|-------:|
| scrabble-backend | **99 %** (~1 core) | 91 MiB |
| scrabble-gateway | **93 %** | 76 MiB |
| scrabble-postgres | **90 %** | 69 MiB |
| scrabble-loadtest (harness) | **86 %** | 42 MiB |
| scrabble-otelcol | 10 % | 110 MiB |
| scrabble-tempo | 9 % | 446 MiB |
| prometheus / postgres-exporter | ~0 % | 46 / 16 MiB |

- **The contour is CPU-bound at 500 concurrent players:** backend, gateway and Postgres
  each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at
  this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for
  R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover.
- **Caveat:** the harness itself peaked at **86 % of a core** on the *same host*, so the
  step-3 latency and `transport_error` figures are pessimistic — the contour competed
  with the generator for CPU. A clean ceiling needs separate hardware (R7).
- **Postgres:** peak 28 backend connections, ~5 581 commits/s at the peak, **100 % cache
  hit ratio** (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here.
- **Goroutines:** backend 638, gateway **1 698** (it holds the 500 `Subscribe` streams +
  per-request goroutines), telegram 49 — all stable, no leak across the ramp.

## Recommendations feeding later phases
- **R3 (edge hardening):** the per-user limiter holds (99.97 % rejected, p99 2 ms) — add
  the per-IP body-size cap on top. Investigate the **~14 % `transport_error` on
  `game.state` at 500 players**: confirm the gateway h2c `MaxConcurrentStreams` and edge
  read/write timeouts are sized for many persistent `Subscribe` streams, and glance at the
  ~0.04 % transient `unauthenticated` resolves under load.
- **R6 (refactor):** no logic bug forced a code change beyond the two harness-payload
  fixes; the run surfaced no deadlock or goroutine leak across the ramp.
- **R7 (final tuning + stress):** (1) fix the per-container observability gap — adopt the
  otelcol `docker_stats` receiver so Grafana shows per-container CPU/RSS on the contour;
  (2) refine the harness — per-player/pooled transports and dropping finished games from
  the rotation — and run on hardware **not** shared with the contour; (3) size pools /
  GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway,
  Postgres at 500 players).

## Re-running

See [`README.md`](README.md). Briefly, from the repo root:

```sh
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --name scrabble-loadtest --network scrabble-internal \
  -e POSTGRES_PASSWORD=… scrabble-loadtest run    # add --reset on a re-run
```

The harness stays in the repo for the R7 repeat.