a2265a122e
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held under the gateway-hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation (backend/gateway/Postgres each ~1 core), amplified by the harness's single shared http2.Transport (the harness itself peaked at 86 % of a core on the same host). Observability finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7 should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md; PRERELEASE refinements logged; R2 marked done.
163 lines
9.4 KiB
Markdown
163 lines
9.4 KiB
Markdown
# R2 — early stress-run trip report
|
||
|
||
The early stress pass for `PRERELEASE.md` R2. It exercises the system through the
|
||
**edge protocol** with the `scrabble/loadtest` harness, to surface logic/concurrency
|
||
bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and
|
||
R7 (final tuning). Pass bar: **diagnostic** — the run "passes" by completing without the
|
||
harness crashing; findings are recorded below, not gated.
|
||
|
||
## Method
|
||
|
||
- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the
|
||
`scrabble-internal` docker network (reaching `postgres:5432` and `gateway:8081`
|
||
directly, bypassing the host→gateway hairpin).
|
||
- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written
|
||
directly to Postgres (token hash matches `backend/internal/session`), so the driver
|
||
authenticates without the per-IP-limited auth ops.
|
||
- **Games:** assembled through the real **invitation** flow (`invitation.create` →
|
||
`invitation.accept`), 2–4 players each, no robots; variants spread over
|
||
scrabble_en / scrabble_ru / erudit_ru.
|
||
- **Play:** each virtual player holds a live `Subscribe` stream and, per tick, polls
|
||
`game.state`, replays `game.history` and submits a **mid-ranked** legal move generated
|
||
locally by the embedded `scrabble-solver` (the edge carries no board), or
|
||
passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile /
|
||
stats. A separate **gateway-hammer** floods `games.list` from one account.
|
||
- **Scale:** moderate ramp **50 → 200 → 500** concurrent players, 10 min/step (the
|
||
agreed moderate profile; harness and contour share this host's CPU).
|
||
- **Resource capture:** `docker stats` (docker API) sampled every 28 s for per-container
|
||
CPU/memory; Prometheus for edge latency/throughput, `postgres_exporter` internals and
|
||
per-service Go runtime metrics.
|
||
|
||
## Run configuration
|
||
|
||
```
|
||
loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
|
||
--tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup
|
||
```
|
||
|
||
Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2
|
||
exporters. Seeded population removed by `--cleanup` afterwards.
|
||
|
||
## Findings
|
||
|
||
### Validated (fixed within R2)
|
||
- **Harness draft payload.** `draft.save` first returned `bad_request`: the backend
|
||
draft DTO's `rack_order` is a string (the harness sent `[]`). Fixed → `ok`.
|
||
- **Harness profile marker.** `profile.update` first returned `invalid_profile`: the
|
||
editable-display-name validator (`backend/internal/account/profile.go`) forbids digits
|
||
and colons, but the seed marker was `lt:…`. Switched the marker to a distinctive
|
||
letters-only string → `ok`. Cleanup still matches it.
|
||
|
||
### By-design behaviour (correctly exercised, not bugs)
|
||
- **`chat_not_your_turn`** — chat is gated to the sender's turn
|
||
(`backend/internal/social/chat.go`); off-turn posts are correctly rejected.
|
||
- **`nudge_own_turn`** — you nudge the player whose turn it is, so a nudge on your own
|
||
turn is correctly rejected. The harness nudges/chats at random ticks, so a share of
|
||
these codes is expected.
|
||
|
||
### Observability gap (key R7 input)
|
||
- **cAdvisor yields only the root cgroup on the contour host.** Its docker factory
|
||
registers, but per-container init fails — `failed to identify the read-write layer ID
|
||
… /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory` — because this
|
||
host's `/var/lib/docker` is a **separate XFS mount** not visible under cAdvisor's
|
||
`/rootfs` bind (the existing galaxy deployment on the same host has the same
|
||
limitation). So the **Scrabble — Resources** dashboard's per-container panels are empty
|
||
here, and per-container CPU/RSS for this run was captured via `docker stats` instead.
|
||
Postgres internals (`postgres_exporter`) and per-service Go runtime metrics
|
||
(`go_*` by `service_name`) work. **Recommendation for R7:** adopt the otelcol
|
||
**`docker_stats`** receiver (already the contrib image) — it reads per-container stats
|
||
via the docker API with no cgroup dependency — and/or run the final pass on hardware
|
||
where cAdvisor resolves containers. (Decision to confirm with the owner.)
|
||
|
||
### Run results
|
||
|
||
The ramp ran clean to 500 players with no harness crash, no deadlock and
|
||
`stream errors: 0`; cleanup removed all 11 000 seeded accounts (and their ~941 games).
|
||
|
||
- **Ramp:** step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569.
|
||
- **Volume (30 min):** 1.20 M total edge calls, 659 req/s average. Real gameplay at
|
||
scale: **48 870 committed plays**, 52 772 `your_turn` + 159 631 `opponent_moved`
|
||
events, **2 798 games finished**.
|
||
- **Latency under load (peak, step 3):** `game.state` p50 ≈ 100 ms, p90/p99 in the
|
||
200–500 ms buckets, max 849 ms; `game.submit_play` similar (p99 ≤ 500 ms, max 490 ms).
|
||
Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms).
|
||
- **Rate limiter holds.** The gateway-hammer sent 522 667 `games.list` from one account;
|
||
**522 486 (99.97 %) were `rate_limited`**, only 135 `ok` (the burst). Rejections are
|
||
cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the
|
||
flood. The per-user limiter behaves as designed (R3 input: the cost is negligible).
|
||
|
||
**Top finding — `transport_error` under saturation.** At 500 players ~14 % of
|
||
`game.state` calls (72 429 / 519 067) and a few % of the other ops returned a Connect
|
||
`transport_error` (not a domain code). It correlates with the CPU saturation below: the
|
||
backend/gateway are pinned near one core each while the host also runs the 86 %-core
|
||
harness, so the edge sheds load (resets/timeouts) at the knee. It is **amplified by a
|
||
harness artifact** — all 500 virtual players multiplex over a *single* shared
|
||
`http2.Transport`, so 500 persistent `Subscribe` streams plus Execute calls press on one
|
||
HTTP/2 connection's concurrent-stream limit; real clients each use their own connection.
|
||
**Actions:** R7 harness — give each player (or a pool) its own transport, and run on
|
||
hardware not shared with the contour; R3 — confirm the gateway's h2c
|
||
`MaxConcurrentStreams` and edge timeouts are sized for many persistent streams.
|
||
|
||
**Minor findings:**
|
||
- `unauthenticated` on a tiny share (188 / 519 067 `game.state`, ~0.04 %) — transient
|
||
session-resolve failures under load; worth a glance in R3 but not material.
|
||
- one `internal` on `game.pass` (1 / 4 788).
|
||
- `game_finished` dominates `chat.nudge`/`chat.post` (≈ 3 900 each): the harness keeps
|
||
secondary ops on games that already ended. Harness refinement — drop finished games
|
||
from the rotation (R7).
|
||
- `nudge_own_turn` / `chat_not_your_turn` / `nudge_too_soon` are the expected turn/rate
|
||
gates, correctly exercised.
|
||
|
||
## Resource baseline
|
||
|
||
Per-container peak during step 3 (500 players), from `docker stats`:
|
||
|
||
| container | peak CPU | memory |
|
||
|-----------|---------:|-------:|
|
||
| scrabble-backend | **99 %** (~1 core) | 91 MiB |
|
||
| scrabble-gateway | **93 %** | 76 MiB |
|
||
| scrabble-postgres | **90 %** | 69 MiB |
|
||
| scrabble-loadtest (harness) | **86 %** | 42 MiB |
|
||
| scrabble-otelcol | 10 % | 110 MiB |
|
||
| scrabble-tempo | 9 % | 446 MiB |
|
||
| prometheus / postgres-exporter | ~0 % | 46 / 16 MiB |
|
||
|
||
- **The contour is CPU-bound at 500 concurrent players:** backend, gateway and Postgres
|
||
each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at
|
||
this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for
|
||
R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover.
|
||
- **Caveat:** the harness itself peaked at **86 % of a core** on the *same host*, so the
|
||
step-3 latency and `transport_error` figures are pessimistic — the contour competed
|
||
with the generator for CPU. A clean ceiling needs separate hardware (R7).
|
||
- **Postgres:** peak 28 backend connections, ~5 581 commits/s at the peak, **100 % cache
|
||
hit ratio** (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here.
|
||
- **Goroutines:** backend 638, gateway **1 698** (it holds the 500 `Subscribe` streams +
|
||
per-request goroutines), telegram 49 — all stable, no leak across the ramp.
|
||
|
||
## Recommendations feeding later phases
|
||
- **R3 (edge hardening):** the per-user limiter holds (99.97 % rejected, p99 2 ms) — add
|
||
the per-IP body-size cap on top. Investigate the **~14 % `transport_error` on
|
||
`game.state` at 500 players**: confirm the gateway h2c `MaxConcurrentStreams` and edge
|
||
read/write timeouts are sized for many persistent `Subscribe` streams, and glance at the
|
||
~0.04 % transient `unauthenticated` resolves under load.
|
||
- **R6 (refactor):** no logic bug forced a code change beyond the two harness-payload
|
||
fixes; the run surfaced no deadlock or goroutine leak across the ramp.
|
||
- **R7 (final tuning + stress):** (1) fix the per-container observability gap — adopt the
|
||
otelcol `docker_stats` receiver so Grafana shows per-container CPU/RSS on the contour;
|
||
(2) refine the harness — per-player/pooled transports and dropping finished games from
|
||
the rotation — and run on hardware **not** shared with the contour; (3) size pools /
|
||
GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway,
|
||
Postgres at 500 players).
|
||
|
||
## Re-running
|
||
|
||
See [`README.md`](README.md). Briefly, from the repo root:
|
||
|
||
```sh
|
||
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
|
||
docker run --rm --name scrabble-loadtest --network scrabble-internal \
|
||
-e POSTGRES_PASSWORD=… scrabble-loadtest run # add --reset on a re-run
|
||
```
|
||
|
||
The harness stays in the repo for the R7 repeat.
|