R2: early-pass trip report + mark R2 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held under the gateway-hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation (backend/gateway/Postgres each ~1 core), amplified by the harness's single shared http2.Transport (the harness itself peaked at 86 % of a core on the same host). Observability finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7 should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md; PRERELEASE refinements logged; R2 marked done.
This commit is contained in:
+34
-1
@@ -18,7 +18,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l
|
|||||||
| # | Phase | Raw TODOs | Status |
|
| # | Phase | Raw TODOs | Status |
|
||||||
|---|-------|-----------|--------|
|
|---|-------|-----------|--------|
|
||||||
| R1 | Schema & naming reset | 1 + 10 | **done** |
|
| R1 | Schema & naming reset | 1 + 10 | **done** |
|
||||||
| R2 | Stress harness + contour observability + early run | 9a | todo |
|
| R2 | Stress harness + contour observability + early run | 9a | **done** |
|
||||||
| R3 | Edge hardening | 2 + 8 + 3 | todo |
|
| R3 | Edge hardening | 2 + 8 + 3 | todo |
|
||||||
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
|
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
|
||||||
| R5 | Bundle slimming | 6 | todo |
|
| R5 | Bundle slimming | 6 | todo |
|
||||||
@@ -220,3 +220,36 @@ Then Stage 18.
|
|||||||
contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and
|
contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and
|
||||||
re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`),
|
re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`),
|
||||||
`games`=0 and a clean boot.
|
`games`=0 and a clean boot.
|
||||||
|
|
||||||
|
- **R2** (interview + implementation):
|
||||||
|
- **Locked decisions:** game assembly via **invitations** (real path, no robots; not direct game-row
|
||||||
|
inserts); **moderate** ramp **50 → 200 → 500** at 10 min/step; **diagnostic** pass bar (no SLO gate);
|
||||||
|
run as a **one-shot container on `scrabble-internal`** in this PR.
|
||||||
|
- **Harness** = new `scrabble/loadtest` module (`use ./loadtest` + a `replace scrabble/gateway` for the
|
||||||
|
dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions **directly in
|
||||||
|
Postgres** (token hash mirrors `backend/internal/session`), drives players over the **edge protocol**,
|
||||||
|
generates **mid-ranked legal moves locally** with the embedded `scrabble-solver` by replaying
|
||||||
|
`game.history` (the edge carries no board — mirrors `engine.ReplayBoard` via the public API), and a
|
||||||
|
**gateway-hammer**. Compact CLI (`run` / `cleanup`), distroless Dockerfile (DAWGs baked), Go unit tests.
|
||||||
|
- **Adding the module broke the other images' builds** — backend/gateway/telegram Dockerfiles reduce the
|
||||||
|
workspace but still referenced `./loadtest` (not in their context); each now also
|
||||||
|
`-dropuse=./loadtest` (backend/telegram additionally `-dropreplace` the gateway replace). Caught by the
|
||||||
|
first deploy run; verified by building all four images.
|
||||||
|
- **Harness payload fixes found by the smoke pass:** the draft DTO's `rack_order` is a string (was sent
|
||||||
|
as `[]` → `bad_request`); the display-name validator forbids digits/colons, so the cleanup marker
|
||||||
|
became a letters-only `Zzloadtest` so `profile.update` resends the seeded name. `chat_not_your_turn` /
|
||||||
|
`nudge_own_turn` are **by-design** turn gates, correctly exercised.
|
||||||
|
- **Observability:** added **cAdvisor + postgres_exporter** + the **Scrabble — Resources** dashboard +
|
||||||
|
two Prometheus jobs. **Finding:** cAdvisor yields only the root cgroup on the contour host (separate
|
||||||
|
XFS `/var/lib/docker` breaks its layer-ID resolution — the existing galaxy deploy has the same limit),
|
||||||
|
so per-container CPU/RSS for the early pass was captured via `docker stats`. **R7:** adopt the otelcol
|
||||||
|
`docker_stats` receiver (already the contrib image) for per-container metrics in Grafana.
|
||||||
|
- **Early run (2026-06-09):** ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000
|
||||||
|
accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the
|
||||||
|
hammer (99.97 % rejected, p99 2 ms). **Top finding:** ~14 % `transport_error` on `game.state` at 500
|
||||||
|
players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's
|
||||||
|
single shared `http2.Transport`; the harness itself peaked at 86 % of a core on the same host, so the
|
||||||
|
figures are pessimistic. Full trip report in [`../loadtest/REPORT-R2.md`](../loadtest/REPORT-R2.md);
|
||||||
|
it feeds R3 (h2c `MaxConcurrentStreams`/timeouts, body-size cap), R6 and R7 (per-player transports,
|
||||||
|
separate hardware, pool/limit sizing).
|
||||||
|
- **CI:** `./loadtest/...` added to the path filter + vet/build/test; `go.work.sum` carries the new deps.
|
||||||
|
|||||||
@@ -0,0 +1,162 @@
|
|||||||
|
# R2 — early stress-run trip report
|
||||||
|
|
||||||
|
The early stress pass for `PRERELEASE.md` R2. It exercises the system through the
|
||||||
|
**edge protocol** with the `scrabble/loadtest` harness, to surface logic/concurrency
|
||||||
|
bugs and capture a resource baseline that feeds R3 (edge hardening), R6 (refactor) and
|
||||||
|
R7 (final tuning). Pass bar: **diagnostic** — the run "passes" by completing without the
|
||||||
|
harness crashing; findings are recorded below, not gated.
|
||||||
|
|
||||||
|
## Method
|
||||||
|
|
||||||
|
- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the
|
||||||
|
`scrabble-internal` docker network (reaching `postgres:5432` and `gateway:8081`
|
||||||
|
directly, bypassing the host→gateway hairpin).
|
||||||
|
- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written
|
||||||
|
directly to Postgres (token hash matches `backend/internal/session`), so the driver
|
||||||
|
authenticates without the per-IP-limited auth ops.
|
||||||
|
- **Games:** assembled through the real **invitation** flow (`invitation.create` →
|
||||||
|
`invitation.accept`), 2–4 players each, no robots; variants spread over
|
||||||
|
scrabble_en / scrabble_ru / erudit_ru.
|
||||||
|
- **Play:** each virtual player holds a live `Subscribe` stream and, per tick, polls
|
||||||
|
`game.state`, replays `game.history` and submits a **mid-ranked** legal move generated
|
||||||
|
locally by the embedded `scrabble-solver` (the edge carries no board), or
|
||||||
|
passes/exchanges; a fraction exercise nudge / chat / check-word / draft / profile /
|
||||||
|
stats. A separate **gateway-hammer** floods `games.list` from one account.
|
||||||
|
- **Scale:** moderate ramp **50 → 200 → 500** concurrent players, 10 min/step (the
|
||||||
|
agreed moderate profile; harness and contour share this host's CPU).
|
||||||
|
- **Resource capture:** `docker stats` (docker API) sampled every 28 s for per-container
|
||||||
|
CPU/memory; Prometheus for edge latency/throughput, `postgres_exporter` internals and
|
||||||
|
per-service Go runtime metrics.
|
||||||
|
|
||||||
|
## Run configuration
|
||||||
|
|
||||||
|
```
|
||||||
|
loadtest run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
|
||||||
|
--tick 800ms --hammer-workers 20 --hammer-dur 15s --cleanup
|
||||||
|
```
|
||||||
|
|
||||||
|
Date: 2026-06-09. Contour: the R1-baseline schema, freshly deployed with the R2
|
||||||
|
exporters. Seeded population removed by `--cleanup` afterwards.
|
||||||
|
|
||||||
|
## Findings
|
||||||
|
|
||||||
|
### Validated (fixed within R2)
|
||||||
|
- **Harness draft payload.** `draft.save` first returned `bad_request`: the backend
|
||||||
|
draft DTO's `rack_order` is a string (the harness sent `[]`). Fixed → `ok`.
|
||||||
|
- **Harness profile marker.** `profile.update` first returned `invalid_profile`: the
|
||||||
|
editable-display-name validator (`backend/internal/account/profile.go`) forbids digits
|
||||||
|
and colons, but the seed marker was `lt:…`. Switched the marker to a distinctive
|
||||||
|
letters-only string → `ok`. Cleanup still matches it.
|
||||||
|
|
||||||
|
### By-design behaviour (correctly exercised, not bugs)
|
||||||
|
- **`chat_not_your_turn`** — chat is gated to the sender's turn
|
||||||
|
(`backend/internal/social/chat.go`); off-turn posts are correctly rejected.
|
||||||
|
- **`nudge_own_turn`** — you nudge the player whose turn it is, so a nudge on your own
|
||||||
|
turn is correctly rejected. The harness nudges/chats at random ticks, so a share of
|
||||||
|
these codes is expected.
|
||||||
|
|
||||||
|
### Observability gap (key R7 input)
|
||||||
|
- **cAdvisor yields only the root cgroup on the contour host.** Its docker factory
|
||||||
|
registers, but per-container init fails — `failed to identify the read-write layer ID
|
||||||
|
… /rootfs/var/lib/docker/image/overlayfs/…: no such file or directory` — because this
|
||||||
|
host's `/var/lib/docker` is a **separate XFS mount** not visible under cAdvisor's
|
||||||
|
`/rootfs` bind (the existing galaxy deployment on the same host has the same
|
||||||
|
limitation). So the **Scrabble — Resources** dashboard's per-container panels are empty
|
||||||
|
here, and per-container CPU/RSS for this run was captured via `docker stats` instead.
|
||||||
|
Postgres internals (`postgres_exporter`) and per-service Go runtime metrics
|
||||||
|
(`go_*` by `service_name`) work. **Recommendation for R7:** adopt the otelcol
|
||||||
|
**`docker_stats`** receiver (already the contrib image) — it reads per-container stats
|
||||||
|
via the docker API with no cgroup dependency — and/or run the final pass on hardware
|
||||||
|
where cAdvisor resolves containers. (Decision to confirm with the owner.)
|
||||||
|
|
||||||
|
### Run results
|
||||||
|
|
||||||
|
The ramp ran clean to 500 players with no harness crash, no deadlock and
|
||||||
|
`stream errors: 0`; cleanup removed all 11 000 seeded accounts (and their ~941 games).
|
||||||
|
|
||||||
|
- **Ramp:** step 1 = 50 players / 90 games, step 2 = 200 / 282, step 3 = 500 / 569.
|
||||||
|
- **Volume (30 min):** 1.20 M total edge calls, 659 req/s average. Real gameplay at
|
||||||
|
scale: **48 870 committed plays**, 52 772 `your_turn` + 159 631 `opponent_moved`
|
||||||
|
events, **2 798 games finished**.
|
||||||
|
- **Latency under load (peak, step 3):** `game.state` p50 ≈ 100 ms, p90/p99 in the
|
||||||
|
200–500 ms buckets, max 849 ms; `game.submit_play` similar (p99 ≤ 500 ms, max 490 ms).
|
||||||
|
Lobby ops stayed fast (invitation/games.list p99 ≤ 10 ms).
|
||||||
|
- **Rate limiter holds.** The gateway-hammer sent 522 667 `games.list` from one account;
|
||||||
|
**522 486 (99.97 %) were `rate_limited`**, only 135 `ok` (the burst). Rejections are
|
||||||
|
cheap — p99 = 2 ms — and the gateway sustained ~16 k req/s of rejections during the
|
||||||
|
flood. The per-user limiter behaves as designed (R3 input: the cost is negligible).
|
||||||
|
|
||||||
|
**Top finding — `transport_error` under saturation.** At 500 players ~14 % of
|
||||||
|
`game.state` calls (72 429 / 519 067) and a few % of the other ops returned a Connect
|
||||||
|
`transport_error` (not a domain code). It correlates with the CPU saturation below: the
|
||||||
|
backend/gateway are pinned near one core each while the host also runs the 86 %-core
|
||||||
|
harness, so the edge sheds load (resets/timeouts) at the knee. It is **amplified by a
|
||||||
|
harness artifact** — all 500 virtual players multiplex over a *single* shared
|
||||||
|
`http2.Transport`, so 500 persistent `Subscribe` streams plus Execute calls press on one
|
||||||
|
HTTP/2 connection's concurrent-stream limit; real clients each use their own connection.
|
||||||
|
**Actions:** R7 harness — give each player (or a pool) its own transport, and run on
|
||||||
|
hardware not shared with the contour; R3 — confirm the gateway's h2c
|
||||||
|
`MaxConcurrentStreams` and edge timeouts are sized for many persistent streams.
|
||||||
|
|
||||||
|
**Minor findings:**
|
||||||
|
- `unauthenticated` on a tiny share (188 / 519 067 `game.state`, ~0.04 %) — transient
|
||||||
|
session-resolve failures under load; worth a glance in R3 but not material.
|
||||||
|
- one `internal` on `game.pass` (1 / 4 788).
|
||||||
|
- `game_finished` dominates `chat.nudge`/`chat.post` (≈ 3 900 each): the harness keeps
|
||||||
|
secondary ops on games that already ended. Harness refinement — drop finished games
|
||||||
|
from the rotation (R7).
|
||||||
|
- `nudge_own_turn` / `chat_not_your_turn` / `nudge_too_soon` are the expected turn/rate
|
||||||
|
gates, correctly exercised.
|
||||||
|
|
||||||
|
## Resource baseline
|
||||||
|
|
||||||
|
Per-container peak during step 3 (500 players), from `docker stats`:
|
||||||
|
|
||||||
|
| container | peak CPU | memory |
|
||||||
|
|-----------|---------:|-------:|
|
||||||
|
| scrabble-backend | **99 %** (~1 core) | 91 MiB |
|
||||||
|
| scrabble-gateway | **93 %** | 76 MiB |
|
||||||
|
| scrabble-postgres | **90 %** | 69 MiB |
|
||||||
|
| scrabble-loadtest (harness) | **86 %** | 42 MiB |
|
||||||
|
| scrabble-otelcol | 10 % | 110 MiB |
|
||||||
|
| scrabble-tempo | 9 % | 446 MiB |
|
||||||
|
| prometheus / postgres-exporter | ~0 % | 46 / 16 MiB |
|
||||||
|
|
||||||
|
- **The contour is CPU-bound at 500 concurrent players:** backend, gateway and Postgres
|
||||||
|
each saturate ~1 core (single-instance MVP config), so the system draws ~3 cores at
|
||||||
|
this scale; memory is modest (≤ 100 MiB per Go service). This is the sizing input for
|
||||||
|
R7 (pool sizes, GOMAXPROCS, container limits) and the prod cutover.
|
||||||
|
- **Caveat:** the harness itself peaked at **86 % of a core** on the *same host*, so the
|
||||||
|
step-3 latency and `transport_error` figures are pessimistic — the contour competed
|
||||||
|
with the generator for CPU. A clean ceiling needs separate hardware (R7).
|
||||||
|
- **Postgres:** peak 28 backend connections, ~5 581 commits/s at the peak, **100 % cache
|
||||||
|
hit ratio** (no disk reads) — the DB was comfortable; CPU, not I/O, is its limit here.
|
||||||
|
- **Goroutines:** backend 638, gateway **1 698** (it holds the 500 `Subscribe` streams +
|
||||||
|
per-request goroutines), telegram 49 — all stable, no leak across the ramp.
|
||||||
|
|
||||||
|
## Recommendations feeding later phases
|
||||||
|
- **R3 (edge hardening):** the per-user limiter holds (99.97 % rejected, p99 2 ms) — add
|
||||||
|
the per-IP body-size cap on top. Investigate the **~14 % `transport_error` on
|
||||||
|
`game.state` at 500 players**: confirm the gateway h2c `MaxConcurrentStreams` and edge
|
||||||
|
read/write timeouts are sized for many persistent `Subscribe` streams, and glance at the
|
||||||
|
~0.04 % transient `unauthenticated` resolves under load.
|
||||||
|
- **R6 (refactor):** no logic bug forced a code change beyond the two harness-payload
|
||||||
|
fixes; the run surfaced no deadlock or goroutine leak across the ramp.
|
||||||
|
- **R7 (final tuning + stress):** (1) fix the per-container observability gap — adopt the
|
||||||
|
otelcol `docker_stats` receiver so Grafana shows per-container CPU/RSS on the contour;
|
||||||
|
(2) refine the harness — per-player/pooled transports and dropping finished games from
|
||||||
|
the rotation — and run on hardware **not** shared with the contour; (3) size pools /
|
||||||
|
GOMAXPROCS / container limits from the CPU-bound peak (~1 core each for backend, gateway,
|
||||||
|
Postgres at 500 players).
|
||||||
|
|
||||||
|
## Re-running
|
||||||
|
|
||||||
|
See [`README.md`](README.md). Briefly, from the repo root:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
|
||||||
|
docker run --rm --name scrabble-loadtest --network scrabble-internal \
|
||||||
|
-e POSTGRES_PASSWORD=… scrabble-loadtest run # add --reset on a re-run
|
||||||
|
```
|
||||||
|
|
||||||
|
The harness stays in the repo for the R7 repeat.
|
||||||
Reference in New Issue
Block a user