R2: early-pass trip report + mark R2 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 13s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 57s
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held under the gateway-hammer (99.97 % rejected, p99 2 ms). Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation (backend/gateway/Postgres each ~1 core), amplified by the harness's single shared http2.Transport (the harness itself peaked at 86 % of a core on the same host). Observability finding: cAdvisor yields only the root cgroup on the contour host (separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7 should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md; PRERELEASE refinements logged; R2 marked done.
This commit is contained in:
+34
-1
@@ -18,7 +18,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l
|
||||
| # | Phase | Raw TODOs | Status |
|
||||
|---|-------|-----------|--------|
|
||||
| R1 | Schema & naming reset | 1 + 10 | **done** |
|
||||
| R2 | Stress harness + contour observability + early run | 9a | todo |
|
||||
| R2 | Stress harness + contour observability + early run | 9a | **done** |
|
||||
| R3 | Edge hardening | 2 + 8 + 3 | todo |
|
||||
| R4 | Push enrichment + kill the last poll | 4 + 5 | todo |
|
||||
| R5 | Bundle slimming | 6 | todo |
|
||||
@@ -220,3 +220,36 @@ Then Stage 18.
|
||||
contour `backend` schema was wiped (`DROP SCHEMA backend CASCADE` + restart, not a volume drop) and
|
||||
re-migrated to the baseline — verified the new variant CHECK (`scrabble_en/scrabble_ru/erudit_ru`),
|
||||
`games`=0 and a clean boot.
|
||||
|
||||
- **R2** (interview + implementation):
|
||||
- **Locked decisions:** game assembly via **invitations** (real path, no robots; not direct game-row
|
||||
inserts); **moderate** ramp **50 → 200 → 500** at 10 min/step; **diagnostic** pass bar (no SLO gate);
|
||||
run as a **one-shot container on `scrabble-internal`** in this PR.
|
||||
- **Harness** = new `scrabble/loadtest` module (`use ./loadtest` + a `replace scrabble/gateway` for the
|
||||
dot-free edge-proto import). It seeds 1000 guest + 10000 durable accounts + sessions **directly in
|
||||
Postgres** (token hash mirrors `backend/internal/session`), drives players over the **edge protocol**,
|
||||
generates **mid-ranked legal moves locally** with the embedded `scrabble-solver` by replaying
|
||||
`game.history` (the edge carries no board — mirrors `engine.ReplayBoard` via the public API), and a
|
||||
**gateway-hammer**. Compact CLI (`run` / `cleanup`), distroless Dockerfile (DAWGs baked), Go unit tests.
|
||||
- **Adding the module broke the other images' builds** — backend/gateway/telegram Dockerfiles reduce the
|
||||
workspace but still referenced `./loadtest` (not in their context); each now also
|
||||
`-dropuse=./loadtest` (backend/telegram additionally `-dropreplace` the gateway replace). Caught by the
|
||||
first deploy run; verified by building all four images.
|
||||
- **Harness payload fixes found by the smoke pass:** the draft DTO's `rack_order` is a string (was sent
|
||||
as `[]` → `bad_request`); the display-name validator forbids digits/colons, so the cleanup marker
|
||||
became a letters-only `Zzloadtest` so `profile.update` resends the seeded name. `chat_not_your_turn` /
|
||||
`nudge_own_turn` are **by-design** turn gates, correctly exercised.
|
||||
- **Observability:** added **cAdvisor + postgres_exporter** + the **Scrabble — Resources** dashboard +
|
||||
two Prometheus jobs. **Finding:** cAdvisor yields only the root cgroup on the contour host (separate
|
||||
XFS `/var/lib/docker` breaks its layer-ID resolution — the existing galaxy deploy has the same limit),
|
||||
so per-container CPU/RSS for the early pass was captured via `docker stats`. **R7:** adopt the otelcol
|
||||
`docker_stats` receiver (already the contrib image) for per-container metrics in Grafana.
|
||||
- **Early run (2026-06-09):** ramped clean to 500 players, no crash/deadlock, cleanup removed all 11000
|
||||
accounts. 1.2 M edge calls, 48 870 plays, 2 798 games finished; the per-user limiter held under the
|
||||
hammer (99.97 % rejected, p99 2 ms). **Top finding:** ~14 % `transport_error` on `game.state` at 500
|
||||
players, under CPU saturation (backend/gateway/Postgres each ~1 core) and amplified by the harness's
|
||||
single shared `http2.Transport`; the harness itself peaked at 86 % of a core on the same host, so the
|
||||
figures are pessimistic. Full trip report in [`../loadtest/REPORT-R2.md`](../loadtest/REPORT-R2.md);
|
||||
it feeds R3 (h2c `MaxConcurrentStreams`/timeouts, body-size cap), R6 and R7 (per-player transports,
|
||||
separate hardware, pool/limit sizing).
|
||||
- **CI:** `./loadtest/...` added to the path filter + vet/build/test; `go.work.sum` carries the new deps.
|
||||
|
||||
Reference in New Issue
Block a user