R7: trip report + docs/tracker bake-back; mark R7 done
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 9s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 37s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 58s

- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource
  profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway
  cores; the burst run showing connection-bound behavior), and the prod-sizing
  recommendation for Stage 18.
- loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor),
  the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links.
- docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats
  receiver (cAdvisor removed); links to both trip reports.
- CLAUDE.md: repo-layout line reflects docker_stats + per-service limits.
- PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording
  the decisions, findings, applied tuning and validation.

This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
This commit is contained in:
Ilia Denisov
2026-06-11 11:18:57 +02:00
parent f23da88028
commit 2a48df9b83
6 changed files with 257 additions and 21 deletions
+1 -1
View File
@@ -128,7 +128,7 @@ gateway/ ui/ pkg/ # added by their stages
platform/telegram/ # Telegram connector side-service (Stage 9): bot + gRPC API platform/telegram/ # Telegram connector side-service (Stage 9): bot + gRPC API
loadtest/ # module scrabble/loadtest: the pre-release stress harness (R2) loadtest/ # module scrabble/loadtest: the pre-release stress harness (R2)
backend/Dockerfile gateway/Dockerfile platform/telegram/Dockerfile loadtest/Dockerfile # multi-stage distroless (Stage 16; loadtest R2); gateway/Dockerfile also has the `landing` target (R3) backend/Dockerfile gateway/Dockerfile platform/telegram/Dockerfile loadtest/Dockerfile # multi-stage distroless (Stage 16; loadtest R2); gateway/Dockerfile also has the `landing` target (R3)
deploy/ # docker-compose + caddy + landing + otelcol/prometheus/tempo/grafana (+ cAdvisor/postgres_exporter, R2) deploy/ # docker-compose (per-service limits, R7) + caddy + landing + otelcol (OTLP + docker_stats per-container metrics) + prometheus/tempo/grafana + postgres_exporter
``` ```
## Build & test ## Build & test
+35 -2
View File
@@ -23,7 +23,7 @@ the edge before prod. Each phase maps back to the owner's raw pre-release TODO l
| R4 | Push enrichment + kill the last poll | 4 + 5 | **done** | | R4 | Push enrichment + kill the last poll | 4 + 5 | **done** |
| R5 | Bundle slimming | 6 | **done** | | R5 | Bundle slimming | 6 | **done** |
| R6 | Refactor + docs reconciliation + de-staging | 7 | **done** | | R6 | Refactor + docs reconciliation + de-staging | 7 | **done** |
| R7 | Final stress run + tuning | 9b | todo | | R7 | Final stress run + tuning | 9b | **done** |
| → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) | | → | Stage 18 — prod contour deploy | — | see [`PLAN.md`](PLAN.md) |
## Key findings (these reshaped the raw list — read before starting a phase) ## Key findings (these reshaped the raw list — read before starting a phase)
@@ -168,7 +168,7 @@ regression gate. Incorporates the early-run (R2) bug fixes not already shipped.
- Open details: the structural-changes list itself (owner-approved before applying); the test - Open details: the structural-changes list itself (owner-approved before applying); the test
consolidation targets. consolidation targets.
### R7 — Final stress run + tuning *(TODO 9, part 2)* — before Stage 18 ### R7 — Final stress run + tuning *(TODO 9, part 2)* — done
Re-run the R2 harness against the final, refactored system on a clean contour; analyse Re-run the R2 harness against the final, refactored system on a clean contour; analyse
resource consumption across **all** components (gateway, backend, Postgres, the resource consumption across **all** components (gateway, backend, Postgres, the
metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate metrics/observability stack, docker log volume) and agree the tuning (pool sizes, rate
@@ -380,3 +380,36 @@ Then Stage 18.
10 files) into `backend/internal/inttest/helpers.go`; single-file helpers stay local. Pure relocation. 10 files) into `backend/internal/inttest/helpers.go`; single-file helpers stay local. Pure relocation.
- **No schema change → no contour DB wipe.** Regression gate: the full unit + integration + UI suites plus - **No schema change → no contour DB wipe.** Regression gate: the full unit + integration + UI suites plus
the R7 stress run. the R7 stress run.
- **R7** (interview + implementation):
- **Locked decisions:** run the harness **same-host** (one-shot container on `scrabble-internal`, capped
`--cpus=3` so the contour keeps spare cores); **apply container limits + `GOMAXPROCS` now** (not just a
prod recommendation); **replace cAdvisor with the otelcol `docker_stats` receiver** (it resolved only the
root cgroup on this host); keep rate-limit / h2c knobs **compiled-in** (change values only if the data
demands — it did not).
- **Harness refinements (pre-run):** each virtual player builds its **own `edge.Client`** (its own h2c
connection for its Subscribe stream + Execute calls) instead of all players sharing one `http2.Transport`
the R2 `transport_error` artifact; and `playTurn` now reports a **finished** game so the player drops it
from rotation. Effect, measured: `game.state` `transport_error` 14 % (R2) → **2.49 %**; `game_finished` on
chat ≈ 3 900 → **35**.
- **Observability:** added the `docker_stats` receiver to `otelcol` (`api_version: "1.44"` — the daemon's
minimum is 1.40; the receiver defaults to 1.25 and crash-looped until pinned), mounted the docker socket
read-only with `group_add` (the contrib image runs as UID 10001), dropped the cAdvisor service + its
Prometheus job, and retargeted the **Scrabble — Resources** dashboard to the docker_stats metric names
(`container_cpu_utilization`/100 == cores). Cross-checked against `docker stats` within sampling error.
- **Profile (final run, 500 players, limits in force):** the **gateway is the binding constraint** — with
one connection per player it bursts into its 2-core cap (the residual 2.49 % `transport_error`); backend
~0.85 core and postgres ~1.4 cores had headroom; **tempo reached its 1 GiB cap**; the backend pool sat at
its `MaxOpenConns=25` cap (28 backends); docker logs were unbounded (~14 MiB / 30 min on the backend at
info). Full write-up in [`../loadtest/REPORT-R7.md`](../loadtest/REPORT-R7.md).
- **Round-2 tuning (owner-agreed, all in `deploy/docker-compose.yml`, no code change):** gateway **2 → 3
cores + `GOMAXPROCS=3`**; tempo memory **1 → 2 GiB**; backend `MAX_OPEN_CONNS` **25 → 40**; a json-file
**log-rotation** default (10m × 3) applied contour-wide via a YAML anchor (level stays info).
backend/postgres kept at 2 cores / 512 MiB (headroom is cheap on the shared host).
- **Validation:** the same gradual ramp on the tuned contour cut `game.state` `transport_error` to **0.72 %**
(gateway ~2 cores, now under the 3-core cap, no throttle; tempo ~1.27 GiB, under 2 GiB). A separate
**burst** run (a single 100 → 500 jump) pegged the gateway at 3 cores (≈296 % sustained, 9.27 % error),
confirming it is **connection-CPU-bound** — a true arrival spike is a **horizontal-scaling** lever, not
more cores per node (recorded in the prod-sizing recommendation).
- **No schema change → no contour DB wipe.** Bake-back: `loadtest/REPORT-R7.md` (new), `loadtest/README.md`,
`docs/TESTING.md`, the telemetry/observability section of `docs/ARCHITECTURE.md`, the repo-layout line in `CLAUDE.md`.
+7 -5
View File
@@ -561,11 +561,13 @@ promotions) is future work and would deliver short markdown messages (text + lin
metrics + Tempo traces), **Prometheus** (15d), **Tempo** (72h) and **Grafana** metrics + Tempo traces), **Prometheus** (15d), **Tempo** (72h) and **Grafana**
(provisioned datasources + dashboards, behind the caddy `/_gm/grafana` Basic-Auth) (provisioned datasources + dashboards, behind the caddy `/_gm/grafana` Basic-Auth)
are stood up with the deploy (`deploy/`); the default exporter stays are stood up with the deploy (`deploy/`); the default exporter stays
`none`, so CI needs no collector. The contour also runs **cAdvisor** (per-container `none`, so CI needs no collector. The collector also runs a **`docker_stats`**
CPU/memory/network) and **postgres_exporter** (connections, cache-hit ratio, receiver (per-container CPU/memory/network read from the Docker API and exported
transactions, db size), scraped by Prometheus and surfaced on the **Scrabble — through its Prometheus endpoint), and the contour runs **postgres_exporter**
Resources** Grafana dashboard, which captures a resource (connections, cache-hit ratio, transactions, db size, scraped directly by Prometheus);
baseline; these export directly in Prometheus format (not through the collector). both are surfaced on the **Scrabble — Resources** Grafana dashboard, which captures the
stress-run resource profile. (`docker_stats` replaced cAdvisor, which on the contour
host resolved only the root cgroup — a separate-XFS `/var/lib/docker`.)
- Per-request server-side timing via gin middleware from day one (the access log - Per-request server-side timing via gin middleware from day one (the access log
carries method, route, status, latency and the active trace id). A carries method, route, status, latency and the active trace id). A
client-measured RTT piggybacked on the next request is a later enhancement. client-measured RTT piggybacked on the next request is a later enhancement.
+4 -2
View File
@@ -127,8 +127,10 @@ tests or touching CI.
selection, the report); the DAWG-backed move test runs under `BACKEND_DICT_DIR` (as the selection, the report); the DAWG-backed move test runs under `BACKEND_DICT_DIR` (as the
engine tests do). It is **not** part of the per-PR suite's behavioural assertions: it engine tests do). It is **not** part of the per-PR suite's behavioural assertions: it
runs ad hoc as a one-shot container against the contour, producing a trip report (bugs runs ad hoc as a one-shot container against the contour, producing a trip report (bugs
+ a resource baseline) read off the **cAdvisor + postgres_exporter** Grafana dashboard + a per-container resource profile) read off the **otelcol `docker_stats` +
on the contour. See [`../loadtest/README.md`](../loadtest/README.md). postgres_exporter** Grafana dashboard on the contour. Two passes are recorded — the
early [`REPORT-R2.md`](../loadtest/REPORT-R2.md) and the final, tuned
[`REPORT-R7.md`](../loadtest/REPORT-R7.md). See [`../loadtest/README.md`](../loadtest/README.md).
## Principles ## Principles
+25 -11
View File
@@ -36,17 +36,21 @@ container on the contour's docker network (this bypasses the host→gateway hair
# from the repo root # from the repo root
docker build -f loadtest/Dockerfile -t scrabble-loadtest . docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --name scrabble-loadtest --network scrabble-internal \ docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
-e POSTGRES_PASSWORD="$TEST_POSTGRES_PASSWORD" \ -e POSTGRES_PASSWORD="$TEST_POSTGRES_PASSWORD" \
scrabble-loadtest run scrabble-loadtest run
``` ```
Defaults assume the contour service names: `postgres:5432` and `gateway:8081`. The Each virtual player gets its own `edge.Client` (its own h2c connection), mirroring real
DAWGs are baked into the image (`/opt/dawg`, pinned to the dictionary release). Run with clients rather than multiplexing every player over one transport. Defaults assume the
contour service names: `postgres:5432` and `gateway:8081`. The DAWGs are baked into the
image (`/opt/dawg`, pinned to the dictionary release). On a host shared with the contour,
cap the harness (`--cpus=3`) so the contour keeps the spare cores. Run with
`--name scrabble-loadtest` so the harness's own CPU/memory show up as a `scrabble-*` `--name scrabble-loadtest` so the harness's own CPU/memory show up as a `scrabble-*`
series in cAdvisor (keeping it separable from the system under test). Capture the series in the metrics (keeping it separable from the system under test). Capture the
resource baseline from the Grafana **Scrabble — Resources** dashboard resource baseline from the Grafana **Scrabble — Resources** dashboard (the otelcol
(cAdvisor + postgres_exporter) while the run is in progress. `docker_stats` receiver + postgres_exporter), or from `docker stats` directly, while the
run is in progress.
## Commands & flags ## Commands & flags
@@ -80,15 +84,25 @@ DB wipe (`DROP SCHEMA backend CASCADE` + backend restart).
```sh ```sh
go build ./loadtest/... go build ./loadtest/...
go vet ./loadtest/... go vet ./loadtest/...
BACKEND_DICT_DIR=../scrabble-solver/dawg go test -count=1 ./loadtest/... BACKEND_DICT_DIR="$PWD/../scrabble-solver/dawg" go test -count=1 ./loadtest/...
``` ```
The DAWG-backed `moves` test runs only when `BACKEND_DICT_DIR` is set (as the engine The DAWG-backed `moves` test runs only when `BACKEND_DICT_DIR` is set (as the engine
tests use); the pure logic (hashing, board replay, rack build, move selection, report) tests use); the pure logic (hashing, board replay, rack build, move selection, report)
runs unconditionally. runs unconditionally. Use an **absolute** path (here via `$PWD`): `go test ./loadtest/...`
runs each package from its own directory, so a relative `BACKEND_DICT_DIR` would not
resolve.
## Trip reports
The two stress passes are written up in the repo: the early pass in
[`REPORT-R2.md`](REPORT-R2.md) and the final, tuned pass in
[`REPORT-R7.md`](REPORT-R7.md).
## Caveat ## Caveat
The harness shares the host CPU with the contour, so the early-pass resource baseline The harness shares the host CPU with the contour, so its own `scrabble-loadtest`
is read with the harness's own container series in mind; a cleaner number on separate container series is read alongside the system under test; capping it with `--cpus`
hardware is future work. The moderate ramp keeps the generator from being the bottleneck. keeps the contour's quota. Per-player transports (R7) removed the shared-transport
artifact that inflated R2's `transport_error`, so the figures reflect the system. A
fully isolated ceiling on separate hardware remains future work.
+185
View File
@@ -0,0 +1,185 @@
# R7 — final stress-run trip report
The final pre-release stress pass for [`PRERELEASE.md`](../PRERELEASE.md) R7. It re-runs
the R2 harness (`scrabble/loadtest`) against the **final, refactored system** on a
freshly redeployed contour, to confirm the system holds at scale and to settle the
resource sizing (container limits, `GOMAXPROCS`, pools, rate limits, log levels) before
the Stage 18 prod cutover. Pass bar: **diagnostic + a tuning decision** — the run
"passes" by completing cleanly; the per-container resource profile drives the tuning
recorded below. Companion to the early pass, [`REPORT-R2.md`](REPORT-R2.md).
## What changed since the R2 pass
- **Harness — per-player transports.** Each virtual player now owns its `edge.Client`
(its own `http2.Transport` / h2c connection carrying both its `Subscribe` stream and
its `Execute` calls), instead of all players multiplexing over one shared transport.
R2 traced the ~14 % `transport_error` on `game.state` at 500 players to that single
shared connection's stream limit; per-player connections mirror real clients and
remove the artifact, so this pass measures the system, not the harness.
- **Harness — drop finished games.** `playTurn` reports a finished game and the player
drops it from its rotation, so secondary ops stop hitting `game_finished` on ended
games (the other R2 harness finding).
- **Observability — otelcol `docker_stats`.** cAdvisor (which resolves only the root
cgroup on this host — separate-XFS `/var/lib/docker`) is replaced by the otelcol
`docker_stats` receiver, reading per-container CPU/memory/network from the Docker API.
Per-container panels now populate on the contour host. (`api_version` pinned to 1.44;
the daemon's minimum is 1.40.)
- **Contour — container limits + `GOMAXPROCS`.** `deploy.resources.limits` now bound
every service; the Go services pin `GOMAXPROCS` to their CPU limit so the runtime
matches the cgroup quota. Starting values were generous over the R2 peak; this pass
validates them and settles the agreed sizing (below).
## Method
Unchanged from R2 except for the per-player transports and the dropped-finished-games
refinement above:
- **Driver:** the `scrabble/loadtest` module, run as a one-shot container on the
`scrabble-internal` docker network (reaching `postgres:5432` / `gateway:8081`
directly), capped at `--cpus 3` so the contour keeps the host's spare cores.
- **Seed:** 10 000 durable + 1 000 guest accounts with pre-created sessions written
straight to Postgres (token hash matches `backend/internal/session`).
- **Games:** assembled through the real **invitation** flow, 24 players each, no
robots; variants over scrabble_en / scrabble_ru / erudit_ru.
- **Play:** each player holds a live `Subscribe` stream and, per tick, polls
`game.state`, replays `game.history` and submits a **mid-ranked** legal move generated
locally by the embedded `scrabble-solver`, or passes / exchanges; a fraction exercise
nudge / chat / check-word / draft / profile / stats. A separate **gateway-hammer**
floods `games.list` from one account.
- **Scale:** the same moderate ramp **50 → 200 → 500** concurrent players, 10 min/step.
- **Resource capture:** `docker stats` (docker API) sampled every ~20 s for per-container
CPU/memory; the otelcol **`docker_stats`** receiver → Prometheus → the Grafana
**Scrabble — Resources** dashboard for the same per-container series; `postgres_exporter`
internals and per-service Go runtime metrics.
## Run configuration
```
docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
-e POSTGRES_PASSWORD=… scrabble-loadtest \
run --durable 10000 --guest 1000 --steps 50,200,500 --step-dur 10m \
--tick 800ms --hammer-workers 20 --hammer-dur 15s --reset --cleanup
```
Date: 2026-06-10. Contour: the R1-baseline schema, freshly redeployed with the R7
container limits / `GOMAXPROCS` (backend/gateway/postgres capped at 2 cores + 512 MiB,
`GOMAXPROCS=2`) and the `docker_stats` observability. Seeded population removed by
`--cleanup` afterwards.
## Findings
The ramp ran clean to 500 players — no harness crash, no deadlock, `stream errors: 0`
and cleanup removed all 11 000 seeded accounts.
- **Volume (1827 s):** 821 680 edge calls (449.7 req/s incl. the hammer). Real gameplay
at scale: **50 916 committed plays**, 4 817 passes, 2 931 games finished; 165 755
`opponent_moved` + 54 864 `your_turn` events.
- **The per-player transport fix worked.** `game.state` returned `transport_error` on
**3 173 / 127 403 = 2.49 %** of calls — down from R2's ~14 % on the same step. Other
ops were lower still (`game.history` 0.43 %, `game.submit_play` 0.28 %). The residual
is the gateway bursting into its 2-core cap (see the profile below), not the harness.
- **Dropping finished games worked.** `game_finished` on `chat.nudge` / `chat.post` fell
to **35 / 36** (R2: ≈ 3 900 each) — secondary ops no longer hammer ended games.
- **The limiter holds.** The gateway-hammer sent 565 152 `games.list`; **564 979
(99.97 %) were `rate_limited`** (154 ok burst, 19 deadline), p99 = 2 ms, ~309 req/s of
rejections sustained — unchanged from R2.
- **Latency (peak):** `game.state` p50 ≈ 100 ms, p99 in the 2000 ms bucket (max 2549 ms);
`game.submit_play` p50 100 / p99 1000 ms bucket. Lobby ops stayed fast
(invitation / games.list p99 ≤ 10 ms). The p99 tail correlates with the gateway
burst-throttling, not the backend (which stayed at ~0.85 core).
## Resource profile
Per-container peak during step 3 (500 players), with the R7 starting limits in force
(backend/gateway/postgres capped at 2 cores / 512 MiB). Two CPU columns: `docker stats`
samples a ~1 s window (catches bursts); the otelcol `docker_stats` receiver averages over
its 30 s collection interval (smooths them) — they agree within sampling error, which
validates the new observability path.
| container | CPU burst (1 s) | CPU sustained (30 s) | CPU cap | mem peak | mem cap |
|-----------|----------------:|---------------------:|--------:|---------:|--------:|
| scrabble-gateway | **217 %** (at cap) | ~145 % | 200 % | 167 MiB | 512 MiB |
| scrabble-postgres | 138 % | ~153 % | 200 % | 117 MiB | 512 MiB |
| scrabble-backend | 85 % | ~89 % | 200 % | 116 MiB | 512 MiB |
| scrabble-tempo | 33 % | — | (none) | **1024 MiB** (at cap) | 1024 MiB |
| scrabble-otelcol | 11 % | — | (none) | 131 MiB | 512 MiB |
| scrabble-loadtest (harness) | 157 % | — | 300 % | 369 MiB | — |
- **The gateway is the binding constraint.** With one h2c connection per player it draws
~1.45 cores sustained and **bursts to its 2-core cap** at 500 players, throttling
briefly — the source of the 2.49 % `transport_error`. R2 saw only ~0.93 core because
all 500 players shared one connection; the +~0.5 core is the realistic per-connection
overhead (500 separate HTTP/2 connections). This is a sizing fact, not a regression.
- **backend is over-provisioned** (~0.85 core vs a 2-core cap); **postgres** (~1.4 cores)
has headroom; both stayed ≤ 120 MiB.
- **tempo reached its 1 GiB memory cap** (R2: 446 MiB) — an OOM risk under sustained
tracing.
- **Postgres backends peaked at 28**, with the backend pool at its `MaxOpenConns=25` cap.
Cache hit stayed ~100 % (no disk reads); CPU, not I/O, is the limit.
- **docker log volume (30 min):** backend 14.2 MiB, gateway 4.6 MiB, postgres 0.04 MiB —
the backend's per-request latency line at info dominates, and json-file logs had no
rotation.
## Tuning applied
Agreed from the profile (all in `deploy/docker-compose.yml`; no code change — the pool
is already env-driven):
| knob | from | to | why |
|------|------|----|-----|
| gateway CPU + `GOMAXPROCS` | 2 cores / 2 | **3 cores / 3** | it bursts into the 2-core cap at 500 players (the 2.49 % `transport_error`); 3 absorbs the bursts |
| tempo memory | 1 GiB | **2 GiB** | it reached the 1 GiB cap (OOM risk) |
| backend `MAX_OPEN_CONNS` | 25 | **40** | the pool sat at its 25-conn cap at peak; headroom trims the p99 tail |
| docker logs | unbounded | **json-file 10m × 3** | bound the ~14 MiB / 30 min backend log; level stays `info` |
Left as-is: backend / postgres at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom
is cheap on the shared host); the per-user rate limiter and `h2cMaxConcurrentStreams=250`
(per-connection now, ~1 stream each — ample) and cache TTLs (no pressure observed).
### Validation re-run
Re-running the **same gradual ramp** (50 → 200 → 500) on the tuned contour confirms the
fix:
- **`game.state` `transport_error` fell to 0.72 %** (853 / 119 051), down from 2.49 % at
2 cores. The latency tail also improved — p99 in the 1000 ms bucket, max 1220 ms (was
the 2000 ms bucket, max 2549 ms).
- The **gateway peaked at ~2 cores** (≈196 % on the 30 s gauge) — now comfortably **under
the 3-core cap**, so it no longer throttles. backend ~1 core, postgres ~1.3 cores.
- **tempo peaked at ~1.27 GiB** — under the new 2 GiB cap (it would have OOM-ed at 1 GiB).
- Drop-finished still holds (`game_finished` on chat 41/42); the limiter still rejects
99.97 % of the hammer at p99 2 ms; `stream errors: 0`.
A separate **burst stress** (a single 100 → 500 jump — 400 players connecting at once)
**pegged the gateway at 3 cores** (≈296 % sustained) and pushed `game.state`
`transport_error` to 9.27 %. The gateway is **connection-CPU-bound and bursty**: average
load is ~1 core, but a mass-simultaneous connection storm saturates whatever single-node
cap it is given. Real arrivals are gradual (the canonical run), where 3 cores has
headroom; the lever for a true arrival spike is **horizontal scaling**, not more cores per
node — carried into the prod recommendation below.
## Prod-sizing recommendation (Stage 18)
The contour is **CPU-bound and gateway-led** at 500 concurrent players. Carry these to the
prod contour env (the same compose, `PROD_*` values):
- **gateway: ≥ 3 cores** per ~500 concurrent players, `GOMAXPROCS` pinned to the limit —
it scales with the **connection count**, not just the request rate; beyond one node's
worth, scale the gateway **horizontally** rather than vertically.
- **backend: ~12 cores**, pool 40 — comfortable; the work is light per request.
- **postgres: ~2 cores / ≥ 512 MiB** — ~1.4 cores at 500 players, 100 % cache hit.
- **tempo: ≥ 2 GiB**; the Go services run under ~170 MiB (256 MiB would suffice, 512 is
safe); pin `GOMAXPROCS` to each CPU limit; keep json-file rotation.
- Memory is not the constraint anywhere; CPU is.
## Re-running
See [`README.md`](README.md). Briefly, from the repo root:
```sh
docker build -f loadtest/Dockerfile -t scrabble-loadtest .
docker run --rm --cpus=3 --name scrabble-loadtest --network scrabble-internal \
-e POSTGRES_PASSWORD=… scrabble-loadtest run --reset --cleanup
```
The harness stays in the repo for future repeats.