- loadtest/REPORT-R7.md: the final stress-run report — method, the 500-player resource
profile, the agreed tuning, the validation (transport_error 2.49% -> 0.72% at 3 gateway
cores; the burst run showing connection-bound behavior), and the prod-sizing
recommendation for Stage 18.
- loadtest/README.md: per-player transports, --cpus capping, docker_stats (was cAdvisor),
the absolute BACKEND_DICT_DIR for ./loadtest/... , and report links.
- docs/TESTING.md + docs/ARCHITECTURE.md: observability now uses the otelcol docker_stats
receiver (cAdvisor removed); links to both trip reports.
- CLAUDE.md: repo-layout line reflects docker_stats + per-service limits.
- PRERELEASE.md: R7 marked done in the tracker + heading; a Refinements entry recording
the decisions, findings, applied tuning and validation.
This is the final pre-release hardening phase; Stage 18 (prod cutover) is next.
Round-2 tuning, decided from the 500-player resource profile:
- gateway: 2 -> 3 cores + GOMAXPROCS=3. It holds one h2c connection per player, so
at 500 players it burst into the 2-core cap (~2.49% transport_error on game.state);
3 cores absorbs the bursts. The per-connection cost is the realistic prod load.
- tempo: memory 1G -> 2G. It reached the 1 GiB cap during the run (OOM risk).
- backend Postgres pool: MAX_OPEN_CONNS 25 -> 40. The pool sat at its 25-conn cap
(28 backends) at peak; headroom trims the p99 tail. Postgres (2c/512M) handles it.
- docker log volume: a json-file rotation default (10m x 3 = 30 MiB/container) applied
contour-wide via a YAML anchor; the backend logs ~14 MiB / 30 min at info under load
and was previously unbounded. Log level stays info.
backend/postgres stay at 2 cores / 512 MiB (peak ~0.85 / ~1.4 cores — headroom is cheap
on the shared host). A validation re-run confirms the gateway fix before merge.
The receiver defaults to Docker API 1.25, but the contour daemon's minimum is
1.40 (it speaks up to 1.54), so otelcol crash-looped on start with "client
version 1.25 is too old". Pinning api_version to 1.44 (accepted by both the
receiver's bundled client and the daemon) starts the receiver cleanly —
verified by running the image against the host socket ("Everything is ready",
no start error).
Observability: replace cAdvisor (which resolves only the root cgroup on the
contour host — separate-XFS /var/lib/docker) with the otelcol docker_stats
receiver, which reads per-container CPU/memory/network straight from the Docker
API and works the same in prod. The collector joins the host docker group
(DOCKER_GID, default 989) and mounts the socket read-only; its metrics flow out
through the existing prometheus exporter, so the cAdvisor scrape job and the
privileged cAdvisor service are removed. The Resources dashboard panels are
retargeted to the docker_stats metric names (container_name label;
container.cpu.utilization/100 == cores).
Container limits: apply deploy.resources.limits (honoured by Compose v2) across
the contour and pin GOMAXPROCS to the CPU limit on the Go services so the runtime
matches the cgroup quota. Starting values are generous over the R2 peak (~1 core /
<=100 MiB per app service) to avoid skewing or OOM-killing the measurement run;
they are tightened to the agreed prod sizing after the final stress run (R7
Round 2). The privileged VPN sidecar is left unconstrained.
Each virtual player now builds its own edge.Client (its own h2c connection
carrying both the Subscribe stream and the Execute calls), instead of every
player multiplexing over a single shared http2.Transport. The R2 trip report
traced the ~14% transport_error on game.state at 500 players to that single
shared transport; per-player connections mirror real clients and isolate the
artifact. The assembly burst and the gateway-hammer each get their own client.
playTurn now reports when a game has finished so playerLoop drops it from the
rotation (slices.DeleteFunc); once no active game remains the player idles while
still holding its stream. This stops secondary ops from hammering game_finished
on already-ended games (the other R2 harness finding).
Move the cross-file integration fixtures — the service constructors
(newGameService/newSocialService/newRobotService/newMatchmaker), the game-assembly
helpers (newMirror/newGameWithSeats/newDraftGame), account provisioning
(provisionAccount/provisionGuest) and the stats reader — out of the domain test
files (newGameService alone was used by 10 files) into a single
backend/internal/inttest/helpers.go. Helpers used by a single file stay local.
Pure relocation: the helper bodies are unchanged, no test logic changes; the
imports the moves left unused are pruned. go vet -tags=integration is clean.
Extract the FlatBuffers builders for the wire tables shared by the backend push
encoder and the gateway edge transcoder — GameView, MoveRecord, StateView,
AccountRef, Invitation and their nested rows — into a new scrabble/pkg/wire
package. Both callers keep their local builder signatures (no call sites move)
but now map their own source types (the backend's notify.* payloads and the
decoded engine.MoveRecord; the gateway's backendclient.* REST DTOs) to neutral
wire.* structs and delegate the construction to package wire, the single
definition of the nested-table layout.
Behaviour-preserving: the verified-identical field sets mean the wire bytes
decode the same, and the notify + transcode round-trip tests pass unchanged. The
fiddly Start/Add/End + reverse-prepend vector boilerplate now lives once; the two
encode files shrink while pkg/wire carries the shared logic.
These pre-R4 summary scalars on OpponentMovedEvent were redundant with the
move/game delta and read by nobody — the UI codec and mock take only
move/game/bag_len, and the gateway forwards the push payload verbatim. Removed
from scrabble.fbs, the notify emit (notify/events.go) and the round-trip test;
regenerated the FB Go + TS bindings. No prod data, so the wire-slot renumber is
free and there is no DB change.
Pass (a) removed a stale "(reaping abandoned guest rows is deferred — TODO-3)"
note from ARCHITECTURE §3, but guest reaping is implemented (the background
reaper, BACKEND_GUEST_REAP_INTERVAL / BACKEND_GUEST_RETENTION, covered by
inttest). State the current behaviour instead.
A full section-by-section review of ARCHITECTURE / FUNCTIONAL (+_ru) / TESTING /
UI_DESIGN against the code found no other drift — each R-phase baked its own docs,
and FUNCTIONAL/TESTING already describe the reaper correctly.
Analysed the real dist (gzip + sourcemap attribution): the bundle is already minified + tree-shaken and dominated by the Connect/FlatBuffers transport runtime + generated bindings + the Svelte runtime (~2/3 of main), so no in-scope code slimming is warranted. Lazy-loading was rejected (bundle-size.mjs sums every chunk -> zero total-size win, plus +N gateway fetches of latency); i18n lazy-load and chunk-collapsing likewise (caching/HTTP2).
Instead bundle-size.mjs now measures per HTML entry with three independent gates (app entry <=100 KB, Svelte+i18n shared <=30 KB, landing-own <=5 KB): the app's real payload is its entry chunk + the shared chunk (~97 KB), never landing.js. Same CLI + exit-code contract, CI step unchanged. Fixed the stale ~82 KB figure in the script and ui/README.md. No app code change.
Enrich the in-app live stream into a delta channel so the UI renders a move from the event without a follow-up game.state, and make the matchmaking poll a stream-down fallback.
- pkg/fbs: trailing fields on opponent_moved (move+game+bag_len), your_turn (move_count), match_found (state), game_over (game), notify (account/invitation/state), MoveResult (rack+bag_len); regenerate Go + TS.
- backend: notify owns the FB encoding (encode.go + payload.go input structs); game/lobby/social map their domain types in. emitMove builds the move delta; game.Service.InitialState feeds match_found/game_started the recipient's initial StateView; friends/invitations notify carry their account/invitation. The move-commit response (submit_play/pass/exchange/resign) returns the actor's refilled rack + bag size.
- gateway: MoveResult transcode carries rack+bag_len.
- ui: pure lib/gamedelta.ts reducer advances the per-game cache keyed on move_count (idempotent + gap-safe); app.svelte seeds the cache on match_found/game_started; Game.svelte applies the delta (commit/pass/exchange/resign drop their load()); NewGame polls only while app.streamAlive is false.
- docs: ARCHITECTURE §10, FUNCTIONAL(+ru), backend/gateway/ui READMEs; PRERELEASE R4 marked done + Refinements.
- gateway/Dockerfile gains a `landing` target: caddy:2-alpine + the shared
Vite build (identical build args keep the ui stage a single cached build);
the gateway target drops landing.html from the embed.
- The contour caddy routes /app/, /telegram/ and the Connect path to the
gateway; the catch-all — the landing at / and any stray path — goes to the
new landing service, so junk traffic is absorbed by static file serving.
- deploy/landing/Caddyfile mirrors the webui caching (immutable assets,
no-cache shells) and falls back unknown paths to the landing shell.
- The gateway's / now 308-redirects to /app/ (keeps a local no-caddy run
usable); webui placeholder landing.html removed.
- CI deploy probe checks both / (landing) and /app/ (gateway).
Verified: both images build; the landing container serves landing.html at /
(no-cache) with junk-path fallback; the gateway image redirects / to /app/
and carries no landing content.
- accounts.flagged_high_rate_at baked into the R1 baseline (no prod data; the
contour schema is wiped after merge); jet regenerated — the regen also picks
up the previously missing game_drafts/game_hidden models.
- account.Store: FlagHighRate (set-once), ClearHighRateFlag, the flag in
GetByID/ListUsers and a ListFlaggedHighRate review queue.
- New internal/ratewatch: ingests the gateway rejection reports, keeps a
bounded in-memory episode window for the console and applies the
conservative auto-flag (1000 rejected / 10 min, BACKEND_HIGHRATE_FLAG_*).
- POST /api/v1/internal/ratelimit/report (network-trusted, like
sessions/resolve).
- Admin console: Throttled page (episodes + flagged accounts), a high-rate
badge in the user list, the marker + operator clear action on the user card.
- Tests: ratewatch unit suite, report-route handler test, renderer cases,
integration coverage for the store round-trip and the console flow.
- GATEWAY_MAX_BODY_BYTES (1 MiB): connect WithReadMaxBytes + http.MaxBytesReader
on the public mux; explicit http2.Server MaxConcurrentStreams/IdleTimeout and
an http.Server ReadHeaderTimeout (R2 report follow-up).
- gateway_rate_limited_total{class} counter, Debug per rejection, a rejection
tracker drained every 30 s into a Warn summary per key and a report POST to
/api/v1/internal/ratelimit/report (feeds the admin view + auto-flag).
- The dead AdminPerMinute/AdminBurst policy now guards the /_gm mount (429),
ahead of its Basic-Auth.
- resolve() logs the cause of infra session-resolve failures at Warn (the
transient unauthenticated dips from the R2 run); unknown tokens stay silent.
Ran the moderate early pass (50/200/500, 10 min/step) against the contour: ramped
clean to 500 players, 1.2 M edge calls, 48 870 plays, 2 798 games finished, no
crash/deadlock; cleanup removed all 11 000 seeded accounts. The per-user limiter held
under the gateway-hammer (99.97 % rejected, p99 2 ms).
Top finding: ~14 % transport_error on game.state at 500 players under CPU saturation
(backend/gateway/Postgres each ~1 core), amplified by the harness's single shared
http2.Transport (the harness itself peaked at 86 % of a core on the same host).
Observability finding: cAdvisor yields only the root cgroup on the contour host
(separate XFS /var/lib/docker); per-container metrics captured via docker stats; R7
should adopt the otelcol docker_stats receiver. Full report in loadtest/REPORT-R2.md;
PRERELEASE refinements logged; R2 marked done.
- display-name marker: letters-only 'Zzloadtest' (the editable-name validator
forbids digits/colons), so profile.update resends the seeded name successfully.
- draft.save: rack_order is a string in the backend draft DTO (was sent as []),
fixing the bad_request.
Both confirmed ok against the contour. chat_not_your_turn / nudge_own_turn are
by-design turn gates (backend/internal/social/chat.go), correctly exercised.
Adding the loadtest module to go.work (use ./loadtest + the scrabble/gateway
replace it needs) broke the other services' Docker builds: their reduced
workspace still referenced ./loadtest (not in their build context), failing with
'cannot load module loadtest: open loadtest/go.mod: no such file or directory'.
Each service Dockerfile now also -dropuse=./loadtest; backend and telegram (which
do not COPY ./gateway) additionally -dropreplace the loadtest-only scrabble/gateway
replace. Verified by building all three images plus loadtest locally.
New scrabble/loadtest module (the pre-release stress harness): seeds 1000 guest +
10000 durable accounts with pre-created sessions directly in Postgres (token hash
matches backend/internal/session), drives virtual players through the edge protocol
(real 2-4p games assembled via invitations, mid-ranked legal moves generated locally
by the embedded scrabble-solver — the edge carries no board, so the client replays
history), plus nudge/chat/check-word/draft/profile/stats and a gateway-hammer that
verifies the rate limiter. Prints a trip-report summary (per-op latency percentiles,
result codes, live-event tally). Go unit tests cover the pure pieces; the DAWG-backed
move test runs under BACKEND_DICT_DIR.
Contour: add cAdvisor + postgres_exporter + a 'Scrabble - Resources' Grafana
dashboard and the two Prometheus scrape jobs, for the R2/R7 stress-run resource
baseline.
CI: gate ./loadtest/... (path filter + vet/build/test). Docs: TESTING, ARCHITECTURE,
project CLAUDE repo layout.
Squash the 12 goose migrations into one 00001_baseline.sql (there is no prod
data; verified schema-identical to the chain via a pg_dump diff + the green
integration suite) and rename the game-variant labels
english/russian_scrabble/erudit -> scrabble_en/scrabble_ru/erudit_ru across the
backend, the FlatBuffers wire values and the UI.
dawg filenames and the Go enum identifiers are unchanged; the i18n display keys
are kept. Adds PRERELEASE.md (the R1-R7 pre-release tracker), linked from
CLAUDE.md. Contour DB wipe and the scrabble-dictionary tidy are follow-ups.
Two owner-reported defects from a live contour game.
A. Frequency: the robot's proactive nudge fired hourly for 12h+ (a 12h idle threshold
then the 1h cooldown, uncapped). Replaced with a lengthening, randomized schedule
(proactiveNudgeGap): the first nudge ~60-90 min into the human's turn, each later gap
growing toward 1-6h (uniform sample in [60min, ceil], ceil ramping 90min->6h over 12h
of idle, measured from the previous nudge), so a long wait gets a handful of
increasingly-spaced reminders instead of a stream.
B. Language: out-of-app push routed by the recipient's GLOBAL service_language
(last-login-wins), so after re-logging via the RU bot an English game's nudges came
from the RU bot. Now a game push (your_turn, game_over, nudge, match_found) carries
the game's own language (engine.Variant.Language) on push.Event, and the gateway
routes by it (falling back to service_language for non-game pushes). The New-Game
variant-gating guarantees the game's bot is one the player has started, so delivery is
never blocked.
Tests: proactiveNudgeGap unit + retimed TestRobotProactiveNudge; TestVariantLanguage;
emit your_turn/game_over language; TestNudgeRoutedByGameLanguage integration. Docs:
ARCHITECTURE (§7 nudge, §10/§13 routing), FUNCTIONAL (+ _ru), PLAN tracker.
display_name validation gains a rule: at most 5 special characters — the '.' / '_'
punctuation (spaces, which separate words, don't count) — so a still-well-formed name
can't be mostly punctuation. Mirrored in the Go ValidateDisplayName and the UI
validDisplayName; both unit-tested (5 ok, 6 rejected, 'J. R. R. Tolkien' ok). Docs:
FUNCTIONAL (+ _ru).
A dropped/reset/timed-out connection can surface as a Connect code other than
Unavailable (Canceled/DeadlineExceeded/Unknown/…) which fell through to the generic
'internal' -> a red 'something went wrong' toast appeared alongside the Connecting
spinner. Now toGatewayError (moved to the pure retry.ts, unit-tested) collapses every
transport-level code to 'unavailable' so it is retried + flips offline; and handleError
suppresses the toast for any connection code AND whenever the app is mid-reconnect
(!connection.online), covering the race where a unary error lands before the stream
reports the drop. Genuine server-internal / domain errors still toast while online.
Following the in-game bar, the Connecting indicator now also visually disables the
other proactive (server-sending) controls while offline: chat send + nudge, profile
save / link email|telegram / merge-confirm, friends (redeem, get-code, accept/decline,
unfriend, block, unblock), New Game (auto-match variant + send-invitation) and the
lobby hide ❌. Purely local controls (board/rack/reset, menus, navigation, settings,
copy-code) stay live. Each reads the global connection.online signal; full e2e + check
green.
Connectivity failures become state, not a toast on every attempt. A global online
signal (lib/connection.svelte.ts) flips on a transport unavailable / rate_limited and
on the live stream's drop, driving a pure-CSS header spinner + 'Connecting…' in place
of the title and softly disabling the in-game server actions (commit / exchange / pass
/ hint; local board/rack/reset stay live).
- transport: exec auto-retries with capped exponential backoff — every op on a
rate-limit (rejected before processing, safe), reads only on unavailable (a mutation
is never blindly re-sent, to avoid double-applying one whose response was lost; its
button is disabled while offline so the player re-issues on reconnect). A reachability
watcher (profile.get probe) and any successful traffic clear the signal.
- the old red error.unavailable toast is gone (handleError suppresses connection codes;
the indicator replaces it). A server-data screen still opens with the spinner and
fills on reconnect (global indicator + read auto-retry), so navigation is never dead.
- pure retry policy unit-tested (retry.ts); a mock-only window.__conn hook drives a
Chromium+WebKit e2e (indicator shows offline, the action disables, both clear on
reconnect). Full suite + build green.
- docs: ARCHITECTURE transport note, FUNCTIONAL (+ _ru), PLAN tracker (incl. #1 — the
bot already drains all updates, no change).
Also records #1 as investigated/no-change in PLAN. Other server-action buttons (chat
send, profile save, …) still degrade to a safe no-op offline; visual disable is easy to
extend.
The Telegram 'your turn' notification now names the opponent and recaps their last
move (voiced as the opponent: «{name}: my move — «WORD». Score 120:95» for a scoring
play; a short 'swapped / passed, your turn' otherwise), and a new game-over
notification reports the result + final score when a game ends by any path (closing
play, all-pass, resign, timeout). Scores are recipient-first (the reader's score
leads), 2-4 players (120:95:80).
- schema: YourTurnEvent gains opponent_name/last_action/last_word/score_line
(appended, backward-compatible); new GameOverEvent{result, score_line}. Go + UI
bindings regenerated (flatc 23.5.26 + pnpm codegen).
- backend: notify.YourTurn enriched + notify.GameOver; emitMove resolves the mover's
name and emits per-recipient (your_turn to the next mover, game_over to every seat),
with recipient-first score lines built in one place.
- gateway: game_over joins the out-of-app whitelist (routing.go).
- connector: render builds the enriched your_turn + game_over text per language (en/ru).
- tests: notify round-trip (enriched + game_over), emit (enriched fields + game_over to
all seats / per-seat result), connector render (en/ru), routing; integration replay
(play → your_turn with real name; resign → game_over) green.
- docs: ARCHITECTURE push catalog + out-of-app set, FUNCTIONAL (+ _ru), PLAN tracker.
Owner review: the '>' on an active game row should be a real tap target that opens
the game, like the rest of the row — not inert. The chevron now navigates (kept out
of the tab order / a11y tree since the row's main button already does the same), and
active-row swipes no longer suppress the tap. Adds an e2e for the chevron navigation.
A player can remove a finished game from their own 'my games' list. The action is
per-account, finished-only and irreversible (the game stays for the other players;
there is no un-hide).
- backend: migration 00012 game_hidden(account_id, game_id); store HideGame +
hiddenGameIDs + ListGamesForAccount filtering; service HideGame (seat + finished
checks, reusing ErrNotAPlayer / ErrGameActive); POST /api/v1/user/games/:id/hide.
- gateway: game.hide edge op (reuses GameActionRequest -> Ack) + backendclient.HideGame.
- ui: finished rows reveal a delete via swipe-left (touch) or a kebab tap (desktop),
active rows get an inert chevron for icon alignment; optimistic removal + lobby-cache
sync; mock + transport + client wiring; lobby.hideGame label (en/ru).
- tests: integration (active->ErrGameActive, outsider->ErrNotAPlayer, per-account,
idempotent), gateway transcode round-trip, mock e2e (kebab -> delete); hardened a
pre-existing chat-screen .back transition flake surfaced by the new test's timing.
- docs: ARCHITECTURE persistence list, FUNCTIONAL (+ _ru) lobby story, PLAN tracker.
The 'lobby is back' rule slid the chat/check back-to-the-game forward. Direction is now
computed from route depth (lobby < game < chat/check): shallower = back, deeper = forward.
- Chat and word-check are now routed screens (/game/:id/chat, /game/:id/check) with a
header back to the game and no tab-bar, replacing their modals. The soft keyboard just
resizes the visible viewport (tracked into --vvh, which the Screen height uses since iOS
does not shrink dvh for the keyboard) with the input pinned to the bottom: no modal
relayout, no page jump. Supersedes the earlier bottom-sheet Modal attempt.
- A new chat message raises an unread badge on the in-game hamburger + the Chat menu row
(per game, cleared on opening the chat), mirroring the lobby badge.
- TG native back + the header back chevron return chat/check to their game.
- Exposes --tg-safe-top (device notch) for the finalised TG-fullscreen header.
Tests: e2e for chat/check opening as their own screens + back. Docs: PLAN, FUNCTIONAL(+ru).
The sender name and message body are user-controlled; a leading =, +, -, @, tab or
CR in the CSV export would execute as a formula when a moderator opens it in a
spreadsheet. csvSafe() prefixes such values with a single quote. Unit-tested.
A right-aligned 'Export CSV ↓' link in the filter row downloads /_gm/messages.csv
with the active filters (game / sender / name / ext masks), exporting every matching
message (capped at 100k) regardless of the page window — columns time, source,
sender_id, sender, ip, message, game_id.
- Edge-swipe back now listens at the window in the CAPTURE phase (the board's
pointer handlers can't swallow it) and is no longer skipped inside Telegram
(where the owner tests it).
- TG-fullscreen header: expose the device safe-area top (--tg-safe-top) and
centre the title + menu pair within Telegram's nav band ([safe-top,
content-top]) below the notch, keeping the band's height — lining up with
Telegram's own controls.
- DnD auto-zoom-on-hover delay reduced 1000ms -> 700ms.
(Client-IP: diagnosed as the owner's home-router SNAT — the host caddy already
receives 192.168.0.1 with no XFF, so the real IP is lost upstream of our stack;
correct in prod. No code change.)