R3: dashboards, docs and tracker bake-back
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m7s

- Edge/UX dashboard: aggregate request-rate vs rejection-rate panel
  (gateway_rate_limited_total by class; no per-user labels).
- ARCHITECTURE §2/§11/§12/§13: body cap + explicit h2c sizing, the rate-limit
  observability pipeline and auto-flag policy, the admin-limiter note (and the
  caddy-path gap), the landing container topology; fixed the stale 120/min
  per-user figure.
- FUNCTIONAL (+_ru): the Throttled view and the reversible high-rate flag.
- gateway/backend/deploy READMEs, TESTING.md, root CLAUDE.md updated.
- PRERELEASE.md: R3 interview decisions + implementation refinements logged;
  tracker R3 -> done (this PR implements it; CI gates the merge).
This commit is contained in:
Ilia Denisov
2026-06-10 05:12:17 +02:00
parent f20a4b49ff
commit 7e75c32d07
10 changed files with 144 additions and 29 deletions
+44 -14
View File
@@ -98,6 +98,15 @@ dropped). Horizontal scaling is explicit future work.
response was lost — its button is disabled while offline and the player re-issues it on
reconnect). A reachability watcher (a lightweight `profile.get` probe) clears the signal when no
other traffic is in flight; the live `Subscribe` stream's drop/recovery feeds the same signal.
**Edge hardening (R3):** every request body on the public listener is capped at
`GATEWAY_MAX_BODY_BYTES` (default 1 MiB — far above any legitimate payload), both at the HTTP
layer (`http.MaxBytesReader`) and as the Connect per-message read limit, so an oversized
`Execute` is refused (`resource_exhausted`) without buffering. The h2c server carries explicit
sizing: `MaxConcurrentStreams` 250 (the x/net default made visible — a real client holds one
`Subscribe` stream plus a few unary calls) and a 3-minute connection `IdleTimeout` (a live
`Subscribe` stream keeps its connection active, so only abandoned connections are reaped); the
`http.Server` sets only `ReadHeaderTimeout` (10 s) — Read/WriteTimeout would kill the stream.
R7 revisits the exact values under load.
- **Alphabet on the wire (Stage 13)**: live play exchanges **alphabet indices**, not
concrete letters. The rack (`StateView.rack`), the `SubmitPlay`/`Evaluate` tiles, the
`Exchange` tiles and the `CheckWord` word are `ubyte` indices into the variant's alphabet
@@ -572,6 +581,21 @@ promotions) is future work and would deliver short markdown messages (text + lin
distinct accounts that performed an authenticated edge action in the window. The
gauge is single-process by design (single-instance MVP, §10): it is correct for one
gateway, resets on restart, and is a live operational figure, not a billing count.
- **Rate-limit observability (R3):** every limiter rejection increments the gateway
counter `gateway_rate_limited_total` (`class` = user/public/email/admin — aggregate
only, honouring the no-per-user-label discipline above) and logs one **Debug** line;
a gateway reporter drains the per-key rejection tracker every 30 s, emits one **Warn**
summary per throttled key and posts the report to the backend
(`POST /api/v1/internal/ratelimit/report`, network-trusted like `sessions/resolve`).
The backend's `ratewatch` keeps a bounded in-memory episode window (single-instance,
resets on restart, like `active_users`) surfaced on the admin console's **Throttled**
page next to the flagged-account review queue, and applies the **conservative
auto-flag**: an account sustaining `BACKEND_HIGHRATE_FLAG_THRESHOLD` rejected calls
(default 1000) within `BACKEND_HIGHRATE_FLAG_WINDOW` (default 10 min) gets the soft,
reversible `accounts.flagged_high_rate_at` marker — set once, shown in the user
list/detail, cleared by the operator, **never an automatic ban** and never a request
gate. The Edge/UX dashboard graphs the aggregate request rate against the rejection
rate by class.
- Unauthenticated `GET /healthz` (liveness) and `GET /readyz` (readiness — the
database answers a bounded ping and the session cache is warmed).
- The backend serves a **second listener** — a gRPC server
@@ -582,12 +606,12 @@ promotions) is future work and would deliver short markdown messages (text + lin
| Concern | Enforced by |
| --- | --- |
| Public rate limiting / anti-abuse | gateway |
| Public rate limiting / anti-abuse | gateway (per-IP public/email/admin classes, per-user authenticated class; a request body cap of `GATEWAY_MAX_BODY_BYTES`; rejections are metered, summarised to the backend and surfaced in the admin console with a conservative reversible auto-flag — R3, §11) |
| Telegram initData validation (bot-token HMAC) | the Telegram connector; the gateway delegates it over gRPC, so the bot token lives only in the connector |
| Session minting; email-code / guest validation | gateway (with backend) |
| Session → `user_id` resolution, `X-User-ID` injection | gateway |
| Authorisation, ownership, state transitions | backend (`X-User-ID` is the sole identity input) |
| Admin authentication | a single Basic-Auth gate on `/_gm/*`, forwarded **verbatim** to the backend's server-rendered admin console (and, in the deployed contour, routing `/_gm/grafana/*` to Grafana). In the deploy the **caddy** owns this gate (§13); a local non-caddy run uses the gateway's own `GATEWAY_ADMIN_*` proxy. The backend trusts the proxy (no admin principal) and guards its state-changing POSTs with a **same-origin** check — the console's CSRF defence. No operator identity is tracked |
| Admin authentication | a single Basic-Auth gate on `/_gm/*`, forwarded **verbatim** to the backend's server-rendered admin console (and, in the deployed contour, routing `/_gm/grafana/*` to Grafana). In the deploy the **caddy** owns this gate (§13); a local non-caddy run uses the gateway's own `GATEWAY_ADMIN_*` proxy, which the per-IP admin limiter class guards ahead of its Basic-Auth (R3) — the caddy-fronted path has no limiter (stock caddy), an accepted gap. The backend trusts the proxy (no admin principal) and guards its state-changing POSTs with a **same-origin** check — the console's CSRF defence. No operator identity is tracked |
| backend ↔ gateway ↔ connector trust | the network (only gateway may reach backend; the connector serves unauthenticated gRPC on the internal segment) |
This is an explicit, accepted MVP risk: compromise of the gateway↔backend
@@ -597,7 +621,7 @@ mutual auth is a future hardening step.
**Short numeric codes** (email confirm-codes and Stage 8 friend codes) are stored
only as SHA-256 hashes and are short-lived and single-use. The unauthenticated
email path carries a tight per-IP sub-limit (5 / 10 min); the **friend-code redeem**
is authenticated, so it rides the per-user limit (120 / min) and is further bounded
is authenticated, so it rides the per-user limit (300 / min) and is further bounded
by the code's 12 h TTL, single use, and **one live code per issuer** (which caps the
valid-code population). Brute-forcing a 6-digit friend code within these limits is an
accepted MVP risk with low blast radius (an unwanted friendship is removable/blockable);
@@ -605,22 +629,27 @@ a dedicated redeem sub-limit or a longer code is the hardening step if abuse app
## 13. Deployment (informational)
Single public origin, path-routed. The gateway **embeds** the static UI build
(`go:embed`, baked in by a node stage in `gateway/Dockerfile`). The Vite build has two
entries: a lightweight **landing page** served at `/`, and the game **SPA** served at
Single public origin, path-routed. The Vite build has two entries: a lightweight
**landing page** and the game **SPA**. The gateway **embeds** the SPA build
(`go:embed`, baked in by a node stage in `gateway/Dockerfile`) and serves it at
`/app/` (web) and `/telegram/` (the Telegram Mini App; outside Telegram that path
redirects to the root — the client-side guard). Hash-named `/assets/*` are served
redirects to the root — the client-side guard); a stray hit on the gateway's `/`
308-redirects to `/app/`. The **landing** ships in its own static container (R3): the
`landing` target of `gateway/Dockerfile` (caddy:2-alpine + the same Vite build,
`deploy/landing/Caddyfile`) serves it at `/`, so stray public traffic is absorbed by
static file serving and never reaches the Go edge. Hash-named `/assets/*` are served
`immutable` (a relaunch is a cache hit, not a re-download); the HTML shells are
`no-cache` so a new deploy is picked up. An in-compose **caddy** is the
contour's edge: it owns a single `/_gm` Basic-Auth and routes `/_gm/grafana/*` to
**Grafana** (anonymous-admin, so the one shared login gates it with no per-user
Grafana accounts) and the rest of `/_gm/*` to the backend-rendered **admin console**;
everything else (`/`, `/app/`, `/telegram/`, the Connect edge) goes to the gateway. The
`no-cache` so a new deploy is picked up — both containers apply the same caching. An
in-compose **caddy** is the contour's edge: it owns a single `/_gm` Basic-Auth and
routes `/_gm/grafana/*` to **Grafana** (anonymous-admin, so the one shared login gates
it with no per-user Grafana accounts) and the rest of `/_gm/*` to the backend-rendered
**admin console**; `/app/`, `/telegram/` and the Connect path go to the gateway; the
catch-all — notably the landing at `/` — goes to the landing container. The
**Telegram connector** runs as a separate container with **no public ingress** — it
long-polls Telegram and egresses through a VPN sidecar, answering only internal gRPC.
The full contour (`deploy/docker-compose.yml`) runs one `gateway`, one `backend`,
one Postgres, the connector (+ its VPN sidecar) and the **observability stack**
one Postgres, the static `landing`, the connector (+ its VPN sidecar) and the **observability stack**
OTel Collector (OTLP/gRPC ingest → Prometheus metrics + Tempo traces) and Grafana
with provisioned datasources and dashboards. All three services export OTLP to the
collector; the connector shares the VPN sidecar's netns, so its `AWG_CONF` must not
@@ -633,7 +662,8 @@ network (project-scoped DNS); only caddy joins the shared external `edge` networ
Two contours, two secret/variable prefixes (`TEST_` / `PROD_`):
- **Test** (Stage 16): auto-deploys on a PR into — or a push to — `development`
(`.gitea/workflows/ci.yaml``docker compose up -d --build` on the Gitea runner
host, then a `GET /` probe through caddy). The host caddy terminates TLS and
host, then `GET /` + `GET /app/` probes through caddy — the landing container and
the gateway, R3). The host caddy terminates TLS and
forwards the domain to `scrabble:80`, so the in-compose caddy serves plain HTTP
(`CADDY_SITE_ADDRESS=:80`). The in-compose caddy **trusts X-Forwarded-For from
private-range upstreams** (`trusted_proxies private_ranges`), so the real client IP —