R3: dashboards, docs and tracker bake-back
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m7s
CI / changes (pull_request) Successful in 1s
CI / unit (pull_request) Successful in 8s
CI / integration (pull_request) Successful in 12s
CI / ui (pull_request) Successful in 36s
CI / gate (pull_request) Successful in 0s
CI / deploy (pull_request) Successful in 1m7s
- Edge/UX dashboard: aggregate request-rate vs rejection-rate panel (gateway_rate_limited_total by class; no per-user labels). - ARCHITECTURE §2/§11/§12/§13: body cap + explicit h2c sizing, the rate-limit observability pipeline and auto-flag policy, the admin-limiter note (and the caddy-path gap), the landing container topology; fixed the stale 120/min per-user figure. - FUNCTIONAL (+_ru): the Throttled view and the reversible high-rate flag. - gateway/backend/deploy READMEs, TESTING.md, root CLAUDE.md updated. - PRERELEASE.md: R3 interview decisions + implementation refinements logged; tracker R3 -> done (this PR implements it; CI gates the merge).
This commit is contained in:
+44
-14
@@ -98,6 +98,15 @@ dropped). Horizontal scaling is explicit future work.
|
||||
response was lost — its button is disabled while offline and the player re-issues it on
|
||||
reconnect). A reachability watcher (a lightweight `profile.get` probe) clears the signal when no
|
||||
other traffic is in flight; the live `Subscribe` stream's drop/recovery feeds the same signal.
|
||||
**Edge hardening (R3):** every request body on the public listener is capped at
|
||||
`GATEWAY_MAX_BODY_BYTES` (default 1 MiB — far above any legitimate payload), both at the HTTP
|
||||
layer (`http.MaxBytesReader`) and as the Connect per-message read limit, so an oversized
|
||||
`Execute` is refused (`resource_exhausted`) without buffering. The h2c server carries explicit
|
||||
sizing: `MaxConcurrentStreams` 250 (the x/net default made visible — a real client holds one
|
||||
`Subscribe` stream plus a few unary calls) and a 3-minute connection `IdleTimeout` (a live
|
||||
`Subscribe` stream keeps its connection active, so only abandoned connections are reaped); the
|
||||
`http.Server` sets only `ReadHeaderTimeout` (10 s) — Read/WriteTimeout would kill the stream.
|
||||
R7 revisits the exact values under load.
|
||||
- **Alphabet on the wire (Stage 13)**: live play exchanges **alphabet indices**, not
|
||||
concrete letters. The rack (`StateView.rack`), the `SubmitPlay`/`Evaluate` tiles, the
|
||||
`Exchange` tiles and the `CheckWord` word are `ubyte` indices into the variant's alphabet
|
||||
@@ -572,6 +581,21 @@ promotions) is future work and would deliver short markdown messages (text + lin
|
||||
distinct accounts that performed an authenticated edge action in the window. The
|
||||
gauge is single-process by design (single-instance MVP, §10): it is correct for one
|
||||
gateway, resets on restart, and is a live operational figure, not a billing count.
|
||||
- **Rate-limit observability (R3):** every limiter rejection increments the gateway
|
||||
counter `gateway_rate_limited_total` (`class` = user/public/email/admin — aggregate
|
||||
only, honouring the no-per-user-label discipline above) and logs one **Debug** line;
|
||||
a gateway reporter drains the per-key rejection tracker every 30 s, emits one **Warn**
|
||||
summary per throttled key and posts the report to the backend
|
||||
(`POST /api/v1/internal/ratelimit/report`, network-trusted like `sessions/resolve`).
|
||||
The backend's `ratewatch` keeps a bounded in-memory episode window (single-instance,
|
||||
resets on restart, like `active_users`) surfaced on the admin console's **Throttled**
|
||||
page next to the flagged-account review queue, and applies the **conservative
|
||||
auto-flag**: an account sustaining `BACKEND_HIGHRATE_FLAG_THRESHOLD` rejected calls
|
||||
(default 1000) within `BACKEND_HIGHRATE_FLAG_WINDOW` (default 10 min) gets the soft,
|
||||
reversible `accounts.flagged_high_rate_at` marker — set once, shown in the user
|
||||
list/detail, cleared by the operator, **never an automatic ban** and never a request
|
||||
gate. The Edge/UX dashboard graphs the aggregate request rate against the rejection
|
||||
rate by class.
|
||||
- Unauthenticated `GET /healthz` (liveness) and `GET /readyz` (readiness — the
|
||||
database answers a bounded ping and the session cache is warmed).
|
||||
- The backend serves a **second listener** — a gRPC server
|
||||
@@ -582,12 +606,12 @@ promotions) is future work and would deliver short markdown messages (text + lin
|
||||
|
||||
| Concern | Enforced by |
|
||||
| --- | --- |
|
||||
| Public rate limiting / anti-abuse | gateway |
|
||||
| Public rate limiting / anti-abuse | gateway (per-IP public/email/admin classes, per-user authenticated class; a request body cap of `GATEWAY_MAX_BODY_BYTES`; rejections are metered, summarised to the backend and surfaced in the admin console with a conservative reversible auto-flag — R3, §11) |
|
||||
| Telegram initData validation (bot-token HMAC) | the Telegram connector; the gateway delegates it over gRPC, so the bot token lives only in the connector |
|
||||
| Session minting; email-code / guest validation | gateway (with backend) |
|
||||
| Session → `user_id` resolution, `X-User-ID` injection | gateway |
|
||||
| Authorisation, ownership, state transitions | backend (`X-User-ID` is the sole identity input) |
|
||||
| Admin authentication | a single Basic-Auth gate on `/_gm/*`, forwarded **verbatim** to the backend's server-rendered admin console (and, in the deployed contour, routing `/_gm/grafana/*` to Grafana). In the deploy the **caddy** owns this gate (§13); a local non-caddy run uses the gateway's own `GATEWAY_ADMIN_*` proxy. The backend trusts the proxy (no admin principal) and guards its state-changing POSTs with a **same-origin** check — the console's CSRF defence. No operator identity is tracked |
|
||||
| Admin authentication | a single Basic-Auth gate on `/_gm/*`, forwarded **verbatim** to the backend's server-rendered admin console (and, in the deployed contour, routing `/_gm/grafana/*` to Grafana). In the deploy the **caddy** owns this gate (§13); a local non-caddy run uses the gateway's own `GATEWAY_ADMIN_*` proxy, which the per-IP admin limiter class guards ahead of its Basic-Auth (R3) — the caddy-fronted path has no limiter (stock caddy), an accepted gap. The backend trusts the proxy (no admin principal) and guards its state-changing POSTs with a **same-origin** check — the console's CSRF defence. No operator identity is tracked |
|
||||
| backend ↔ gateway ↔ connector trust | the network (only gateway may reach backend; the connector serves unauthenticated gRPC on the internal segment) |
|
||||
|
||||
This is an explicit, accepted MVP risk: compromise of the gateway↔backend
|
||||
@@ -597,7 +621,7 @@ mutual auth is a future hardening step.
|
||||
**Short numeric codes** (email confirm-codes and Stage 8 friend codes) are stored
|
||||
only as SHA-256 hashes and are short-lived and single-use. The unauthenticated
|
||||
email path carries a tight per-IP sub-limit (5 / 10 min); the **friend-code redeem**
|
||||
is authenticated, so it rides the per-user limit (120 / min) and is further bounded
|
||||
is authenticated, so it rides the per-user limit (300 / min) and is further bounded
|
||||
by the code's 12 h TTL, single use, and **one live code per issuer** (which caps the
|
||||
valid-code population). Brute-forcing a 6-digit friend code within these limits is an
|
||||
accepted MVP risk with low blast radius (an unwanted friendship is removable/blockable);
|
||||
@@ -605,22 +629,27 @@ a dedicated redeem sub-limit or a longer code is the hardening step if abuse app
|
||||
|
||||
## 13. Deployment (informational)
|
||||
|
||||
Single public origin, path-routed. The gateway **embeds** the static UI build
|
||||
(`go:embed`, baked in by a node stage in `gateway/Dockerfile`). The Vite build has two
|
||||
entries: a lightweight **landing page** served at `/`, and the game **SPA** served at
|
||||
Single public origin, path-routed. The Vite build has two entries: a lightweight
|
||||
**landing page** and the game **SPA**. The gateway **embeds** the SPA build
|
||||
(`go:embed`, baked in by a node stage in `gateway/Dockerfile`) and serves it at
|
||||
`/app/` (web) and `/telegram/` (the Telegram Mini App; outside Telegram that path
|
||||
redirects to the root — the client-side guard). Hash-named `/assets/*` are served
|
||||
redirects to the root — the client-side guard); a stray hit on the gateway's `/`
|
||||
308-redirects to `/app/`. The **landing** ships in its own static container (R3): the
|
||||
`landing` target of `gateway/Dockerfile` (caddy:2-alpine + the same Vite build,
|
||||
`deploy/landing/Caddyfile`) serves it at `/`, so stray public traffic is absorbed by
|
||||
static file serving and never reaches the Go edge. Hash-named `/assets/*` are served
|
||||
`immutable` (a relaunch is a cache hit, not a re-download); the HTML shells are
|
||||
`no-cache` so a new deploy is picked up. An in-compose **caddy** is the
|
||||
contour's edge: it owns a single `/_gm` Basic-Auth and routes `/_gm/grafana/*` to
|
||||
**Grafana** (anonymous-admin, so the one shared login gates it with no per-user
|
||||
Grafana accounts) and the rest of `/_gm/*` to the backend-rendered **admin console**;
|
||||
everything else (`/`, `/app/`, `/telegram/`, the Connect edge) goes to the gateway. The
|
||||
`no-cache` so a new deploy is picked up — both containers apply the same caching. An
|
||||
in-compose **caddy** is the contour's edge: it owns a single `/_gm` Basic-Auth and
|
||||
routes `/_gm/grafana/*` to **Grafana** (anonymous-admin, so the one shared login gates
|
||||
it with no per-user Grafana accounts) and the rest of `/_gm/*` to the backend-rendered
|
||||
**admin console**; `/app/`, `/telegram/` and the Connect path go to the gateway; the
|
||||
catch-all — notably the landing at `/` — goes to the landing container. The
|
||||
**Telegram connector** runs as a separate container with **no public ingress** — it
|
||||
long-polls Telegram and egresses through a VPN sidecar, answering only internal gRPC.
|
||||
|
||||
The full contour (`deploy/docker-compose.yml`) runs one `gateway`, one `backend`,
|
||||
one Postgres, the connector (+ its VPN sidecar) and the **observability stack** —
|
||||
one Postgres, the static `landing`, the connector (+ its VPN sidecar) and the **observability stack** —
|
||||
OTel Collector (OTLP/gRPC ingest → Prometheus metrics + Tempo traces) and Grafana
|
||||
with provisioned datasources and dashboards. All three services export OTLP to the
|
||||
collector; the connector shares the VPN sidecar's netns, so its `AWG_CONF` must not
|
||||
@@ -633,7 +662,8 @@ network (project-scoped DNS); only caddy joins the shared external `edge` networ
|
||||
Two contours, two secret/variable prefixes (`TEST_` / `PROD_`):
|
||||
- **Test** (Stage 16): auto-deploys on a PR into — or a push to — `development`
|
||||
(`.gitea/workflows/ci.yaml` → `docker compose up -d --build` on the Gitea runner
|
||||
host, then a `GET /` probe through caddy). The host caddy terminates TLS and
|
||||
host, then `GET /` + `GET /app/` probes through caddy — the landing container and
|
||||
the gateway, R3). The host caddy terminates TLS and
|
||||
forwards the domain to `scrabble:80`, so the in-compose caddy serves plain HTTP
|
||||
(`CADDY_SITE_ADDRESS=:80`). The in-compose caddy **trusts X-Forwarded-For from
|
||||
private-range upstreams** (`trusted_proxies private_ranges`), so the real client IP —
|
||||
|
||||
@@ -171,3 +171,11 @@ applied after a reload). When a Telegram connector is configured an operator can
|
||||
**message a user** (by their Telegram identity) or **post to the game channel**.
|
||||
State-changing actions are protected by a same-origin check; the console tracks no
|
||||
operator identity.
|
||||
|
||||
The console also surfaces **rate-limit abuse** (R3): a **Throttled** page lists the
|
||||
recently throttled users/IPs the gateway reported (an in-memory window — it resets on
|
||||
a backend restart) and the accounts currently carrying the soft **high-rate flag**. An
|
||||
account sustaining rejections past a tunable threshold is flagged automatically —
|
||||
the marker is reversible, shown as a badge in the user list and on the user card, and
|
||||
**never blocks play**; the operator reviews and clears it from the user card. There is
|
||||
no automatic ban.
|
||||
|
||||
@@ -175,3 +175,11 @@ identity, их игры) и **игры** (сводка + места), разби
|
||||
подключён Telegram-коннектор, оператор также может **написать пользователю** (по его
|
||||
Telegram-identity) или **отправить пост в игровой канал**. Изменяющие действия
|
||||
защищены проверкой same-origin; личность оператора не отслеживается.
|
||||
|
||||
Консоль также показывает **злоупотребление лимитами** (R3): страница **Throttled**
|
||||
перечисляет недавно затроттленных пользователей/IP по отчётам gateway (окно в памяти —
|
||||
сбрасывается при рестарте backend) и аккаунты с действующим мягким **high-rate
|
||||
флагом**. Аккаунт, устойчиво превышающий настраиваемый порог отказов, помечается
|
||||
автоматически — маркер обратим, виден бейджем в списке пользователей и на карточке
|
||||
аккаунта и **никогда не блокирует игру**; оператор рассматривает и снимает его с
|
||||
карточки пользователя. Автоматического бана нет.
|
||||
|
||||
+14
-2
@@ -76,7 +76,14 @@ tests or touching CI.
|
||||
unsubscribe), the transcode round-trips (FlatBuffers↔JSON, X-User-ID
|
||||
forwarding, nested GameView, domain-code surfacing), the admin Basic-Auth
|
||||
reverse proxy (401 / forward), and a full Connect `Execute` path end to end
|
||||
(guest auth, unauthenticated rejection, unknown message type). The backend gains
|
||||
(guest auth, unauthenticated rejection, unknown message type). **R3** adds the
|
||||
edge-hardening cases: an oversized `Execute` payload is refused
|
||||
(`resource_exhausted`, the `GATEWAY_MAX_BODY_BYTES` cap), a limiter rejection
|
||||
lands in `gateway_rate_limited_total{class}` and the rejection tracker
|
||||
(drain/aggregate unit tests), the report POST reaches
|
||||
`/api/v1/internal/ratelimit/report` with the agreed JSON shape, the `/_gm`
|
||||
mount is 429-guarded by the per-IP admin class, and the gateway's `/`
|
||||
308-redirects to `/app/` (the landing left the embed). The backend gains
|
||||
the **guest** lifecycle (a guest plays an auto-match to a natural end yet accrues
|
||||
no statistics) and the **email-as-login** flow (request/verify, returning user)
|
||||
in `inttest`. Stage 8 adds gateway transcode round-trips for the new social/account
|
||||
@@ -92,7 +99,12 @@ tests or touching CI.
|
||||
404 when not). Postgres-backed `inttest` drives the **complaint resolution →
|
||||
dictionary-change pipeline** (file → resolve with a disposition → pending change → mark
|
||||
applied), the admin **list/count** read queries, and the **/_gm console over HTTP**
|
||||
(pages render; a resolve POST needs a same-origin header).
|
||||
(pages render; a resolve POST needs a same-origin header). **R3** adds `ratewatch`
|
||||
unit tests (window accumulation, the auto-flag threshold + expiry, the bounded
|
||||
episode map), the account-store **high-rate flag round-trip** (set-once / clear /
|
||||
re-flag) and a console flow in `inttest`: a gateway report auto-flags the account,
|
||||
the **Throttled** page shows the episode and the flagged queue, the user card
|
||||
carries the marker and the CSRF-guarded **Clear** reverses it.
|
||||
- **Observability & performance** *(Stage 12)* — `pkg/telemetry` unit-tests the exporter
|
||||
selection (`none`/`stdout`/`otlp` build providers; OTLP constructs with no collector;
|
||||
the nil-runtime fallback). The domain metrics are exercised through a manual
|
||||
|
||||
Reference in New Issue
Block a user