R3: gateway edge hardening — body cap, h2c sizing, rate-limit observability

- GATEWAY_MAX_BODY_BYTES (1 MiB): connect WithReadMaxBytes + http.MaxBytesReader on the public mux; explicit http2.Server MaxConcurrentStreams/IdleTimeout and an http.Server ReadHeaderTimeout (R2 report follow-up). - gateway_rate_limited_total{class} counter, Debug per rejection, a rejection tracker drained every 30 s into a Warn summary per key and a report POST to /api/v1/internal/ratelimit/report (feeds the admin view + auto-flag). - The dead AdminPerMinute/AdminBurst policy now guards the /_gm mount (429), ahead of its Basic-Auth. - resolve() logs the cause of infra session-resolve failures at Warn (the transient unauthenticated dips from the R2 run); unknown tokens stay silent.
2026-06-10 01:58:48 +02:00
parent c23ac94c4e
commit 8878711cf3
12 changed files with 549 additions and 35 deletions
@@ -23,7 +23,7 @@ proto/edge/v1/          # Connect envelope contract (committed generated Go)
 internal/config/        # GATEWAY_* env config
 internal/backendclient/ # typed REST client (+ X-User-ID) and push gRPC client
 internal/session/       # in-memory session cache (LRU/TTL, backend fallback)
-internal/ratelimit/     # token-bucket limiter (golang.org/x/time/rate)
+internal/ratelimit/     # token-bucket limiter (golang.org/x/time/rate) + the rejection tracker (R3)
 internal/connector/     # gRPC client to the Telegram connector (initData validate, out-of-app push) + routing
 internal/push/          # live-event fan-out hub (per-user client streams)
 internal/transcode/     # FlatBuffers<->REST bridge + message_type registry
@@ -79,12 +79,21 @@ connector (`ValidateLoginWidget`) and forward the trusted `external_id`. These
 | `GATEWAY_SESSION_TTL` | `10m` | cached session lifetime |
 | `GATEWAY_SESSION_CACHE_MAX` | `50000` | cached session cap |
 | `GATEWAY_PUSH_HEARTBEAT_INTERVAL` | `10s` | live-stream keep-alive (an immediate heartbeat also fires on open, under the ~15s edge idle timeout) |
+| `GATEWAY_MAX_BODY_BYTES` | `1048576` | caps one request body and one Connect message read; an oversized Execute is refused with `resource_exhausted` (R3) |
 | `GATEWAY_SERVICE_NAME` | `scrabble-gateway` | OpenTelemetry `service.name` |
 | `GATEWAY_OTEL_TRACES_EXPORTER` | `none` | `none`, `stdout` or `otlp` (gRPC; endpoint from `OTEL_EXPORTER_OTLP_*`) |
 | `GATEWAY_OTEL_METRICS_EXPORTER` | `none` | `none`, `stdout` or `otlp` |

 Rate-limit defaults (built-in): public 30/min·IP (burst 10), authenticated
-120/min·user (burst 40), admin 60/min·IP (burst 20), email-code 5/10 min·IP.
+300/min·user (burst 80, raised in Stage 17 for multi-device play), admin
+60/min·IP (burst 20, guarding the `/_gm` mount ahead of its Basic-Auth),
+email-code 5/10 min·IP (burst 2).
+
+Every rejection increments `gateway_rate_limited_total{class}`
+(`user`/`public`/`email`/`admin`) and logs one Debug line; a reporter drains the
+per-key rejection tracker every 30 s, emits a Warn summary per throttled key and
+posts the report to the backend (`/api/v1/internal/ratelimit/report`), feeding
+the admin console's throttled view and the high-rate auto-flag (R3).

 ## Run