feat(gateway): unsigned gateway.heartbeat keeps Safari push streams alive
Tests · UI / test (push) Successful in 2m35s
Tests · Go / test (push) Successful in 1m56s
Tests · UI / test (pull_request) Has been cancelled
Tests · Integration / integration (pull_request) Successful in 1m42s
Tests · Go / test (pull_request) Successful in 2m0s

Browser fetch-streaming layers close response bodies they consider
idle after roughly 15-30 s without incoming bytes. Safari is the
most aggressive, but the symptom matters everywhere: a quiet
SubscribeEvents stream (lobby, between turns, mailbox empty) gets
torn down by the browser, the EventStream singleton reconnects with
backoff, and any push event that fires inside the reconnect window
is lost because `push.Hub` queues are not persisted across
subscription closes. The user-visible failure mode is the
intermittent "Fetch API cannot load … due to access control checks"
console error (a misleading WebKit symptom — CORS headers are
actually present) plus missed turn-ready / mail-received toasts.

Server-side fix: a silence-based heartbeat at the
`authenticatedPushStreamService` wrapper layer. After the signed
`gateway.server_time` bootstrap event, gateway wraps the bound
stream with `heartbeatingStream`. Every tail Send (fan-out, future
variants) resets the silence timer; when the timer elapses, a
goroutine emits `gateway.heartbeat` with only `EventType` set —
everything else stays at proto3 defaults, so the wire frame is
~45 bytes amortised. A `sendMu` serialises the heartbeat goroutine
with tail Sends because grpc.ServerStream.Send is not goroutine-safe.

The heartbeat is intentionally UNSIGNED: heartbeats carry no
payload, dispatch to no handler on the client, and an injected
heartbeat trivially causes no user-visible state change. TLS still
protects the wire and real events keep the signed envelope
unchanged. Documented in `docs/ARCHITECTURE.md` § 15 alongside the
per-scale bandwidth projection (100…100 000 clients × 15…60 s).

Config: new `GATEWAY_PUSH_HEARTBEAT_INTERVAL` (default `15s`,
`0s` disables). Telemetry: new
`gateway.push.heartbeats_sent{outcome}` counter so operators can
budget bandwidth and spot a sudden `outcome=error` bump as an
upstream-failing-before-flush signal.

Client (`ui/frontend/src/api/events.svelte.ts`): early `continue`
on `event.eventType === "gateway.heartbeat"` before `verifyEvent`,
`verifyPayloadHash`, or dispatch — empty signature would otherwise
trip SignatureError and reconnect. A leading heartbeat still flips
`connectionStatus` to `connected` and resets backoff, because
receiving one is proof the stream is healthy.

Tests:
- `push_heartbeat_test.go`: unit tests for the wrapper — zero
  interval returns nil, heartbeat fires after silence, real Send
  resets the timer, Stop / context-cancel halt the goroutine,
  Send errors propagate.
- `server_test.go`: integration tests through the full gateway
  pipeline — heartbeat fires after the configured silence window,
  zero interval keeps the stream silent.
- `config_test.go`: default applied, env-override parsed,
  negative value rejected.
- `events.test.ts`: heartbeat skipped before verification + not
  dispatched to handlers; leading heartbeat still flips
  `connectionStatus` to `connected`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Ilia Denisov
2026-05-19 09:29:29 +02:00
parent 8f84075c4b
commit 14b65389ef
14 changed files with 787 additions and 12 deletions
+49
View File
@@ -691,6 +691,55 @@ stream opens is `event_type = gateway.server_time`, reusing the opening
`request_id` as `event_id` and carrying `server_time_ms` so clients can
calibrate offset without a separate time request.
#### Unsigned `gateway.heartbeat` keepalive
Browser fetch-streaming layers (notably WebKit/Safari) close response
bodies they consider idle after roughly 15-30 seconds without
incoming bytes. A push stream in a quiet game (no `game.turn.ready`,
no diplomatic mail) would otherwise be torn down and reopened
repeatedly; events that fire during the reconnect window vanish
because `push.Hub` queues are not persisted across subscription
closes. To keep the body active, the gateway emits a
`gateway.heartbeat` event after `GATEWAY_PUSH_HEARTBEAT_INTERVAL` of
stream silence (default `15s`; set to `0s` to disable). Every real
event resets the silence timer, so the heartbeat fires rarely on
busy streams.
Heartbeats are sent **unsigned**: every field except `event_type` is
left at its protobuf default and no Ed25519 signature is computed.
The client short-circuits on the `gateway.heartbeat` type before
calling `verifyEvent` / `verifyPayloadHash` and never dispatches the
event to handlers. The security implication is intentional —
heartbeats carry no payload that the UI acts on, so an injected
heartbeat trivially fails to cause any user-visible state change.
TLS still protects the wire and the rest of the signed envelope is
unchanged for real events.
##### Wire cost projections
| Clients | 15 s | 30 s | 60 s |
| ------: | ---: | ---: | ---: |
| 100 | 25 MB/day | 13 MB/day | 6 MB/day |
| 1 000 | 250 MB/day | 125 MB/day | 62 MB/day |
| 10 000 | 2.5 GB/day | 1.3 GB/day | 0.6 GB/day |
| 100 000 | 25 GB/day | 12.5 GB/day | 6 GB/day |
Per-heartbeat budget at ~45 bytes on the wire (proto + Connect
framing + HTTP/2 DATA header + amortised TLS overhead), worst case
when no real event ever displaces a tick. Active streams trade
heartbeat traffic for real-event traffic 1:1, so the table is the
upper bound at the chosen interval. Larger deployments that are
willing to take a marginally higher Safari reconnect risk should
raise `GATEWAY_PUSH_HEARTBEAT_INTERVAL` toward 30 s before paying
the full table; setting `0s` reclaims all bytes at the cost of the
visible Safari reconnect loop returning.
Observability: every emission increments the
`gateway.push.heartbeats_sent{outcome}` counter, where
`outcome=sent` is the steady-state line item the operator budgets
bandwidth against and a sudden `outcome=error` bump means the
upstream connection is failing before the gateway can flush.
### Verification order at gateway
Before any payload is forwarded to backend, gateway must:
+12
View File
@@ -820,6 +820,18 @@ internal hub. The first frame the client receives is a
gateway-signed bootstrap event carrying the current server time, so
the client can calibrate its local clock without a separate request.
While the stream is open, gateway tracks a silence timer; if no real
event has been forwarded for `GATEWAY_PUSH_HEARTBEAT_INTERVAL`
(default `15s`, `0s` disables), gateway emits an unsigned
`gateway.heartbeat` event to keep browser fetch-streaming layers
from closing the response body as idle. Real events reset the
timer, so on busy streams the heartbeat fires rarely. The UI client
short-circuits the heartbeat type before signature verification and
never dispatches it to handlers — see
[`docs/ARCHITECTURE.md` § 15](ARCHITECTURE.md#15-transport-security-model-gateway-boundary)
for the wire-cost projection and the security rationale of leaving
the heartbeat unsigned.
### 7.3 Backend → gateway control
Backend hosts a single gRPC service `Push.SubscribePush`, consumed
+12
View File
@@ -845,6 +845,18 @@ control-канал backend → gateway, который производит эт
с текущим серверным временем, чтобы клиент мог калибровать свои
локальные часы без отдельного запроса.
Пока стрим открыт, gateway отслеживает таймер тишины; если за
`GATEWAY_PUSH_HEARTBEAT_INTERVAL` (по умолчанию `15s`, `0s`
отключает) не пришло ни одного реального события, gateway
отправляет неподписанное `gateway.heartbeat`-событие, чтобы
браузерные fetch-streaming слои не закрыли response body как idle.
Реальные события сбрасывают таймер, поэтому на нагруженных
стримах heartbeat срабатывает редко. UI-клиент короткозамыкает
heartbeat-тип до верификации подписи и никогда не дотягивает его
до handlers — см.
[`docs/ARCHITECTURE.md` § 15](ARCHITECTURE.md#15-transport-security-model-gateway-boundary)
для расчёта траффика и обоснования отсутствия подписи у heartbeat.
### 7.3 Управление backend → gateway
Backend хостит единственный gRPC-сервис `Push.SubscribePush`,