14b65389ef
Tests · UI / test (push) Successful in 2m35s
Tests · Go / test (push) Successful in 1m56s
Tests · UI / test (pull_request) Has been cancelled
Tests · Integration / integration (pull_request) Successful in 1m42s
Tests · Go / test (pull_request) Successful in 2m0s
Browser fetch-streaming layers close response bodies they consider
idle after roughly 15-30 s without incoming bytes. Safari is the
most aggressive, but the symptom matters everywhere: a quiet
SubscribeEvents stream (lobby, between turns, mailbox empty) gets
torn down by the browser, the EventStream singleton reconnects with
backoff, and any push event that fires inside the reconnect window
is lost because `push.Hub` queues are not persisted across
subscription closes. The user-visible failure mode is the
intermittent "Fetch API cannot load … due to access control checks"
console error (a misleading WebKit symptom — CORS headers are
actually present) plus missed turn-ready / mail-received toasts.
Server-side fix: a silence-based heartbeat at the
`authenticatedPushStreamService` wrapper layer. After the signed
`gateway.server_time` bootstrap event, gateway wraps the bound
stream with `heartbeatingStream`. Every tail Send (fan-out, future
variants) resets the silence timer; when the timer elapses, a
goroutine emits `gateway.heartbeat` with only `EventType` set —
everything else stays at proto3 defaults, so the wire frame is
~45 bytes amortised. A `sendMu` serialises the heartbeat goroutine
with tail Sends because grpc.ServerStream.Send is not goroutine-safe.
The heartbeat is intentionally UNSIGNED: heartbeats carry no
payload, dispatch to no handler on the client, and an injected
heartbeat trivially causes no user-visible state change. TLS still
protects the wire and real events keep the signed envelope
unchanged. Documented in `docs/ARCHITECTURE.md` § 15 alongside the
per-scale bandwidth projection (100…100 000 clients × 15…60 s).
Config: new `GATEWAY_PUSH_HEARTBEAT_INTERVAL` (default `15s`,
`0s` disables). Telemetry: new
`gateway.push.heartbeats_sent{outcome}` counter so operators can
budget bandwidth and spot a sudden `outcome=error` bump as an
upstream-failing-before-flush signal.
Client (`ui/frontend/src/api/events.svelte.ts`): early `continue`
on `event.eventType === "gateway.heartbeat"` before `verifyEvent`,
`verifyPayloadHash`, or dispatch — empty signature would otherwise
trip SignatureError and reconnect. A leading heartbeat still flips
`connectionStatus` to `connected` and resets backoff, because
receiving one is proof the stream is healthy.
Tests:
- `push_heartbeat_test.go`: unit tests for the wrapper — zero
interval returns nil, heartbeat fires after silence, real Send
resets the timer, Stop / context-cancel halt the goroutine,
Send errors propagate.
- `server_test.go`: integration tests through the full gateway
pipeline — heartbeat fires after the configured silence window,
zero interval keeps the stream silent.
- `config_test.go`: default applied, env-override parsed,
negative value rejected.
- `events.test.ts`: heartbeat skipped before verification + not
dispatched to handlers; leading heartbeat still flips
`connectionStatus` to `connected`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
164 lines
5.1 KiB
Go
164 lines
5.1 KiB
Go
package grpcapi
|
|
|
|
import (
|
|
"context"
|
|
"sync"
|
|
"time"
|
|
|
|
"galaxy/gateway/internal/telemetry"
|
|
gatewayv1 "galaxy/gateway/proto/galaxy/gateway/v1"
|
|
|
|
"go.opentelemetry.io/otel/attribute"
|
|
"google.golang.org/grpc"
|
|
)
|
|
|
|
// heartbeatingStream wraps a server-streaming response so the inner
|
|
// stream stays alive across browser fetch-streaming idle timeouts.
|
|
// Every call to Send (a real event from a tail service) resets a
|
|
// silence timer; when the timer fires, Run emits an unsigned
|
|
// `gateway.heartbeat` event on its own. Send and the heartbeat
|
|
// goroutine serialise on the same mutex because grpc.ServerStream.Send
|
|
// is documented as not goroutine-safe.
|
|
//
|
|
// Wire-cost budgeting: each heartbeat is one GatewayEvent with only
|
|
// EventType populated (~17 bytes + protobuf tag), framed by Connect
|
|
// (~5 bytes) and HTTP/2 plus TLS overhead (~50 bytes). At the
|
|
// 15-second default a fully-idle stream costs ~840 KB/day per client;
|
|
// see `docs/ARCHITECTURE.md` for the per-scale projection.
|
|
type heartbeatingStream struct {
|
|
grpc.ServerStreamingServer[gatewayv1.GatewayEvent]
|
|
|
|
interval time.Duration
|
|
metrics *telemetry.Runtime
|
|
|
|
sendMu sync.Mutex
|
|
timer *time.Timer
|
|
|
|
stopOnce sync.Once
|
|
done chan struct{}
|
|
}
|
|
|
|
// newHeartbeatingStream wraps inner with a silence-based heartbeat
|
|
// emitter. A non-positive interval returns nil so the caller can skip
|
|
// the wrapping entirely; non-nil returns must have `Stop()` called once
|
|
// the stream lifecycle ends.
|
|
func newHeartbeatingStream(
|
|
inner grpc.ServerStreamingServer[gatewayv1.GatewayEvent],
|
|
interval time.Duration,
|
|
metrics *telemetry.Runtime,
|
|
) *heartbeatingStream {
|
|
if interval <= 0 {
|
|
return nil
|
|
}
|
|
|
|
return &heartbeatingStream{
|
|
ServerStreamingServer: inner,
|
|
interval: interval,
|
|
metrics: metrics,
|
|
timer: time.NewTimer(interval),
|
|
done: make(chan struct{}),
|
|
}
|
|
}
|
|
|
|
// Send forwards event to the inner stream and resets the silence timer
|
|
// so the heartbeat goroutine waits a fresh interval before firing
|
|
// again. A Send that succeeds means the transport just delivered real
|
|
// bytes; the silence window restarts from "now".
|
|
func (s *heartbeatingStream) Send(event *gatewayv1.GatewayEvent) error {
|
|
s.sendMu.Lock()
|
|
defer s.sendMu.Unlock()
|
|
if err := s.ServerStreamingServer.Send(event); err != nil {
|
|
return err
|
|
}
|
|
s.resetTimerLocked()
|
|
|
|
return nil
|
|
}
|
|
|
|
// Run blocks until ctx is canceled or Stop is called, emitting one
|
|
// `gateway.heartbeat` event whenever the silence timer fires. Intended
|
|
// to run in its own goroutine alongside the tail service that owns the
|
|
// stream. A Send failure from the heartbeat path is recorded in
|
|
// telemetry and returned to the caller; production wiring discards it
|
|
// because the tail service will see the same transport failure on its
|
|
// next Send and propagate the real error to the gateway frame
|
|
// observability layer.
|
|
func (s *heartbeatingStream) Run(ctx context.Context) error {
|
|
for {
|
|
select {
|
|
case <-ctx.Done():
|
|
return nil
|
|
case <-s.done:
|
|
return nil
|
|
case <-s.timer.C:
|
|
err := s.sendHeartbeat()
|
|
if err != nil {
|
|
return err
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
// Stop halts the heartbeat goroutine and drains the silence timer.
|
|
// Safe to call multiple times; subsequent calls are no-ops.
|
|
func (s *heartbeatingStream) Stop() {
|
|
s.stopOnce.Do(func() {
|
|
close(s.done)
|
|
if !s.timer.Stop() {
|
|
select {
|
|
case <-s.timer.C:
|
|
default:
|
|
}
|
|
}
|
|
})
|
|
}
|
|
|
|
// sendHeartbeat emits one heartbeat event, records the outcome in
|
|
// telemetry, and re-arms the silence timer. The outcome attribute
|
|
// makes a sudden bump of `error` easy to spot in dashboards — it
|
|
// usually means the upstream connection is failing before the gateway
|
|
// can flush, while a steady `sent` rate is the normal idle baseline
|
|
// the deployment operator budgets bandwidth against.
|
|
func (s *heartbeatingStream) sendHeartbeat() error {
|
|
s.sendMu.Lock()
|
|
defer s.sendMu.Unlock()
|
|
|
|
err := s.ServerStreamingServer.Send(buildHeartbeatEvent())
|
|
outcome := attribute.String("outcome", "sent")
|
|
if err != nil {
|
|
outcome = attribute.String("outcome", "error")
|
|
}
|
|
s.metrics.RecordPushHeartbeat(context.Background(), outcome)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
s.resetTimerLocked()
|
|
|
|
return nil
|
|
}
|
|
|
|
// resetTimerLocked re-arms the silence timer. Caller must hold sendMu.
|
|
// The drain follows the canonical pattern from the time.Timer
|
|
// docstring: Stop may report `false` either because the timer already
|
|
// fired or because nothing was queued, so the non-blocking drain
|
|
// handles both states without deadlocking when the channel was already
|
|
// emptied by Run.
|
|
func (s *heartbeatingStream) resetTimerLocked() {
|
|
if !s.timer.Stop() {
|
|
select {
|
|
case <-s.timer.C:
|
|
default:
|
|
}
|
|
}
|
|
s.timer.Reset(s.interval)
|
|
}
|
|
|
|
// buildHeartbeatEvent returns the minimal `gateway.heartbeat`
|
|
// GatewayEvent emitted into the push stream. Every field except
|
|
// EventType is left at its proto3 default so the wire frame stays as
|
|
// small as Connect framing allows. See `gatewayHeartbeatEventType` for
|
|
// the security rationale of leaving the event unsigned.
|
|
func buildHeartbeatEvent() *gatewayv1.GatewayEvent {
|
|
return &gatewayv1.GatewayEvent{EventType: gatewayHeartbeatEventType}
|
|
}
|