feat: edge gateway service

2026-04-02 19:18:42 +02:00
parent 8cde99936c
commit 436c97a38b
95 changed files with 20504 additions and 57 deletions
@@ -1,5 +1,46 @@
 # Edge Gateway

+## Run and Dependencies
+
+`cmd/gateway` starts with built-in listener defaults, but it still requires:
+
+- one reachable Redis deployment for session lookup, replay reservations, and
+  both internal event streams;
+- one configured session event stream via `GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
+- one configured client event stream via `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
+- one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by
+  `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.
+
+Required startup environment variables:
+
+- `GATEWAY_SESSION_CACHE_REDIS_ADDR`
+- `GATEWAY_SESSION_EVENTS_REDIS_STREAM`
+- `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`
+- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`
+
+Optional integrations:
+
+- `GATEWAY_ADMIN_HTTP_ADDR` enables the private `/metrics` listener;
+- an injected `AuthServiceClient` enables real public auth handling;
+- injected downstream routes are required for successful `ExecuteCommand`.
+
+Operational caveats:
+
+- public auth routes stay mounted and return `503 service_unavailable` until an
+  auth adapter is wired;
+- authenticated gRPC starts without downstream routes, but `ExecuteCommand`
+  returns gRPC `UNIMPLEMENTED` until routing is configured.
+
+Additional module docs:
+
+- [Public REST contract](openapi.yaml)
+- [Documentation index](docs/README.md)
+- [Runtime and components](docs/runtime.md)
+- [Request and push flows](docs/flows.md)
+- [Operator runbook](docs/runbook.md)
+- [Configuration and contract examples](docs/examples.md)
+- [Example `.env`](.env.example)
+
 ## Purpose

 `Edge Gateway` is the only public ingress for Galaxy Plus clients.
@@ -40,29 +81,97 @@ The gateway exposes two external transport classes.

 | Transport | Audience | Authentication | Payload format | Primary use |
 | --- | --- | --- | --- | --- |
-| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Public auth commands, health checks, browser/bootstrap traffic |
+| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Health checks, public auth commands, and browser/bootstrap traffic |
 | gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery |

 ### Public REST Surface

 The public REST surface is used for commands that must work before a device
 session exists and for browser-originated traffic that may share the same edge.
+It covers the probe endpoints, public auth routes, and coarse public
+anti-abuse.

-Stable public endpoints:
+Currently implemented public endpoints:

- `POST /api/v1/public/auth/send-email-code`
- `POST /api/v1/public/auth/confirm-email-code`
 - `GET /healthz`
 - `GET /readyz`
+- `POST /api/v1/public/auth/send-email-code`
+- `POST /api/v1/public/auth/confirm-email-code`
+
+The implemented REST contract is documented in [`openapi.yaml`](openapi.yaml).
+The listener address is configured by `GATEWAY_PUBLIC_HTTP_ADDR`.
+The public REST listener read budgets are configured by:
+
+- `GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
+- `GATEWAY_PUBLIC_HTTP_READ_TIMEOUT` with default `10s`;
+- `GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUT` with default `1m`.
+
+The public auth JSON contract uses a challenge-token flow:
+
+- `send-email-code` accepts `email` and returns `challenge_id`;
+- `confirm-email-code` accepts `challenge_id`, `code`, and
+  `client_public_key`, then returns `device_session_id`.
+
+`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
+key for the device session being created.
+
+These routes remain unauthenticated and delegate only through an injected
+`AuthServiceClient`.
+The default wiring used by `cmd/gateway` keeps the routes mounted and returns
+`503 service_unavailable` until a concrete upstream auth adapter is supplied.
+Public auth adapter calls are wrapped in
+`GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT`, which defaults to `3s`.
+When that timeout expires, the gateway preserves the public REST contract and
+returns `503 service_unavailable`.
+When an injected auth adapter returns `*AuthServiceError`, the gateway projects
+that client-safe `4xx/5xx` status, `code`, and `message` back to the caller
+after normalizing blank or invalid fields. Unexpected non-`AuthServiceError`
+adapter failures fail closed as `500 internal_error`.
+
+Public anti-abuse is process-local and in-memory.
+Per-IP buckets are derived only from the TCP peer `RemoteAddr`.
+Forwarded proxy headers such as `X-Forwarded-For` and `Forwarded` are
+intentionally ignored.
+Oversized public REST bodies are rejected with `413 request_too_large`.
+Rate-limited requests are rejected with `429 rate_limited` and a
+`Retry-After` header.

 In addition to the fixed endpoints above, the gateway may front browser
 bootstrap or asset traffic through a pluggable public handler or proxy.
 That traffic belongs to dedicated public route classes and must not share rate
 limit buckets or abuse counters with the public auth API.

+### Operational Admin Surface
+
+The gateway may expose one private operational HTTP listener used for metrics.
+
+The admin listener is disabled by default and is enabled only when
+`GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
+When enabled, it serves:
+
+- `GET /metrics`
+
+The admin listener read budgets are configured by:
+
+- `GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
+- `GATEWAY_ADMIN_HTTP_READ_TIMEOUT` with default `10s`;
+- `GATEWAY_ADMIN_HTTP_IDLE_TIMEOUT` with default `1m`.
+
+`/metrics` is intentionally not mounted on the public REST ingress.
+It is also intentionally excluded from [`openapi.yaml`](openapi.yaml), because
+that specification covers only the public REST ingress.
+The endpoint exposes metrics in the Prometheus text exposition format described
+in the official Prometheus documentation:
+<https://prometheus.io/docs/instrumenting/exposition_formats/>.
+
 ### Authenticated gRPC Surface

 All authenticated client requests use HTTP/2 and gRPC.
+The listener address is configured by `GATEWAY_AUTHENTICATED_GRPC_ADDR`.
+Inbound authenticated gRPC connection setup is bounded by
+`GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT`, which defaults to `5s`.
+The accepted client timestamp skew is configured by
+`GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW` and defaults to `5m`.

 The public gRPC service exposes two methods:

@@ -72,10 +181,133 @@ The public gRPC service exposes two methods:
 `ExecuteCommand` is a generic unary RPC.
 The gateway routes the request downstream by `message_type` after transport
 verification succeeds.
+Downstream unary execution is bounded by
+`GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT`, which defaults to `5s`.
+When that timeout expires, the gateway preserves the authenticated gRPC
+contract and returns gRPC `UNAVAILABLE` with message
+`downstream service is unavailable`.

 `SubscribeEvents` is an authenticated server-streaming RPC.
 It binds the stream to `user_id` and `device_session_id` and starts by sending
-a service event that includes the current server time in milliseconds.
+a signed service event that includes the current server time in milliseconds.
+
+The v1 protobuf contract lives in
+`proto/galaxy/gateway/v1/edge_gateway.proto` under package
+`galaxy.gateway.v1` and service `EdgeGateway`.
+Generated Go bindings are committed under `proto/galaxy/gateway/v1/` and are
+regenerated with:
+
+```bash
+buf generate
+```
+
+The gateway validates the request envelope, device-session
+cache lookup, `payload_hash`, the client Ed25519 signature, timestamp
+freshness, replay reservation, authenticated rate limits, and the
+authenticated policy hook before any later routing or push step runs.
+Malformed envelopes are rejected with gRPC `INVALID_ARGUMENT`.
+Requests with a non-empty but unsupported `protocol_version` are rejected with
+gRPC `FAILED_PRECONDITION`.
+The supported request `protocol_version` literal is `v1`.
+Requests with an unknown `device_session_id` are rejected with gRPC
+`UNAUTHENTICATED`.
+Requests for revoked sessions are rejected with gRPC `FAILED_PRECONDITION`.
+SessionCache backend failures, including Redis lookup or record-decode
+failures, are rejected with gRPC `UNAVAILABLE`.
+Requests with a `payload_hash` that is not a 32-byte SHA-256 digest or does
+not match `payload_bytes` are rejected with gRPC `INVALID_ARGUMENT`.
+Requests with an invalid client signature or a signature created by a
+different key are rejected with gRPC `UNAUTHENTICATED` and message
+`invalid request signature`.
+Requests with malformed cached `client_public_key` material fail closed as
+gRPC `UNAVAILABLE`.
+Requests with a `timestamp_ms` outside the symmetric freshness window around
+current server time are rejected with gRPC `FAILED_PRECONDITION` and message
+`request timestamp is outside the freshness window`.
+Requests that reuse the same `request_id` for the same `device_session_id`
+inside the active replay window are rejected with gRPC
+`FAILED_PRECONDITION` and message `request replay detected`.
+ReplayStore backend failures fail closed with gRPC `UNAVAILABLE` and message
+`replay store is unavailable`.
+Authenticated rate limits are enforced independently by transport peer IP,
+authenticated `device_session_id`, authenticated `user_id`, and authenticated
+message class. The gateway uses the full verified `message_type` literal as the
+stable v1 message-class key because the transport does not yet define a
+coarser authenticated class taxonomy. The peer IP is derived only from the
+gRPC transport peer address; if it is missing or cannot be parsed, the
+request falls back to the stable `unknown` IP bucket.
+Requests that exceed any authenticated rate-limit bucket are rejected with
+gRPC `RESOURCE_EXHAUSTED` and message
+`authenticated request rate limit exceeded`.
+The authenticated edge policy hook runs after those rate limits and defaults
+to allow-all until a concrete policy evaluator is wired into the process.
+`ExecuteCommand` builds an internal authenticated command context,
+resolves one exact-match downstream route by the full verified `message_type`
+literal, executes the downstream unary client, and signs the response before
+it is returned to the caller. When no exact downstream route is registered,
+`ExecuteCommand` is rejected with gRPC `UNIMPLEMENTED` and message
+`message_type is not routed`. Downstream availability failures are rejected
+with gRPC `UNAVAILABLE` and message `downstream service is unavailable`.
+Unexpected downstream route-resolution or execution failures are rejected with
+gRPC `INTERNAL`. Successful unary responses preserve the original
+`request_id`, carry a SHA-256 `payload_hash` of the returned `payload_bytes`,
+and are signed with the configured server Ed25519 response signer.
+The default `cmd/gateway` wiring currently installs an empty static
+downstream router, so verified `ExecuteCommand` requests still return gRPC
+`UNIMPLEMENTED` until concrete downstream routes are injected.
+`SubscribeEvents` applies the full authenticated ingress pipeline, binds
+the stream to the verified `user_id` and `device_session_id`, sends one
+signed `gateway.server_time` bootstrap event whose FlatBuffers payload carries
+`server_time_ms`, registers the active stream in the in-memory `PushHub`, and
+then forwards signed client-facing events consumed from the configured client
+event Redis stream. User-targeted events fan out to every active stream for
+that user. Session-targeted events fan out only to streams whose
+`user_id` and `device_session_id` both match the event target. Each active
+stream uses a bounded in-memory queue; when that queue overflows, only the
+affected stream is closed with gRPC `RESOURCE_EXHAUSTED` and message
+`push stream overflowed`. When the session lifecycle stream reports that the
+same `device_session_id` was revoked, every active `SubscribeEvents` stream
+bound to that exact session is closed with gRPC `FAILED_PRECONDITION` and
+message `device session is revoked`. During gateway shutdown, the in-memory
+push hub is closed before gRPC graceful stop, and every active
+`SubscribeEvents` stream is terminated with gRPC `UNAVAILABLE` and message
+`gateway is shutting down`.
+Authenticated anti-abuse budgets are configured by the
+`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_*` environment variables.
+
+Current authenticated gRPC defaults:
+
+- per-IP: `120 requests / minute`, `burst=40`;
+- per-session: `60 requests / minute`, `burst=20`;
+- per-user: `120 requests / minute`, `burst=40`;
+- per-message-class: `60 requests / minute`, `burst=20`.
+
+Authenticated anti-abuse configuration surface:
+
+- per-IP:
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS` default
+  `120`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOW` default `1m`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST` default `40`;
+- per-session:
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS` default
+  `60`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOW` default
+  `1m`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURST` default
+  `20`;
+- per-user:
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTS` default
+  `120`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOW` default `1m`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURST` default `40`;
+- per-message-class:
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTS`
+  default `60`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOW`
+  default `1m`,
+  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURST`
+  default `20`.

 ## Envelope and Payload Model

@@ -86,10 +318,25 @@ The authenticated transport uses a split contract:
 - signatures are computed over canonical envelope fields and a hash of raw
  FlatBuffers bytes.

-The gateway treats `payload_bytes` as opaque business data.
+The gateway treats authenticated request `payload_bytes` as opaque business
+data.
 It verifies integrity and forwards verified bytes downstream without rewriting
 them.

+The request envelope version literal is `v1`.
+`payload_hash` is the raw 32-byte SHA-256 digest of `payload_bytes`.
+`ExecuteCommand` hashes the raw FlatBuffers payload bytes exactly as sent,
+while `SubscribeEvents` with an empty payload still requires
+`sha256([]byte{})` rather than a special-case value.
+The v1 request signature scheme is Ed25519.
+`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
+key registered during `confirm-email-code`.
+`signature` carries the raw 64-byte Ed25519 signature computed over the
+canonical request signing input.
+
+The v1 stream bootstrap payload uses the shared FlatBuffers schema
+`pkg/schema/fbs/gateway.fbs` with root table `gateway.ServerTimeEvent`.
+
 ### ExecuteCommandRequest

 Required fields:
@@ -119,6 +366,22 @@ Required fields:
 - `payload_hash`
 - `signature`

+The v1 unary response signature scheme is Ed25519 with response
+domain marker `galaxy-response-v1`.
+The response signing input uses the same canonical binary encoding shape as
+the request signer:
+
+- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
+  followed by raw bytes;
+- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
+- the signed field order is `galaxy-response-v1`, `protocol_version`,
+  `request_id`, `timestamp_ms`, `result_code`, `payload_hash`.
+
+`cmd/gateway` loads the unary response signer from
+`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must point to a PKCS#8
+PEM-encoded Ed25519 private key. Startup fails when the file is absent,
+unreadable, not strict PEM, not PKCS#8, or not Ed25519.
+
 ### SubscribeEventsRequest

 The stream open request reuses the authenticated request model.
@@ -158,6 +421,33 @@ Optional fields:
 - `request_id`
 - `trace_id`

+The v1 stream-event signature scheme is Ed25519 with event domain
+marker `galaxy-event-v1`.
+The event signing input uses the same canonical binary encoding shape as the
+request and unary response signers:
+
+- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
+  followed by raw bytes;
+- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
+- the signed field order is `galaxy-event-v1`, `event_type`, `event_id`,
+  `timestamp_ms`, `request_id`, `trace_id`, `payload_hash`.
+
+The bootstrap event uses:
+
+- `event_type = "gateway.server_time"`;
+- `event_id = request_id` from the opening `SubscribeEvents` request;
+- `payload_bytes` encoded as FlatBuffers `gateway.ServerTimeEvent` with
+  `server_time_ms`;
+- the same loaded Ed25519 signer configured by
+  `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.
+
+Client-facing fan-out events are sourced from the internal client
+event stream. Internal publishers provide the event target and business
+payload only: `user_id`, optional `device_session_id`, `event_type`,
+`event_id`, `payload_bytes`, and optional `request_id` / `trace_id`. The
+gateway derives `timestamp_ms`, recomputes `payload_hash`, signs the event,
+and only then forwards it to the matching `SubscribeEvents` streams.
+
 ## Verification and Routing Pipeline

 The gateway applies the same strict verification order for authenticated gRPC
@@ -178,6 +468,38 @@ ingress.
 No downstream business service should receive a request that has not passed
 this full verification pipeline.

+`ExecuteCommand` enforces steps 1 through 11 and
+signs the successful unary response afterward. `SubscribeEvents` enforces
+steps 1 through 9, binds the verified stream identity, sends the initial
+signed server-time bootstrap event, and then keeps the stream open for push
+delivery.
+Malformed envelopes fail with gRPC `INVALID_ARGUMENT`.
+Unsupported non-empty `protocol_version` values fail with gRPC
+`FAILED_PRECONDITION`.
+Unknown sessions fail with gRPC `UNAUTHENTICATED`.
+Revoked sessions fail with gRPC `FAILED_PRECONDITION`.
+SessionCache backend failures fail with gRPC `UNAVAILABLE`.
+`payload_hash` values that are not raw 32-byte SHA-256 digests fail with gRPC
+`INVALID_ARGUMENT` and message `payload_hash must be a 32-byte SHA-256 digest`.
+`payload_hash` values that do not match `payload_bytes` fail with gRPC
+`INVALID_ARGUMENT` and message `payload_hash does not match payload_bytes`.
+Invalid request signatures fail with gRPC `UNAUTHENTICATED` and message
+`invalid request signature`.
+Malformed cached `client_public_key` values fail closed with gRPC
+`UNAVAILABLE` and message `session cache is unavailable`.
+Requests with a `timestamp_ms` outside the accepted freshness window fail with
+gRPC `FAILED_PRECONDITION` and message
+`request timestamp is outside the freshness window`.
+Requests that reuse the same `request_id` for the same `device_session_id`
+inside the active replay window fail with gRPC `FAILED_PRECONDITION` and
+message `request replay detected`.
+ReplayStore backend failures fail with gRPC `UNAVAILABLE` and message
+`replay store is unavailable`.
+Unrouted exact-match `message_type` values fail with gRPC `UNIMPLEMENTED` and
+message `message_type is not routed`.
+Downstream availability failures fail with gRPC `UNAVAILABLE` and message
+`downstream service is unavailable`.
+
 ## Internal Authenticated Contract

 Downstream services should receive an internal authenticated command rather than
@@ -206,7 +528,7 @@ Expected session fields available to the gateway:

 - `device_session_id`
 - `user_id`
- client public key
+- base64-encoded raw 32-byte Ed25519 client public key
 - session status
 - revoke metadata
 - optional client metadata
@@ -217,12 +539,189 @@ Expected session fields available to the gateway:

 - session existence checks;
 - `device_session_id -> user_id`;
- access to the client public key used for signature verification;
+- access to the base64-encoded raw Ed25519 client public key used for
+  signature verification;
 - revoked versus active status checks.

 Cache updates are event-driven.
 TTL is allowed only as a safety net and must not replace invalidation events.

+The gateway keeps a process-local in-memory snapshot
+cache in front of the Redis fallback backend. Authenticated requests read the
+local snapshot first. A local miss performs one bounded Redis lookup and seeds
+the local snapshot so later requests for the same session avoid another Redis
+round-trip unless a later session event changes the cached state.
+
+The local snapshot cache intentionally has no TTL and no size-based
+eviction policy. Session lifecycle events are the authoritative mechanism for
+keeping the hot path current, while Redis fallback remains the safety net for
+cold misses and process restarts.
+
+The Redis fallback implementation uses `go-redis/v9`.
+`cmd/gateway` requires the Redis fallback backend during startup, issues a
+bounded `PING`, and refuses to start when Redis is misconfigured or
+unavailable.
+
+Required environment variable:
+
+- `GATEWAY_SESSION_CACHE_REDIS_ADDR`
+
+Optional environment variables:
+
+- `GATEWAY_SESSION_CACHE_REDIS_USERNAME`
+- `GATEWAY_SESSION_CACHE_REDIS_PASSWORD`
+- `GATEWAY_SESSION_CACHE_REDIS_DB` with default `0`
+- `GATEWAY_SESSION_CACHE_REDIS_KEY_PREFIX` with default `gateway:session:`
+- `GATEWAY_SESSION_CACHE_REDIS_LOOKUP_TIMEOUT` with default `250ms`
+- `GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED` with default `false`
+
+The Redis key format is:
+
+- `<key_prefix><device_session_id>`
+
+The Redis value is one strict JSON object:
+
+- `device_session_id`
+- `user_id`
+- `client_public_key`
+- `status`
+- optional `revoked_at_ms`
+
+`client_public_key` stores the standard base64-encoded raw 32-byte Ed25519
+public key registered for the device session.
+
+Malformed JSON, missing required fields, unsupported `status`, or a
+`device_session_id` mismatch between the Redis value and the lookup key are
+treated as SessionCache backend failures rather than as valid session states.
+
+### Session Event Stream
+
+The gateway keeps the process-local session snapshot cache synchronized from one
+Redis Stream consumed through `go-redis/v9`.
+
+`cmd/gateway` requires the session event stream configuration during startup,
+issues a bounded `PING` against the same Redis deployment used for
+`SessionCache`, and refuses to start when that Redis backend is unavailable.
+
+Required environment variable:
+
+- `GATEWAY_SESSION_EVENTS_REDIS_STREAM`
+
+Optional environment variable:
+
+- `GATEWAY_SESSION_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s`
+
+The subscriber reuses the same Redis address, ACL credentials, logical
+database, timeout, and TLS settings configured for `SessionCache`.
+
+Each gateway replica keeps its own in-memory last-seen stream ID and consumes
+the stream with plain `XREAD`, not a shared consumer group.
+On startup the replica resolves the current stream tail and begins from that
+point, which preserves the same fresh-process semantics as Redis `$` while
+avoiding a race before the first blocking read.
+
+The session event payload is one strict full snapshot with these
+fields:
+
+- `device_session_id`
+- `user_id`
+- `client_public_key`
+- `status`
+- optional `revoked_at_ms`
+
+Valid active and revoked snapshots upsert or replace the local session state.
+Later stream entries win.
+Malformed events are skipped without stopping the subscriber; when
+`device_session_id` can still be extracted, the gateway evicts the local
+snapshot for that session so it cannot continue using stale state.
+
+Session event publishers must keep the stream bounded by using
+`XADD ... MAXLEN ~ <limit>` or an equivalent retention policy.
+The gateway intentionally does not trim the stream from the consumer side,
+because consumer-side trimming could drop updates that another gateway replica
+has not read yet.
+
+### Client Event Stream
+
+The gateway delivers client-facing push events from one dedicated Redis Stream
+consumed through `go-redis/v9`.
+
+`cmd/gateway` requires the client event stream configuration during startup,
+issues a bounded `PING` against the same Redis deployment used for
+`SessionCache`, and refuses to start when that Redis backend is unavailable.
+
+Required environment variable:
+
+- `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`
+
+Optional environment variable:
+
+- `GATEWAY_CLIENT_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s`
+
+The subscriber reuses the same Redis address, ACL credentials, logical
+database, timeout, and TLS settings configured for `SessionCache`.
+
+Each gateway replica keeps its own in-memory last-seen stream ID and consumes
+the stream with plain `XREAD`, not a shared consumer group.
+On startup the replica resolves the current stream tail and begins from that
+point, which preserves the same fresh-process semantics as Redis `$` while
+avoiding a race before the first blocking read.
+
+The client event payload is one strict target-plus-payload entry with
+these fields:
+
+- `user_id`
+- optional `device_session_id`
+- `event_type`
+- `event_id`
+- `payload_bytes`
+- optional `request_id`
+- optional `trace_id`
+
+`payload_bytes` carries the raw binary-safe business payload bytes for the
+outbound client event.
+When `device_session_id` is absent or blank, the gateway fans the event out to
+every active stream for `user_id`.
+When `device_session_id` is present, the gateway fans the event out only to
+active streams whose `user_id` and `device_session_id` both match.
+Malformed client event entries are skipped without stopping the subscriber or
+delivering partial data to clients.
+
+Client event publishers must keep the stream bounded by using
+`XADD ... MAXLEN ~ <limit>` or an equivalent retention policy.
+The gateway intentionally does not trim the stream from the consumer side,
+because consumer-side trimming could drop updates that another gateway replica
+has not read yet.
+
+### Replay Store
+
+`ReplayStore` provides the hot-path anti-replay reservation for:
+
+- duplicate detection by `device_session_id + request_id`;
+- bounded replay protection for the authenticated freshness window.
+
+The ReplayStore uses Redis through `go-redis/v9`.
+`cmd/gateway` requires the ReplayStore backend during startup, issues a
+bounded `PING`, and refuses to start when Redis is misconfigured or
+unavailable.
+
+The ReplayStore reuses the same Redis deployment settings as `SessionCache`
+and adds two replay-specific environment variables:
+
+- `GATEWAY_REPLAY_REDIS_KEY_PREFIX` with default `gateway:replay:`
+- `GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT` with default `250ms`
+
+Replay keys use this format:
+
+- `<key_prefix><base64url(device_session_id)>:<base64url(request_id)>`
+
+For each accepted request, the replay reservation TTL is computed as:
+
+- `timestamp_ms + freshness_window - now`
+
+The TTL is clamped to a minimum positive duration so requests accepted exactly
+on the freshness boundary still reserve their replay key.
+
 ### Revocation Behavior

 When a device session is revoked:
@@ -231,7 +730,9 @@ When a device session is revoked:
 2. it publishes a session update or revoke event;
 3. the gateway invalidates or updates `SessionCache`;
 4. new unary gRPC requests for that session are rejected;
-5. active `SubscribeEvents` streams for that session are closed.
+5. active `SubscribeEvents` streams for that exact `device_session_id` are
+   closed with gRPC `FAILED_PRECONDITION` and message
+   `device session is revoked`.

 ## Public Anti-Abuse Model

@@ -245,9 +746,15 @@ The gateway uses these public route classes:
 - `browser_asset`
 - `public_misc`

+Any classifier result outside this fixed set is normalized to `public_misc`
+before the class is stored in request context or used for policy derivation.
+The canonical base bucket namespace for public REST policy is
+`public_rest/class=<class>`.
+
 ### Public Auth

-`public_auth` includes `send-email-code` and `confirm-email-code`.
+`public_auth` is the stable route class for `send-email-code` and
+`confirm-email-code`.
 This class uses stricter limits and abuse scoring because it directly touches
 account and session creation flows.

@@ -259,6 +766,36 @@ Controls include:
 - malformed request counters;
 - elevated logging and security telemetry for repeated failures.

+Current defaults:
+
+- per-IP: `30 requests / minute`, `burst=10`;
+- `send-email-code` identity buckets: `3 requests / 10 minutes`, `burst=1`,
+  keyed by normalized `email`;
+- `confirm-email-code` identity buckets: `6 requests / 10 minutes`,
+  `burst=2`, keyed by normalized `challenge_id`;
+- maximum request body size: `8192` bytes;
+- only `POST` is accepted for public auth routes.
+
+Configuration surface:
+
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTES` default `8192`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS` default
+  `30`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOW` default `1m`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST` default `10`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
+  default `3`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
+  default `10m`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
+  default `1`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
+  default `6`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
+  default `10m`;
+- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
+  default `2`.
+
 ### Browser Bootstrap and Asset Traffic

 `browser_bootstrap` and `browser_asset` use separate coarse-grained budgets.
@@ -275,6 +812,40 @@ This traffic is still constrained by:

 The gateway must not merge these buckets or counters with `public_auth`.

+Current defaults:
+
+- `browser_bootstrap`: `60 requests / minute`, `burst=20`, `GET` and `HEAD`
+  only, and no request body;
+- `browser_asset`: `300 requests / minute`, `burst=80`, `GET` and `HEAD`
+  only, and no request body;
+- `public_misc`: `30 requests / minute`, `burst=10`, and no request body.
+
+Configuration surface:
+
+- `browser_bootstrap`:
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTES` default
+  `0`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTS`
+  default `60`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOW` default
+  `1m`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURST` default
+  `20`;
+- `browser_asset`:
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTES` default `0`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTS` default
+  `300`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOW` default
+  `1m`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURST` default
+  `80`;
+- `public_misc`:
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTES` default `0`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTS` default
+  `30`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOW` default `1m`,
+  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURST` default `10`.
+
 ## Push Delivery Model

 The v1 push channel is a gRPC server stream.
@@ -285,15 +856,34 @@ Expected stream behavior:
 1. the client opens `SubscribeEvents`;
 2. the gateway applies the full authenticated ingress verification pipeline;
 3. the stream is bound to `user_id` and `device_session_id`;
-4. the first service event includes `server_time_ms`;
-5. client-facing events from internal pub/sub are fanned out to matching active
-   streams;
-6. revoke events close affected streams.
+4. the first signed service event is `gateway.server_time` and its
+   FlatBuffers payload includes `server_time_ms`;
+5. after that bootstrap event, the stream is registered in `PushHub` and
+   remains open until client cancellation, server shutdown, queue overflow,
+   session revoke for the same `device_session_id`, or a later send failure;
+6. internal pub/sub may target all active streams for one `user_id` or only
+   one `device_session_id` within that user;
+7. the current per-stream in-memory queue capacity is `64` events and
+   overflow closes only the affected stream;
+8. session revoke closes only streams bound to the same exact
+   `device_session_id` and returns gRPC `FAILED_PRECONDITION` with message
+   `device session is revoked`.
+
+## Lifecycle and Shutdown
+
+Gateway process shutdown is coordinated across the public REST listener,
+authenticated gRPC listener, optional admin listener, internal Redis
+subscribers, and telemetry runtime.
+
+`GATEWAY_SHUTDOWN_TIMEOUT` configures the per-component graceful shutdown
+budget and defaults to `5s`.
+During authenticated gRPC shutdown, the in-memory `PushHub` closes active
+streams before gRPC graceful stop, so active `SubscribeEvents` calls terminate
+with gRPC `UNAVAILABLE` and message `gateway is shutting down`.

 ## Recommended Package Layout

-The initial package layout should keep transport, policy, and downstream
-adapters separate:
+The package layout keeps transport, policy, and downstream adapters separate:

 - `cmd/gateway`
 - `internal/app`
@@ -317,11 +907,17 @@ The gateway should be built around explicit consumer-side interfaces.

 Provides cached session lookup by `device_session_id`.
 Returns enough data to verify signatures and identify the authenticated user.
+The current production implementation is a process-local read-through cache in
+front of a Redis fallback adapter that uses strict JSON records under a
+configurable key prefix.

 ### ReplayStore

 Tracks recently seen `request_id` values per device session and rejects replayed
 requests inside the accepted freshness window.
+The current production adapter is Redis-backed, uses a dedicated configurable
+key prefix, and reserves keys with a TTL derived from
+`timestamp_ms + freshness_window - now`.

 ### RateLimiter

@@ -333,24 +929,44 @@ Applies independent policies for:
 - authenticated gRPC requests by user;
 - authenticated gRPC requests by message class.

+The current rate limiter is process-local and in-memory.
+Public REST keys stay under the `public_rest/...` namespace, while
+authenticated gRPC keys stay under `authenticated_grpc/...`, so both traffic
+surfaces keep independent buckets even when they share the same limiter
+backend.
+
 ### PublicTrafficClassifier

 Maps incoming public REST requests to one of the public route classes so that
 limits and anti-abuse counters remain isolated.
+The gateway normalizes any unsupported or empty classifier output to
+`public_misc`, and public policy code derives the base bucket namespace from
+the normalized class as `public_rest/class=<class>`.

 ### AuthServiceClient

 Handles public auth commands and session-related updates exchanged with the
 Auth / Session Service.
+The gateway contract is:
+
+- `SendEmailCode(email) -> challenge_id`
+- `ConfirmEmailCode(challenge_id, code, client_public_key) -> device_session_id`
+
+When no concrete implementation is wired, the gateway keeps the public routes
+available and returns a stable `503 service_unavailable` response instead of
+failing process startup.

 ### DownstreamRouter

-Resolves the target downstream service or adapter by `message_type`.
+Resolves the target downstream service or adapter by the full exact-match
+`message_type` literal.

 ### DownstreamClient

 Executes a verified authenticated command against a downstream internal service
-and returns response payload bytes plus a stable result code.
+and returns response payload bytes plus a stable opaque result code.
+An empty or whitespace-only result code is treated as an internal downstream
+contract violation.

 ### EventSubscriber

@@ -360,15 +976,25 @@ Subscribes to internal pub/sub topics used for:
 - revocations;
 - client-facing event delivery.

+The implementation consumes two Redis Streams with replica-safe plain
+`XREAD`: one strict full-session snapshot stream for the process-local session
+cache and one client-facing event stream for live push fan-out.
+
 ### PushHub

 Tracks active `SubscribeEvents` streams, binds them to authenticated identities,
 and delivers events to the correct connections.
+The implementation uses one bounded in-memory queue per stream with a
+default capacity of `64` events; overflowing one queue closes only that stream
+and leaves the remaining streams active.

 ### ResponseSigner

 Signs unary responses and stream events so clients can verify server-originated
 messages.
+The implementation uses one Ed25519 signer loaded from
+`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must reference a PKCS#8
+PEM-encoded private key.

 ### Clock

@@ -382,6 +1008,7 @@ internal implementation details.
 Minimum error categories:

 - malformed request;
+- request too large;
 - unsupported protocol;
 - unknown session;
 - revoked session;
@@ -389,7 +1016,10 @@ Minimum error categories:
 - stale request;
 - replay detected;
 - rate limited;
+- policy denied;
 - downstream unavailable;
+- backend unavailable;
+- gateway shutting down;
 - internal error.

 Observability requirements:
@@ -400,6 +1030,51 @@ Observability requirements:
 - metrics keyed by route class, message type, result code, and reject reason;
 - no logging of secrets, raw private material, or raw signatures.

+The service uses:
+
+- `go.uber.org/zap` for structured JSON logs;
+- `otelgin` for the public REST listener;
+- `otelgrpc` for the authenticated gRPC listener;
+- OpenTelemetry metrics exported through Prometheus on the optional admin
+  `/metrics` listener.
+
+Current custom metric families:
+
+- `gateway.public_http.requests`
+- `gateway.public_http.duration`
+- `gateway.authenticated_grpc.requests`
+- `gateway.authenticated_grpc.duration`
+- `gateway.push.active_streams`
+- `gateway.push.stream_closures`
+- `gateway.internal_event_drops`
+
+The process-wide log level is configured by `GATEWAY_LOG_LEVEL` and
+defaults to `info`.
+The default OpenTelemetry resource uses `service.name=galaxy-edge-gateway`
+when `OTEL_SERVICE_NAME` is unset.
+If `OTEL_TRACES_EXPORTER` is unset or set to `none`, the gateway keeps tracing
+runtime enabled but installs no external trace exporter.
+If `OTEL_TRACES_EXPORTER=otlp`, the gateway uses the standard
+`OTEL_EXPORTER_OTLP_*` environment variables to configure the OTLP trace
+exporter protocol and endpoint.
+The protocol selection specifically honors
+`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` first and falls back to
+`OTEL_EXPORTER_OTLP_PROTOCOL` when the trace-specific variable is unset.
+Supported values are `http/protobuf` and `grpc`; when both variables are
+unset, the gateway defaults to `http/protobuf`.
+
+Structured logs intentionally omit:
+
+- public auth e-mail addresses, login codes, and challenge IDs;
+- client public keys;
+- raw payload bytes and payload hashes;
+- raw request or response signatures;
+- response-signer private key material and Redis credentials.
+
+Malformed internal session and client-event stream entries are no longer
+silently dropped: the gateway logs the drop and increments
+`gateway.internal_event_drops`.
+
 ## Non-Goals

 The gateway is not a business authorization layer and must not grow into a