# Edge Gateway ## Run and Dependencies `cmd/gateway` starts with built-in listener defaults, but it still requires: - one reachable Redis deployment for session lookup, replay reservations, and both internal event streams; - one configured session event stream via `GATEWAY_SESSION_EVENTS_REDIS_STREAM`; - one configured client event stream via `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`; - one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`. Required startup environment variables: - `GATEWAY_SESSION_CACHE_REDIS_ADDR` - `GATEWAY_SESSION_EVENTS_REDIS_STREAM` - `GATEWAY_CLIENT_EVENTS_REDIS_STREAM` - `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` Optional integrations: - `GATEWAY_ADMIN_HTTP_ADDR` enables the private `/metrics` listener; - `GATEWAY_AUTH_SERVICE_BASE_URL` enables real public auth handling through Auth / Session Service public HTTP; - `GATEWAY_USER_SERVICE_BASE_URL` enables direct authenticated self-service routing to User Service internal HTTP; - injected downstream routes are required for successful `ExecuteCommand`. Operational caveats: - public auth routes stay mounted and return `503 service_unavailable` until an auth service base URL is configured; - authenticated gRPC starts without downstream routes, but `ExecuteCommand` returns gRPC `UNIMPLEMENTED` until routing is configured. Additional module docs: - [Public REST contract](openapi.yaml) - [Documentation index](docs/README.md) - [Runtime and components](docs/runtime.md) - [Request and push flows](docs/flows.md) - [Operator runbook](docs/runbook.md) - [Configuration and contract examples](docs/examples.md) - [Example `.env`](.env.example) ## Purpose `Edge Gateway` is the only public ingress for Galaxy Plus clients. It terminates the external transport and security boundary, enforces edge policies, and routes verified requests to internal services. The gateway does not implement domain-specific business logic. Business validation, authorization, ownership checks, and state transitions remain inside downstream services. ## Trust Boundary The gateway sits between untrusted external clients and trusted internal services. The gateway is responsible for: - parsing external transport requests; - classifying public REST traffic; - authenticating protected gRPC traffic; - loading session state from cache; - verifying request freshness and anti-replay constraints; - applying edge rate limits and anti-abuse policy; - building an authenticated internal command context; - routing verified commands to internal services; - maintaining authenticated push delivery connections. The gateway is not responsible for: - deciding whether a user is allowed to execute a business action; - validating domain invariants; - storing the source-of-truth session record; - implementing business idempotency. ## Transport Matrix The gateway exposes two external transport classes. | Transport | Audience | Authentication | Payload format | Primary use | | --- | --- | --- | --- | --- | | REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Health checks, public auth commands, and browser/bootstrap traffic | | gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery | ### Public REST Surface The public REST surface is used for commands that must work before a device session exists and for browser-originated traffic that may share the same edge. It covers the probe endpoints, public auth routes, and coarse public anti-abuse. Currently implemented public endpoints: - `GET /healthz` - `GET /readyz` - `POST /api/v1/public/auth/send-email-code` - `POST /api/v1/public/auth/confirm-email-code` The implemented REST contract is documented in [`openapi.yaml`](openapi.yaml). The listener address is configured by `GATEWAY_PUBLIC_HTTP_ADDR`. The public REST listener read budgets are configured by: - `GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUT` with default `2s`; - `GATEWAY_PUBLIC_HTTP_READ_TIMEOUT` with default `10s`; - `GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUT` with default `1m`. The public auth JSON contract uses a challenge-token flow: - `send-email-code` accepts `email` and returns `challenge_id`; - `confirm-email-code` accepts `challenge_id`, `code`, `client_public_key`, and `time_zone`, then returns `device_session_id`. The JSON body for `send-email-code` remains unchanged, but gateway may also consume the standard `Accept-Language` header on that route. Gateway resolves the first supported BCP 47 language tag, falls back to `en` when needed, and forwards that derived preferred-language candidate to `Auth / Session Service` for localized auth mail and possible first-user creation. The public JSON DTO itself remains unchanged. `client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public key for the device session being created. `time_zone` is the client-selected IANA time zone name forwarded unchanged to `Auth / Session Service`. The current create-path source of truth for `preferred_language` is the language candidate derived from public `Accept-Language`, with fallback to `en`. The public `confirm-email-code` DTO itself remains unchanged. These routes remain unauthenticated and delegate only through an injected `AuthServiceClient`. The default wiring used by `cmd/gateway` keeps the routes mounted and returns `503 service_unavailable` until a concrete upstream auth adapter is supplied. Public auth adapter calls are wrapped in `GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT`, which defaults to `3s`. When that timeout expires, the gateway preserves the public REST contract and returns `503 service_unavailable`. When an injected auth adapter returns `*AuthServiceError`, the gateway projects that client-safe `4xx/5xx` status, `code`, and `message` back to the caller after normalizing blank or invalid fields. Unexpected non-`AuthServiceError` adapter failures fail closed as `500 internal_error`. Public anti-abuse is process-local and in-memory. Per-IP buckets are derived only from the TCP peer `RemoteAddr`. Forwarded proxy headers such as `X-Forwarded-For` and `Forwarded` are intentionally ignored. Oversized public REST bodies are rejected with `413 request_too_large`. Rate-limited requests are rejected with `429 rate_limited` and a `Retry-After` header. In addition to the fixed endpoints above, the gateway may front browser bootstrap or asset traffic through a pluggable public handler or proxy. That traffic belongs to dedicated public route classes and must not share rate limit buckets or abuse counters with the public auth API. ### Operational Admin Surface The gateway may expose one private operational HTTP listener used for metrics. The admin listener is disabled by default and is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty. When enabled, it serves: - `GET /metrics` The admin listener read budgets are configured by: - `GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUT` with default `2s`; - `GATEWAY_ADMIN_HTTP_READ_TIMEOUT` with default `10s`; - `GATEWAY_ADMIN_HTTP_IDLE_TIMEOUT` with default `1m`. `/metrics` is intentionally not mounted on the public REST ingress. It is also intentionally excluded from [`openapi.yaml`](openapi.yaml), because that specification covers only the public REST ingress. The endpoint exposes metrics in the Prometheus text exposition format described in the official Prometheus documentation: . ### Authenticated gRPC Surface All authenticated client requests use HTTP/2 and gRPC. The listener address is configured by `GATEWAY_AUTHENTICATED_GRPC_ADDR`. Inbound authenticated gRPC connection setup is bounded by `GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT`, which defaults to `5s`. The accepted client timestamp skew is configured by `GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW` and defaults to `5m`. The public gRPC service exposes two methods: - `ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)` - `SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)` `ExecuteCommand` is a generic unary RPC. The gateway routes the request downstream by `message_type` after transport verification succeeds. Downstream unary execution is bounded by `GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT`, which defaults to `5s`. When that timeout expires, the gateway preserves the authenticated gRPC contract and returns gRPC `UNAVAILABLE` with message `downstream service is unavailable`. `SubscribeEvents` is an authenticated server-streaming RPC. It binds the stream to `user_id` and `device_session_id` and starts by sending a signed service event that includes the current server time in milliseconds. The v1 protobuf contract lives in `proto/galaxy/gateway/v1/edge_gateway.proto` under package `galaxy.gateway.v1` and service `EdgeGateway`. Generated Go bindings are committed under `proto/galaxy/gateway/v1/` and are regenerated with: ```bash buf generate ``` The gateway validates the request envelope, device-session cache lookup, `payload_hash`, the client Ed25519 signature, timestamp freshness, replay reservation, authenticated rate limits, and the authenticated policy hook before any later routing or push step runs. Malformed envelopes are rejected with gRPC `INVALID_ARGUMENT`. Requests with a non-empty but unsupported `protocol_version` are rejected with gRPC `FAILED_PRECONDITION`. The supported request `protocol_version` literal is `v1`. Requests with an unknown `device_session_id` are rejected with gRPC `UNAUTHENTICATED`. Requests for revoked sessions are rejected with gRPC `FAILED_PRECONDITION`. SessionCache backend failures, including Redis lookup or record-decode failures, are rejected with gRPC `UNAVAILABLE`. Requests with a `payload_hash` that is not a 32-byte SHA-256 digest or does not match `payload_bytes` are rejected with gRPC `INVALID_ARGUMENT`. Requests with an invalid client signature or a signature created by a different key are rejected with gRPC `UNAUTHENTICATED` and message `invalid request signature`. Requests with malformed cached `client_public_key` material fail closed as gRPC `UNAVAILABLE`. Requests with a `timestamp_ms` outside the symmetric freshness window around current server time are rejected with gRPC `FAILED_PRECONDITION` and message `request timestamp is outside the freshness window`. Requests that reuse the same `request_id` for the same `device_session_id` inside the active replay window are rejected with gRPC `FAILED_PRECONDITION` and message `request replay detected`. ReplayStore backend failures fail closed with gRPC `UNAVAILABLE` and message `replay store is unavailable`. Authenticated rate limits are enforced independently by transport peer IP, authenticated `device_session_id`, authenticated `user_id`, and authenticated message class. The gateway uses the full verified `message_type` literal as the stable v1 message-class key because the transport does not yet define a coarser authenticated class taxonomy. The peer IP is derived only from the gRPC transport peer address; if it is missing or cannot be parsed, the request falls back to the stable `unknown` IP bucket. Requests that exceed any authenticated rate-limit bucket are rejected with gRPC `RESOURCE_EXHAUSTED` and message `authenticated request rate limit exceeded`. The authenticated edge policy hook runs after those rate limits and defaults to allow-all until a concrete policy evaluator is wired into the process. `ExecuteCommand` builds an internal authenticated command context, resolves one exact-match downstream route by the full verified `message_type` literal, executes the downstream unary client, and signs the response before it is returned to the caller. When no exact downstream route is registered, `ExecuteCommand` is rejected with gRPC `UNIMPLEMENTED` and message `message_type is not routed`. Downstream availability failures are rejected with gRPC `UNAVAILABLE` and message `downstream service is unavailable`. Unexpected downstream route-resolution or execution failures are rejected with gRPC `INTERNAL`. Successful unary responses preserve the original `request_id`, carry a SHA-256 `payload_hash` of the returned `payload_bytes`, and are signed with the configured server Ed25519 response signer. The default `cmd/gateway` wiring currently installs an empty static downstream router, so verified `ExecuteCommand` requests still return gRPC `UNIMPLEMENTED` until concrete downstream routes are injected. `SubscribeEvents` applies the full authenticated ingress pipeline, binds the stream to the verified `user_id` and `device_session_id`, sends one signed `gateway.server_time` bootstrap event whose FlatBuffers payload carries `server_time_ms`, registers the active stream in the in-memory `PushHub`, and then forwards signed client-facing events consumed from the configured client event Redis stream. User-targeted events fan out to every active stream for that user. Session-targeted events fan out only to streams whose `user_id` and `device_session_id` both match the event target. Each active stream uses a bounded in-memory queue; when that queue overflows, only the affected stream is closed with gRPC `RESOURCE_EXHAUSTED` and message `push stream overflowed`. When the session lifecycle stream reports that the same `device_session_id` was revoked, every active `SubscribeEvents` stream bound to that exact session is closed with gRPC `FAILED_PRECONDITION` and message `device session is revoked`. During gateway shutdown, the in-memory push hub is closed before gRPC graceful stop, and every active `SubscribeEvents` stream is terminated with gRPC `UNAVAILABLE` and message `gateway is shutting down`. Authenticated anti-abuse budgets are configured by the `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_*` environment variables. Current authenticated gRPC defaults: - per-IP: `120 requests / minute`, `burst=40`; - per-session: `60 requests / minute`, `burst=20`; - per-user: `120 requests / minute`, `burst=40`; - per-message-class: `60 requests / minute`, `burst=20`. Authenticated anti-abuse configuration surface: - per-IP: `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS` default `120`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST` default `40`; - per-session: `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS` default `60`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURST` default `20`; - per-user: `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTS` default `120`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURST` default `40`; - per-message-class: `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTS` default `60`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURST` default `20`. ## Envelope and Payload Model The authenticated transport uses a split contract: - gRPC control messages are protobuf-based; - business payload bytes are FlatBuffers; - signatures are computed over canonical envelope fields and a hash of raw FlatBuffers bytes. The gateway verifies authenticated payload bytes before any downstream call. Most downstream routes may still treat those bytes as opaque, but the gateway is also allowed to transcode verified FlatBuffers payloads into trusted downstream REST/JSON calls when the concrete downstream contract requires it. The current direct `Gateway -> User` self-service boundary uses that pattern: - external message types: - `user.account.get` - `user.profile.update` - `user.settings.update` - external payloads and responses: - FlatBuffers - internal downstream transport: - strict REST/JSON to User Service - business error projection: - gateway `result_code` - FlatBuffers error payload mirroring User Service `code` and `message` The request envelope version literal is `v1`. `payload_hash` is the raw 32-byte SHA-256 digest of `payload_bytes`. `ExecuteCommand` hashes the raw FlatBuffers payload bytes exactly as sent, while `SubscribeEvents` with an empty payload still requires `sha256([]byte{})` rather than a special-case value. The v1 request signature scheme is Ed25519. `client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public key registered during `confirm-email-code`. `signature` carries the raw 64-byte Ed25519 signature computed over the canonical request signing input. The v1 stream bootstrap payload uses the shared FlatBuffers schema `pkg/schema/fbs/gateway.fbs` with root table `gateway.ServerTimeEvent`. ### ExecuteCommandRequest Required fields: - `protocol_version` - `device_session_id` - `message_type` - `timestamp_ms` - `request_id` - `payload_bytes` - `payload_hash` - `signature` Optional fields: - `trace_id` ### ExecuteCommandResponse Required fields: - `protocol_version` - `request_id` - `timestamp_ms` - `result_code` - `payload_bytes` - `payload_hash` - `signature` The v1 unary response signature scheme is Ed25519 with response domain marker `galaxy-response-v1`. The response signing input uses the same canonical binary encoding shape as the request signer: - each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))` followed by raw bytes; - `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer; - the signed field order is `galaxy-response-v1`, `protocol_version`, `request_id`, `timestamp_ms`, `result_code`, `payload_hash`. `cmd/gateway` loads the unary response signer from `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must point to a PKCS#8 PEM-encoded Ed25519 private key. Startup fails when the file is absent, unreadable, not strict PEM, not PKCS#8, or not Ed25519. ### SubscribeEventsRequest The stream open request reuses the authenticated request model. It contains the same authentication fields as the unary request and either an empty payload or a minimal connect payload. Required fields: - `protocol_version` - `device_session_id` - `message_type` - `timestamp_ms` - `request_id` - `payload_hash` - `signature` Optional fields: - `payload_bytes` - `trace_id` ### GatewayEvent Every stream event is a client-facing signed server message. Required fields: - `event_type` - `event_id` - `timestamp_ms` - `payload_bytes` - `payload_hash` - `signature` Optional fields: - `request_id` - `trace_id` The v1 stream-event signature scheme is Ed25519 with event domain marker `galaxy-event-v1`. The event signing input uses the same canonical binary encoding shape as the request and unary response signers: - each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))` followed by raw bytes; - `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer; - the signed field order is `galaxy-event-v1`, `event_type`, `event_id`, `timestamp_ms`, `request_id`, `trace_id`, `payload_hash`. The bootstrap event uses: - `event_type = "gateway.server_time"`; - `event_id = request_id` from the opening `SubscribeEvents` request; - `payload_bytes` encoded as FlatBuffers `gateway.ServerTimeEvent` with `server_time_ms`; - the same loaded Ed25519 signer configured by `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`. Client-facing fan-out events are sourced from the internal client event stream. Internal publishers provide the event target and business payload only: `user_id`, optional `device_session_id`, `event_type`, `event_id`, `payload_bytes`, and optional `request_id` / `trace_id`. The gateway derives `timestamp_ms`, recomputes `payload_hash`, signs the event, and only then forwards it to the matching `SubscribeEvents` streams. ## Verification and Routing Pipeline The gateway applies the same strict verification order for authenticated gRPC ingress. 1. Parse the control envelope and validate required fields. 2. Check whether `protocol_version` is supported. 3. Resolve `device_session_id` through `SessionCache`. 4. Reject unknown or revoked sessions. 5. Verify that `payload_hash` matches raw `payload_bytes`. 6. Verify the client signature using the public key from session cache. 7. Verify that `timestamp_ms` is inside the accepted freshness window. 8. Verify anti-replay by checking `device_session_id + request_id`. 9. Apply authenticated rate limit and edge policy checks. 10. Build the authenticated internal command context. 11. Route the command downstream by `message_type`. No downstream business service should receive a request that has not passed this full verification pipeline. `ExecuteCommand` enforces steps 1 through 11 and signs the successful unary response afterward. `SubscribeEvents` enforces steps 1 through 9, binds the verified stream identity, sends the initial signed server-time bootstrap event, and then keeps the stream open for push delivery. Malformed envelopes fail with gRPC `INVALID_ARGUMENT`. Unsupported non-empty `protocol_version` values fail with gRPC `FAILED_PRECONDITION`. Unknown sessions fail with gRPC `UNAUTHENTICATED`. Revoked sessions fail with gRPC `FAILED_PRECONDITION`. SessionCache backend failures fail with gRPC `UNAVAILABLE`. `payload_hash` values that are not raw 32-byte SHA-256 digests fail with gRPC `INVALID_ARGUMENT` and message `payload_hash must be a 32-byte SHA-256 digest`. `payload_hash` values that do not match `payload_bytes` fail with gRPC `INVALID_ARGUMENT` and message `payload_hash does not match payload_bytes`. Invalid request signatures fail with gRPC `UNAUTHENTICATED` and message `invalid request signature`. Malformed cached `client_public_key` values fail closed with gRPC `UNAVAILABLE` and message `session cache is unavailable`. Requests with a `timestamp_ms` outside the accepted freshness window fail with gRPC `FAILED_PRECONDITION` and message `request timestamp is outside the freshness window`. Requests that reuse the same `request_id` for the same `device_session_id` inside the active replay window fail with gRPC `FAILED_PRECONDITION` and message `request replay detected`. ReplayStore backend failures fail with gRPC `UNAVAILABLE` and message `replay store is unavailable`. Unrouted exact-match `message_type` values fail with gRPC `UNIMPLEMENTED` and message `message_type is not routed`. Downstream availability failures fail with gRPC `UNAVAILABLE` and message `downstream service is unavailable`. ## Internal Authenticated Contract Downstream services should receive an internal authenticated command rather than raw external gRPC transport data. The minimum authenticated context is: - `user_id` - `device_session_id` - `message_type` - verified `payload_bytes` - `request_id` - optional `trace_id` - optional client metadata needed for logs and tracing Downstream services may trust that the gateway has already performed transport authentication, freshness verification, and anti-replay checks. They must still perform business authorization and domain validation. ## Session Model The Auth / Session Service is the source of truth for device session state. The gateway is designed to authenticate the hot path from cache. Expected session fields available to the gateway: - `device_session_id` - `user_id` - base64-encoded raw 32-byte Ed25519 client public key - session status - revoke metadata - optional client metadata ### Session Cache `SessionCache` provides the fast path for: - session existence checks; - `device_session_id -> user_id`; - access to the base64-encoded raw Ed25519 client public key used for signature verification; - revoked versus active status checks. Cache updates are event-driven. TTL is allowed only as a safety net and must not replace invalidation events. The gateway keeps a process-local in-memory snapshot cache in front of the Redis fallback backend. Authenticated requests read the local snapshot first. A local miss performs one bounded Redis lookup and seeds the local snapshot so later requests for the same session avoid another Redis round-trip unless a later session event changes the cached state. The local snapshot cache intentionally has no TTL and no size-based eviction policy. Session lifecycle events are the authoritative mechanism for keeping the hot path current, while Redis fallback remains the safety net for cold misses and process restarts. The Redis fallback implementation uses `go-redis/v9`. `cmd/gateway` requires the Redis fallback backend during startup, issues a bounded `PING`, and refuses to start when Redis is misconfigured or unavailable. Required environment variable: - `GATEWAY_SESSION_CACHE_REDIS_ADDR` Optional environment variables: - `GATEWAY_SESSION_CACHE_REDIS_USERNAME` - `GATEWAY_SESSION_CACHE_REDIS_PASSWORD` - `GATEWAY_SESSION_CACHE_REDIS_DB` with default `0` - `GATEWAY_SESSION_CACHE_REDIS_KEY_PREFIX` with default `gateway:session:` - `GATEWAY_SESSION_CACHE_REDIS_LOOKUP_TIMEOUT` with default `250ms` - `GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED` with default `false` The Redis key format is: - `` The Redis value is one strict JSON object: - `device_session_id` - `user_id` - `client_public_key` - `status` - optional `revoked_at_ms` `client_public_key` stores the standard base64-encoded raw 32-byte Ed25519 public key registered for the device session. Malformed JSON, missing required fields, unsupported `status`, or a `device_session_id` mismatch between the Redis value and the lookup key are treated as SessionCache backend failures rather than as valid session states. ### Session Event Stream The gateway keeps the process-local session snapshot cache synchronized from one Redis Stream consumed through `go-redis/v9`. `cmd/gateway` requires the session event stream configuration during startup, issues a bounded `PING` against the same Redis deployment used for `SessionCache`, and refuses to start when that Redis backend is unavailable. Required environment variable: - `GATEWAY_SESSION_EVENTS_REDIS_STREAM` Optional environment variable: - `GATEWAY_SESSION_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s` The subscriber reuses the same Redis address, ACL credentials, logical database, timeout, and TLS settings configured for `SessionCache`. Each gateway replica keeps its own in-memory last-seen stream ID and consumes the stream with plain `XREAD`, not a shared consumer group. On startup the replica resolves the current stream tail and begins from that point, which preserves the same fresh-process semantics as Redis `$` while avoiding a race before the first blocking read. The session event payload is one strict full snapshot with these fields: - `device_session_id` - `user_id` - `client_public_key` - `status` - optional `revoked_at_ms` Valid active and revoked snapshots upsert or replace the local session state. Later stream entries win. Malformed events are skipped without stopping the subscriber; when `device_session_id` can still be extracted, the gateway evicts the local snapshot for that session so it cannot continue using stale state. Session event publishers must keep the stream bounded by using `XADD ... MAXLEN ~ ` or an equivalent retention policy. The gateway intentionally does not trim the stream from the consumer side, because consumer-side trimming could drop updates that another gateway replica has not read yet. ### Client Event Stream The gateway delivers client-facing push events from one dedicated Redis Stream consumed through `go-redis/v9`. `cmd/gateway` requires the client event stream configuration during startup, issues a bounded `PING` against the same Redis deployment used for `SessionCache`, and refuses to start when that Redis backend is unavailable. Required environment variable: - `GATEWAY_CLIENT_EVENTS_REDIS_STREAM` Optional environment variable: - `GATEWAY_CLIENT_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s` The subscriber reuses the same Redis address, ACL credentials, logical database, timeout, and TLS settings configured for `SessionCache`. Each gateway replica keeps its own in-memory last-seen stream ID and consumes the stream with plain `XREAD`, not a shared consumer group. On startup the replica resolves the current stream tail and begins from that point, which preserves the same fresh-process semantics as Redis `$` while avoiding a race before the first blocking read. The client event payload is one strict target-plus-payload entry with these fields: - `user_id` - optional `device_session_id` - `event_type` - `event_id` - `payload_bytes` - optional `request_id` - optional `trace_id` `payload_bytes` carries the raw binary-safe business payload bytes for the outbound client event. When `device_session_id` is absent or blank, the gateway fans the event out to every active stream for `user_id`. When `device_session_id` is present, the gateway fans the event out only to active streams whose `user_id` and `device_session_id` both match. Malformed client event entries are skipped without stopping the subscriber or delivering partial data to clients. Client event publishers must keep the stream bounded by using `XADD ... MAXLEN ~ ` or an equivalent retention policy. The gateway intentionally does not trim the stream from the consumer side, because consumer-side trimming could drop updates that another gateway replica has not read yet. ### Replay Store `ReplayStore` provides the hot-path anti-replay reservation for: - duplicate detection by `device_session_id + request_id`; - bounded replay protection for the authenticated freshness window. The ReplayStore uses Redis through `go-redis/v9`. `cmd/gateway` requires the ReplayStore backend during startup, issues a bounded `PING`, and refuses to start when Redis is misconfigured or unavailable. The ReplayStore reuses the same Redis deployment settings as `SessionCache` and adds two replay-specific environment variables: - `GATEWAY_REPLAY_REDIS_KEY_PREFIX` with default `gateway:replay:` - `GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT` with default `250ms` Replay keys use this format: - `:` For each accepted request, the replay reservation TTL is computed as: - `timestamp_ms + freshness_window - now` The TTL is clamped to a minimum positive duration so requests accepted exactly on the freshness boundary still reserve their replay key. ### Revocation Behavior When a device session is revoked: 1. the Auth / Session Service updates the source of truth; 2. it publishes a session update or revoke event; 3. the gateway invalidates or updates `SessionCache`; 4. new unary gRPC requests for that session are rejected; 5. active `SubscribeEvents` streams for that exact `device_session_id` are closed with gRPC `FAILED_PRECONDITION` and message `device session is revoked`. ## Public Anti-Abuse Model The public REST layer must distinguish between public auth operations and browser-originated traffic that may burst during a normal first page load. The gateway uses these public route classes: - `public_auth` - `browser_bootstrap` - `browser_asset` - `public_misc` Any classifier result outside this fixed set is normalized to `public_misc` before the class is stored in request context or used for policy derivation. The canonical base bucket namespace for public REST policy is `public_rest/class=`. ### Public Auth `public_auth` is the stable route class for `send-email-code` and `confirm-email-code`. This class uses stricter limits and abuse scoring because it directly touches account and session creation flows. Controls include: - per-IP and per-identity rate limits; - request body size limits; - method allow-lists; - malformed request counters; - elevated logging and security telemetry for repeated failures. Current defaults: - per-IP: `30 requests / minute`, `burst=10`; - `send-email-code` identity buckets: `3 requests / 10 minutes`, `burst=1`, keyed by normalized `email`; - `confirm-email-code` identity buckets: `6 requests / 10 minutes`, `burst=2`, keyed by normalized `challenge_id`; - maximum request body size: `8192` bytes; - only `POST` is accepted for public auth routes. Configuration surface: - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTES` default `8192`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS` default `30`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOW` default `1m`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST` default `10`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS` default `3`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW` default `10m`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST` default `1`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS` default `6`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW` default `10m`; - `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST` default `2`. ### Browser Bootstrap and Asset Traffic `browser_bootstrap` and `browser_asset` use separate coarse-grained budgets. They may exhibit bursty behavior during the first load and therefore must not be treated as hostile based on burst pattern alone. This traffic is still constrained by: - dedicated rate limits; - method allow-lists; - body size limits where request bodies are expected; - protocol and path validation; - independent abuse telemetry. The gateway must not merge these buckets or counters with `public_auth`. Current defaults: - `browser_bootstrap`: `60 requests / minute`, `burst=20`, `GET` and `HEAD` only, and no request body; - `browser_asset`: `300 requests / minute`, `burst=80`, `GET` and `HEAD` only, and no request body; - `public_misc`: `30 requests / minute`, `burst=10`, and no request body. Configuration surface: - `browser_bootstrap`: `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTES` default `0`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTS` default `60`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURST` default `20`; - `browser_asset`: `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTES` default `0`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTS` default `300`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURST` default `80`; - `public_misc`: `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTES` default `0`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTS` default `30`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOW` default `1m`, `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURST` default `10`. ## Push Delivery Model The v1 push channel is a gRPC server stream. Long-polling is intentionally out of scope for the first version. Expected stream behavior: 1. the client opens `SubscribeEvents`; 2. the gateway applies the full authenticated ingress verification pipeline; 3. the stream is bound to `user_id` and `device_session_id`; 4. the first signed service event is `gateway.server_time` and its FlatBuffers payload includes `server_time_ms`; 5. after that bootstrap event, the stream is registered in `PushHub` and remains open until client cancellation, server shutdown, queue overflow, session revoke for the same `device_session_id`, or a later send failure; 6. internal pub/sub may target all active streams for one `user_id` or only one `device_session_id` within that user; 7. the current per-stream in-memory queue capacity is `64` events and overflow closes only the affected stream; 8. session revoke closes only streams bound to the same exact `device_session_id` and returns gRPC `FAILED_PRECONDITION` with message `device session is revoked`. ## Lifecycle and Shutdown Gateway process shutdown is coordinated across the public REST listener, authenticated gRPC listener, optional admin listener, internal Redis subscribers, and telemetry runtime. `GATEWAY_SHUTDOWN_TIMEOUT` configures the per-component graceful shutdown budget and defaults to `5s`. During authenticated gRPC shutdown, the in-memory `PushHub` closes active streams before gRPC graceful stop, so active `SubscribeEvents` calls terminate with gRPC `UNAVAILABLE` and message `gateway is shutting down`. ## Recommended Package Layout The package layout keeps transport, policy, and downstream adapters separate: - `cmd/gateway` - `internal/app` - `internal/config` - `internal/restapi` - `internal/grpcapi` - `internal/authn` - `internal/session` - `internal/replay` - `internal/ratelimit` - `internal/downstream` - `internal/push` - `internal/events` - `internal/clock` ## Key Interfaces The gateway should be built around explicit consumer-side interfaces. ### SessionCache Provides cached session lookup by `device_session_id`. Returns enough data to verify signatures and identify the authenticated user. The current production implementation is a process-local read-through cache in front of a Redis fallback adapter that uses strict JSON records under a configurable key prefix. ### ReplayStore Tracks recently seen `request_id` values per device session and rejects replayed requests inside the accepted freshness window. The current production adapter is Redis-backed, uses a dedicated configurable key prefix, and reserves keys with a TTL derived from `timestamp_ms + freshness_window - now`. ### RateLimiter Applies independent policies for: - public REST route classes; - authenticated gRPC requests by IP; - authenticated gRPC requests by session; - authenticated gRPC requests by user; - authenticated gRPC requests by message class. The current rate limiter is process-local and in-memory. Public REST keys stay under the `public_rest/...` namespace, while authenticated gRPC keys stay under `authenticated_grpc/...`, so both traffic surfaces keep independent buckets even when they share the same limiter backend. ### PublicTrafficClassifier Maps incoming public REST requests to one of the public route classes so that limits and anti-abuse counters remain isolated. The gateway normalizes any unsupported or empty classifier output to `public_misc`, and public policy code derives the base bucket namespace from the normalized class as `public_rest/class=`. ### AuthServiceClient Handles public auth commands and session-related updates exchanged with the Auth / Session Service. The gateway contract is: - `SendEmailCode(email) -> challenge_id` - `ConfirmEmailCode(challenge_id, code, client_public_key, time_zone) -> device_session_id` When no concrete implementation is wired, the gateway keeps the public routes available and returns a stable `503 service_unavailable` response instead of failing process startup. ### DownstreamRouter Resolves the target downstream service or adapter by the full exact-match `message_type` literal. The default `cmd/gateway` wiring keeps the reserved `user.*` self-service message types mounted even when `GATEWAY_USER_SERVICE_BASE_URL` is unset. In that configuration they fail closed as dependency-unavailable instead of falling through to a generic route miss. ### DownstreamClient Executes a verified authenticated command against a downstream internal service and returns response payload bytes plus a stable opaque result code. An empty or whitespace-only result code is treated as an internal downstream contract violation. Downstream clients may be pure pass-through adapters or gateway-owned transcoding adapters. The current User Service adapter decodes authenticated FlatBuffers payloads, calls the trusted internal REST API, and re-encodes the result into FlatBuffers before the signed gateway response is emitted. ### EventSubscriber Subscribes to internal pub/sub topics used for: - session cache updates; - revocations; - client-facing event delivery. The implementation consumes two Redis Streams with replica-safe plain `XREAD`: one strict full-session snapshot stream for the process-local session cache and one client-facing event stream for live push fan-out. ### PushHub Tracks active `SubscribeEvents` streams, binds them to authenticated identities, and delivers events to the correct connections. The implementation uses one bounded in-memory queue per stream with a default capacity of `64` events; overflowing one queue closes only that stream and leaves the remaining streams active. ### ResponseSigner Signs unary responses and stream events so clients can verify server-originated messages. The implementation uses one Ed25519 signer loaded from `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must reference a PKCS#8 PEM-encoded private key. ### Clock Provides current server time and supports consistent freshness-window checks. ## Error Model and Observability The gateway should expose stable edge-level error classes instead of leaking internal implementation details. Minimum error categories: - malformed request; - request too large; - unsupported protocol; - unknown session; - revoked session; - invalid signature; - stale request; - replay detected; - rate limited; - policy denied; - downstream unavailable; - backend unavailable; - gateway shutting down; - internal error. Observability requirements: - stable correlation identifiers, including `request_id` and optional `trace_id`; - structured logs; - security audit events for rejects and abuse signals; - metrics keyed by route class, message type, result code, and reject reason; - no logging of secrets, raw private material, or raw signatures. The service uses: - `go.uber.org/zap` for structured JSON logs; - `otelgin` for the public REST listener; - `otelgrpc` for the authenticated gRPC listener; - OpenTelemetry metrics exported through Prometheus on the optional admin `/metrics` listener. Current custom metric families: - `gateway.public_http.requests` - `gateway.public_http.duration` - `gateway.authenticated_grpc.requests` - `gateway.authenticated_grpc.duration` - `gateway.push.active_streams` - `gateway.push.stream_closures` - `gateway.internal_event_drops` The process-wide log level is configured by `GATEWAY_LOG_LEVEL` and defaults to `info`. The default OpenTelemetry resource uses `service.name=galaxy-edge-gateway` when `OTEL_SERVICE_NAME` is unset. If `OTEL_TRACES_EXPORTER` is unset or set to `none`, the gateway keeps tracing runtime enabled but installs no external trace exporter. If `OTEL_TRACES_EXPORTER=otlp`, the gateway uses the standard `OTEL_EXPORTER_OTLP_*` environment variables to configure the OTLP trace exporter protocol and endpoint. The protocol selection specifically honors `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` first and falls back to `OTEL_EXPORTER_OTLP_PROTOCOL` when the trace-specific variable is unset. Supported values are `http/protobuf` and `grpc`; when both variables are unset, the gateway defaults to `http/protobuf`. Structured logs intentionally omit: - public auth e-mail addresses, login codes, and challenge IDs; - client public keys; - raw payload bytes and payload hashes; - raw request or response signatures; - response-signer private key material and Redis credentials. Malformed internal session and client-event stream entries are no longer silently dropped: the gateway logs the drop and increments `gateway.internal_event_drops`. ## Non-Goals The gateway is not a business authorization layer and must not grow into a domain coordinator. The gateway must not: - implement business ownership checks; - validate domain state transitions; - replace the Auth / Session Service as the session source of truth; - degrade into a synchronous pass-through that reloads session state for every authenticated request.