galaxy-game/gateway/README.md

# Edge Gateway

## Run and Dependencies

`cmd/gateway` starts with built-in listener defaults, but it still requires:

- one reachable Redis deployment for session lookup, replay reservations, and
  both internal event streams;
- one configured session event stream via `GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
- one configured client event stream via `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
- one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by
  `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.

Required startup environment variables:

- `GATEWAY_SESSION_CACHE_REDIS_ADDR`
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM`
- `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`

Optional integrations:

- `GATEWAY_ADMIN_HTTP_ADDR` enables the private `/metrics` listener;
- an injected `AuthServiceClient` enables real public auth handling;
- injected downstream routes are required for successful `ExecuteCommand`.

Operational caveats:

- public auth routes stay mounted and return `503 service_unavailable` until an
  auth adapter is wired;
- authenticated gRPC starts without downstream routes, but `ExecuteCommand`
  returns gRPC `UNIMPLEMENTED` until routing is configured.

Additional module docs:

- [Public REST contract](openapi.yaml)
- [Documentation index](docs/README.md)
- [Runtime and components](docs/runtime.md)
- [Request and push flows](docs/flows.md)
- [Operator runbook](docs/runbook.md)
- [Configuration and contract examples](docs/examples.md)
- [Example `.env`](.env.example)

## Purpose

`Edge Gateway` is the only public ingress for Galaxy Plus clients.
It terminates the external transport and security boundary, enforces edge
policies, and routes verified requests to internal services.

The gateway does not implement domain-specific business logic.
Business validation, authorization, ownership checks, and state transitions
remain inside downstream services.

## Trust Boundary

The gateway sits between untrusted external clients and trusted internal
services.

The gateway is responsible for:

- parsing external transport requests;
- classifying public REST traffic;
- authenticating protected gRPC traffic;
- loading session state from cache;
- verifying request freshness and anti-replay constraints;
- applying edge rate limits and anti-abuse policy;
- building an authenticated internal command context;
- routing verified commands to internal services;
- maintaining authenticated push delivery connections.

The gateway is not responsible for:

- deciding whether a user is allowed to execute a business action;
- validating domain invariants;
- storing the source-of-truth session record;
- implementing business idempotency.

## Transport Matrix

The gateway exposes two external transport classes.

| Transport | Audience | Authentication | Payload format | Primary use |
| --- | --- | --- | --- | --- |
| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Health checks, public auth commands, and browser/bootstrap traffic |
| gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery |

### Public REST Surface

The public REST surface is used for commands that must work before a device
session exists and for browser-originated traffic that may share the same edge.
It covers the probe endpoints, public auth routes, and coarse public
anti-abuse.

Currently implemented public endpoints:

- `GET /healthz`
- `GET /readyz`
- `POST /api/v1/public/auth/send-email-code`
- `POST /api/v1/public/auth/confirm-email-code`

The implemented REST contract is documented in [`openapi.yaml`](openapi.yaml).
The listener address is configured by `GATEWAY_PUBLIC_HTTP_ADDR`.
The public REST listener read budgets are configured by:

- `GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
- `GATEWAY_PUBLIC_HTTP_READ_TIMEOUT` with default `10s`;
- `GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUT` with default `1m`.

The public auth JSON contract uses a challenge-token flow:

- `send-email-code` accepts `email` and returns `challenge_id`;
- `confirm-email-code` accepts `challenge_id`, `code`,
  `client_public_key`, and `time_zone`, then returns
  `device_session_id`.

`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
key for the device session being created.
`time_zone` is the client-selected IANA time zone name forwarded unchanged to
`Auth / Session Service`.

These routes remain unauthenticated and delegate only through an injected
`AuthServiceClient`.
The default wiring used by `cmd/gateway` keeps the routes mounted and returns
`503 service_unavailable` until a concrete upstream auth adapter is supplied.
Public auth adapter calls are wrapped in
`GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT`, which defaults to `3s`.
When that timeout expires, the gateway preserves the public REST contract and
returns `503 service_unavailable`.
When an injected auth adapter returns `*AuthServiceError`, the gateway projects
that client-safe `4xx/5xx` status, `code`, and `message` back to the caller
after normalizing blank or invalid fields. Unexpected non-`AuthServiceError`
adapter failures fail closed as `500 internal_error`.

Public anti-abuse is process-local and in-memory.
Per-IP buckets are derived only from the TCP peer `RemoteAddr`.
Forwarded proxy headers such as `X-Forwarded-For` and `Forwarded` are
intentionally ignored.
Oversized public REST bodies are rejected with `413 request_too_large`.
Rate-limited requests are rejected with `429 rate_limited` and a
`Retry-After` header.

In addition to the fixed endpoints above, the gateway may front browser
bootstrap or asset traffic through a pluggable public handler or proxy.
That traffic belongs to dedicated public route classes and must not share rate
limit buckets or abuse counters with the public auth API.

### Operational Admin Surface

The gateway may expose one private operational HTTP listener used for metrics.

The admin listener is disabled by default and is enabled only when
`GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
When enabled, it serves:

- `GET /metrics`

The admin listener read budgets are configured by:

- `GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
- `GATEWAY_ADMIN_HTTP_READ_TIMEOUT` with default `10s`;
- `GATEWAY_ADMIN_HTTP_IDLE_TIMEOUT` with default `1m`.

`/metrics` is intentionally not mounted on the public REST ingress.
It is also intentionally excluded from [`openapi.yaml`](openapi.yaml), because
that specification covers only the public REST ingress.
The endpoint exposes metrics in the Prometheus text exposition format described
in the official Prometheus documentation:
<https://prometheus.io/docs/instrumenting/exposition_formats/>.

### Authenticated gRPC Surface

All authenticated client requests use HTTP/2 and gRPC.
The listener address is configured by `GATEWAY_AUTHENTICATED_GRPC_ADDR`.
Inbound authenticated gRPC connection setup is bounded by
`GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT`, which defaults to `5s`.
The accepted client timestamp skew is configured by
`GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW` and defaults to `5m`.

The public gRPC service exposes two methods:

- `ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)`
- `SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)`

`ExecuteCommand` is a generic unary RPC.
The gateway routes the request downstream by `message_type` after transport
verification succeeds.
Downstream unary execution is bounded by
`GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT`, which defaults to `5s`.
When that timeout expires, the gateway preserves the authenticated gRPC
contract and returns gRPC `UNAVAILABLE` with message
`downstream service is unavailable`.

`SubscribeEvents` is an authenticated server-streaming RPC.
It binds the stream to `user_id` and `device_session_id` and starts by sending
a signed service event that includes the current server time in milliseconds.

The v1 protobuf contract lives in
`proto/galaxy/gateway/v1/edge_gateway.proto` under package
`galaxy.gateway.v1` and service `EdgeGateway`.
Generated Go bindings are committed under `proto/galaxy/gateway/v1/` and are
regenerated with:

```bash
buf generate
```

The gateway validates the request envelope, device-session
cache lookup, `payload_hash`, the client Ed25519 signature, timestamp
freshness, replay reservation, authenticated rate limits, and the
authenticated policy hook before any later routing or push step runs.
Malformed envelopes are rejected with gRPC `INVALID_ARGUMENT`.
Requests with a non-empty but unsupported `protocol_version` are rejected with
gRPC `FAILED_PRECONDITION`.
The supported request `protocol_version` literal is `v1`.
Requests with an unknown `device_session_id` are rejected with gRPC
`UNAUTHENTICATED`.
Requests for revoked sessions are rejected with gRPC `FAILED_PRECONDITION`.
SessionCache backend failures, including Redis lookup or record-decode
failures, are rejected with gRPC `UNAVAILABLE`.
Requests with a `payload_hash` that is not a 32-byte SHA-256 digest or does
not match `payload_bytes` are rejected with gRPC `INVALID_ARGUMENT`.
Requests with an invalid client signature or a signature created by a
different key are rejected with gRPC `UNAUTHENTICATED` and message
`invalid request signature`.
Requests with malformed cached `client_public_key` material fail closed as
gRPC `UNAVAILABLE`.
Requests with a `timestamp_ms` outside the symmetric freshness window around
current server time are rejected with gRPC `FAILED_PRECONDITION` and message
`request timestamp is outside the freshness window`.
Requests that reuse the same `request_id` for the same `device_session_id`
inside the active replay window are rejected with gRPC
`FAILED_PRECONDITION` and message `request replay detected`.
ReplayStore backend failures fail closed with gRPC `UNAVAILABLE` and message
`replay store is unavailable`.
Authenticated rate limits are enforced independently by transport peer IP,
authenticated `device_session_id`, authenticated `user_id`, and authenticated
message class. The gateway uses the full verified `message_type` literal as the
stable v1 message-class key because the transport does not yet define a
coarser authenticated class taxonomy. The peer IP is derived only from the
gRPC transport peer address; if it is missing or cannot be parsed, the
request falls back to the stable `unknown` IP bucket.
Requests that exceed any authenticated rate-limit bucket are rejected with
gRPC `RESOURCE_EXHAUSTED` and message
`authenticated request rate limit exceeded`.
The authenticated edge policy hook runs after those rate limits and defaults
to allow-all until a concrete policy evaluator is wired into the process.
`ExecuteCommand` builds an internal authenticated command context,
resolves one exact-match downstream route by the full verified `message_type`
literal, executes the downstream unary client, and signs the response before
it is returned to the caller. When no exact downstream route is registered,
`ExecuteCommand` is rejected with gRPC `UNIMPLEMENTED` and message
`message_type is not routed`. Downstream availability failures are rejected
with gRPC `UNAVAILABLE` and message `downstream service is unavailable`.
Unexpected downstream route-resolution or execution failures are rejected with
gRPC `INTERNAL`. Successful unary responses preserve the original
`request_id`, carry a SHA-256 `payload_hash` of the returned `payload_bytes`,
and are signed with the configured server Ed25519 response signer.
The default `cmd/gateway` wiring currently installs an empty static
downstream router, so verified `ExecuteCommand` requests still return gRPC
`UNIMPLEMENTED` until concrete downstream routes are injected.
`SubscribeEvents` applies the full authenticated ingress pipeline, binds
the stream to the verified `user_id` and `device_session_id`, sends one
signed `gateway.server_time` bootstrap event whose FlatBuffers payload carries
`server_time_ms`, registers the active stream in the in-memory `PushHub`, and
then forwards signed client-facing events consumed from the configured client
event Redis stream. User-targeted events fan out to every active stream for
that user. Session-targeted events fan out only to streams whose
`user_id` and `device_session_id` both match the event target. Each active
stream uses a bounded in-memory queue; when that queue overflows, only the
affected stream is closed with gRPC `RESOURCE_EXHAUSTED` and message
`push stream overflowed`. When the session lifecycle stream reports that the
same `device_session_id` was revoked, every active `SubscribeEvents` stream
bound to that exact session is closed with gRPC `FAILED_PRECONDITION` and
message `device session is revoked`. During gateway shutdown, the in-memory
push hub is closed before gRPC graceful stop, and every active
`SubscribeEvents` stream is terminated with gRPC `UNAVAILABLE` and message
`gateway is shutting down`.
Authenticated anti-abuse budgets are configured by the
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_*` environment variables.

Current authenticated gRPC defaults:

- per-IP: `120 requests / minute`, `burst=40`;
- per-session: `60 requests / minute`, `burst=20`;
- per-user: `120 requests / minute`, `burst=40`;
- per-message-class: `60 requests / minute`, `burst=20`.

Authenticated anti-abuse configuration surface:

- per-IP:
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS` default
  `120`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOW` default `1m`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST` default `40`;
- per-session:
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS` default
  `60`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOW` default
  `1m`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURST` default
  `20`;
- per-user:
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTS` default
  `120`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOW` default `1m`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURST` default `40`;
- per-message-class:
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTS`
  default `60`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOW`
  default `1m`,
  `GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURST`
  default `20`.

## Envelope and Payload Model

The authenticated transport uses a split contract:

- gRPC control messages are protobuf-based;
- business payload bytes are FlatBuffers;
- signatures are computed over canonical envelope fields and a hash of raw
  FlatBuffers bytes.

The gateway treats authenticated request `payload_bytes` as opaque business
data.
It verifies integrity and forwards verified bytes downstream without rewriting
them.

The request envelope version literal is `v1`.
`payload_hash` is the raw 32-byte SHA-256 digest of `payload_bytes`.
`ExecuteCommand` hashes the raw FlatBuffers payload bytes exactly as sent,
while `SubscribeEvents` with an empty payload still requires
`sha256([]byte{})` rather than a special-case value.
The v1 request signature scheme is Ed25519.
`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
key registered during `confirm-email-code`.
`signature` carries the raw 64-byte Ed25519 signature computed over the
canonical request signing input.

The v1 stream bootstrap payload uses the shared FlatBuffers schema
`pkg/schema/fbs/gateway.fbs` with root table `gateway.ServerTimeEvent`.

### ExecuteCommandRequest

Required fields:

- `protocol_version`
- `device_session_id`
- `message_type`
- `timestamp_ms`
- `request_id`
- `payload_bytes`
- `payload_hash`
- `signature`

Optional fields:

- `trace_id`

### ExecuteCommandResponse

Required fields:

- `protocol_version`
- `request_id`
- `timestamp_ms`
- `result_code`
- `payload_bytes`
- `payload_hash`
- `signature`

The v1 unary response signature scheme is Ed25519 with response
domain marker `galaxy-response-v1`.
The response signing input uses the same canonical binary encoding shape as
the request signer:

- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
  followed by raw bytes;
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
- the signed field order is `galaxy-response-v1`, `protocol_version`,
  `request_id`, `timestamp_ms`, `result_code`, `payload_hash`.

`cmd/gateway` loads the unary response signer from
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must point to a PKCS#8
PEM-encoded Ed25519 private key. Startup fails when the file is absent,
unreadable, not strict PEM, not PKCS#8, or not Ed25519.

### SubscribeEventsRequest

The stream open request reuses the authenticated request model.
It contains the same authentication fields as the unary request and either an
empty payload or a minimal connect payload.

Required fields:

- `protocol_version`
- `device_session_id`
- `message_type`
- `timestamp_ms`
- `request_id`
- `payload_hash`
- `signature`

Optional fields:

- `payload_bytes`
- `trace_id`

### GatewayEvent

Every stream event is a client-facing signed server message.

Required fields:

- `event_type`
- `event_id`
- `timestamp_ms`
- `payload_bytes`
- `payload_hash`
- `signature`

Optional fields:

- `request_id`
- `trace_id`

The v1 stream-event signature scheme is Ed25519 with event domain
marker `galaxy-event-v1`.
The event signing input uses the same canonical binary encoding shape as the
request and unary response signers:

- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
  followed by raw bytes;
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
- the signed field order is `galaxy-event-v1`, `event_type`, `event_id`,
  `timestamp_ms`, `request_id`, `trace_id`, `payload_hash`.

The bootstrap event uses:

- `event_type = "gateway.server_time"`;
- `event_id = request_id` from the opening `SubscribeEvents` request;
- `payload_bytes` encoded as FlatBuffers `gateway.ServerTimeEvent` with
  `server_time_ms`;
- the same loaded Ed25519 signer configured by
  `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.

Client-facing fan-out events are sourced from the internal client
event stream. Internal publishers provide the event target and business
payload only: `user_id`, optional `device_session_id`, `event_type`,
`event_id`, `payload_bytes`, and optional `request_id` / `trace_id`. The
gateway derives `timestamp_ms`, recomputes `payload_hash`, signs the event,
and only then forwards it to the matching `SubscribeEvents` streams.

## Verification and Routing Pipeline

The gateway applies the same strict verification order for authenticated gRPC
ingress.

1. Parse the control envelope and validate required fields.
2. Check whether `protocol_version` is supported.
3. Resolve `device_session_id` through `SessionCache`.
4. Reject unknown or revoked sessions.
5. Verify that `payload_hash` matches raw `payload_bytes`.
6. Verify the client signature using the public key from session cache.
7. Verify that `timestamp_ms` is inside the accepted freshness window.
8. Verify anti-replay by checking `device_session_id + request_id`.
9. Apply authenticated rate limit and edge policy checks.
10. Build the authenticated internal command context.
11. Route the command downstream by `message_type`.

No downstream business service should receive a request that has not passed
this full verification pipeline.

`ExecuteCommand` enforces steps 1 through 11 and
signs the successful unary response afterward. `SubscribeEvents` enforces
steps 1 through 9, binds the verified stream identity, sends the initial
signed server-time bootstrap event, and then keeps the stream open for push
delivery.
Malformed envelopes fail with gRPC `INVALID_ARGUMENT`.
Unsupported non-empty `protocol_version` values fail with gRPC
`FAILED_PRECONDITION`.
Unknown sessions fail with gRPC `UNAUTHENTICATED`.
Revoked sessions fail with gRPC `FAILED_PRECONDITION`.
SessionCache backend failures fail with gRPC `UNAVAILABLE`.
`payload_hash` values that are not raw 32-byte SHA-256 digests fail with gRPC
`INVALID_ARGUMENT` and message `payload_hash must be a 32-byte SHA-256 digest`.
`payload_hash` values that do not match `payload_bytes` fail with gRPC
`INVALID_ARGUMENT` and message `payload_hash does not match payload_bytes`.
Invalid request signatures fail with gRPC `UNAUTHENTICATED` and message
`invalid request signature`.
Malformed cached `client_public_key` values fail closed with gRPC
`UNAVAILABLE` and message `session cache is unavailable`.
Requests with a `timestamp_ms` outside the accepted freshness window fail with
gRPC `FAILED_PRECONDITION` and message
`request timestamp is outside the freshness window`.
Requests that reuse the same `request_id` for the same `device_session_id`
inside the active replay window fail with gRPC `FAILED_PRECONDITION` and
message `request replay detected`.
ReplayStore backend failures fail with gRPC `UNAVAILABLE` and message
`replay store is unavailable`.
Unrouted exact-match `message_type` values fail with gRPC `UNIMPLEMENTED` and
message `message_type is not routed`.
Downstream availability failures fail with gRPC `UNAVAILABLE` and message
`downstream service is unavailable`.

## Internal Authenticated Contract

Downstream services should receive an internal authenticated command rather than
raw external gRPC transport data.

The minimum authenticated context is:

- `user_id`
- `device_session_id`
- `message_type`
- verified `payload_bytes`
- `request_id`
- optional `trace_id`
- optional client metadata needed for logs and tracing

Downstream services may trust that the gateway has already performed transport
authentication, freshness verification, and anti-replay checks.
They must still perform business authorization and domain validation.

## Session Model

The Auth / Session Service is the source of truth for device session state.
The gateway is designed to authenticate the hot path from cache.

Expected session fields available to the gateway:

- `device_session_id`
- `user_id`
- base64-encoded raw 32-byte Ed25519 client public key
- session status
- revoke metadata
- optional client metadata

### Session Cache

`SessionCache` provides the fast path for:

- session existence checks;
- `device_session_id -> user_id`;
- access to the base64-encoded raw Ed25519 client public key used for
  signature verification;
- revoked versus active status checks.

Cache updates are event-driven.
TTL is allowed only as a safety net and must not replace invalidation events.

The gateway keeps a process-local in-memory snapshot
cache in front of the Redis fallback backend. Authenticated requests read the
local snapshot first. A local miss performs one bounded Redis lookup and seeds
the local snapshot so later requests for the same session avoid another Redis
round-trip unless a later session event changes the cached state.

The local snapshot cache intentionally has no TTL and no size-based
eviction policy. Session lifecycle events are the authoritative mechanism for
keeping the hot path current, while Redis fallback remains the safety net for
cold misses and process restarts.

The Redis fallback implementation uses `go-redis/v9`.
`cmd/gateway` requires the Redis fallback backend during startup, issues a
bounded `PING`, and refuses to start when Redis is misconfigured or
unavailable.

Required environment variable:

- `GATEWAY_SESSION_CACHE_REDIS_ADDR`

Optional environment variables:

- `GATEWAY_SESSION_CACHE_REDIS_USERNAME`
- `GATEWAY_SESSION_CACHE_REDIS_PASSWORD`
- `GATEWAY_SESSION_CACHE_REDIS_DB` with default `0`
- `GATEWAY_SESSION_CACHE_REDIS_KEY_PREFIX` with default `gateway:session:`
- `GATEWAY_SESSION_CACHE_REDIS_LOOKUP_TIMEOUT` with default `250ms`
- `GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED` with default `false`

The Redis key format is:

- `<key_prefix><device_session_id>`

The Redis value is one strict JSON object:

- `device_session_id`
- `user_id`
- `client_public_key`
- `status`
- optional `revoked_at_ms`

`client_public_key` stores the standard base64-encoded raw 32-byte Ed25519
public key registered for the device session.

Malformed JSON, missing required fields, unsupported `status`, or a
`device_session_id` mismatch between the Redis value and the lookup key are
treated as SessionCache backend failures rather than as valid session states.

### Session Event Stream

The gateway keeps the process-local session snapshot cache synchronized from one
Redis Stream consumed through `go-redis/v9`.

`cmd/gateway` requires the session event stream configuration during startup,
issues a bounded `PING` against the same Redis deployment used for
`SessionCache`, and refuses to start when that Redis backend is unavailable.

Required environment variable:

- `GATEWAY_SESSION_EVENTS_REDIS_STREAM`

Optional environment variable:

- `GATEWAY_SESSION_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s`

The subscriber reuses the same Redis address, ACL credentials, logical
database, timeout, and TLS settings configured for `SessionCache`.

Each gateway replica keeps its own in-memory last-seen stream ID and consumes
the stream with plain `XREAD`, not a shared consumer group.
On startup the replica resolves the current stream tail and begins from that
point, which preserves the same fresh-process semantics as Redis `$` while
avoiding a race before the first blocking read.

The session event payload is one strict full snapshot with these
fields:

- `device_session_id`
- `user_id`
- `client_public_key`
- `status`
- optional `revoked_at_ms`

Valid active and revoked snapshots upsert or replace the local session state.
Later stream entries win.
Malformed events are skipped without stopping the subscriber; when
`device_session_id` can still be extracted, the gateway evicts the local
snapshot for that session so it cannot continue using stale state.

Session event publishers must keep the stream bounded by using
`XADD ... MAXLEN ~ <limit>` or an equivalent retention policy.
The gateway intentionally does not trim the stream from the consumer side,
because consumer-side trimming could drop updates that another gateway replica
has not read yet.

### Client Event Stream

The gateway delivers client-facing push events from one dedicated Redis Stream
consumed through `go-redis/v9`.

`cmd/gateway` requires the client event stream configuration during startup,
issues a bounded `PING` against the same Redis deployment used for
`SessionCache`, and refuses to start when that Redis backend is unavailable.

Required environment variable:

- `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`

Optional environment variable:

- `GATEWAY_CLIENT_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s`

The subscriber reuses the same Redis address, ACL credentials, logical
database, timeout, and TLS settings configured for `SessionCache`.

Each gateway replica keeps its own in-memory last-seen stream ID and consumes
the stream with plain `XREAD`, not a shared consumer group.
On startup the replica resolves the current stream tail and begins from that
point, which preserves the same fresh-process semantics as Redis `$` while
avoiding a race before the first blocking read.

The client event payload is one strict target-plus-payload entry with
these fields:

- `user_id`
- optional `device_session_id`
- `event_type`
- `event_id`
- `payload_bytes`
- optional `request_id`
- optional `trace_id`

`payload_bytes` carries the raw binary-safe business payload bytes for the
outbound client event.
When `device_session_id` is absent or blank, the gateway fans the event out to
every active stream for `user_id`.
When `device_session_id` is present, the gateway fans the event out only to
active streams whose `user_id` and `device_session_id` both match.
Malformed client event entries are skipped without stopping the subscriber or
delivering partial data to clients.

Client event publishers must keep the stream bounded by using
`XADD ... MAXLEN ~ <limit>` or an equivalent retention policy.
The gateway intentionally does not trim the stream from the consumer side,
because consumer-side trimming could drop updates that another gateway replica
has not read yet.

### Replay Store

`ReplayStore` provides the hot-path anti-replay reservation for:

- duplicate detection by `device_session_id + request_id`;
- bounded replay protection for the authenticated freshness window.

The ReplayStore uses Redis through `go-redis/v9`.
`cmd/gateway` requires the ReplayStore backend during startup, issues a
bounded `PING`, and refuses to start when Redis is misconfigured or
unavailable.

The ReplayStore reuses the same Redis deployment settings as `SessionCache`
and adds two replay-specific environment variables:

- `GATEWAY_REPLAY_REDIS_KEY_PREFIX` with default `gateway:replay:`
- `GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT` with default `250ms`

Replay keys use this format:

- `<key_prefix><base64url(device_session_id)>:<base64url(request_id)>`

For each accepted request, the replay reservation TTL is computed as:

- `timestamp_ms + freshness_window - now`

The TTL is clamped to a minimum positive duration so requests accepted exactly
on the freshness boundary still reserve their replay key.

### Revocation Behavior

When a device session is revoked:

1. the Auth / Session Service updates the source of truth;
2. it publishes a session update or revoke event;
3. the gateway invalidates or updates `SessionCache`;
4. new unary gRPC requests for that session are rejected;
5. active `SubscribeEvents` streams for that exact `device_session_id` are
   closed with gRPC `FAILED_PRECONDITION` and message
   `device session is revoked`.

## Public Anti-Abuse Model

The public REST layer must distinguish between public auth operations and
browser-originated traffic that may burst during a normal first page load.

The gateway uses these public route classes:

- `public_auth`
- `browser_bootstrap`
- `browser_asset`
- `public_misc`

Any classifier result outside this fixed set is normalized to `public_misc`
before the class is stored in request context or used for policy derivation.
The canonical base bucket namespace for public REST policy is
`public_rest/class=<class>`.

### Public Auth

`public_auth` is the stable route class for `send-email-code` and
`confirm-email-code`.
This class uses stricter limits and abuse scoring because it directly touches
account and session creation flows.

Controls include:

- per-IP and per-identity rate limits;
- request body size limits;
- method allow-lists;
- malformed request counters;
- elevated logging and security telemetry for repeated failures.

Current defaults:

- per-IP: `30 requests / minute`, `burst=10`;
- `send-email-code` identity buckets: `3 requests / 10 minutes`, `burst=1`,
  keyed by normalized `email`;
- `confirm-email-code` identity buckets: `6 requests / 10 minutes`,
  `burst=2`, keyed by normalized `challenge_id`;
- maximum request body size: `8192` bytes;
- only `POST` is accepted for public auth routes.

Configuration surface:

- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTES` default `8192`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS` default
  `30`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOW` default `1m`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST` default `10`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
  default `3`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
  default `10m`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
  default `1`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
  default `6`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
  default `10m`;
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
  default `2`.

### Browser Bootstrap and Asset Traffic

`browser_bootstrap` and `browser_asset` use separate coarse-grained budgets.
They may exhibit bursty behavior during the first load and therefore must not
be treated as hostile based on burst pattern alone.

This traffic is still constrained by:

- dedicated rate limits;
- method allow-lists;
- body size limits where request bodies are expected;
- protocol and path validation;
- independent abuse telemetry.

The gateway must not merge these buckets or counters with `public_auth`.

Current defaults:

- `browser_bootstrap`: `60 requests / minute`, `burst=20`, `GET` and `HEAD`
  only, and no request body;
- `browser_asset`: `300 requests / minute`, `burst=80`, `GET` and `HEAD`
  only, and no request body;
- `public_misc`: `30 requests / minute`, `burst=10`, and no request body.

Configuration surface:

- `browser_bootstrap`:
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTES` default
  `0`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTS`
  default `60`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOW` default
  `1m`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURST` default
  `20`;
- `browser_asset`:
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTES` default `0`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTS` default
  `300`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOW` default
  `1m`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURST` default
  `80`;
- `public_misc`:
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTES` default `0`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTS` default
  `30`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOW` default `1m`,
  `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURST` default `10`.

## Push Delivery Model

The v1 push channel is a gRPC server stream.
Long-polling is intentionally out of scope for the first version.

Expected stream behavior:

1. the client opens `SubscribeEvents`;
2. the gateway applies the full authenticated ingress verification pipeline;
3. the stream is bound to `user_id` and `device_session_id`;
4. the first signed service event is `gateway.server_time` and its
   FlatBuffers payload includes `server_time_ms`;
5. after that bootstrap event, the stream is registered in `PushHub` and
   remains open until client cancellation, server shutdown, queue overflow,
   session revoke for the same `device_session_id`, or a later send failure;
6. internal pub/sub may target all active streams for one `user_id` or only
   one `device_session_id` within that user;
7. the current per-stream in-memory queue capacity is `64` events and
   overflow closes only the affected stream;
8. session revoke closes only streams bound to the same exact
   `device_session_id` and returns gRPC `FAILED_PRECONDITION` with message
   `device session is revoked`.

## Lifecycle and Shutdown

Gateway process shutdown is coordinated across the public REST listener,
authenticated gRPC listener, optional admin listener, internal Redis
subscribers, and telemetry runtime.

`GATEWAY_SHUTDOWN_TIMEOUT` configures the per-component graceful shutdown
budget and defaults to `5s`.
During authenticated gRPC shutdown, the in-memory `PushHub` closes active
streams before gRPC graceful stop, so active `SubscribeEvents` calls terminate
with gRPC `UNAVAILABLE` and message `gateway is shutting down`.

## Recommended Package Layout

The package layout keeps transport, policy, and downstream adapters separate:

- `cmd/gateway`
- `internal/app`
- `internal/config`
- `internal/restapi`
- `internal/grpcapi`
- `internal/authn`
- `internal/session`
- `internal/replay`
- `internal/ratelimit`
- `internal/downstream`
- `internal/push`
- `internal/events`
- `internal/clock`

## Key Interfaces

The gateway should be built around explicit consumer-side interfaces.

### SessionCache

Provides cached session lookup by `device_session_id`.
Returns enough data to verify signatures and identify the authenticated user.
The current production implementation is a process-local read-through cache in
front of a Redis fallback adapter that uses strict JSON records under a
configurable key prefix.

### ReplayStore

Tracks recently seen `request_id` values per device session and rejects replayed
requests inside the accepted freshness window.
The current production adapter is Redis-backed, uses a dedicated configurable
key prefix, and reserves keys with a TTL derived from
`timestamp_ms + freshness_window - now`.

### RateLimiter

Applies independent policies for:

- public REST route classes;
- authenticated gRPC requests by IP;
- authenticated gRPC requests by session;
- authenticated gRPC requests by user;
- authenticated gRPC requests by message class.

The current rate limiter is process-local and in-memory.
Public REST keys stay under the `public_rest/...` namespace, while
authenticated gRPC keys stay under `authenticated_grpc/...`, so both traffic
surfaces keep independent buckets even when they share the same limiter
backend.

### PublicTrafficClassifier

Maps incoming public REST requests to one of the public route classes so that
limits and anti-abuse counters remain isolated.
The gateway normalizes any unsupported or empty classifier output to
`public_misc`, and public policy code derives the base bucket namespace from
the normalized class as `public_rest/class=<class>`.

### AuthServiceClient

Handles public auth commands and session-related updates exchanged with the
Auth / Session Service.
The gateway contract is:

- `SendEmailCode(email) -> challenge_id`
- `ConfirmEmailCode(challenge_id, code, client_public_key, time_zone) -> device_session_id`

When no concrete implementation is wired, the gateway keeps the public routes
available and returns a stable `503 service_unavailable` response instead of
failing process startup.

### DownstreamRouter

Resolves the target downstream service or adapter by the full exact-match
`message_type` literal.

### DownstreamClient

Executes a verified authenticated command against a downstream internal service
and returns response payload bytes plus a stable opaque result code.
An empty or whitespace-only result code is treated as an internal downstream
contract violation.

### EventSubscriber

Subscribes to internal pub/sub topics used for:

- session cache updates;
- revocations;
- client-facing event delivery.

The implementation consumes two Redis Streams with replica-safe plain
`XREAD`: one strict full-session snapshot stream for the process-local session
cache and one client-facing event stream for live push fan-out.

### PushHub

Tracks active `SubscribeEvents` streams, binds them to authenticated identities,
and delivers events to the correct connections.
The implementation uses one bounded in-memory queue per stream with a
default capacity of `64` events; overflowing one queue closes only that stream
and leaves the remaining streams active.

### ResponseSigner

Signs unary responses and stream events so clients can verify server-originated
messages.
The implementation uses one Ed25519 signer loaded from
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must reference a PKCS#8
PEM-encoded private key.

### Clock

Provides current server time and supports consistent freshness-window checks.

## Error Model and Observability

The gateway should expose stable edge-level error classes instead of leaking
internal implementation details.

Minimum error categories:

- malformed request;
- request too large;
- unsupported protocol;
- unknown session;
- revoked session;
- invalid signature;
- stale request;
- replay detected;
- rate limited;
- policy denied;
- downstream unavailable;
- backend unavailable;
- gateway shutting down;
- internal error.

Observability requirements:

- stable correlation identifiers, including `request_id` and optional `trace_id`;
- structured logs;
- security audit events for rejects and abuse signals;
- metrics keyed by route class, message type, result code, and reject reason;
- no logging of secrets, raw private material, or raw signatures.

The service uses:

- `go.uber.org/zap` for structured JSON logs;
- `otelgin` for the public REST listener;
- `otelgrpc` for the authenticated gRPC listener;
- OpenTelemetry metrics exported through Prometheus on the optional admin
  `/metrics` listener.

Current custom metric families:

- `gateway.public_http.requests`
- `gateway.public_http.duration`
- `gateway.authenticated_grpc.requests`
- `gateway.authenticated_grpc.duration`
- `gateway.push.active_streams`
- `gateway.push.stream_closures`
- `gateway.internal_event_drops`

The process-wide log level is configured by `GATEWAY_LOG_LEVEL` and
defaults to `info`.
The default OpenTelemetry resource uses `service.name=galaxy-edge-gateway`
when `OTEL_SERVICE_NAME` is unset.
If `OTEL_TRACES_EXPORTER` is unset or set to `none`, the gateway keeps tracing
runtime enabled but installs no external trace exporter.
If `OTEL_TRACES_EXPORTER=otlp`, the gateway uses the standard
`OTEL_EXPORTER_OTLP_*` environment variables to configure the OTLP trace
exporter protocol and endpoint.
The protocol selection specifically honors
`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` first and falls back to
`OTEL_EXPORTER_OTLP_PROTOCOL` when the trace-specific variable is unset.
Supported values are `http/protobuf` and `grpc`; when both variables are
unset, the gateway defaults to `http/protobuf`.

Structured logs intentionally omit:

- public auth e-mail addresses, login codes, and challenge IDs;
- client public keys;
- raw payload bytes and payload hashes;
- raw request or response signatures;
- response-signer private key material and Redis credentials.

Malformed internal session and client-event stream entries are no longer
silently dropped: the gateway logs the drop and increments
`gateway.internal_event_drops`.

## Non-Goals

The gateway is not a business authorization layer and must not grow into a
domain coordinator.

The gateway must not:

- implement business ownership checks;
- validate domain state transitions;
- replace the Auth / Session Service as the session source of truth;
- degrade into a synchronous pass-through that reloads session state for every
  authenticated request.