1093 lines
41 KiB
Markdown
1093 lines
41 KiB
Markdown
# Edge Gateway
|
|
|
|
## Run and Dependencies
|
|
|
|
`cmd/gateway` starts with built-in listener defaults, but it still requires:
|
|
|
|
- one reachable Redis deployment for session lookup, replay reservations, and
|
|
both internal event streams;
|
|
- one configured session event stream via `GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
|
|
- one configured client event stream via `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
|
|
- one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.
|
|
|
|
Required startup environment variables:
|
|
|
|
- `GATEWAY_SESSION_CACHE_REDIS_ADDR`
|
|
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM`
|
|
- `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`
|
|
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`
|
|
|
|
Optional integrations:
|
|
|
|
- `GATEWAY_ADMIN_HTTP_ADDR` enables the private `/metrics` listener;
|
|
- an injected `AuthServiceClient` enables real public auth handling;
|
|
- injected downstream routes are required for successful `ExecuteCommand`.
|
|
|
|
Operational caveats:
|
|
|
|
- public auth routes stay mounted and return `503 service_unavailable` until an
|
|
auth adapter is wired;
|
|
- authenticated gRPC starts without downstream routes, but `ExecuteCommand`
|
|
returns gRPC `UNIMPLEMENTED` until routing is configured.
|
|
|
|
Additional module docs:
|
|
|
|
- [Public REST contract](openapi.yaml)
|
|
- [Documentation index](docs/README.md)
|
|
- [Runtime and components](docs/runtime.md)
|
|
- [Request and push flows](docs/flows.md)
|
|
- [Operator runbook](docs/runbook.md)
|
|
- [Configuration and contract examples](docs/examples.md)
|
|
- [Example `.env`](.env.example)
|
|
|
|
## Purpose
|
|
|
|
`Edge Gateway` is the only public ingress for Galaxy Plus clients.
|
|
It terminates the external transport and security boundary, enforces edge
|
|
policies, and routes verified requests to internal services.
|
|
|
|
The gateway does not implement domain-specific business logic.
|
|
Business validation, authorization, ownership checks, and state transitions
|
|
remain inside downstream services.
|
|
|
|
## Trust Boundary
|
|
|
|
The gateway sits between untrusted external clients and trusted internal
|
|
services.
|
|
|
|
The gateway is responsible for:
|
|
|
|
- parsing external transport requests;
|
|
- classifying public REST traffic;
|
|
- authenticating protected gRPC traffic;
|
|
- loading session state from cache;
|
|
- verifying request freshness and anti-replay constraints;
|
|
- applying edge rate limits and anti-abuse policy;
|
|
- building an authenticated internal command context;
|
|
- routing verified commands to internal services;
|
|
- maintaining authenticated push delivery connections.
|
|
|
|
The gateway is not responsible for:
|
|
|
|
- deciding whether a user is allowed to execute a business action;
|
|
- validating domain invariants;
|
|
- storing the source-of-truth session record;
|
|
- implementing business idempotency.
|
|
|
|
## Transport Matrix
|
|
|
|
The gateway exposes two external transport classes.
|
|
|
|
| Transport | Audience | Authentication | Payload format | Primary use |
|
|
| --- | --- | --- | --- | --- |
|
|
| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Health checks, public auth commands, and browser/bootstrap traffic |
|
|
| gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery |
|
|
|
|
### Public REST Surface
|
|
|
|
The public REST surface is used for commands that must work before a device
|
|
session exists and for browser-originated traffic that may share the same edge.
|
|
It covers the probe endpoints, public auth routes, and coarse public
|
|
anti-abuse.
|
|
|
|
Currently implemented public endpoints:
|
|
|
|
- `GET /healthz`
|
|
- `GET /readyz`
|
|
- `POST /api/v1/public/auth/send-email-code`
|
|
- `POST /api/v1/public/auth/confirm-email-code`
|
|
|
|
The implemented REST contract is documented in [`openapi.yaml`](openapi.yaml).
|
|
The listener address is configured by `GATEWAY_PUBLIC_HTTP_ADDR`.
|
|
The public REST listener read budgets are configured by:
|
|
|
|
- `GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
|
|
- `GATEWAY_PUBLIC_HTTP_READ_TIMEOUT` with default `10s`;
|
|
- `GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUT` with default `1m`.
|
|
|
|
The public auth JSON contract uses a challenge-token flow:
|
|
|
|
- `send-email-code` accepts `email` and returns `challenge_id`;
|
|
- `confirm-email-code` accepts `challenge_id`, `code`,
|
|
`client_public_key`, and `time_zone`, then returns
|
|
`device_session_id`.
|
|
|
|
`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
|
|
key for the device session being created.
|
|
`time_zone` is the client-selected IANA time zone name forwarded unchanged to
|
|
`Auth / Session Service`.
|
|
|
|
These routes remain unauthenticated and delegate only through an injected
|
|
`AuthServiceClient`.
|
|
The default wiring used by `cmd/gateway` keeps the routes mounted and returns
|
|
`503 service_unavailable` until a concrete upstream auth adapter is supplied.
|
|
Public auth adapter calls are wrapped in
|
|
`GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT`, which defaults to `3s`.
|
|
When that timeout expires, the gateway preserves the public REST contract and
|
|
returns `503 service_unavailable`.
|
|
When an injected auth adapter returns `*AuthServiceError`, the gateway projects
|
|
that client-safe `4xx/5xx` status, `code`, and `message` back to the caller
|
|
after normalizing blank or invalid fields. Unexpected non-`AuthServiceError`
|
|
adapter failures fail closed as `500 internal_error`.
|
|
|
|
Public anti-abuse is process-local and in-memory.
|
|
Per-IP buckets are derived only from the TCP peer `RemoteAddr`.
|
|
Forwarded proxy headers such as `X-Forwarded-For` and `Forwarded` are
|
|
intentionally ignored.
|
|
Oversized public REST bodies are rejected with `413 request_too_large`.
|
|
Rate-limited requests are rejected with `429 rate_limited` and a
|
|
`Retry-After` header.
|
|
|
|
In addition to the fixed endpoints above, the gateway may front browser
|
|
bootstrap or asset traffic through a pluggable public handler or proxy.
|
|
That traffic belongs to dedicated public route classes and must not share rate
|
|
limit buckets or abuse counters with the public auth API.
|
|
|
|
### Operational Admin Surface
|
|
|
|
The gateway may expose one private operational HTTP listener used for metrics.
|
|
|
|
The admin listener is disabled by default and is enabled only when
|
|
`GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
|
|
When enabled, it serves:
|
|
|
|
- `GET /metrics`
|
|
|
|
The admin listener read budgets are configured by:
|
|
|
|
- `GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
|
|
- `GATEWAY_ADMIN_HTTP_READ_TIMEOUT` with default `10s`;
|
|
- `GATEWAY_ADMIN_HTTP_IDLE_TIMEOUT` with default `1m`.
|
|
|
|
`/metrics` is intentionally not mounted on the public REST ingress.
|
|
It is also intentionally excluded from [`openapi.yaml`](openapi.yaml), because
|
|
that specification covers only the public REST ingress.
|
|
The endpoint exposes metrics in the Prometheus text exposition format described
|
|
in the official Prometheus documentation:
|
|
<https://prometheus.io/docs/instrumenting/exposition_formats/>.
|
|
|
|
### Authenticated gRPC Surface
|
|
|
|
All authenticated client requests use HTTP/2 and gRPC.
|
|
The listener address is configured by `GATEWAY_AUTHENTICATED_GRPC_ADDR`.
|
|
Inbound authenticated gRPC connection setup is bounded by
|
|
`GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT`, which defaults to `5s`.
|
|
The accepted client timestamp skew is configured by
|
|
`GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW` and defaults to `5m`.
|
|
|
|
The public gRPC service exposes two methods:
|
|
|
|
- `ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)`
|
|
- `SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)`
|
|
|
|
`ExecuteCommand` is a generic unary RPC.
|
|
The gateway routes the request downstream by `message_type` after transport
|
|
verification succeeds.
|
|
Downstream unary execution is bounded by
|
|
`GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT`, which defaults to `5s`.
|
|
When that timeout expires, the gateway preserves the authenticated gRPC
|
|
contract and returns gRPC `UNAVAILABLE` with message
|
|
`downstream service is unavailable`.
|
|
|
|
`SubscribeEvents` is an authenticated server-streaming RPC.
|
|
It binds the stream to `user_id` and `device_session_id` and starts by sending
|
|
a signed service event that includes the current server time in milliseconds.
|
|
|
|
The v1 protobuf contract lives in
|
|
`proto/galaxy/gateway/v1/edge_gateway.proto` under package
|
|
`galaxy.gateway.v1` and service `EdgeGateway`.
|
|
Generated Go bindings are committed under `proto/galaxy/gateway/v1/` and are
|
|
regenerated with:
|
|
|
|
```bash
|
|
buf generate
|
|
```
|
|
|
|
The gateway validates the request envelope, device-session
|
|
cache lookup, `payload_hash`, the client Ed25519 signature, timestamp
|
|
freshness, replay reservation, authenticated rate limits, and the
|
|
authenticated policy hook before any later routing or push step runs.
|
|
Malformed envelopes are rejected with gRPC `INVALID_ARGUMENT`.
|
|
Requests with a non-empty but unsupported `protocol_version` are rejected with
|
|
gRPC `FAILED_PRECONDITION`.
|
|
The supported request `protocol_version` literal is `v1`.
|
|
Requests with an unknown `device_session_id` are rejected with gRPC
|
|
`UNAUTHENTICATED`.
|
|
Requests for revoked sessions are rejected with gRPC `FAILED_PRECONDITION`.
|
|
SessionCache backend failures, including Redis lookup or record-decode
|
|
failures, are rejected with gRPC `UNAVAILABLE`.
|
|
Requests with a `payload_hash` that is not a 32-byte SHA-256 digest or does
|
|
not match `payload_bytes` are rejected with gRPC `INVALID_ARGUMENT`.
|
|
Requests with an invalid client signature or a signature created by a
|
|
different key are rejected with gRPC `UNAUTHENTICATED` and message
|
|
`invalid request signature`.
|
|
Requests with malformed cached `client_public_key` material fail closed as
|
|
gRPC `UNAVAILABLE`.
|
|
Requests with a `timestamp_ms` outside the symmetric freshness window around
|
|
current server time are rejected with gRPC `FAILED_PRECONDITION` and message
|
|
`request timestamp is outside the freshness window`.
|
|
Requests that reuse the same `request_id` for the same `device_session_id`
|
|
inside the active replay window are rejected with gRPC
|
|
`FAILED_PRECONDITION` and message `request replay detected`.
|
|
ReplayStore backend failures fail closed with gRPC `UNAVAILABLE` and message
|
|
`replay store is unavailable`.
|
|
Authenticated rate limits are enforced independently by transport peer IP,
|
|
authenticated `device_session_id`, authenticated `user_id`, and authenticated
|
|
message class. The gateway uses the full verified `message_type` literal as the
|
|
stable v1 message-class key because the transport does not yet define a
|
|
coarser authenticated class taxonomy. The peer IP is derived only from the
|
|
gRPC transport peer address; if it is missing or cannot be parsed, the
|
|
request falls back to the stable `unknown` IP bucket.
|
|
Requests that exceed any authenticated rate-limit bucket are rejected with
|
|
gRPC `RESOURCE_EXHAUSTED` and message
|
|
`authenticated request rate limit exceeded`.
|
|
The authenticated edge policy hook runs after those rate limits and defaults
|
|
to allow-all until a concrete policy evaluator is wired into the process.
|
|
`ExecuteCommand` builds an internal authenticated command context,
|
|
resolves one exact-match downstream route by the full verified `message_type`
|
|
literal, executes the downstream unary client, and signs the response before
|
|
it is returned to the caller. When no exact downstream route is registered,
|
|
`ExecuteCommand` is rejected with gRPC `UNIMPLEMENTED` and message
|
|
`message_type is not routed`. Downstream availability failures are rejected
|
|
with gRPC `UNAVAILABLE` and message `downstream service is unavailable`.
|
|
Unexpected downstream route-resolution or execution failures are rejected with
|
|
gRPC `INTERNAL`. Successful unary responses preserve the original
|
|
`request_id`, carry a SHA-256 `payload_hash` of the returned `payload_bytes`,
|
|
and are signed with the configured server Ed25519 response signer.
|
|
The default `cmd/gateway` wiring currently installs an empty static
|
|
downstream router, so verified `ExecuteCommand` requests still return gRPC
|
|
`UNIMPLEMENTED` until concrete downstream routes are injected.
|
|
`SubscribeEvents` applies the full authenticated ingress pipeline, binds
|
|
the stream to the verified `user_id` and `device_session_id`, sends one
|
|
signed `gateway.server_time` bootstrap event whose FlatBuffers payload carries
|
|
`server_time_ms`, registers the active stream in the in-memory `PushHub`, and
|
|
then forwards signed client-facing events consumed from the configured client
|
|
event Redis stream. User-targeted events fan out to every active stream for
|
|
that user. Session-targeted events fan out only to streams whose
|
|
`user_id` and `device_session_id` both match the event target. Each active
|
|
stream uses a bounded in-memory queue; when that queue overflows, only the
|
|
affected stream is closed with gRPC `RESOURCE_EXHAUSTED` and message
|
|
`push stream overflowed`. When the session lifecycle stream reports that the
|
|
same `device_session_id` was revoked, every active `SubscribeEvents` stream
|
|
bound to that exact session is closed with gRPC `FAILED_PRECONDITION` and
|
|
message `device session is revoked`. During gateway shutdown, the in-memory
|
|
push hub is closed before gRPC graceful stop, and every active
|
|
`SubscribeEvents` stream is terminated with gRPC `UNAVAILABLE` and message
|
|
`gateway is shutting down`.
|
|
Authenticated anti-abuse budgets are configured by the
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_*` environment variables.
|
|
|
|
Current authenticated gRPC defaults:
|
|
|
|
- per-IP: `120 requests / minute`, `burst=40`;
|
|
- per-session: `60 requests / minute`, `burst=20`;
|
|
- per-user: `120 requests / minute`, `burst=40`;
|
|
- per-message-class: `60 requests / minute`, `burst=20`.
|
|
|
|
Authenticated anti-abuse configuration surface:
|
|
|
|
- per-IP:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS` default
|
|
`120`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOW` default `1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST` default `40`;
|
|
- per-session:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS` default
|
|
`60`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOW` default
|
|
`1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURST` default
|
|
`20`;
|
|
- per-user:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTS` default
|
|
`120`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOW` default `1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURST` default `40`;
|
|
- per-message-class:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTS`
|
|
default `60`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOW`
|
|
default `1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURST`
|
|
default `20`.
|
|
|
|
## Envelope and Payload Model
|
|
|
|
The authenticated transport uses a split contract:
|
|
|
|
- gRPC control messages are protobuf-based;
|
|
- business payload bytes are FlatBuffers;
|
|
- signatures are computed over canonical envelope fields and a hash of raw
|
|
FlatBuffers bytes.
|
|
|
|
The gateway treats authenticated request `payload_bytes` as opaque business
|
|
data.
|
|
It verifies integrity and forwards verified bytes downstream without rewriting
|
|
them.
|
|
|
|
The request envelope version literal is `v1`.
|
|
`payload_hash` is the raw 32-byte SHA-256 digest of `payload_bytes`.
|
|
`ExecuteCommand` hashes the raw FlatBuffers payload bytes exactly as sent,
|
|
while `SubscribeEvents` with an empty payload still requires
|
|
`sha256([]byte{})` rather than a special-case value.
|
|
The v1 request signature scheme is Ed25519.
|
|
`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
|
|
key registered during `confirm-email-code`.
|
|
`signature` carries the raw 64-byte Ed25519 signature computed over the
|
|
canonical request signing input.
|
|
|
|
The v1 stream bootstrap payload uses the shared FlatBuffers schema
|
|
`pkg/schema/fbs/gateway.fbs` with root table `gateway.ServerTimeEvent`.
|
|
|
|
### ExecuteCommandRequest
|
|
|
|
Required fields:
|
|
|
|
- `protocol_version`
|
|
- `device_session_id`
|
|
- `message_type`
|
|
- `timestamp_ms`
|
|
- `request_id`
|
|
- `payload_bytes`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
Optional fields:
|
|
|
|
- `trace_id`
|
|
|
|
### ExecuteCommandResponse
|
|
|
|
Required fields:
|
|
|
|
- `protocol_version`
|
|
- `request_id`
|
|
- `timestamp_ms`
|
|
- `result_code`
|
|
- `payload_bytes`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
The v1 unary response signature scheme is Ed25519 with response
|
|
domain marker `galaxy-response-v1`.
|
|
The response signing input uses the same canonical binary encoding shape as
|
|
the request signer:
|
|
|
|
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
|
followed by raw bytes;
|
|
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
|
- the signed field order is `galaxy-response-v1`, `protocol_version`,
|
|
`request_id`, `timestamp_ms`, `result_code`, `payload_hash`.
|
|
|
|
`cmd/gateway` loads the unary response signer from
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must point to a PKCS#8
|
|
PEM-encoded Ed25519 private key. Startup fails when the file is absent,
|
|
unreadable, not strict PEM, not PKCS#8, or not Ed25519.
|
|
|
|
### SubscribeEventsRequest
|
|
|
|
The stream open request reuses the authenticated request model.
|
|
It contains the same authentication fields as the unary request and either an
|
|
empty payload or a minimal connect payload.
|
|
|
|
Required fields:
|
|
|
|
- `protocol_version`
|
|
- `device_session_id`
|
|
- `message_type`
|
|
- `timestamp_ms`
|
|
- `request_id`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
Optional fields:
|
|
|
|
- `payload_bytes`
|
|
- `trace_id`
|
|
|
|
### GatewayEvent
|
|
|
|
Every stream event is a client-facing signed server message.
|
|
|
|
Required fields:
|
|
|
|
- `event_type`
|
|
- `event_id`
|
|
- `timestamp_ms`
|
|
- `payload_bytes`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
Optional fields:
|
|
|
|
- `request_id`
|
|
- `trace_id`
|
|
|
|
The v1 stream-event signature scheme is Ed25519 with event domain
|
|
marker `galaxy-event-v1`.
|
|
The event signing input uses the same canonical binary encoding shape as the
|
|
request and unary response signers:
|
|
|
|
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
|
followed by raw bytes;
|
|
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
|
- the signed field order is `galaxy-event-v1`, `event_type`, `event_id`,
|
|
`timestamp_ms`, `request_id`, `trace_id`, `payload_hash`.
|
|
|
|
The bootstrap event uses:
|
|
|
|
- `event_type = "gateway.server_time"`;
|
|
- `event_id = request_id` from the opening `SubscribeEvents` request;
|
|
- `payload_bytes` encoded as FlatBuffers `gateway.ServerTimeEvent` with
|
|
`server_time_ms`;
|
|
- the same loaded Ed25519 signer configured by
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.
|
|
|
|
Client-facing fan-out events are sourced from the internal client
|
|
event stream. Internal publishers provide the event target and business
|
|
payload only: `user_id`, optional `device_session_id`, `event_type`,
|
|
`event_id`, `payload_bytes`, and optional `request_id` / `trace_id`. The
|
|
gateway derives `timestamp_ms`, recomputes `payload_hash`, signs the event,
|
|
and only then forwards it to the matching `SubscribeEvents` streams.
|
|
|
|
## Verification and Routing Pipeline
|
|
|
|
The gateway applies the same strict verification order for authenticated gRPC
|
|
ingress.
|
|
|
|
1. Parse the control envelope and validate required fields.
|
|
2. Check whether `protocol_version` is supported.
|
|
3. Resolve `device_session_id` through `SessionCache`.
|
|
4. Reject unknown or revoked sessions.
|
|
5. Verify that `payload_hash` matches raw `payload_bytes`.
|
|
6. Verify the client signature using the public key from session cache.
|
|
7. Verify that `timestamp_ms` is inside the accepted freshness window.
|
|
8. Verify anti-replay by checking `device_session_id + request_id`.
|
|
9. Apply authenticated rate limit and edge policy checks.
|
|
10. Build the authenticated internal command context.
|
|
11. Route the command downstream by `message_type`.
|
|
|
|
No downstream business service should receive a request that has not passed
|
|
this full verification pipeline.
|
|
|
|
`ExecuteCommand` enforces steps 1 through 11 and
|
|
signs the successful unary response afterward. `SubscribeEvents` enforces
|
|
steps 1 through 9, binds the verified stream identity, sends the initial
|
|
signed server-time bootstrap event, and then keeps the stream open for push
|
|
delivery.
|
|
Malformed envelopes fail with gRPC `INVALID_ARGUMENT`.
|
|
Unsupported non-empty `protocol_version` values fail with gRPC
|
|
`FAILED_PRECONDITION`.
|
|
Unknown sessions fail with gRPC `UNAUTHENTICATED`.
|
|
Revoked sessions fail with gRPC `FAILED_PRECONDITION`.
|
|
SessionCache backend failures fail with gRPC `UNAVAILABLE`.
|
|
`payload_hash` values that are not raw 32-byte SHA-256 digests fail with gRPC
|
|
`INVALID_ARGUMENT` and message `payload_hash must be a 32-byte SHA-256 digest`.
|
|
`payload_hash` values that do not match `payload_bytes` fail with gRPC
|
|
`INVALID_ARGUMENT` and message `payload_hash does not match payload_bytes`.
|
|
Invalid request signatures fail with gRPC `UNAUTHENTICATED` and message
|
|
`invalid request signature`.
|
|
Malformed cached `client_public_key` values fail closed with gRPC
|
|
`UNAVAILABLE` and message `session cache is unavailable`.
|
|
Requests with a `timestamp_ms` outside the accepted freshness window fail with
|
|
gRPC `FAILED_PRECONDITION` and message
|
|
`request timestamp is outside the freshness window`.
|
|
Requests that reuse the same `request_id` for the same `device_session_id`
|
|
inside the active replay window fail with gRPC `FAILED_PRECONDITION` and
|
|
message `request replay detected`.
|
|
ReplayStore backend failures fail with gRPC `UNAVAILABLE` and message
|
|
`replay store is unavailable`.
|
|
Unrouted exact-match `message_type` values fail with gRPC `UNIMPLEMENTED` and
|
|
message `message_type is not routed`.
|
|
Downstream availability failures fail with gRPC `UNAVAILABLE` and message
|
|
`downstream service is unavailable`.
|
|
|
|
## Internal Authenticated Contract
|
|
|
|
Downstream services should receive an internal authenticated command rather than
|
|
raw external gRPC transport data.
|
|
|
|
The minimum authenticated context is:
|
|
|
|
- `user_id`
|
|
- `device_session_id`
|
|
- `message_type`
|
|
- verified `payload_bytes`
|
|
- `request_id`
|
|
- optional `trace_id`
|
|
- optional client metadata needed for logs and tracing
|
|
|
|
Downstream services may trust that the gateway has already performed transport
|
|
authentication, freshness verification, and anti-replay checks.
|
|
They must still perform business authorization and domain validation.
|
|
|
|
## Session Model
|
|
|
|
The Auth / Session Service is the source of truth for device session state.
|
|
The gateway is designed to authenticate the hot path from cache.
|
|
|
|
Expected session fields available to the gateway:
|
|
|
|
- `device_session_id`
|
|
- `user_id`
|
|
- base64-encoded raw 32-byte Ed25519 client public key
|
|
- session status
|
|
- revoke metadata
|
|
- optional client metadata
|
|
|
|
### Session Cache
|
|
|
|
`SessionCache` provides the fast path for:
|
|
|
|
- session existence checks;
|
|
- `device_session_id -> user_id`;
|
|
- access to the base64-encoded raw Ed25519 client public key used for
|
|
signature verification;
|
|
- revoked versus active status checks.
|
|
|
|
Cache updates are event-driven.
|
|
TTL is allowed only as a safety net and must not replace invalidation events.
|
|
|
|
The gateway keeps a process-local in-memory snapshot
|
|
cache in front of the Redis fallback backend. Authenticated requests read the
|
|
local snapshot first. A local miss performs one bounded Redis lookup and seeds
|
|
the local snapshot so later requests for the same session avoid another Redis
|
|
round-trip unless a later session event changes the cached state.
|
|
|
|
The local snapshot cache intentionally has no TTL and no size-based
|
|
eviction policy. Session lifecycle events are the authoritative mechanism for
|
|
keeping the hot path current, while Redis fallback remains the safety net for
|
|
cold misses and process restarts.
|
|
|
|
The Redis fallback implementation uses `go-redis/v9`.
|
|
`cmd/gateway` requires the Redis fallback backend during startup, issues a
|
|
bounded `PING`, and refuses to start when Redis is misconfigured or
|
|
unavailable.
|
|
|
|
Required environment variable:
|
|
|
|
- `GATEWAY_SESSION_CACHE_REDIS_ADDR`
|
|
|
|
Optional environment variables:
|
|
|
|
- `GATEWAY_SESSION_CACHE_REDIS_USERNAME`
|
|
- `GATEWAY_SESSION_CACHE_REDIS_PASSWORD`
|
|
- `GATEWAY_SESSION_CACHE_REDIS_DB` with default `0`
|
|
- `GATEWAY_SESSION_CACHE_REDIS_KEY_PREFIX` with default `gateway:session:`
|
|
- `GATEWAY_SESSION_CACHE_REDIS_LOOKUP_TIMEOUT` with default `250ms`
|
|
- `GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED` with default `false`
|
|
|
|
The Redis key format is:
|
|
|
|
- `<key_prefix><device_session_id>`
|
|
|
|
The Redis value is one strict JSON object:
|
|
|
|
- `device_session_id`
|
|
- `user_id`
|
|
- `client_public_key`
|
|
- `status`
|
|
- optional `revoked_at_ms`
|
|
|
|
`client_public_key` stores the standard base64-encoded raw 32-byte Ed25519
|
|
public key registered for the device session.
|
|
|
|
Malformed JSON, missing required fields, unsupported `status`, or a
|
|
`device_session_id` mismatch between the Redis value and the lookup key are
|
|
treated as SessionCache backend failures rather than as valid session states.
|
|
|
|
### Session Event Stream
|
|
|
|
The gateway keeps the process-local session snapshot cache synchronized from one
|
|
Redis Stream consumed through `go-redis/v9`.
|
|
|
|
`cmd/gateway` requires the session event stream configuration during startup,
|
|
issues a bounded `PING` against the same Redis deployment used for
|
|
`SessionCache`, and refuses to start when that Redis backend is unavailable.
|
|
|
|
Required environment variable:
|
|
|
|
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM`
|
|
|
|
Optional environment variable:
|
|
|
|
- `GATEWAY_SESSION_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s`
|
|
|
|
The subscriber reuses the same Redis address, ACL credentials, logical
|
|
database, timeout, and TLS settings configured for `SessionCache`.
|
|
|
|
Each gateway replica keeps its own in-memory last-seen stream ID and consumes
|
|
the stream with plain `XREAD`, not a shared consumer group.
|
|
On startup the replica resolves the current stream tail and begins from that
|
|
point, which preserves the same fresh-process semantics as Redis `$` while
|
|
avoiding a race before the first blocking read.
|
|
|
|
The session event payload is one strict full snapshot with these
|
|
fields:
|
|
|
|
- `device_session_id`
|
|
- `user_id`
|
|
- `client_public_key`
|
|
- `status`
|
|
- optional `revoked_at_ms`
|
|
|
|
Valid active and revoked snapshots upsert or replace the local session state.
|
|
Later stream entries win.
|
|
Malformed events are skipped without stopping the subscriber; when
|
|
`device_session_id` can still be extracted, the gateway evicts the local
|
|
snapshot for that session so it cannot continue using stale state.
|
|
|
|
Session event publishers must keep the stream bounded by using
|
|
`XADD ... MAXLEN ~ <limit>` or an equivalent retention policy.
|
|
The gateway intentionally does not trim the stream from the consumer side,
|
|
because consumer-side trimming could drop updates that another gateway replica
|
|
has not read yet.
|
|
|
|
### Client Event Stream
|
|
|
|
The gateway delivers client-facing push events from one dedicated Redis Stream
|
|
consumed through `go-redis/v9`.
|
|
|
|
`cmd/gateway` requires the client event stream configuration during startup,
|
|
issues a bounded `PING` against the same Redis deployment used for
|
|
`SessionCache`, and refuses to start when that Redis backend is unavailable.
|
|
|
|
Required environment variable:
|
|
|
|
- `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`
|
|
|
|
Optional environment variable:
|
|
|
|
- `GATEWAY_CLIENT_EVENTS_REDIS_READ_BLOCK_TIMEOUT` with default `1s`
|
|
|
|
The subscriber reuses the same Redis address, ACL credentials, logical
|
|
database, timeout, and TLS settings configured for `SessionCache`.
|
|
|
|
Each gateway replica keeps its own in-memory last-seen stream ID and consumes
|
|
the stream with plain `XREAD`, not a shared consumer group.
|
|
On startup the replica resolves the current stream tail and begins from that
|
|
point, which preserves the same fresh-process semantics as Redis `$` while
|
|
avoiding a race before the first blocking read.
|
|
|
|
The client event payload is one strict target-plus-payload entry with
|
|
these fields:
|
|
|
|
- `user_id`
|
|
- optional `device_session_id`
|
|
- `event_type`
|
|
- `event_id`
|
|
- `payload_bytes`
|
|
- optional `request_id`
|
|
- optional `trace_id`
|
|
|
|
`payload_bytes` carries the raw binary-safe business payload bytes for the
|
|
outbound client event.
|
|
When `device_session_id` is absent or blank, the gateway fans the event out to
|
|
every active stream for `user_id`.
|
|
When `device_session_id` is present, the gateway fans the event out only to
|
|
active streams whose `user_id` and `device_session_id` both match.
|
|
Malformed client event entries are skipped without stopping the subscriber or
|
|
delivering partial data to clients.
|
|
|
|
Client event publishers must keep the stream bounded by using
|
|
`XADD ... MAXLEN ~ <limit>` or an equivalent retention policy.
|
|
The gateway intentionally does not trim the stream from the consumer side,
|
|
because consumer-side trimming could drop updates that another gateway replica
|
|
has not read yet.
|
|
|
|
### Replay Store
|
|
|
|
`ReplayStore` provides the hot-path anti-replay reservation for:
|
|
|
|
- duplicate detection by `device_session_id + request_id`;
|
|
- bounded replay protection for the authenticated freshness window.
|
|
|
|
The ReplayStore uses Redis through `go-redis/v9`.
|
|
`cmd/gateway` requires the ReplayStore backend during startup, issues a
|
|
bounded `PING`, and refuses to start when Redis is misconfigured or
|
|
unavailable.
|
|
|
|
The ReplayStore reuses the same Redis deployment settings as `SessionCache`
|
|
and adds two replay-specific environment variables:
|
|
|
|
- `GATEWAY_REPLAY_REDIS_KEY_PREFIX` with default `gateway:replay:`
|
|
- `GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT` with default `250ms`
|
|
|
|
Replay keys use this format:
|
|
|
|
- `<key_prefix><base64url(device_session_id)>:<base64url(request_id)>`
|
|
|
|
For each accepted request, the replay reservation TTL is computed as:
|
|
|
|
- `timestamp_ms + freshness_window - now`
|
|
|
|
The TTL is clamped to a minimum positive duration so requests accepted exactly
|
|
on the freshness boundary still reserve their replay key.
|
|
|
|
### Revocation Behavior
|
|
|
|
When a device session is revoked:
|
|
|
|
1. the Auth / Session Service updates the source of truth;
|
|
2. it publishes a session update or revoke event;
|
|
3. the gateway invalidates or updates `SessionCache`;
|
|
4. new unary gRPC requests for that session are rejected;
|
|
5. active `SubscribeEvents` streams for that exact `device_session_id` are
|
|
closed with gRPC `FAILED_PRECONDITION` and message
|
|
`device session is revoked`.
|
|
|
|
## Public Anti-Abuse Model
|
|
|
|
The public REST layer must distinguish between public auth operations and
|
|
browser-originated traffic that may burst during a normal first page load.
|
|
|
|
The gateway uses these public route classes:
|
|
|
|
- `public_auth`
|
|
- `browser_bootstrap`
|
|
- `browser_asset`
|
|
- `public_misc`
|
|
|
|
Any classifier result outside this fixed set is normalized to `public_misc`
|
|
before the class is stored in request context or used for policy derivation.
|
|
The canonical base bucket namespace for public REST policy is
|
|
`public_rest/class=<class>`.
|
|
|
|
### Public Auth
|
|
|
|
`public_auth` is the stable route class for `send-email-code` and
|
|
`confirm-email-code`.
|
|
This class uses stricter limits and abuse scoring because it directly touches
|
|
account and session creation flows.
|
|
|
|
Controls include:
|
|
|
|
- per-IP and per-identity rate limits;
|
|
- request body size limits;
|
|
- method allow-lists;
|
|
- malformed request counters;
|
|
- elevated logging and security telemetry for repeated failures.
|
|
|
|
Current defaults:
|
|
|
|
- per-IP: `30 requests / minute`, `burst=10`;
|
|
- `send-email-code` identity buckets: `3 requests / 10 minutes`, `burst=1`,
|
|
keyed by normalized `email`;
|
|
- `confirm-email-code` identity buckets: `6 requests / 10 minutes`,
|
|
`burst=2`, keyed by normalized `challenge_id`;
|
|
- maximum request body size: `8192` bytes;
|
|
- only `POST` is accepted for public auth routes.
|
|
|
|
Configuration surface:
|
|
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTES` default `8192`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS` default
|
|
`30`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOW` default `1m`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST` default `10`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
|
|
default `3`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
|
|
default `10m`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
|
|
default `1`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
|
|
default `6`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
|
|
default `10m`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
|
|
default `2`.
|
|
|
|
### Browser Bootstrap and Asset Traffic
|
|
|
|
`browser_bootstrap` and `browser_asset` use separate coarse-grained budgets.
|
|
They may exhibit bursty behavior during the first load and therefore must not
|
|
be treated as hostile based on burst pattern alone.
|
|
|
|
This traffic is still constrained by:
|
|
|
|
- dedicated rate limits;
|
|
- method allow-lists;
|
|
- body size limits where request bodies are expected;
|
|
- protocol and path validation;
|
|
- independent abuse telemetry.
|
|
|
|
The gateway must not merge these buckets or counters with `public_auth`.
|
|
|
|
Current defaults:
|
|
|
|
- `browser_bootstrap`: `60 requests / minute`, `burst=20`, `GET` and `HEAD`
|
|
only, and no request body;
|
|
- `browser_asset`: `300 requests / minute`, `burst=80`, `GET` and `HEAD`
|
|
only, and no request body;
|
|
- `public_misc`: `30 requests / minute`, `burst=10`, and no request body.
|
|
|
|
Configuration surface:
|
|
|
|
- `browser_bootstrap`:
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTES` default
|
|
`0`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTS`
|
|
default `60`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOW` default
|
|
`1m`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURST` default
|
|
`20`;
|
|
- `browser_asset`:
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTES` default `0`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTS` default
|
|
`300`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOW` default
|
|
`1m`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURST` default
|
|
`80`;
|
|
- `public_misc`:
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTES` default `0`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTS` default
|
|
`30`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOW` default `1m`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURST` default `10`.
|
|
|
|
## Push Delivery Model
|
|
|
|
The v1 push channel is a gRPC server stream.
|
|
Long-polling is intentionally out of scope for the first version.
|
|
|
|
Expected stream behavior:
|
|
|
|
1. the client opens `SubscribeEvents`;
|
|
2. the gateway applies the full authenticated ingress verification pipeline;
|
|
3. the stream is bound to `user_id` and `device_session_id`;
|
|
4. the first signed service event is `gateway.server_time` and its
|
|
FlatBuffers payload includes `server_time_ms`;
|
|
5. after that bootstrap event, the stream is registered in `PushHub` and
|
|
remains open until client cancellation, server shutdown, queue overflow,
|
|
session revoke for the same `device_session_id`, or a later send failure;
|
|
6. internal pub/sub may target all active streams for one `user_id` or only
|
|
one `device_session_id` within that user;
|
|
7. the current per-stream in-memory queue capacity is `64` events and
|
|
overflow closes only the affected stream;
|
|
8. session revoke closes only streams bound to the same exact
|
|
`device_session_id` and returns gRPC `FAILED_PRECONDITION` with message
|
|
`device session is revoked`.
|
|
|
|
## Lifecycle and Shutdown
|
|
|
|
Gateway process shutdown is coordinated across the public REST listener,
|
|
authenticated gRPC listener, optional admin listener, internal Redis
|
|
subscribers, and telemetry runtime.
|
|
|
|
`GATEWAY_SHUTDOWN_TIMEOUT` configures the per-component graceful shutdown
|
|
budget and defaults to `5s`.
|
|
During authenticated gRPC shutdown, the in-memory `PushHub` closes active
|
|
streams before gRPC graceful stop, so active `SubscribeEvents` calls terminate
|
|
with gRPC `UNAVAILABLE` and message `gateway is shutting down`.
|
|
|
|
## Recommended Package Layout
|
|
|
|
The package layout keeps transport, policy, and downstream adapters separate:
|
|
|
|
- `cmd/gateway`
|
|
- `internal/app`
|
|
- `internal/config`
|
|
- `internal/restapi`
|
|
- `internal/grpcapi`
|
|
- `internal/authn`
|
|
- `internal/session`
|
|
- `internal/replay`
|
|
- `internal/ratelimit`
|
|
- `internal/downstream`
|
|
- `internal/push`
|
|
- `internal/events`
|
|
- `internal/clock`
|
|
|
|
## Key Interfaces
|
|
|
|
The gateway should be built around explicit consumer-side interfaces.
|
|
|
|
### SessionCache
|
|
|
|
Provides cached session lookup by `device_session_id`.
|
|
Returns enough data to verify signatures and identify the authenticated user.
|
|
The current production implementation is a process-local read-through cache in
|
|
front of a Redis fallback adapter that uses strict JSON records under a
|
|
configurable key prefix.
|
|
|
|
### ReplayStore
|
|
|
|
Tracks recently seen `request_id` values per device session and rejects replayed
|
|
requests inside the accepted freshness window.
|
|
The current production adapter is Redis-backed, uses a dedicated configurable
|
|
key prefix, and reserves keys with a TTL derived from
|
|
`timestamp_ms + freshness_window - now`.
|
|
|
|
### RateLimiter
|
|
|
|
Applies independent policies for:
|
|
|
|
- public REST route classes;
|
|
- authenticated gRPC requests by IP;
|
|
- authenticated gRPC requests by session;
|
|
- authenticated gRPC requests by user;
|
|
- authenticated gRPC requests by message class.
|
|
|
|
The current rate limiter is process-local and in-memory.
|
|
Public REST keys stay under the `public_rest/...` namespace, while
|
|
authenticated gRPC keys stay under `authenticated_grpc/...`, so both traffic
|
|
surfaces keep independent buckets even when they share the same limiter
|
|
backend.
|
|
|
|
### PublicTrafficClassifier
|
|
|
|
Maps incoming public REST requests to one of the public route classes so that
|
|
limits and anti-abuse counters remain isolated.
|
|
The gateway normalizes any unsupported or empty classifier output to
|
|
`public_misc`, and public policy code derives the base bucket namespace from
|
|
the normalized class as `public_rest/class=<class>`.
|
|
|
|
### AuthServiceClient
|
|
|
|
Handles public auth commands and session-related updates exchanged with the
|
|
Auth / Session Service.
|
|
The gateway contract is:
|
|
|
|
- `SendEmailCode(email) -> challenge_id`
|
|
- `ConfirmEmailCode(challenge_id, code, client_public_key, time_zone) -> device_session_id`
|
|
|
|
When no concrete implementation is wired, the gateway keeps the public routes
|
|
available and returns a stable `503 service_unavailable` response instead of
|
|
failing process startup.
|
|
|
|
### DownstreamRouter
|
|
|
|
Resolves the target downstream service or adapter by the full exact-match
|
|
`message_type` literal.
|
|
|
|
### DownstreamClient
|
|
|
|
Executes a verified authenticated command against a downstream internal service
|
|
and returns response payload bytes plus a stable opaque result code.
|
|
An empty or whitespace-only result code is treated as an internal downstream
|
|
contract violation.
|
|
|
|
### EventSubscriber
|
|
|
|
Subscribes to internal pub/sub topics used for:
|
|
|
|
- session cache updates;
|
|
- revocations;
|
|
- client-facing event delivery.
|
|
|
|
The implementation consumes two Redis Streams with replica-safe plain
|
|
`XREAD`: one strict full-session snapshot stream for the process-local session
|
|
cache and one client-facing event stream for live push fan-out.
|
|
|
|
### PushHub
|
|
|
|
Tracks active `SubscribeEvents` streams, binds them to authenticated identities,
|
|
and delivers events to the correct connections.
|
|
The implementation uses one bounded in-memory queue per stream with a
|
|
default capacity of `64` events; overflowing one queue closes only that stream
|
|
and leaves the remaining streams active.
|
|
|
|
### ResponseSigner
|
|
|
|
Signs unary responses and stream events so clients can verify server-originated
|
|
messages.
|
|
The implementation uses one Ed25519 signer loaded from
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must reference a PKCS#8
|
|
PEM-encoded private key.
|
|
|
|
### Clock
|
|
|
|
Provides current server time and supports consistent freshness-window checks.
|
|
|
|
## Error Model and Observability
|
|
|
|
The gateway should expose stable edge-level error classes instead of leaking
|
|
internal implementation details.
|
|
|
|
Minimum error categories:
|
|
|
|
- malformed request;
|
|
- request too large;
|
|
- unsupported protocol;
|
|
- unknown session;
|
|
- revoked session;
|
|
- invalid signature;
|
|
- stale request;
|
|
- replay detected;
|
|
- rate limited;
|
|
- policy denied;
|
|
- downstream unavailable;
|
|
- backend unavailable;
|
|
- gateway shutting down;
|
|
- internal error.
|
|
|
|
Observability requirements:
|
|
|
|
- stable correlation identifiers, including `request_id` and optional `trace_id`;
|
|
- structured logs;
|
|
- security audit events for rejects and abuse signals;
|
|
- metrics keyed by route class, message type, result code, and reject reason;
|
|
- no logging of secrets, raw private material, or raw signatures.
|
|
|
|
The service uses:
|
|
|
|
- `go.uber.org/zap` for structured JSON logs;
|
|
- `otelgin` for the public REST listener;
|
|
- `otelgrpc` for the authenticated gRPC listener;
|
|
- OpenTelemetry metrics exported through Prometheus on the optional admin
|
|
`/metrics` listener.
|
|
|
|
Current custom metric families:
|
|
|
|
- `gateway.public_http.requests`
|
|
- `gateway.public_http.duration`
|
|
- `gateway.authenticated_grpc.requests`
|
|
- `gateway.authenticated_grpc.duration`
|
|
- `gateway.push.active_streams`
|
|
- `gateway.push.stream_closures`
|
|
- `gateway.internal_event_drops`
|
|
|
|
The process-wide log level is configured by `GATEWAY_LOG_LEVEL` and
|
|
defaults to `info`.
|
|
The default OpenTelemetry resource uses `service.name=galaxy-edge-gateway`
|
|
when `OTEL_SERVICE_NAME` is unset.
|
|
If `OTEL_TRACES_EXPORTER` is unset or set to `none`, the gateway keeps tracing
|
|
runtime enabled but installs no external trace exporter.
|
|
If `OTEL_TRACES_EXPORTER=otlp`, the gateway uses the standard
|
|
`OTEL_EXPORTER_OTLP_*` environment variables to configure the OTLP trace
|
|
exporter protocol and endpoint.
|
|
The protocol selection specifically honors
|
|
`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` first and falls back to
|
|
`OTEL_EXPORTER_OTLP_PROTOCOL` when the trace-specific variable is unset.
|
|
Supported values are `http/protobuf` and `grpc`; when both variables are
|
|
unset, the gateway defaults to `http/protobuf`.
|
|
|
|
Structured logs intentionally omit:
|
|
|
|
- public auth e-mail addresses, login codes, and challenge IDs;
|
|
- client public keys;
|
|
- raw payload bytes and payload hashes;
|
|
- raw request or response signatures;
|
|
- response-signer private key material and Redis credentials.
|
|
|
|
Malformed internal session and client-event stream entries are no longer
|
|
silently dropped: the gateway logs the drop and increments
|
|
`gateway.internal_event_drops`.
|
|
|
|
## Non-Goals
|
|
|
|
The gateway is not a business authorization layer and must not grow into a
|
|
domain coordinator.
|
|
|
|
The gateway must not:
|
|
|
|
- implement business ownership checks;
|
|
- validate domain state transitions;
|
|
- replace the Auth / Session Service as the session source of truth;
|
|
- degrade into a synchronous pass-through that reloads session state for every
|
|
authenticated request.
|