1089 lines
42 KiB
Markdown
1089 lines
42 KiB
Markdown
# Edge Gateway
|
|
|
|
## Run and Dependencies
|
|
|
|
`cmd/gateway` starts with built-in listener defaults, but it still requires:
|
|
|
|
- one reachable Redis deployment used exclusively for anti-replay
|
|
reservations (no session projection, no event streams);
|
|
- one reachable `backend` instance hosting the consolidated REST surface
|
|
(`/api/v1/{public,user,internal}/*`) and the `Push.SubscribePush` gRPC
|
|
listener;
|
|
- one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.
|
|
|
|
Required startup environment variables:
|
|
|
|
- `GATEWAY_REDIS_MASTER_ADDR`
|
|
- `GATEWAY_REDIS_PASSWORD`
|
|
- `GATEWAY_BACKEND_HTTP_URL`
|
|
- `GATEWAY_BACKEND_GRPC_PUSH_URL`
|
|
- `GATEWAY_BACKEND_GATEWAY_CLIENT_ID`
|
|
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`
|
|
|
|
Optional integrations:
|
|
|
|
- `GATEWAY_ADMIN_HTTP_ADDR` enables the private `/metrics` listener;
|
|
- `GATEWAY_BACKEND_HTTP_TIMEOUT`, `GATEWAY_BACKEND_PUSH_RECONNECT_BASE_BACKOFF`,
|
|
`GATEWAY_BACKEND_PUSH_RECONNECT_MAX_BACKOFF` tune the backend client.
|
|
|
|
Operational caveats:
|
|
|
|
- gateway issues one synchronous `/api/v1/internal/sessions/{id}` lookup per
|
|
authenticated request — there is no process-local cache; backend keeps the
|
|
source-of-truth record;
|
|
- the gRPC `SubscribePush` consumer reconnects with exponential backoff and
|
|
jitter on every backend restart and resumes from the last cursor it
|
|
observed.
|
|
|
|
Additional module docs:
|
|
|
|
- [Public REST contract](openapi.yaml)
|
|
- [Documentation index](docs/README.md)
|
|
- [Runtime and components](docs/runtime.md)
|
|
- [Request and push flows](docs/flows.md)
|
|
- [Operator runbook](docs/runbook.md)
|
|
- [Configuration and contract examples](docs/examples.md)
|
|
- [Example `.env`](.env.example)
|
|
|
|
## Purpose
|
|
|
|
`Edge Gateway` is the only public ingress for Galaxy Plus clients.
|
|
It terminates the external transport and security boundary, enforces edge
|
|
policies, and routes verified requests to internal services.
|
|
|
|
The gateway does not implement domain-specific business logic.
|
|
Business validation, authorization, ownership checks, and state transitions
|
|
remain inside downstream services.
|
|
|
|
## Trust Boundary
|
|
|
|
The gateway sits between untrusted external clients and trusted internal
|
|
services.
|
|
|
|
The gateway is responsible for:
|
|
|
|
- parsing external transport requests;
|
|
- classifying public REST traffic;
|
|
- authenticating protected gRPC traffic;
|
|
- loading session state from cache;
|
|
- verifying request freshness and anti-replay constraints;
|
|
- applying edge rate limits and anti-abuse policy;
|
|
- building an authenticated internal command context;
|
|
- routing verified commands to internal services;
|
|
- maintaining authenticated push delivery connections.
|
|
|
|
The gateway is not responsible for:
|
|
|
|
- deciding whether a user is allowed to execute a business action;
|
|
- validating domain invariants;
|
|
- storing the source-of-truth session record;
|
|
- implementing business idempotency.
|
|
|
|
## Transport Matrix
|
|
|
|
The gateway exposes two external transport classes.
|
|
|
|
| Transport | Audience | Authentication | Payload format | Primary use |
|
|
| --- | --- | --- | --- | --- |
|
|
| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Health checks, public auth commands, and browser/bootstrap traffic |
|
|
| gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery |
|
|
|
|
### Public REST Surface
|
|
|
|
The public REST surface is used for commands that must work before a device
|
|
session exists and for browser-originated traffic that may share the same edge.
|
|
It covers the probe endpoints, public auth routes, and coarse public
|
|
anti-abuse.
|
|
|
|
Currently implemented public endpoints:
|
|
|
|
- `GET /healthz`
|
|
- `GET /readyz`
|
|
- `POST /api/v1/public/auth/send-email-code`
|
|
- `POST /api/v1/public/auth/confirm-email-code`
|
|
|
|
The implemented REST contract is documented in [`openapi.yaml`](openapi.yaml).
|
|
The listener address is configured by `GATEWAY_PUBLIC_HTTP_ADDR`.
|
|
The public REST listener read budgets are configured by:
|
|
|
|
- `GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
|
|
- `GATEWAY_PUBLIC_HTTP_READ_TIMEOUT` with default `10s`;
|
|
- `GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUT` with default `1m`.
|
|
|
|
The public auth JSON contract uses a challenge-token flow:
|
|
|
|
- `send-email-code` accepts `email` and returns `challenge_id`;
|
|
- `confirm-email-code` accepts `challenge_id`, `code`,
|
|
`client_public_key`, and `time_zone`, then returns
|
|
`device_session_id`.
|
|
|
|
The JSON body for `send-email-code` remains unchanged, but gateway may also
|
|
consume the standard `Accept-Language` header on that route. Gateway resolves
|
|
the first supported BCP 47 language tag, falls back to `en` when needed, and
|
|
forwards that derived preferred-language candidate to
|
|
`Auth / Session Service` for localized auth mail and possible first-user
|
|
creation. The public JSON DTO itself remains unchanged.
|
|
`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
|
|
key for the device session being created.
|
|
`time_zone` is the client-selected IANA time zone name forwarded unchanged to
|
|
`Auth / Session Service`.
|
|
The current create-path source of truth for `preferred_language` is the
|
|
language candidate derived from public `Accept-Language`, with fallback to
|
|
`en`. The public `confirm-email-code` DTO itself remains unchanged.
|
|
|
|
These routes remain unauthenticated and delegate only through an injected
|
|
`AuthServiceClient`.
|
|
The default wiring used by `cmd/gateway` keeps the routes mounted and returns
|
|
`503 service_unavailable` until a concrete upstream auth adapter is supplied.
|
|
Public auth adapter calls are wrapped in
|
|
`GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT`, which defaults to `3s`.
|
|
When that timeout expires, the gateway preserves the public REST contract and
|
|
returns `503 service_unavailable`.
|
|
When an injected auth adapter returns `*AuthServiceError`, the gateway projects
|
|
that client-safe `4xx/5xx` status, `code`, and `message` back to the caller
|
|
after normalizing blank or invalid fields. Unexpected non-`AuthServiceError`
|
|
adapter failures fail closed as `500 internal_error`.
|
|
|
|
Public anti-abuse is process-local and in-memory.
|
|
Per-IP buckets are derived only from the TCP peer `RemoteAddr`.
|
|
Forwarded proxy headers such as `X-Forwarded-For` and `Forwarded` are
|
|
intentionally ignored.
|
|
Oversized public REST bodies are rejected with `413 request_too_large`.
|
|
Rate-limited requests are rejected with `429 rate_limited` and a
|
|
`Retry-After` header.
|
|
|
|
In addition to the fixed endpoints above, the gateway may front browser
|
|
bootstrap or asset traffic through a pluggable public handler or proxy.
|
|
That traffic belongs to dedicated public route classes and must not share rate
|
|
limit buckets or abuse counters with the public auth API.
|
|
|
|
### Operational Admin Surface
|
|
|
|
The gateway may expose one private operational HTTP listener used for metrics.
|
|
|
|
The admin listener is disabled by default and is enabled only when
|
|
`GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
|
|
When enabled, it serves:
|
|
|
|
- `GET /metrics`
|
|
|
|
The admin listener read budgets are configured by:
|
|
|
|
- `GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUT` with default `2s`;
|
|
- `GATEWAY_ADMIN_HTTP_READ_TIMEOUT` with default `10s`;
|
|
- `GATEWAY_ADMIN_HTTP_IDLE_TIMEOUT` with default `1m`.
|
|
|
|
`/metrics` is intentionally not mounted on the public REST ingress.
|
|
It is also intentionally excluded from [`openapi.yaml`](openapi.yaml), because
|
|
that specification covers only the public REST ingress.
|
|
The endpoint exposes metrics in the Prometheus text exposition format described
|
|
in the official Prometheus documentation:
|
|
<https://prometheus.io/docs/instrumenting/exposition_formats/>.
|
|
|
|
### Authenticated gRPC Surface
|
|
|
|
All authenticated client requests use HTTP/2 and gRPC.
|
|
The listener address is configured by `GATEWAY_AUTHENTICATED_GRPC_ADDR`.
|
|
Inbound authenticated gRPC connection setup is bounded by
|
|
`GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT`, which defaults to `5s`.
|
|
The accepted client timestamp skew is configured by
|
|
`GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW` and defaults to `5m`.
|
|
|
|
The public gRPC service exposes two methods:
|
|
|
|
- `ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)`
|
|
- `SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)`
|
|
|
|
`ExecuteCommand` is a generic unary RPC.
|
|
The gateway routes the request downstream by `message_type` after transport
|
|
verification succeeds.
|
|
Downstream unary execution is bounded by
|
|
`GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT`, which defaults to `5s`.
|
|
When that timeout expires, the gateway preserves the authenticated gRPC
|
|
contract and returns gRPC `UNAVAILABLE` with message
|
|
`downstream service is unavailable`.
|
|
|
|
`SubscribeEvents` is an authenticated server-streaming RPC.
|
|
It binds the stream to `user_id` and `device_session_id` and starts by sending
|
|
a signed service event that includes the current server time in milliseconds.
|
|
|
|
The v1 protobuf contract lives in
|
|
`proto/galaxy/gateway/v1/edge_gateway.proto` under package
|
|
`galaxy.gateway.v1` and service `EdgeGateway`.
|
|
Generated Go bindings are committed under `proto/galaxy/gateway/v1/` and are
|
|
regenerated with:
|
|
|
|
```bash
|
|
buf generate
|
|
```
|
|
|
|
The gateway validates the request envelope, device-session
|
|
cache lookup, `payload_hash`, the client Ed25519 signature, timestamp
|
|
freshness, replay reservation, authenticated rate limits, and the
|
|
authenticated policy hook before any later routing or push step runs.
|
|
Malformed envelopes are rejected with gRPC `INVALID_ARGUMENT`.
|
|
Requests with a non-empty but unsupported `protocol_version` are rejected with
|
|
gRPC `FAILED_PRECONDITION`.
|
|
The supported request `protocol_version` literal is `v1`.
|
|
Requests with an unknown `device_session_id` are rejected with gRPC
|
|
`UNAUTHENTICATED`.
|
|
Requests for revoked sessions are rejected with gRPC `FAILED_PRECONDITION`.
|
|
SessionCache backend failures, including Redis lookup or record-decode
|
|
failures, are rejected with gRPC `UNAVAILABLE`.
|
|
Requests with a `payload_hash` that is not a 32-byte SHA-256 digest or does
|
|
not match `payload_bytes` are rejected with gRPC `INVALID_ARGUMENT`.
|
|
Requests with an invalid client signature or a signature created by a
|
|
different key are rejected with gRPC `UNAUTHENTICATED` and message
|
|
`invalid request signature`.
|
|
Requests with malformed cached `client_public_key` material fail closed as
|
|
gRPC `UNAVAILABLE`.
|
|
Requests with a `timestamp_ms` outside the symmetric freshness window around
|
|
current server time are rejected with gRPC `FAILED_PRECONDITION` and message
|
|
`request timestamp is outside the freshness window`.
|
|
Requests that reuse the same `request_id` for the same `device_session_id`
|
|
inside the active replay window are rejected with gRPC
|
|
`FAILED_PRECONDITION` and message `request replay detected`.
|
|
ReplayStore backend failures fail closed with gRPC `UNAVAILABLE` and message
|
|
`replay store is unavailable`.
|
|
Authenticated rate limits are enforced independently by transport peer IP,
|
|
authenticated `device_session_id`, authenticated `user_id`, and authenticated
|
|
message class. The gateway uses the full verified `message_type` literal as the
|
|
stable v1 message-class key because the transport does not yet define a
|
|
coarser authenticated class taxonomy. The peer IP is derived only from the
|
|
gRPC transport peer address; if it is missing or cannot be parsed, the
|
|
request falls back to the stable `unknown` IP bucket.
|
|
Requests that exceed any authenticated rate-limit bucket are rejected with
|
|
gRPC `RESOURCE_EXHAUSTED` and message
|
|
`authenticated request rate limit exceeded`.
|
|
The authenticated edge policy hook runs after those rate limits and defaults
|
|
to allow-all until a concrete policy evaluator is wired into the process.
|
|
`ExecuteCommand` builds an internal authenticated command context,
|
|
resolves one exact-match downstream route by the full verified `message_type`
|
|
literal, executes the downstream unary client, and signs the response before
|
|
it is returned to the caller. When no exact downstream route is registered,
|
|
`ExecuteCommand` is rejected with gRPC `UNIMPLEMENTED` and message
|
|
`message_type is not routed`. Downstream availability failures are rejected
|
|
with gRPC `UNAVAILABLE` and message `downstream service is unavailable`.
|
|
Unexpected downstream route-resolution or execution failures are rejected with
|
|
gRPC `INTERNAL`. Successful unary responses preserve the original
|
|
`request_id`, carry a SHA-256 `payload_hash` of the returned `payload_bytes`,
|
|
and are signed with the configured server Ed25519 response signer.
|
|
The default `cmd/gateway` wiring currently installs an empty static
|
|
downstream router, so verified `ExecuteCommand` requests still return gRPC
|
|
`UNIMPLEMENTED` until concrete downstream routes are injected.
|
|
`SubscribeEvents` applies the full authenticated ingress pipeline, binds
|
|
the stream to the verified `user_id` and `device_session_id`, sends one
|
|
signed `gateway.server_time` bootstrap event whose FlatBuffers payload carries
|
|
`server_time_ms`, registers the active stream in the in-memory `PushHub`, and
|
|
then forwards signed client-facing events consumed from the configured client
|
|
event Redis stream. User-targeted events fan out to every active stream for
|
|
that user. Session-targeted events fan out only to streams whose
|
|
`user_id` and `device_session_id` both match the event target. Each active
|
|
stream uses a bounded in-memory queue; when that queue overflows, only the
|
|
affected stream is closed with gRPC `RESOURCE_EXHAUSTED` and message
|
|
`push stream overflowed`. When the session lifecycle stream reports that the
|
|
same `device_session_id` was revoked, every active `SubscribeEvents` stream
|
|
bound to that exact session is closed with gRPC `FAILED_PRECONDITION` and
|
|
message `device session is revoked`. During gateway shutdown, the in-memory
|
|
push hub is closed before gRPC graceful stop, and every active
|
|
`SubscribeEvents` stream is terminated with gRPC `UNAVAILABLE` and message
|
|
`gateway is shutting down`.
|
|
Authenticated anti-abuse budgets are configured by the
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_*` environment variables.
|
|
|
|
Current authenticated gRPC defaults:
|
|
|
|
- per-IP: `120 requests / minute`, `burst=40`;
|
|
- per-session: `60 requests / minute`, `burst=20`;
|
|
- per-user: `120 requests / minute`, `burst=40`;
|
|
- per-message-class: `60 requests / minute`, `burst=20`.
|
|
|
|
Authenticated anti-abuse configuration surface:
|
|
|
|
- per-IP:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS` default
|
|
`120`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOW` default `1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST` default `40`;
|
|
- per-session:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS` default
|
|
`60`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOW` default
|
|
`1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURST` default
|
|
`20`;
|
|
- per-user:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTS` default
|
|
`120`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOW` default `1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURST` default `40`;
|
|
- per-message-class:
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTS`
|
|
default `60`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOW`
|
|
default `1m`,
|
|
`GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURST`
|
|
default `20`.
|
|
|
|
## Envelope and Payload Model
|
|
|
|
The authenticated transport uses a split contract:
|
|
|
|
- gRPC control messages are protobuf-based;
|
|
- business payload bytes are FlatBuffers;
|
|
- signatures are computed over canonical envelope fields and a hash of raw
|
|
FlatBuffers bytes.
|
|
|
|
The gateway verifies authenticated payload bytes before any downstream call.
|
|
Most downstream routes may still treat those bytes as opaque, but the gateway
|
|
is also allowed to transcode verified FlatBuffers payloads into trusted
|
|
downstream REST/JSON calls when the concrete downstream contract requires it.
|
|
|
|
The current direct `Gateway -> User` self-service boundary uses that pattern:
|
|
|
|
- external message types:
|
|
- `user.account.get`
|
|
- `user.profile.update`
|
|
- `user.settings.update`
|
|
- external payloads and responses:
|
|
- FlatBuffers
|
|
- internal downstream transport:
|
|
- strict REST/JSON to User Service
|
|
- business error projection:
|
|
- gateway `result_code`
|
|
- FlatBuffers error payload mirroring User Service `code` and `message`
|
|
|
|
The request envelope version literal is `v1`.
|
|
`payload_hash` is the raw 32-byte SHA-256 digest of `payload_bytes`.
|
|
`ExecuteCommand` hashes the raw FlatBuffers payload bytes exactly as sent,
|
|
while `SubscribeEvents` with an empty payload still requires
|
|
`sha256([]byte{})` rather than a special-case value.
|
|
The v1 request signature scheme is Ed25519.
|
|
`client_public_key` is the standard base64-encoded raw 32-byte Ed25519 public
|
|
key registered during `confirm-email-code`.
|
|
`signature` carries the raw 64-byte Ed25519 signature computed over the
|
|
canonical request signing input.
|
|
|
|
The v1 stream bootstrap payload uses the shared FlatBuffers schema
|
|
`pkg/schema/fbs/gateway.fbs` with root table `gateway.ServerTimeEvent`.
|
|
|
|
### ExecuteCommandRequest
|
|
|
|
Required fields:
|
|
|
|
- `protocol_version`
|
|
- `device_session_id`
|
|
- `message_type`
|
|
- `timestamp_ms`
|
|
- `request_id`
|
|
- `payload_bytes`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
Optional fields:
|
|
|
|
- `trace_id`
|
|
|
|
### ExecuteCommandResponse
|
|
|
|
Required fields:
|
|
|
|
- `protocol_version`
|
|
- `request_id`
|
|
- `timestamp_ms`
|
|
- `result_code`
|
|
- `payload_bytes`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
The v1 unary response signature scheme is Ed25519 with response
|
|
domain marker `galaxy-response-v1`.
|
|
The response signing input uses the same canonical binary encoding shape as
|
|
the request signer:
|
|
|
|
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
|
followed by raw bytes;
|
|
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
|
- the signed field order is `galaxy-response-v1`, `protocol_version`,
|
|
`request_id`, `timestamp_ms`, `result_code`, `payload_hash`.
|
|
|
|
`cmd/gateway` loads the unary response signer from
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must point to a PKCS#8
|
|
PEM-encoded Ed25519 private key. Startup fails when the file is absent,
|
|
unreadable, not strict PEM, not PKCS#8, or not Ed25519.
|
|
|
|
### SubscribeEventsRequest
|
|
|
|
The stream open request reuses the authenticated request model.
|
|
It contains the same authentication fields as the unary request and either an
|
|
empty payload or a minimal connect payload.
|
|
|
|
Required fields:
|
|
|
|
- `protocol_version`
|
|
- `device_session_id`
|
|
- `message_type`
|
|
- `timestamp_ms`
|
|
- `request_id`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
Optional fields:
|
|
|
|
- `payload_bytes`
|
|
- `trace_id`
|
|
|
|
### GatewayEvent
|
|
|
|
Every stream event is a client-facing signed server message.
|
|
|
|
Required fields:
|
|
|
|
- `event_type`
|
|
- `event_id`
|
|
- `timestamp_ms`
|
|
- `payload_bytes`
|
|
- `payload_hash`
|
|
- `signature`
|
|
|
|
Optional fields:
|
|
|
|
- `request_id`
|
|
- `trace_id`
|
|
|
|
The v1 stream-event signature scheme is Ed25519 with event domain
|
|
marker `galaxy-event-v1`.
|
|
The event signing input uses the same canonical binary encoding shape as the
|
|
request and unary response signers:
|
|
|
|
- each `string` and `bytes` field is encoded as `uvarint(len(field_bytes))`
|
|
followed by raw bytes;
|
|
- `timestamp_ms` is encoded as an 8-byte big-endian unsigned integer;
|
|
- the signed field order is `galaxy-event-v1`, `event_type`, `event_id`,
|
|
`timestamp_ms`, `request_id`, `trace_id`, `payload_hash`.
|
|
|
|
The bootstrap event uses:
|
|
|
|
- `event_type = "gateway.server_time"`;
|
|
- `event_id = request_id` from the opening `SubscribeEvents` request;
|
|
- `payload_bytes` encoded as FlatBuffers `gateway.ServerTimeEvent` with
|
|
`server_time_ms`;
|
|
- the same loaded Ed25519 signer configured by
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`.
|
|
|
|
Client-facing fan-out events are sourced from the internal client
|
|
event stream. Internal publishers provide the event target and business
|
|
payload only: `user_id`, optional `device_session_id`, `event_type`,
|
|
`event_id`, `payload_bytes`, and optional `request_id` / `trace_id`. The
|
|
gateway derives `timestamp_ms`, recomputes `payload_hash`, signs the event,
|
|
and only then forwards it to the matching `SubscribeEvents` streams.
|
|
|
|
Notification-owned user-facing payloads are expected to use
|
|
`pkg/schema/fbs/notification.fbs`. The initial notification event vocabulary
|
|
in v1 is exactly:
|
|
|
|
- `game.turn.ready`
|
|
- `game.finished`
|
|
- `lobby.application.submitted`
|
|
- `lobby.membership.approved`
|
|
- `lobby.membership.rejected`
|
|
- `lobby.membership.blocked`
|
|
- `lobby.invite.created`
|
|
- `lobby.invite.redeemed`
|
|
- `lobby.race_name.registration_eligible`
|
|
- `lobby.race_name.registered`
|
|
|
|
`lobby.application.submitted` is published toward `Gateway` only for the
|
|
private-game owner flow. The public-game variant is email-only.
|
|
The real `Notification Service -> Gateway` integration suite verifies this
|
|
user-targeted fan-out path and asserts that notification-owned push events do
|
|
not include `device_session_id`, so Gateway delivers them to every active
|
|
stream for the target user. Auth-code email does not use this push path and
|
|
continues to bypass `Notification Service`.
|
|
|
|
## Verification and Routing Pipeline
|
|
|
|
The gateway applies the same strict verification order for authenticated gRPC
|
|
ingress.
|
|
|
|
1. Parse the control envelope and validate required fields.
|
|
2. Check whether `protocol_version` is supported.
|
|
3. Resolve `device_session_id` through `SessionCache`.
|
|
4. Reject unknown or revoked sessions.
|
|
5. Verify that `payload_hash` matches raw `payload_bytes`.
|
|
6. Verify the client signature using the public key from session cache.
|
|
7. Verify that `timestamp_ms` is inside the accepted freshness window.
|
|
8. Verify anti-replay by checking `device_session_id + request_id`.
|
|
9. Apply authenticated rate limit and edge policy checks.
|
|
10. Build the authenticated internal command context.
|
|
11. Route the command downstream by `message_type`.
|
|
|
|
No downstream business service should receive a request that has not passed
|
|
this full verification pipeline.
|
|
|
|
`ExecuteCommand` enforces steps 1 through 11 and
|
|
signs the successful unary response afterward. `SubscribeEvents` enforces
|
|
steps 1 through 9, binds the verified stream identity, sends the initial
|
|
signed server-time bootstrap event, and then keeps the stream open for push
|
|
delivery.
|
|
Malformed envelopes fail with gRPC `INVALID_ARGUMENT`.
|
|
Unsupported non-empty `protocol_version` values fail with gRPC
|
|
`FAILED_PRECONDITION`.
|
|
Unknown sessions fail with gRPC `UNAUTHENTICATED`.
|
|
Revoked sessions fail with gRPC `FAILED_PRECONDITION`.
|
|
SessionCache backend failures fail with gRPC `UNAVAILABLE`.
|
|
`payload_hash` values that are not raw 32-byte SHA-256 digests fail with gRPC
|
|
`INVALID_ARGUMENT` and message `payload_hash must be a 32-byte SHA-256 digest`.
|
|
`payload_hash` values that do not match `payload_bytes` fail with gRPC
|
|
`INVALID_ARGUMENT` and message `payload_hash does not match payload_bytes`.
|
|
Invalid request signatures fail with gRPC `UNAUTHENTICATED` and message
|
|
`invalid request signature`.
|
|
Malformed cached `client_public_key` values fail closed with gRPC
|
|
`UNAVAILABLE` and message `session cache is unavailable`.
|
|
Requests with a `timestamp_ms` outside the accepted freshness window fail with
|
|
gRPC `FAILED_PRECONDITION` and message
|
|
`request timestamp is outside the freshness window`.
|
|
Requests that reuse the same `request_id` for the same `device_session_id`
|
|
inside the active replay window fail with gRPC `FAILED_PRECONDITION` and
|
|
message `request replay detected`.
|
|
ReplayStore backend failures fail with gRPC `UNAVAILABLE` and message
|
|
`replay store is unavailable`.
|
|
Unrouted exact-match `message_type` values fail with gRPC `UNIMPLEMENTED` and
|
|
message `message_type is not routed`.
|
|
Downstream availability failures fail with gRPC `UNAVAILABLE` and message
|
|
`downstream service is unavailable`.
|
|
|
|
## Internal Authenticated Contract
|
|
|
|
Downstream services should receive an internal authenticated command rather than
|
|
raw external gRPC transport data.
|
|
|
|
The minimum authenticated context is:
|
|
|
|
- `user_id`
|
|
- `device_session_id`
|
|
- `message_type`
|
|
- verified `payload_bytes`
|
|
- `request_id`
|
|
- optional `trace_id`
|
|
- optional client metadata needed for logs and tracing
|
|
|
|
Downstream services may trust that the gateway has already performed transport
|
|
authentication, freshness verification, and anti-replay checks.
|
|
They must still perform business authorization and domain validation.
|
|
|
|
## Session Model
|
|
|
|
The Auth / Session Service is the source of truth for device session state.
|
|
The gateway is designed to authenticate the hot path from cache.
|
|
|
|
Expected session fields available to the gateway:
|
|
|
|
- `device_session_id`
|
|
- `user_id`
|
|
- base64-encoded raw 32-byte Ed25519 client public key
|
|
- session status
|
|
- revoke metadata
|
|
- optional client metadata
|
|
|
|
### Session Cache
|
|
|
|
`SessionCache` provides the fast path for:
|
|
|
|
- session existence checks;
|
|
- `device_session_id -> user_id`;
|
|
- access to the base64-encoded raw Ed25519 client public key used for
|
|
signature verification;
|
|
- revoked versus active status checks.
|
|
|
|
Cache updates are event-driven.
|
|
TTL is allowed only as a safety net and must not replace invalidation events.
|
|
|
|
The gateway keeps a process-local in-memory snapshot
|
|
cache in front of the Redis fallback backend. Authenticated requests read the
|
|
local snapshot first. A local miss performs one bounded Redis lookup and seeds
|
|
the local snapshot so later requests for the same session avoid another Redis
|
|
round-trip unless a later session event changes the cached state.
|
|
|
|
The local snapshot cache intentionally has no TTL and no size-based
|
|
eviction policy. Session lifecycle events are the authoritative mechanism for
|
|
keeping the hot path current, while Redis fallback remains the safety net for
|
|
cold misses and process restarts.
|
|
|
|
The Redis fallback implementation uses `go-redis/v9`. `cmd/gateway` opens one
|
|
shared `*redis.Client` via `pkg/redisconn` (instrumented with OpenTelemetry
|
|
tracing and metrics), issues a single bounded `PING` on startup, and refuses
|
|
to start when Redis is misconfigured or unavailable. The session cache,
|
|
replay store, session-events subscriber, and client-events subscriber all
|
|
use that shared client. See `docs/redis-config.md` for the rationale behind
|
|
the shape and the project-wide rules in
|
|
`ARCHITECTURE.md §Persistence Backends`.
|
|
|
|
Required Redis connection variables:
|
|
|
|
- `GATEWAY_REDIS_MASTER_ADDR`
|
|
- `GATEWAY_REDIS_PASSWORD`
|
|
|
|
Optional Redis connection variables:
|
|
|
|
- `GATEWAY_REDIS_REPLICA_ADDRS` (comma-separated; reserved for future
|
|
read-routing — currently unused)
|
|
- `GATEWAY_REDIS_DB` with default `0`
|
|
- `GATEWAY_REDIS_OPERATION_TIMEOUT` with default `250ms`
|
|
|
|
> Removed: `GATEWAY_SESSION_CACHE_REDIS_ADDR`,
|
|
> `GATEWAY_SESSION_CACHE_REDIS_USERNAME`,
|
|
> `GATEWAY_SESSION_CACHE_REDIS_PASSWORD`,
|
|
> `GATEWAY_SESSION_CACHE_REDIS_DB`,
|
|
> `GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED`. `pkg/redisconn.LoadFromEnv`
|
|
> rejects the deprecated `GATEWAY_REDIS_TLS_ENABLED` and
|
|
> `GATEWAY_REDIS_USERNAME` variables at startup.
|
|
|
|
Per-subsystem Redis behavior variables (namespace, timeouts):
|
|
|
|
- `GATEWAY_REPLAY_REDIS_KEY_PREFIX` with default `gateway:replay:`
|
|
- `GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT` with default `250ms`
|
|
|
|
Gateway no longer keeps a session cache projection or the two Redis
|
|
Streams (`session_events`, `client_events`). Session lookup is a
|
|
synchronous REST call to backend, and inbound client / session events
|
|
arrive through the gRPC `Push.SubscribePush` consumer (see the
|
|
**Backend Client** section below). Redis is therefore used only by
|
|
the Replay Store.
|
|
|
|
### Backend Client
|
|
|
|
`backendclient` is the single gateway → backend adapter:
|
|
|
|
- `RESTClient` calls `/api/v1/internal/sessions/{id}` synchronously per
|
|
authenticated request, forwards public auth (`/api/v1/public/auth/*`)
|
|
and authenticated user / lobby commands (`/api/v1/user/*`) with the
|
|
verified `X-User-Id` header.
|
|
- `PushClient` consumes `Push.SubscribePush` and reconnects with
|
|
exponential backoff plus jitter, replaying the last cursor on every
|
|
reconnect.
|
|
|
|
Required startup variables:
|
|
|
|
- `GATEWAY_BACKEND_HTTP_URL` — absolute base URL for the backend HTTP
|
|
listener;
|
|
- `GATEWAY_BACKEND_GRPC_PUSH_URL` — `host:port` of the backend
|
|
`Push.SubscribePush` listener;
|
|
- `GATEWAY_BACKEND_GATEWAY_CLIENT_ID` — durable identity presented to
|
|
backend so reconnects replace the previous subscription.
|
|
|
|
Optional tuning:
|
|
|
|
- `GATEWAY_BACKEND_HTTP_TIMEOUT` with default `5s`;
|
|
- `GATEWAY_BACKEND_PUSH_RECONNECT_BASE_BACKOFF` with default `250ms`;
|
|
- `GATEWAY_BACKEND_PUSH_RECONNECT_MAX_BACKOFF` with default `30s`.
|
|
|
|
### Replay Store
|
|
|
|
`ReplayStore` provides the hot-path anti-replay reservation for:
|
|
|
|
- duplicate detection by `device_session_id + request_id`;
|
|
- bounded replay protection for the authenticated freshness window.
|
|
|
|
The ReplayStore uses Redis through `go-redis/v9`.
|
|
`cmd/gateway` requires the ReplayStore backend during startup, issues a
|
|
bounded `PING`, and refuses to start when Redis is misconfigured or
|
|
unavailable.
|
|
|
|
The ReplayStore reuses the same Redis deployment settings as `SessionCache`
|
|
and adds two replay-specific environment variables:
|
|
|
|
- `GATEWAY_REPLAY_REDIS_KEY_PREFIX` with default `gateway:replay:`
|
|
- `GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT` with default `250ms`
|
|
|
|
Replay keys use this format:
|
|
|
|
- `<key_prefix><base64url(device_session_id)>:<base64url(request_id)>`
|
|
|
|
For each accepted request, the replay reservation TTL is computed as:
|
|
|
|
- `timestamp_ms + freshness_window - now`
|
|
|
|
The TTL is clamped to a minimum positive duration so requests accepted exactly
|
|
on the freshness boundary still reserve their replay key.
|
|
|
|
### Revocation Behavior
|
|
|
|
When a device session is revoked:
|
|
|
|
1. the Auth / Session Service updates the source of truth;
|
|
2. it publishes a session update or revoke event;
|
|
3. the gateway invalidates or updates `SessionCache`;
|
|
4. new unary gRPC requests for that session are rejected;
|
|
5. active `SubscribeEvents` streams for that exact `device_session_id` are
|
|
closed with gRPC `FAILED_PRECONDITION` and message
|
|
`device session is revoked`.
|
|
|
|
## Public Anti-Abuse Model
|
|
|
|
The public REST layer must distinguish between public auth operations and
|
|
browser-originated traffic that may burst during a normal first page load.
|
|
|
|
The gateway uses these public route classes:
|
|
|
|
- `public_auth`
|
|
- `browser_bootstrap`
|
|
- `browser_asset`
|
|
- `public_misc`
|
|
|
|
Any classifier result outside this fixed set is normalized to `public_misc`
|
|
before the class is stored in request context or used for policy derivation.
|
|
The canonical base bucket namespace for public REST policy is
|
|
`public_rest/class=<class>`.
|
|
|
|
### Public Auth
|
|
|
|
`public_auth` is the stable route class for `send-email-code` and
|
|
`confirm-email-code`.
|
|
This class uses stricter limits and abuse scoring because it directly touches
|
|
account and session creation flows.
|
|
|
|
Controls include:
|
|
|
|
- per-IP and per-identity rate limits;
|
|
- request body size limits;
|
|
- method allow-lists;
|
|
- malformed request counters;
|
|
- elevated logging and security telemetry for repeated failures.
|
|
|
|
Current defaults:
|
|
|
|
- per-IP: `30 requests / minute`, `burst=10`;
|
|
- `send-email-code` identity buckets: `3 requests / 10 minutes`, `burst=1`,
|
|
keyed by normalized `email`;
|
|
- `confirm-email-code` identity buckets: `6 requests / 10 minutes`,
|
|
`burst=2`, keyed by normalized `challenge_id`;
|
|
- maximum request body size: `8192` bytes;
|
|
- only `POST` is accepted for public auth routes.
|
|
|
|
Configuration surface:
|
|
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTES` default `8192`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS` default
|
|
`30`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOW` default `1m`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST` default `10`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
|
|
default `3`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
|
|
default `10m`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
|
|
default `1`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS`
|
|
default `6`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW`
|
|
default `10m`;
|
|
- `GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST`
|
|
default `2`.
|
|
|
|
### Browser Bootstrap and Asset Traffic
|
|
|
|
`browser_bootstrap` and `browser_asset` use separate coarse-grained budgets.
|
|
They may exhibit bursty behavior during the first load and therefore must not
|
|
be treated as hostile based on burst pattern alone.
|
|
|
|
This traffic is still constrained by:
|
|
|
|
- dedicated rate limits;
|
|
- method allow-lists;
|
|
- body size limits where request bodies are expected;
|
|
- protocol and path validation;
|
|
- independent abuse telemetry.
|
|
|
|
The gateway must not merge these buckets or counters with `public_auth`.
|
|
|
|
Current defaults:
|
|
|
|
- `browser_bootstrap`: `60 requests / minute`, `burst=20`, `GET` and `HEAD`
|
|
only, and no request body;
|
|
- `browser_asset`: `300 requests / minute`, `burst=80`, `GET` and `HEAD`
|
|
only, and no request body;
|
|
- `public_misc`: `30 requests / minute`, `burst=10`, and no request body.
|
|
|
|
Configuration surface:
|
|
|
|
- `browser_bootstrap`:
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTES` default
|
|
`0`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTS`
|
|
default `60`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOW` default
|
|
`1m`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURST` default
|
|
`20`;
|
|
- `browser_asset`:
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTES` default `0`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTS` default
|
|
`300`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOW` default
|
|
`1m`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURST` default
|
|
`80`;
|
|
- `public_misc`:
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTES` default `0`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTS` default
|
|
`30`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOW` default `1m`,
|
|
`GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURST` default `10`.
|
|
|
|
## Push Delivery Model
|
|
|
|
The v1 push channel is a gRPC server stream.
|
|
Long-polling is intentionally out of scope for the first version.
|
|
|
|
Expected stream behavior:
|
|
|
|
1. the client opens `SubscribeEvents`;
|
|
2. the gateway applies the full authenticated ingress verification pipeline;
|
|
3. the stream is bound to `user_id` and `device_session_id`;
|
|
4. the first signed service event is `gateway.server_time` and its
|
|
FlatBuffers payload includes `server_time_ms`;
|
|
5. after that bootstrap event, the stream is registered in `PushHub` and
|
|
remains open until client cancellation, server shutdown, queue overflow,
|
|
session revoke for the same `device_session_id`, or a later send failure;
|
|
6. internal pub/sub may target all active streams for one `user_id` or only
|
|
one `device_session_id` within that user;
|
|
7. the current per-stream in-memory queue capacity is `64` events and
|
|
overflow closes only the affected stream;
|
|
8. session revoke closes only streams bound to the same exact
|
|
`device_session_id` and returns gRPC `FAILED_PRECONDITION` with message
|
|
`device session is revoked`.
|
|
|
|
## Lifecycle and Shutdown
|
|
|
|
Gateway process shutdown is coordinated across the public REST listener,
|
|
authenticated gRPC listener, optional admin listener, internal Redis
|
|
subscribers, and telemetry runtime.
|
|
|
|
`GATEWAY_SHUTDOWN_TIMEOUT` configures the per-component graceful shutdown
|
|
budget and defaults to `5s`.
|
|
During authenticated gRPC shutdown, the in-memory `PushHub` closes active
|
|
streams before gRPC graceful stop, so active `SubscribeEvents` calls terminate
|
|
with gRPC `UNAVAILABLE` and message `gateway is shutting down`.
|
|
|
|
## Recommended Package Layout
|
|
|
|
The package layout keeps transport, policy, and downstream adapters separate:
|
|
|
|
- `cmd/gateway`
|
|
- `internal/app`
|
|
- `internal/config`
|
|
- `internal/restapi`
|
|
- `internal/grpcapi`
|
|
- `authn` *(public — canonical request/response/event signing input shared with external clients and the integration test suite)*
|
|
- `internal/session`
|
|
- `internal/replay`
|
|
- `internal/ratelimit`
|
|
- `internal/downstream`
|
|
- `internal/push`
|
|
- `internal/events`
|
|
- `internal/clock`
|
|
|
|
## Key Interfaces
|
|
|
|
The gateway should be built around explicit consumer-side interfaces.
|
|
|
|
### SessionCache
|
|
|
|
Provides cached session lookup by `device_session_id`.
|
|
Returns enough data to verify signatures and identify the authenticated user.
|
|
The current production implementation is a process-local read-through cache in
|
|
front of a Redis fallback adapter that uses strict JSON records under a
|
|
configurable key prefix.
|
|
|
|
### ReplayStore
|
|
|
|
Tracks recently seen `request_id` values per device session and rejects replayed
|
|
requests inside the accepted freshness window.
|
|
The current production adapter is Redis-backed, uses a dedicated configurable
|
|
key prefix, and reserves keys with a TTL derived from
|
|
`timestamp_ms + freshness_window - now`.
|
|
|
|
### RateLimiter
|
|
|
|
Applies independent policies for:
|
|
|
|
- public REST route classes;
|
|
- authenticated gRPC requests by IP;
|
|
- authenticated gRPC requests by session;
|
|
- authenticated gRPC requests by user;
|
|
- authenticated gRPC requests by message class.
|
|
|
|
The current rate limiter is process-local and in-memory.
|
|
Public REST keys stay under the `public_rest/...` namespace, while
|
|
authenticated gRPC keys stay under `authenticated_grpc/...`, so both traffic
|
|
surfaces keep independent buckets even when they share the same limiter
|
|
backend.
|
|
|
|
### PublicTrafficClassifier
|
|
|
|
Maps incoming public REST requests to one of the public route classes so that
|
|
limits and anti-abuse counters remain isolated.
|
|
The gateway normalizes any unsupported or empty classifier output to
|
|
`public_misc`, and public policy code derives the base bucket namespace from
|
|
the normalized class as `public_rest/class=<class>`.
|
|
|
|
### AuthServiceClient
|
|
|
|
Handles public auth commands and session-related updates exchanged with the
|
|
Auth / Session Service.
|
|
The gateway contract is:
|
|
|
|
- `SendEmailCode(email) -> challenge_id`
|
|
- `ConfirmEmailCode(challenge_id, code, client_public_key, time_zone) -> device_session_id`
|
|
|
|
When no concrete implementation is wired, the gateway keeps the public routes
|
|
available and returns a stable `503 service_unavailable` response instead of
|
|
failing process startup.
|
|
|
|
### DownstreamRouter
|
|
|
|
Resolves the target downstream service or adapter by the full exact-match
|
|
`message_type` literal.
|
|
|
|
The default `cmd/gateway` wiring resolves the reserved `user.*` and
|
|
`lobby.*` self-service message types through `backendclient.UserRoutes`
|
|
and `backendclient.LobbyRoutes`. When `GATEWAY_BACKEND_HTTP_URL` is
|
|
unset these routes stay mounted and fail closed as
|
|
dependency-unavailable instead of falling through to a generic route
|
|
miss.
|
|
|
|
### DownstreamClient
|
|
|
|
Executes a verified authenticated command against a downstream internal service
|
|
and returns response payload bytes plus a stable opaque result code.
|
|
An empty or whitespace-only result code is treated as an internal downstream
|
|
contract violation.
|
|
|
|
Downstream clients may be pure pass-through adapters or gateway-owned
|
|
transcoding adapters. The `backendclient` adapter decodes
|
|
authenticated FlatBuffers payloads, calls backend's `/api/v1/user/*`
|
|
REST surface with `X-User-Id`, and re-encodes the JSON result into
|
|
FlatBuffers before the signed gateway response is emitted.
|
|
|
|
### EventSubscriber
|
|
|
|
Subscribes to internal pub/sub topics used for:
|
|
|
|
- session cache updates;
|
|
- revocations;
|
|
- client-facing event delivery.
|
|
|
|
The implementation consumes two Redis Streams with replica-safe plain
|
|
`XREAD`: one strict full-session snapshot stream for the process-local session
|
|
cache and one client-facing event stream for live push fan-out.
|
|
|
|
### PushHub
|
|
|
|
Tracks active `SubscribeEvents` streams, binds them to authenticated identities,
|
|
and delivers events to the correct connections.
|
|
The implementation uses one bounded in-memory queue per stream with a
|
|
default capacity of `64` events; overflowing one queue closes only that stream
|
|
and leaves the remaining streams active.
|
|
|
|
### ResponseSigner
|
|
|
|
Signs unary responses and stream events so clients can verify server-originated
|
|
messages.
|
|
The implementation uses one Ed25519 signer loaded from
|
|
`GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH`, which must reference a PKCS#8
|
|
PEM-encoded private key.
|
|
|
|
### Clock
|
|
|
|
Provides current server time and supports consistent freshness-window checks.
|
|
|
|
## Error Model and Observability
|
|
|
|
The gateway should expose stable edge-level error classes instead of leaking
|
|
internal implementation details.
|
|
|
|
Minimum error categories:
|
|
|
|
- malformed request;
|
|
- request too large;
|
|
- unsupported protocol;
|
|
- unknown session;
|
|
- revoked session;
|
|
- invalid signature;
|
|
- stale request;
|
|
- replay detected;
|
|
- rate limited;
|
|
- policy denied;
|
|
- downstream unavailable;
|
|
- backend unavailable;
|
|
- gateway shutting down;
|
|
- internal error.
|
|
|
|
Observability requirements:
|
|
|
|
- stable correlation identifiers, including `request_id` and optional `trace_id`;
|
|
- structured logs;
|
|
- security audit events for rejects and abuse signals;
|
|
- metrics keyed by route class, message type, result code, and reject reason;
|
|
- no logging of secrets, raw private material, or raw signatures.
|
|
|
|
The service uses:
|
|
|
|
- `go.uber.org/zap` for structured JSON logs;
|
|
- `otelgin` for the public REST listener;
|
|
- `otelgrpc` for the authenticated gRPC listener;
|
|
- OpenTelemetry metrics exported through Prometheus on the optional admin
|
|
`/metrics` listener.
|
|
|
|
Current custom metric families:
|
|
|
|
- `gateway.public_http.requests`
|
|
- `gateway.public_http.duration`
|
|
- `gateway.authenticated_grpc.requests`
|
|
- `gateway.authenticated_grpc.duration`
|
|
- `gateway.push.active_streams`
|
|
- `gateway.push.stream_closures`
|
|
- `gateway.internal_event_drops`
|
|
|
|
The process-wide log level is configured by `GATEWAY_LOG_LEVEL` and
|
|
defaults to `info`.
|
|
The default OpenTelemetry resource uses `service.name=galaxy-edge-gateway`
|
|
when `OTEL_SERVICE_NAME` is unset.
|
|
If `OTEL_TRACES_EXPORTER` is unset or set to `none`, the gateway keeps tracing
|
|
runtime enabled but installs no external trace exporter.
|
|
If `OTEL_TRACES_EXPORTER=otlp`, the gateway uses the standard
|
|
`OTEL_EXPORTER_OTLP_*` environment variables to configure the OTLP trace
|
|
exporter protocol and endpoint.
|
|
The protocol selection specifically honors
|
|
`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` first and falls back to
|
|
`OTEL_EXPORTER_OTLP_PROTOCOL` when the trace-specific variable is unset.
|
|
Supported values are `http/protobuf` and `grpc`; when both variables are
|
|
unset, the gateway defaults to `http/protobuf`.
|
|
|
|
Structured logs intentionally omit:
|
|
|
|
- public auth e-mail addresses, login codes, and challenge IDs;
|
|
- client public keys;
|
|
- raw payload bytes and payload hashes;
|
|
- raw request or response signatures;
|
|
- response-signer private key material and Redis credentials.
|
|
|
|
Malformed internal session and client-event stream entries are no longer
|
|
silently dropped: the gateway logs the drop and increments
|
|
`gateway.internal_event_drops`.
|
|
|
|
## Non-Goals
|
|
|
|
The gateway is not a business authorization layer and must not grow into a
|
|
domain coordinator.
|
|
|
|
The gateway must not:
|
|
|
|
- implement business ownership checks;
|
|
- validate domain state transitions;
|
|
- replace the Auth / Session Service as the session source of truth;
|
|
- degrade into a synchronous pass-through that reloads session state for every
|
|
authenticated request.
|