42 KiB
Edge Gateway
Run and Dependencies
cmd/gateway starts with built-in listener defaults, but it still requires:
- one reachable Redis deployment used exclusively for anti-replay reservations (no session projection, no event streams);
- one reachable
backendinstance hosting the consolidated REST surface (/api/v1/{public,user,internal}/*) and thePush.SubscribePushgRPC listener; - one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by
GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH.
Required startup environment variables:
GATEWAY_REDIS_MASTER_ADDRGATEWAY_REDIS_PASSWORDGATEWAY_BACKEND_HTTP_URLGATEWAY_BACKEND_GRPC_PUSH_URLGATEWAY_BACKEND_GATEWAY_CLIENT_IDGATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH
Optional integrations:
GATEWAY_ADMIN_HTTP_ADDRenables the private/metricslistener;GATEWAY_BACKEND_HTTP_TIMEOUT,GATEWAY_BACKEND_PUSH_RECONNECT_BASE_BACKOFF,GATEWAY_BACKEND_PUSH_RECONNECT_MAX_BACKOFFtune the backend client.
Operational caveats:
- gateway issues one synchronous
/api/v1/internal/sessions/{id}lookup per authenticated request — there is no process-local cache; backend keeps the source-of-truth record; - the gRPC
SubscribePushconsumer reconnects with exponential backoff and jitter on every backend restart and resumes from the last cursor it observed.
Additional module docs:
- Public REST contract
- Documentation index
- Runtime and components
- Request and push flows
- Operator runbook
- Configuration and contract examples
- Example
.env
Purpose
Edge Gateway is the only public ingress for Galaxy Plus clients.
It terminates the external transport and security boundary, enforces edge
policies, and routes verified requests to internal services.
The gateway does not implement domain-specific business logic. Business validation, authorization, ownership checks, and state transitions remain inside downstream services.
Trust Boundary
The gateway sits between untrusted external clients and trusted internal services.
The gateway is responsible for:
- parsing external transport requests;
- classifying public REST traffic;
- authenticating protected gRPC traffic;
- loading session state from cache;
- verifying request freshness and anti-replay constraints;
- applying edge rate limits and anti-abuse policy;
- building an authenticated internal command context;
- routing verified commands to internal services;
- maintaining authenticated push delivery connections.
The gateway is not responsible for:
- deciding whether a user is allowed to execute a business action;
- validating domain invariants;
- storing the source-of-truth session record;
- implementing business idempotency.
Transport Matrix
The gateway exposes two external transport classes.
| Transport | Audience | Authentication | Payload format | Primary use |
|---|---|---|---|---|
| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Health checks, public auth commands, and browser/bootstrap traffic |
| gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery |
Public REST Surface
The public REST surface is used for commands that must work before a device session exists and for browser-originated traffic that may share the same edge. It covers the probe endpoints, public auth routes, and coarse public anti-abuse.
Currently implemented public endpoints:
GET /healthzGET /readyzPOST /api/v1/public/auth/send-email-codePOST /api/v1/public/auth/confirm-email-code
The implemented REST contract is documented in openapi.yaml.
The listener address is configured by GATEWAY_PUBLIC_HTTP_ADDR.
The public REST listener read budgets are configured by:
GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUTwith default2s;GATEWAY_PUBLIC_HTTP_READ_TIMEOUTwith default10s;GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUTwith default1m.
The public auth JSON contract uses a challenge-token flow:
send-email-codeacceptsemailand returnschallenge_id;confirm-email-codeacceptschallenge_id,code,client_public_key, andtime_zone, then returnsdevice_session_id.
The JSON body for send-email-code remains unchanged, but gateway may also
consume the standard Accept-Language header on that route. Gateway resolves
the first supported BCP 47 language tag, falls back to en when needed, and
forwards that derived preferred-language candidate to
Auth / Session Service for localized auth mail and possible first-user
creation. The public JSON DTO itself remains unchanged.
client_public_key is the standard base64-encoded raw 32-byte Ed25519 public
key for the device session being created.
time_zone is the client-selected IANA time zone name forwarded unchanged to
Auth / Session Service.
The current create-path source of truth for preferred_language is the
language candidate derived from public Accept-Language, with fallback to
en. The public confirm-email-code DTO itself remains unchanged.
These routes remain unauthenticated and delegate only through an injected
AuthServiceClient.
The default wiring used by cmd/gateway keeps the routes mounted and returns
503 service_unavailable until a concrete upstream auth adapter is supplied.
Public auth adapter calls are wrapped in
GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT, which defaults to 3s.
When that timeout expires, the gateway preserves the public REST contract and
returns 503 service_unavailable.
When an injected auth adapter returns *AuthServiceError, the gateway projects
that client-safe 4xx/5xx status, code, and message back to the caller
after normalizing blank or invalid fields. Unexpected non-AuthServiceError
adapter failures fail closed as 500 internal_error.
Public anti-abuse is process-local and in-memory.
Per-IP buckets are derived only from the TCP peer RemoteAddr.
Forwarded proxy headers such as X-Forwarded-For and Forwarded are
intentionally ignored.
Oversized public REST bodies are rejected with 413 request_too_large.
Rate-limited requests are rejected with 429 rate_limited and a
Retry-After header.
In addition to the fixed endpoints above, the gateway may front browser bootstrap or asset traffic through a pluggable public handler or proxy. That traffic belongs to dedicated public route classes and must not share rate limit buckets or abuse counters with the public auth API.
Operational Admin Surface
The gateway may expose one private operational HTTP listener used for metrics.
The admin listener is disabled by default and is enabled only when
GATEWAY_ADMIN_HTTP_ADDR is non-empty.
When enabled, it serves:
GET /metrics
The admin listener read budgets are configured by:
GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUTwith default2s;GATEWAY_ADMIN_HTTP_READ_TIMEOUTwith default10s;GATEWAY_ADMIN_HTTP_IDLE_TIMEOUTwith default1m.
/metrics is intentionally not mounted on the public REST ingress.
It is also intentionally excluded from openapi.yaml, because
that specification covers only the public REST ingress.
The endpoint exposes metrics in the Prometheus text exposition format described
in the official Prometheus documentation:
https://prometheus.io/docs/instrumenting/exposition_formats/.
Authenticated gRPC Surface
All authenticated client requests use HTTP/2 and gRPC.
The listener address is configured by GATEWAY_AUTHENTICATED_GRPC_ADDR.
Inbound authenticated gRPC connection setup is bounded by
GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT, which defaults to 5s.
The accepted client timestamp skew is configured by
GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW and defaults to 5m.
The public gRPC service exposes two methods:
ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)
ExecuteCommand is a generic unary RPC.
The gateway routes the request downstream by message_type after transport
verification succeeds.
Downstream unary execution is bounded by
GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT, which defaults to 5s.
When that timeout expires, the gateway preserves the authenticated gRPC
contract and returns gRPC UNAVAILABLE with message
downstream service is unavailable.
SubscribeEvents is an authenticated server-streaming RPC.
It binds the stream to user_id and device_session_id and starts by sending
a signed service event that includes the current server time in milliseconds.
The v1 protobuf contract lives in
proto/galaxy/gateway/v1/edge_gateway.proto under package
galaxy.gateway.v1 and service EdgeGateway.
Generated Go bindings are committed under proto/galaxy/gateway/v1/ and are
regenerated with:
buf generate
The gateway validates the request envelope, device-session
cache lookup, payload_hash, the client Ed25519 signature, timestamp
freshness, replay reservation, authenticated rate limits, and the
authenticated policy hook before any later routing or push step runs.
Malformed envelopes are rejected with gRPC INVALID_ARGUMENT.
Requests with a non-empty but unsupported protocol_version are rejected with
gRPC FAILED_PRECONDITION.
The supported request protocol_version literal is v1.
Requests with an unknown device_session_id are rejected with gRPC
UNAUTHENTICATED.
Requests for revoked sessions are rejected with gRPC FAILED_PRECONDITION.
SessionCache backend failures, including Redis lookup or record-decode
failures, are rejected with gRPC UNAVAILABLE.
Requests with a payload_hash that is not a 32-byte SHA-256 digest or does
not match payload_bytes are rejected with gRPC INVALID_ARGUMENT.
Requests with an invalid client signature or a signature created by a
different key are rejected with gRPC UNAUTHENTICATED and message
invalid request signature.
Requests with malformed cached client_public_key material fail closed as
gRPC UNAVAILABLE.
Requests with a timestamp_ms outside the symmetric freshness window around
current server time are rejected with gRPC FAILED_PRECONDITION and message
request timestamp is outside the freshness window.
Requests that reuse the same request_id for the same device_session_id
inside the active replay window are rejected with gRPC
FAILED_PRECONDITION and message request replay detected.
ReplayStore backend failures fail closed with gRPC UNAVAILABLE and message
replay store is unavailable.
Authenticated rate limits are enforced independently by transport peer IP,
authenticated device_session_id, authenticated user_id, and authenticated
message class. The gateway uses the full verified message_type literal as the
stable v1 message-class key because the transport does not yet define a
coarser authenticated class taxonomy. The peer IP is derived only from the
gRPC transport peer address; if it is missing or cannot be parsed, the
request falls back to the stable unknown IP bucket.
Requests that exceed any authenticated rate-limit bucket are rejected with
gRPC RESOURCE_EXHAUSTED and message
authenticated request rate limit exceeded.
The authenticated edge policy hook runs after those rate limits and defaults
to allow-all until a concrete policy evaluator is wired into the process.
ExecuteCommand builds an internal authenticated command context,
resolves one exact-match downstream route by the full verified message_type
literal, executes the downstream unary client, and signs the response before
it is returned to the caller. When no exact downstream route is registered,
ExecuteCommand is rejected with gRPC UNIMPLEMENTED and message
message_type is not routed. Downstream availability failures are rejected
with gRPC UNAVAILABLE and message downstream service is unavailable.
Unexpected downstream route-resolution or execution failures are rejected with
gRPC INTERNAL. Successful unary responses preserve the original
request_id, carry a SHA-256 payload_hash of the returned payload_bytes,
and are signed with the configured server Ed25519 response signer.
The default cmd/gateway wiring currently installs an empty static
downstream router, so verified ExecuteCommand requests still return gRPC
UNIMPLEMENTED until concrete downstream routes are injected.
SubscribeEvents applies the full authenticated ingress pipeline, binds
the stream to the verified user_id and device_session_id, sends one
signed gateway.server_time bootstrap event whose FlatBuffers payload carries
server_time_ms, registers the active stream in the in-memory PushHub, and
then forwards signed client-facing events consumed from the configured client
event Redis stream. User-targeted events fan out to every active stream for
that user. Session-targeted events fan out only to streams whose
user_id and device_session_id both match the event target. Each active
stream uses a bounded in-memory queue; when that queue overflows, only the
affected stream is closed with gRPC RESOURCE_EXHAUSTED and message
push stream overflowed. When the session lifecycle stream reports that the
same device_session_id was revoked, every active SubscribeEvents stream
bound to that exact session is closed with gRPC FAILED_PRECONDITION and
message device session is revoked. During gateway shutdown, the in-memory
push hub is closed before gRPC graceful stop, and every active
SubscribeEvents stream is terminated with gRPC UNAVAILABLE and message
gateway is shutting down.
Authenticated anti-abuse budgets are configured by the
GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_* environment variables.
Current authenticated gRPC defaults:
- per-IP:
120 requests / minute,burst=40; - per-session:
60 requests / minute,burst=20; - per-user:
120 requests / minute,burst=40; - per-message-class:
60 requests / minute,burst=20.
Authenticated anti-abuse configuration surface:
- per-IP:
GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTSdefault120,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOWdefault1m,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURSTdefault40; - per-session:
GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTSdefault60,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOWdefault1m,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURSTdefault20; - per-user:
GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTSdefault120,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOWdefault1m,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURSTdefault40; - per-message-class:
GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTSdefault60,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOWdefault1m,GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURSTdefault20.
Envelope and Payload Model
The authenticated transport uses a split contract:
- gRPC control messages are protobuf-based;
- business payload bytes are FlatBuffers;
- signatures are computed over canonical envelope fields and a hash of raw FlatBuffers bytes.
The gateway verifies authenticated payload bytes before any downstream call. Most downstream routes may still treat those bytes as opaque, but the gateway is also allowed to transcode verified FlatBuffers payloads into trusted downstream REST/JSON calls when the concrete downstream contract requires it.
The current direct Gateway -> User self-service boundary uses that pattern:
- external message types:
user.account.getuser.profile.updateuser.settings.update
- external payloads and responses:
- FlatBuffers
- internal downstream transport:
- strict REST/JSON to User Service
- business error projection:
- gateway
result_code - FlatBuffers error payload mirroring User Service
codeandmessage
- gateway
The request envelope version literal is v1.
payload_hash is the raw 32-byte SHA-256 digest of payload_bytes.
ExecuteCommand hashes the raw FlatBuffers payload bytes exactly as sent,
while SubscribeEvents with an empty payload still requires
sha256([]byte{}) rather than a special-case value.
The v1 request signature scheme is Ed25519.
client_public_key is the standard base64-encoded raw 32-byte Ed25519 public
key registered during confirm-email-code.
signature carries the raw 64-byte Ed25519 signature computed over the
canonical request signing input.
The v1 stream bootstrap payload uses the shared FlatBuffers schema
pkg/schema/fbs/gateway.fbs with root table gateway.ServerTimeEvent.
ExecuteCommandRequest
Required fields:
protocol_versiondevice_session_idmessage_typetimestamp_msrequest_idpayload_bytespayload_hashsignature
Optional fields:
trace_id
ExecuteCommandResponse
Required fields:
protocol_versionrequest_idtimestamp_msresult_codepayload_bytespayload_hashsignature
The v1 unary response signature scheme is Ed25519 with response
domain marker galaxy-response-v1.
The response signing input uses the same canonical binary encoding shape as
the request signer:
- each
stringandbytesfield is encoded asuvarint(len(field_bytes))followed by raw bytes; timestamp_msis encoded as an 8-byte big-endian unsigned integer;- the signed field order is
galaxy-response-v1,protocol_version,request_id,timestamp_ms,result_code,payload_hash.
cmd/gateway loads the unary response signer from
GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH, which must point to a PKCS#8
PEM-encoded Ed25519 private key. Startup fails when the file is absent,
unreadable, not strict PEM, not PKCS#8, or not Ed25519.
SubscribeEventsRequest
The stream open request reuses the authenticated request model. It contains the same authentication fields as the unary request and either an empty payload or a minimal connect payload.
Required fields:
protocol_versiondevice_session_idmessage_typetimestamp_msrequest_idpayload_hashsignature
Optional fields:
payload_bytestrace_id
GatewayEvent
Every stream event is a client-facing signed server message.
Required fields:
event_typeevent_idtimestamp_mspayload_bytespayload_hashsignature
Optional fields:
request_idtrace_id
The v1 stream-event signature scheme is Ed25519 with event domain
marker galaxy-event-v1.
The event signing input uses the same canonical binary encoding shape as the
request and unary response signers:
- each
stringandbytesfield is encoded asuvarint(len(field_bytes))followed by raw bytes; timestamp_msis encoded as an 8-byte big-endian unsigned integer;- the signed field order is
galaxy-event-v1,event_type,event_id,timestamp_ms,request_id,trace_id,payload_hash.
The bootstrap event uses:
event_type = "gateway.server_time";event_id = request_idfrom the openingSubscribeEventsrequest;payload_bytesencoded as FlatBuffersgateway.ServerTimeEventwithserver_time_ms;- the same loaded Ed25519 signer configured by
GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH.
Client-facing fan-out events are sourced from the internal client
event stream. Internal publishers provide the event target and business
payload only: user_id, optional device_session_id, event_type,
event_id, payload_bytes, and optional request_id / trace_id. The
gateway derives timestamp_ms, recomputes payload_hash, signs the event,
and only then forwards it to the matching SubscribeEvents streams.
Notification-owned user-facing payloads are expected to use
pkg/schema/fbs/notification.fbs. The initial notification event vocabulary
in v1 is exactly:
game.turn.readygame.finishedlobby.application.submittedlobby.membership.approvedlobby.membership.rejectedlobby.membership.blockedlobby.invite.createdlobby.invite.redeemedlobby.race_name.registration_eligiblelobby.race_name.registered
lobby.application.submitted is published toward Gateway only for the
private-game owner flow. The public-game variant is email-only.
The real Notification Service -> Gateway integration suite verifies this
user-targeted fan-out path and asserts that notification-owned push events do
not include device_session_id, so Gateway delivers them to every active
stream for the target user. Auth-code email does not use this push path and
continues to bypass Notification Service.
Verification and Routing Pipeline
The gateway applies the same strict verification order for authenticated gRPC ingress.
- Parse the control envelope and validate required fields.
- Check whether
protocol_versionis supported. - Resolve
device_session_idthroughSessionCache. - Reject unknown or revoked sessions.
- Verify that
payload_hashmatches rawpayload_bytes. - Verify the client signature using the public key from session cache.
- Verify that
timestamp_msis inside the accepted freshness window. - Verify anti-replay by checking
device_session_id + request_id. - Apply authenticated rate limit and edge policy checks.
- Build the authenticated internal command context.
- Route the command downstream by
message_type.
No downstream business service should receive a request that has not passed this full verification pipeline.
ExecuteCommand enforces steps 1 through 11 and
signs the successful unary response afterward. SubscribeEvents enforces
steps 1 through 9, binds the verified stream identity, sends the initial
signed server-time bootstrap event, and then keeps the stream open for push
delivery.
Malformed envelopes fail with gRPC INVALID_ARGUMENT.
Unsupported non-empty protocol_version values fail with gRPC
FAILED_PRECONDITION.
Unknown sessions fail with gRPC UNAUTHENTICATED.
Revoked sessions fail with gRPC FAILED_PRECONDITION.
SessionCache backend failures fail with gRPC UNAVAILABLE.
payload_hash values that are not raw 32-byte SHA-256 digests fail with gRPC
INVALID_ARGUMENT and message payload_hash must be a 32-byte SHA-256 digest.
payload_hash values that do not match payload_bytes fail with gRPC
INVALID_ARGUMENT and message payload_hash does not match payload_bytes.
Invalid request signatures fail with gRPC UNAUTHENTICATED and message
invalid request signature.
Malformed cached client_public_key values fail closed with gRPC
UNAVAILABLE and message session cache is unavailable.
Requests with a timestamp_ms outside the accepted freshness window fail with
gRPC FAILED_PRECONDITION and message
request timestamp is outside the freshness window.
Requests that reuse the same request_id for the same device_session_id
inside the active replay window fail with gRPC FAILED_PRECONDITION and
message request replay detected.
ReplayStore backend failures fail with gRPC UNAVAILABLE and message
replay store is unavailable.
Unrouted exact-match message_type values fail with gRPC UNIMPLEMENTED and
message message_type is not routed.
Downstream availability failures fail with gRPC UNAVAILABLE and message
downstream service is unavailable.
Internal Authenticated Contract
Downstream services should receive an internal authenticated command rather than raw external gRPC transport data.
The minimum authenticated context is:
user_iddevice_session_idmessage_type- verified
payload_bytes request_id- optional
trace_id - optional client metadata needed for logs and tracing
Downstream services may trust that the gateway has already performed transport authentication, freshness verification, and anti-replay checks. They must still perform business authorization and domain validation.
Session Model
The Auth / Session Service is the source of truth for device session state. The gateway is designed to authenticate the hot path from cache.
Expected session fields available to the gateway:
device_session_iduser_id- base64-encoded raw 32-byte Ed25519 client public key
- session status
- revoke metadata
- optional client metadata
Session Cache
SessionCache provides the fast path for:
- session existence checks;
device_session_id -> user_id;- access to the base64-encoded raw Ed25519 client public key used for signature verification;
- revoked versus active status checks.
Cache updates are event-driven. TTL is allowed only as a safety net and must not replace invalidation events.
The gateway keeps a process-local in-memory snapshot cache in front of the Redis fallback backend. Authenticated requests read the local snapshot first. A local miss performs one bounded Redis lookup and seeds the local snapshot so later requests for the same session avoid another Redis round-trip unless a later session event changes the cached state.
The local snapshot cache intentionally has no TTL and no size-based eviction policy. Session lifecycle events are the authoritative mechanism for keeping the hot path current, while Redis fallback remains the safety net for cold misses and process restarts.
The Redis fallback implementation uses go-redis/v9. cmd/gateway opens one
shared *redis.Client via pkg/redisconn (instrumented with OpenTelemetry
tracing and metrics), issues a single bounded PING on startup, and refuses
to start when Redis is misconfigured or unavailable. The session cache,
replay store, session-events subscriber, and client-events subscriber all
use that shared client. See docs/redis-config.md for the rationale behind
the shape and the project-wide rules in
ARCHITECTURE.md §Persistence Backends.
Required Redis connection variables:
GATEWAY_REDIS_MASTER_ADDRGATEWAY_REDIS_PASSWORD
Optional Redis connection variables:
GATEWAY_REDIS_REPLICA_ADDRS(comma-separated; reserved for future read-routing — currently unused)GATEWAY_REDIS_DBwith default0GATEWAY_REDIS_OPERATION_TIMEOUTwith default250ms
Removed:
GATEWAY_SESSION_CACHE_REDIS_ADDR,GATEWAY_SESSION_CACHE_REDIS_USERNAME,GATEWAY_SESSION_CACHE_REDIS_PASSWORD,GATEWAY_SESSION_CACHE_REDIS_DB,GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED.pkg/redisconn.LoadFromEnvrejects the deprecatedGATEWAY_REDIS_TLS_ENABLEDandGATEWAY_REDIS_USERNAMEvariables at startup.
Per-subsystem Redis behavior variables (namespace, timeouts):
GATEWAY_REPLAY_REDIS_KEY_PREFIXwith defaultgateway:replay:GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUTwith default250ms
Gateway no longer keeps a session cache projection or the two Redis
Streams (session_events, client_events). Session lookup is a
synchronous REST call to backend, and inbound client / session events
arrive through the gRPC Push.SubscribePush consumer (see the
Backend Client section below). Redis is therefore used only by
the Replay Store.
Backend Client
backendclient is the single gateway → backend adapter:
RESTClientcalls/api/v1/internal/sessions/{id}synchronously per authenticated request, forwards public auth (/api/v1/public/auth/*) and authenticated user / lobby commands (/api/v1/user/*) with the verifiedX-User-Idheader.PushClientconsumesPush.SubscribePushand reconnects with exponential backoff plus jitter, replaying the last cursor on every reconnect.
Required startup variables:
GATEWAY_BACKEND_HTTP_URL— absolute base URL for the backend HTTP listener;GATEWAY_BACKEND_GRPC_PUSH_URL—host:portof the backendPush.SubscribePushlistener;GATEWAY_BACKEND_GATEWAY_CLIENT_ID— durable identity presented to backend so reconnects replace the previous subscription.
Optional tuning:
GATEWAY_BACKEND_HTTP_TIMEOUTwith default5s;GATEWAY_BACKEND_PUSH_RECONNECT_BASE_BACKOFFwith default250ms;GATEWAY_BACKEND_PUSH_RECONNECT_MAX_BACKOFFwith default30s.
Replay Store
ReplayStore provides the hot-path anti-replay reservation for:
- duplicate detection by
device_session_id + request_id; - bounded replay protection for the authenticated freshness window.
The ReplayStore uses Redis through go-redis/v9.
cmd/gateway requires the ReplayStore backend during startup, issues a
bounded PING, and refuses to start when Redis is misconfigured or
unavailable.
The ReplayStore reuses the same Redis deployment settings as SessionCache
and adds two replay-specific environment variables:
GATEWAY_REPLAY_REDIS_KEY_PREFIXwith defaultgateway:replay:GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUTwith default250ms
Replay keys use this format:
<key_prefix><base64url(device_session_id)>:<base64url(request_id)>
For each accepted request, the replay reservation TTL is computed as:
timestamp_ms + freshness_window - now
The TTL is clamped to a minimum positive duration so requests accepted exactly on the freshness boundary still reserve their replay key.
Revocation Behavior
When a device session is revoked:
- the Auth / Session Service updates the source of truth;
- it publishes a session update or revoke event;
- the gateway invalidates or updates
SessionCache; - new unary gRPC requests for that session are rejected;
- active
SubscribeEventsstreams for that exactdevice_session_idare closed with gRPCFAILED_PRECONDITIONand messagedevice session is revoked.
Public Anti-Abuse Model
The public REST layer must distinguish between public auth operations and browser-originated traffic that may burst during a normal first page load.
The gateway uses these public route classes:
public_authbrowser_bootstrapbrowser_assetpublic_misc
Any classifier result outside this fixed set is normalized to public_misc
before the class is stored in request context or used for policy derivation.
The canonical base bucket namespace for public REST policy is
public_rest/class=<class>.
Public Auth
public_auth is the stable route class for send-email-code and
confirm-email-code.
This class uses stricter limits and abuse scoring because it directly touches
account and session creation flows.
Controls include:
- per-IP and per-identity rate limits;
- request body size limits;
- method allow-lists;
- malformed request counters;
- elevated logging and security telemetry for repeated failures.
Current defaults:
- per-IP:
30 requests / minute,burst=10; send-email-codeidentity buckets:3 requests / 10 minutes,burst=1, keyed by normalizedemail;confirm-email-codeidentity buckets:6 requests / 10 minutes,burst=2, keyed by normalizedchallenge_id;- maximum request body size:
8192bytes; - only
POSTis accepted for public auth routes.
Configuration surface:
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTESdefault8192;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTSdefault30;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOWdefault1m;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURSTdefault10;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTSdefault3;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOWdefault10m;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURSTdefault1;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTSdefault6;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOWdefault10m;GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURSTdefault2.
Browser Bootstrap and Asset Traffic
browser_bootstrap and browser_asset use separate coarse-grained budgets.
They may exhibit bursty behavior during the first load and therefore must not
be treated as hostile based on burst pattern alone.
This traffic is still constrained by:
- dedicated rate limits;
- method allow-lists;
- body size limits where request bodies are expected;
- protocol and path validation;
- independent abuse telemetry.
The gateway must not merge these buckets or counters with public_auth.
Current defaults:
browser_bootstrap:60 requests / minute,burst=20,GETandHEADonly, and no request body;browser_asset:300 requests / minute,burst=80,GETandHEADonly, and no request body;public_misc:30 requests / minute,burst=10, and no request body.
Configuration surface:
browser_bootstrap:GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTESdefault0,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTSdefault60,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOWdefault1m,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURSTdefault20;browser_asset:GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTESdefault0,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTSdefault300,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOWdefault1m,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURSTdefault80;public_misc:GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTESdefault0,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTSdefault30,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOWdefault1m,GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURSTdefault10.
Push Delivery Model
The v1 push channel is a gRPC server stream. Long-polling is intentionally out of scope for the first version.
Expected stream behavior:
- the client opens
SubscribeEvents; - the gateway applies the full authenticated ingress verification pipeline;
- the stream is bound to
user_idanddevice_session_id; - the first signed service event is
gateway.server_timeand its FlatBuffers payload includesserver_time_ms; - after that bootstrap event, the stream is registered in
PushHuband remains open until client cancellation, server shutdown, queue overflow, session revoke for the samedevice_session_id, or a later send failure; - internal pub/sub may target all active streams for one
user_idor only onedevice_session_idwithin that user; - the current per-stream in-memory queue capacity is
64events and overflow closes only the affected stream; - session revoke closes only streams bound to the same exact
device_session_idand returns gRPCFAILED_PRECONDITIONwith messagedevice session is revoked.
Lifecycle and Shutdown
Gateway process shutdown is coordinated across the public REST listener, authenticated gRPC listener, optional admin listener, internal Redis subscribers, and telemetry runtime.
GATEWAY_SHUTDOWN_TIMEOUT configures the per-component graceful shutdown
budget and defaults to 5s.
During authenticated gRPC shutdown, the in-memory PushHub closes active
streams before gRPC graceful stop, so active SubscribeEvents calls terminate
with gRPC UNAVAILABLE and message gateway is shutting down.
Recommended Package Layout
The package layout keeps transport, policy, and downstream adapters separate:
cmd/gatewayinternal/appinternal/configinternal/restapiinternal/grpcapiauthn(public — canonical request/response/event signing input shared with external clients and the integration test suite)internal/sessioninternal/replayinternal/ratelimitinternal/downstreaminternal/pushinternal/eventsinternal/clock
Key Interfaces
The gateway should be built around explicit consumer-side interfaces.
SessionCache
Provides cached session lookup by device_session_id.
Returns enough data to verify signatures and identify the authenticated user.
The current production implementation is a process-local read-through cache in
front of a Redis fallback adapter that uses strict JSON records under a
configurable key prefix.
ReplayStore
Tracks recently seen request_id values per device session and rejects replayed
requests inside the accepted freshness window.
The current production adapter is Redis-backed, uses a dedicated configurable
key prefix, and reserves keys with a TTL derived from
timestamp_ms + freshness_window - now.
RateLimiter
Applies independent policies for:
- public REST route classes;
- authenticated gRPC requests by IP;
- authenticated gRPC requests by session;
- authenticated gRPC requests by user;
- authenticated gRPC requests by message class.
The current rate limiter is process-local and in-memory.
Public REST keys stay under the public_rest/... namespace, while
authenticated gRPC keys stay under authenticated_grpc/..., so both traffic
surfaces keep independent buckets even when they share the same limiter
backend.
PublicTrafficClassifier
Maps incoming public REST requests to one of the public route classes so that
limits and anti-abuse counters remain isolated.
The gateway normalizes any unsupported or empty classifier output to
public_misc, and public policy code derives the base bucket namespace from
the normalized class as public_rest/class=<class>.
AuthServiceClient
Handles public auth commands and session-related updates exchanged with the Auth / Session Service. The gateway contract is:
SendEmailCode(email) -> challenge_idConfirmEmailCode(challenge_id, code, client_public_key, time_zone) -> device_session_id
When no concrete implementation is wired, the gateway keeps the public routes
available and returns a stable 503 service_unavailable response instead of
failing process startup.
DownstreamRouter
Resolves the target downstream service or adapter by the full exact-match
message_type literal.
The default cmd/gateway wiring resolves the reserved user.* and
lobby.* self-service message types through backendclient.UserRoutes
and backendclient.LobbyRoutes. When GATEWAY_BACKEND_HTTP_URL is
unset these routes stay mounted and fail closed as
dependency-unavailable instead of falling through to a generic route
miss.
DownstreamClient
Executes a verified authenticated command against a downstream internal service and returns response payload bytes plus a stable opaque result code. An empty or whitespace-only result code is treated as an internal downstream contract violation.
Downstream clients may be pure pass-through adapters or gateway-owned
transcoding adapters. The backendclient adapter decodes
authenticated FlatBuffers payloads, calls backend's /api/v1/user/*
REST surface with X-User-Id, and re-encodes the JSON result into
FlatBuffers before the signed gateway response is emitted.
EventSubscriber
Subscribes to internal pub/sub topics used for:
- session cache updates;
- revocations;
- client-facing event delivery.
The implementation consumes two Redis Streams with replica-safe plain
XREAD: one strict full-session snapshot stream for the process-local session
cache and one client-facing event stream for live push fan-out.
PushHub
Tracks active SubscribeEvents streams, binds them to authenticated identities,
and delivers events to the correct connections.
The implementation uses one bounded in-memory queue per stream with a
default capacity of 64 events; overflowing one queue closes only that stream
and leaves the remaining streams active.
ResponseSigner
Signs unary responses and stream events so clients can verify server-originated
messages.
The implementation uses one Ed25519 signer loaded from
GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH, which must reference a PKCS#8
PEM-encoded private key.
Clock
Provides current server time and supports consistent freshness-window checks.
Error Model and Observability
The gateway should expose stable edge-level error classes instead of leaking internal implementation details.
Minimum error categories:
- malformed request;
- request too large;
- unsupported protocol;
- unknown session;
- revoked session;
- invalid signature;
- stale request;
- replay detected;
- rate limited;
- policy denied;
- downstream unavailable;
- backend unavailable;
- gateway shutting down;
- internal error.
Observability requirements:
- stable correlation identifiers, including
request_idand optionaltrace_id; - structured logs;
- security audit events for rejects and abuse signals;
- metrics keyed by route class, message type, result code, and reject reason;
- no logging of secrets, raw private material, or raw signatures.
The service uses:
go.uber.org/zapfor structured JSON logs;otelginfor the public REST listener;otelgrpcfor the authenticated gRPC listener;- OpenTelemetry metrics exported through Prometheus on the optional admin
/metricslistener.
Current custom metric families:
gateway.public_http.requestsgateway.public_http.durationgateway.authenticated_grpc.requestsgateway.authenticated_grpc.durationgateway.push.active_streamsgateway.push.stream_closuresgateway.internal_event_drops
The process-wide log level is configured by GATEWAY_LOG_LEVEL and
defaults to info.
The default OpenTelemetry resource uses service.name=galaxy-edge-gateway
when OTEL_SERVICE_NAME is unset.
If OTEL_TRACES_EXPORTER is unset or set to none, the gateway keeps tracing
runtime enabled but installs no external trace exporter.
If OTEL_TRACES_EXPORTER=otlp, the gateway uses the standard
OTEL_EXPORTER_OTLP_* environment variables to configure the OTLP trace
exporter protocol and endpoint.
The protocol selection specifically honors
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL first and falls back to
OTEL_EXPORTER_OTLP_PROTOCOL when the trace-specific variable is unset.
Supported values are http/protobuf and grpc; when both variables are
unset, the gateway defaults to http/protobuf.
Structured logs intentionally omit:
- public auth e-mail addresses, login codes, and challenge IDs;
- client public keys;
- raw payload bytes and payload hashes;
- raw request or response signatures;
- response-signer private key material and Redis credentials.
Malformed internal session and client-event stream entries are no longer
silently dropped: the gateway logs the drop and increments
gateway.internal_event_drops.
Non-Goals
The gateway is not a business authorization layer and must not grow into a domain coordinator.
The gateway must not:
- implement business ownership checks;
- validate domain state transitions;
- replace the Auth / Session Service as the session source of truth;
- degrade into a synchronous pass-through that reloads session state for every authenticated request.