developer/galaxy-game

Fork 0

Files

T

Ilia Denisov f446c6a2ac feat: backend service

2026-05-06 10:14:55 +03:00

42 KiB

Raw Blame History

Edge Gateway

Run and Dependencies

cmd/gateway starts with built-in listener defaults, but it still requires:

one reachable Redis deployment used exclusively for anti-replay reservations (no session projection, no event streams);
one reachable backend instance hosting the consolidated REST surface (/api/v1/{public,user,internal}/*) and the Push.SubscribePush gRPC listener;
one PKCS#8 PEM-encoded Ed25519 response-signer key referenced by GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH.

Required startup environment variables:

GATEWAY_REDIS_MASTER_ADDR
GATEWAY_REDIS_PASSWORD
GATEWAY_BACKEND_HTTP_URL
GATEWAY_BACKEND_GRPC_PUSH_URL
GATEWAY_BACKEND_GATEWAY_CLIENT_ID
GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH

Optional integrations:

GATEWAY_ADMIN_HTTP_ADDR enables the private /metrics listener;
GATEWAY_BACKEND_HTTP_TIMEOUT, GATEWAY_BACKEND_PUSH_RECONNECT_BASE_BACKOFF, GATEWAY_BACKEND_PUSH_RECONNECT_MAX_BACKOFF tune the backend client.

Operational caveats:

gateway issues one synchronous /api/v1/internal/sessions/{id} lookup per authenticated request — there is no process-local cache; backend keeps the source-of-truth record;
the gRPC SubscribePush consumer reconnects with exponential backoff and jitter on every backend restart and resumes from the last cursor it observed.

Additional module docs:

Purpose

Edge Gateway is the only public ingress for Galaxy Plus clients. It terminates the external transport and security boundary, enforces edge policies, and routes verified requests to internal services.

The gateway does not implement domain-specific business logic. Business validation, authorization, ownership checks, and state transitions remain inside downstream services.

Trust Boundary

The gateway sits between untrusted external clients and trusted internal services.

The gateway is responsible for:

parsing external transport requests;
classifying public REST traffic;
authenticating protected gRPC traffic;
loading session state from cache;
verifying request freshness and anti-replay constraints;
applying edge rate limits and anti-abuse policy;
building an authenticated internal command context;
routing verified commands to internal services;
maintaining authenticated push delivery connections.

The gateway is not responsible for:

deciding whether a user is allowed to execute a business action;
validating domain invariants;
storing the source-of-truth session record;
implementing business idempotency.

Transport Matrix

The gateway exposes two external transport classes.

Transport	Audience	Authentication	Payload format	Primary use
REST/JSON	Public, unauthenticated traffic	No device session auth	JSON	Health checks, public auth commands, and browser/bootstrap traffic
gRPC over HTTP/2	Authenticated clients only	Required	FlatBuffers payload inside protobuf control envelope	Verified commands and push delivery

Public REST Surface

The public REST surface is used for commands that must work before a device session exists and for browser-originated traffic that may share the same edge. It covers the probe endpoints, public auth routes, and coarse public anti-abuse.

Currently implemented public endpoints:

GET /healthz
GET /readyz
POST /api/v1/public/auth/send-email-code
POST /api/v1/public/auth/confirm-email-code

The implemented REST contract is documented in openapi.yaml. The listener address is configured by GATEWAY_PUBLIC_HTTP_ADDR. The public REST listener read budgets are configured by:

GATEWAY_PUBLIC_HTTP_READ_HEADER_TIMEOUT with default 2s;
GATEWAY_PUBLIC_HTTP_READ_TIMEOUT with default 10s;
GATEWAY_PUBLIC_HTTP_IDLE_TIMEOUT with default 1m.

The public auth JSON contract uses a challenge-token flow:

send-email-code accepts email and returns challenge_id;
confirm-email-code accepts challenge_id, code, client_public_key, and time_zone, then returns device_session_id.

The JSON body for send-email-code remains unchanged, but gateway may also consume the standard Accept-Language header on that route. Gateway resolves the first supported BCP 47 language tag, falls back to en when needed, and forwards that derived preferred-language candidate to Auth / Session Service for localized auth mail and possible first-user creation. The public JSON DTO itself remains unchanged. client_public_key is the standard base64-encoded raw 32-byte Ed25519 public key for the device session being created. time_zone is the client-selected IANA time zone name forwarded unchanged to Auth / Session Service. The current create-path source of truth for preferred_language is the language candidate derived from public Accept-Language, with fallback to en. The public confirm-email-code DTO itself remains unchanged.

These routes remain unauthenticated and delegate only through an injected AuthServiceClient. The default wiring used by cmd/gateway keeps the routes mounted and returns 503 service_unavailable until a concrete upstream auth adapter is supplied. Public auth adapter calls are wrapped in GATEWAY_PUBLIC_AUTH_UPSTREAM_TIMEOUT, which defaults to 3s. When that timeout expires, the gateway preserves the public REST contract and returns 503 service_unavailable. When an injected auth adapter returns *AuthServiceError, the gateway projects that client-safe 4xx/5xx status, code, and message back to the caller after normalizing blank or invalid fields. Unexpected non-AuthServiceError adapter failures fail closed as 500 internal_error.

Public anti-abuse is process-local and in-memory. Per-IP buckets are derived only from the TCP peer RemoteAddr. Forwarded proxy headers such as X-Forwarded-For and Forwarded are intentionally ignored. Oversized public REST bodies are rejected with 413 request_too_large. Rate-limited requests are rejected with 429 rate_limited and a Retry-After header.

In addition to the fixed endpoints above, the gateway may front browser bootstrap or asset traffic through a pluggable public handler or proxy. That traffic belongs to dedicated public route classes and must not share rate limit buckets or abuse counters with the public auth API.

Operational Admin Surface

The gateway may expose one private operational HTTP listener used for metrics.

The admin listener is disabled by default and is enabled only when GATEWAY_ADMIN_HTTP_ADDR is non-empty. When enabled, it serves:

GET /metrics

The admin listener read budgets are configured by:

GATEWAY_ADMIN_HTTP_READ_HEADER_TIMEOUT with default 2s;
GATEWAY_ADMIN_HTTP_READ_TIMEOUT with default 10s;
GATEWAY_ADMIN_HTTP_IDLE_TIMEOUT with default 1m.

/metrics is intentionally not mounted on the public REST ingress. It is also intentionally excluded from openapi.yaml, because that specification covers only the public REST ingress. The endpoint exposes metrics in the Prometheus text exposition format described in the official Prometheus documentation: https://prometheus.io/docs/instrumenting/exposition_formats/.

Authenticated gRPC Surface

All authenticated client requests use HTTP/2 and gRPC. The listener address is configured by GATEWAY_AUTHENTICATED_GRPC_ADDR. Inbound authenticated gRPC connection setup is bounded by GATEWAY_AUTHENTICATED_GRPC_CONNECTION_TIMEOUT, which defaults to 5s. The accepted client timestamp skew is configured by GATEWAY_AUTHENTICATED_GRPC_FRESHNESS_WINDOW and defaults to 5m.

The public gRPC service exposes two methods:

ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)
SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)

ExecuteCommand is a generic unary RPC. The gateway routes the request downstream by message_type after transport verification succeeds. Downstream unary execution is bounded by GATEWAY_AUTHENTICATED_DOWNSTREAM_TIMEOUT, which defaults to 5s. When that timeout expires, the gateway preserves the authenticated gRPC contract and returns gRPC UNAVAILABLE with message downstream service is unavailable.

SubscribeEvents is an authenticated server-streaming RPC. It binds the stream to user_id and device_session_id and starts by sending a signed service event that includes the current server time in milliseconds.

The v1 protobuf contract lives in proto/galaxy/gateway/v1/edge_gateway.proto under package galaxy.gateway.v1 and service EdgeGateway. Generated Go bindings are committed under proto/galaxy/gateway/v1/ and are regenerated with:

buf generate

The gateway validates the request envelope, device-session cache lookup, payload_hash, the client Ed25519 signature, timestamp freshness, replay reservation, authenticated rate limits, and the authenticated policy hook before any later routing or push step runs. Malformed envelopes are rejected with gRPC INVALID_ARGUMENT. Requests with a non-empty but unsupported protocol_version are rejected with gRPC FAILED_PRECONDITION. The supported request protocol_version literal is v1. Requests with an unknown device_session_id are rejected with gRPC UNAUTHENTICATED. Requests for revoked sessions are rejected with gRPC FAILED_PRECONDITION. SessionCache backend failures, including Redis lookup or record-decode failures, are rejected with gRPC UNAVAILABLE. Requests with a payload_hash that is not a 32-byte SHA-256 digest or does not match payload_bytes are rejected with gRPC INVALID_ARGUMENT. Requests with an invalid client signature or a signature created by a different key are rejected with gRPC UNAUTHENTICATED and message invalid request signature. Requests with malformed cached client_public_key material fail closed as gRPC UNAVAILABLE. Requests with a timestamp_ms outside the symmetric freshness window around current server time are rejected with gRPC FAILED_PRECONDITION and message request timestamp is outside the freshness window. Requests that reuse the same request_id for the same device_session_id inside the active replay window are rejected with gRPC FAILED_PRECONDITION and message request replay detected. ReplayStore backend failures fail closed with gRPC UNAVAILABLE and message replay store is unavailable. Authenticated rate limits are enforced independently by transport peer IP, authenticated device_session_id, authenticated user_id, and authenticated message class. The gateway uses the full verified message_type literal as the stable v1 message-class key because the transport does not yet define a coarser authenticated class taxonomy. The peer IP is derived only from the gRPC transport peer address; if it is missing or cannot be parsed, the request falls back to the stable unknown IP bucket. Requests that exceed any authenticated rate-limit bucket are rejected with gRPC RESOURCE_EXHAUSTED and message authenticated request rate limit exceeded. The authenticated edge policy hook runs after those rate limits and defaults to allow-all until a concrete policy evaluator is wired into the process. ExecuteCommand builds an internal authenticated command context, resolves one exact-match downstream route by the full verified message_type literal, executes the downstream unary client, and signs the response before it is returned to the caller. When no exact downstream route is registered, ExecuteCommand is rejected with gRPC UNIMPLEMENTED and message message_type is not routed. Downstream availability failures are rejected with gRPC UNAVAILABLE and message downstream service is unavailable. Unexpected downstream route-resolution or execution failures are rejected with gRPC INTERNAL. Successful unary responses preserve the original request_id, carry a SHA-256 payload_hash of the returned payload_bytes, and are signed with the configured server Ed25519 response signer. The default cmd/gateway wiring currently installs an empty static downstream router, so verified ExecuteCommand requests still return gRPC UNIMPLEMENTED until concrete downstream routes are injected. SubscribeEvents applies the full authenticated ingress pipeline, binds the stream to the verified user_id and device_session_id, sends one signed gateway.server_time bootstrap event whose FlatBuffers payload carries server_time_ms, registers the active stream in the in-memory PushHub, and then forwards signed client-facing events consumed from the configured client event Redis stream. User-targeted events fan out to every active stream for that user. Session-targeted events fan out only to streams whose user_id and device_session_id both match the event target. Each active stream uses a bounded in-memory queue; when that queue overflows, only the affected stream is closed with gRPC RESOURCE_EXHAUSTED and message push stream overflowed. When the session lifecycle stream reports that the same device_session_id was revoked, every active SubscribeEvents stream bound to that exact session is closed with gRPC FAILED_PRECONDITION and message device session is revoked. During gateway shutdown, the in-memory push hub is closed before gRPC graceful stop, and every active SubscribeEvents stream is terminated with gRPC UNAVAILABLE and message gateway is shutting down. Authenticated anti-abuse budgets are configured by the GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_* environment variables.

Current authenticated gRPC defaults:

per-IP: 120 requests / minute, burst=40;
per-session: 60 requests / minute, burst=20;
per-user: 120 requests / minute, burst=40;
per-message-class: 60 requests / minute, burst=20.

Authenticated anti-abuse configuration surface:

per-IP: GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_REQUESTS default 120, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_WINDOW default 1m, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_IP_RATE_LIMIT_BURST default 40;
per-session: GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_REQUESTS default 60, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_WINDOW default 1m, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_SESSION_RATE_LIMIT_BURST default 20;
per-user: GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_REQUESTS default 120, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_WINDOW default 1m, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_USER_RATE_LIMIT_BURST default 40;
per-message-class: GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_REQUESTS default 60, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_WINDOW default 1m, GATEWAY_AUTHENTICATED_GRPC_ANTI_ABUSE_MESSAGE_CLASS_RATE_LIMIT_BURST default 20.

Envelope and Payload Model

The authenticated transport uses a split contract:

gRPC control messages are protobuf-based;
business payload bytes are FlatBuffers;
signatures are computed over canonical envelope fields and a hash of raw FlatBuffers bytes.

The gateway verifies authenticated payload bytes before any downstream call. Most downstream routes may still treat those bytes as opaque, but the gateway is also allowed to transcode verified FlatBuffers payloads into trusted downstream REST/JSON calls when the concrete downstream contract requires it.

The current direct Gateway -> User self-service boundary uses that pattern:

external message types:
- user.account.get
- user.profile.update
- user.settings.update
external payloads and responses:
- FlatBuffers
internal downstream transport:
- strict REST/JSON to User Service
business error projection:
- gateway result_code
- FlatBuffers error payload mirroring User Service code and message

The request envelope version literal is v1. payload_hash is the raw 32-byte SHA-256 digest of payload_bytes. ExecuteCommand hashes the raw FlatBuffers payload bytes exactly as sent, while SubscribeEvents with an empty payload still requires sha256([]byte{}) rather than a special-case value. The v1 request signature scheme is Ed25519. client_public_key is the standard base64-encoded raw 32-byte Ed25519 public key registered during confirm-email-code. signature carries the raw 64-byte Ed25519 signature computed over the canonical request signing input.

The v1 stream bootstrap payload uses the shared FlatBuffers schema pkg/schema/fbs/gateway.fbs with root table gateway.ServerTimeEvent.

ExecuteCommandRequest

Required fields:

protocol_version
device_session_id
message_type
timestamp_ms
request_id
payload_bytes
payload_hash
signature

Optional fields:

trace_id

ExecuteCommandResponse

Required fields:

protocol_version
request_id
timestamp_ms
result_code
payload_bytes
payload_hash
signature

The v1 unary response signature scheme is Ed25519 with response domain marker galaxy-response-v1. The response signing input uses the same canonical binary encoding shape as the request signer:

each string and bytes field is encoded as uvarint(len(field_bytes)) followed by raw bytes;
timestamp_ms is encoded as an 8-byte big-endian unsigned integer;
the signed field order is galaxy-response-v1, protocol_version, request_id, timestamp_ms, result_code, payload_hash.

cmd/gateway loads the unary response signer from GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH, which must point to a PKCS#8 PEM-encoded Ed25519 private key. Startup fails when the file is absent, unreadable, not strict PEM, not PKCS#8, or not Ed25519.

SubscribeEventsRequest

The stream open request reuses the authenticated request model. It contains the same authentication fields as the unary request and either an empty payload or a minimal connect payload.

Required fields:

protocol_version
device_session_id
message_type
timestamp_ms
request_id
payload_hash
signature

Optional fields:

payload_bytes
trace_id

GatewayEvent

Every stream event is a client-facing signed server message.

Required fields:

event_type
event_id
timestamp_ms
payload_bytes
payload_hash
signature

Optional fields:

request_id
trace_id

The v1 stream-event signature scheme is Ed25519 with event domain marker galaxy-event-v1. The event signing input uses the same canonical binary encoding shape as the request and unary response signers:

each string and bytes field is encoded as uvarint(len(field_bytes)) followed by raw bytes;
timestamp_ms is encoded as an 8-byte big-endian unsigned integer;
the signed field order is galaxy-event-v1, event_type, event_id, timestamp_ms, request_id, trace_id, payload_hash.

The bootstrap event uses:

event_type = "gateway.server_time";
event_id = request_id from the opening SubscribeEvents request;
payload_bytes encoded as FlatBuffers gateway.ServerTimeEvent with server_time_ms;
the same loaded Ed25519 signer configured by GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH.

Client-facing fan-out events are sourced from the internal client event stream. Internal publishers provide the event target and business payload only: user_id, optional device_session_id, event_type, event_id, payload_bytes, and optional request_id / trace_id. The gateway derives timestamp_ms, recomputes payload_hash, signs the event, and only then forwards it to the matching SubscribeEvents streams.

Notification-owned user-facing payloads are expected to use pkg/schema/fbs/notification.fbs. The initial notification event vocabulary in v1 is exactly:

game.turn.ready
game.finished
lobby.application.submitted
lobby.membership.approved
lobby.membership.rejected
lobby.membership.blocked
lobby.invite.created
lobby.invite.redeemed
lobby.race_name.registration_eligible
lobby.race_name.registered

lobby.application.submitted is published toward Gateway only for the private-game owner flow. The public-game variant is email-only. The real Notification Service -> Gateway integration suite verifies this user-targeted fan-out path and asserts that notification-owned push events do not include device_session_id, so Gateway delivers them to every active stream for the target user. Auth-code email does not use this push path and continues to bypass Notification Service.

Verification and Routing Pipeline

The gateway applies the same strict verification order for authenticated gRPC ingress.

Parse the control envelope and validate required fields.
Check whether protocol_version is supported.
Resolve device_session_id through SessionCache.
Reject unknown or revoked sessions.
Verify that payload_hash matches raw payload_bytes.
Verify the client signature using the public key from session cache.
Verify that timestamp_ms is inside the accepted freshness window.
Verify anti-replay by checking device_session_id + request_id.
Apply authenticated rate limit and edge policy checks.
Build the authenticated internal command context.
Route the command downstream by message_type.

No downstream business service should receive a request that has not passed this full verification pipeline.

ExecuteCommand enforces steps 1 through 11 and signs the successful unary response afterward. SubscribeEvents enforces steps 1 through 9, binds the verified stream identity, sends the initial signed server-time bootstrap event, and then keeps the stream open for push delivery. Malformed envelopes fail with gRPC INVALID_ARGUMENT. Unsupported non-empty protocol_version values fail with gRPC FAILED_PRECONDITION. Unknown sessions fail with gRPC UNAUTHENTICATED. Revoked sessions fail with gRPC FAILED_PRECONDITION. SessionCache backend failures fail with gRPC UNAVAILABLE. payload_hash values that are not raw 32-byte SHA-256 digests fail with gRPC INVALID_ARGUMENT and message payload_hash must be a 32-byte SHA-256 digest. payload_hash values that do not match payload_bytes fail with gRPC INVALID_ARGUMENT and message payload_hash does not match payload_bytes. Invalid request signatures fail with gRPC UNAUTHENTICATED and message invalid request signature. Malformed cached client_public_key values fail closed with gRPC UNAVAILABLE and message session cache is unavailable. Requests with a timestamp_ms outside the accepted freshness window fail with gRPC FAILED_PRECONDITION and message request timestamp is outside the freshness window. Requests that reuse the same request_id for the same device_session_id inside the active replay window fail with gRPC FAILED_PRECONDITION and message request replay detected. ReplayStore backend failures fail with gRPC UNAVAILABLE and message replay store is unavailable. Unrouted exact-match message_type values fail with gRPC UNIMPLEMENTED and message message_type is not routed. Downstream availability failures fail with gRPC UNAVAILABLE and message downstream service is unavailable.

Internal Authenticated Contract

Downstream services should receive an internal authenticated command rather than raw external gRPC transport data.

The minimum authenticated context is:

user_id
device_session_id
message_type
verified payload_bytes
request_id
optional trace_id
optional client metadata needed for logs and tracing

Downstream services may trust that the gateway has already performed transport authentication, freshness verification, and anti-replay checks. They must still perform business authorization and domain validation.

Session Model

The Auth / Session Service is the source of truth for device session state. The gateway is designed to authenticate the hot path from cache.

Expected session fields available to the gateway:

device_session_id
user_id
base64-encoded raw 32-byte Ed25519 client public key
session status
revoke metadata
optional client metadata

Session Cache

SessionCache provides the fast path for:

session existence checks;
device_session_id -> user_id;
access to the base64-encoded raw Ed25519 client public key used for signature verification;
revoked versus active status checks.

Cache updates are event-driven. TTL is allowed only as a safety net and must not replace invalidation events.

The gateway keeps a process-local in-memory snapshot cache in front of the Redis fallback backend. Authenticated requests read the local snapshot first. A local miss performs one bounded Redis lookup and seeds the local snapshot so later requests for the same session avoid another Redis round-trip unless a later session event changes the cached state.

The local snapshot cache intentionally has no TTL and no size-based eviction policy. Session lifecycle events are the authoritative mechanism for keeping the hot path current, while Redis fallback remains the safety net for cold misses and process restarts.

The Redis fallback implementation uses go-redis/v9. cmd/gateway opens one shared *redis.Client via pkg/redisconn (instrumented with OpenTelemetry tracing and metrics), issues a single bounded PING on startup, and refuses to start when Redis is misconfigured or unavailable. The session cache, replay store, session-events subscriber, and client-events subscriber all use that shared client. See docs/redis-config.md for the rationale behind the shape and the project-wide rules in ARCHITECTURE.md §Persistence Backends.

Required Redis connection variables:

GATEWAY_REDIS_MASTER_ADDR
GATEWAY_REDIS_PASSWORD

Optional Redis connection variables:

GATEWAY_REDIS_REPLICA_ADDRS (comma-separated; reserved for future read-routing — currently unused)
GATEWAY_REDIS_DB with default 0
GATEWAY_REDIS_OPERATION_TIMEOUT with default 250ms

Removed: GATEWAY_SESSION_CACHE_REDIS_ADDR, GATEWAY_SESSION_CACHE_REDIS_USERNAME, GATEWAY_SESSION_CACHE_REDIS_PASSWORD, GATEWAY_SESSION_CACHE_REDIS_DB, GATEWAY_SESSION_CACHE_REDIS_TLS_ENABLED. pkg/redisconn.LoadFromEnv rejects the deprecated GATEWAY_REDIS_TLS_ENABLED and GATEWAY_REDIS_USERNAME variables at startup.

Per-subsystem Redis behavior variables (namespace, timeouts):

GATEWAY_REPLAY_REDIS_KEY_PREFIX with default gateway:replay:
GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT with default 250ms

Gateway no longer keeps a session cache projection or the two Redis Streams (session_events, client_events). Session lookup is a synchronous REST call to backend, and inbound client / session events arrive through the gRPC Push.SubscribePush consumer (see the Backend Client section below). Redis is therefore used only by the Replay Store.

Backend Client

backendclient is the single gateway → backend adapter:

RESTClient calls /api/v1/internal/sessions/{id} synchronously per authenticated request, forwards public auth (/api/v1/public/auth/*) and authenticated user / lobby commands (/api/v1/user/*) with the verified X-User-Id header.
PushClient consumes Push.SubscribePush and reconnects with exponential backoff plus jitter, replaying the last cursor on every reconnect.

Required startup variables:

GATEWAY_BACKEND_HTTP_URL — absolute base URL for the backend HTTP listener;
GATEWAY_BACKEND_GRPC_PUSH_URL — host:port of the backend Push.SubscribePush listener;
GATEWAY_BACKEND_GATEWAY_CLIENT_ID — durable identity presented to backend so reconnects replace the previous subscription.

Optional tuning:

GATEWAY_BACKEND_HTTP_TIMEOUT with default 5s;
GATEWAY_BACKEND_PUSH_RECONNECT_BASE_BACKOFF with default 250ms;
GATEWAY_BACKEND_PUSH_RECONNECT_MAX_BACKOFF with default 30s.

Replay Store

ReplayStore provides the hot-path anti-replay reservation for:

duplicate detection by device_session_id + request_id;
bounded replay protection for the authenticated freshness window.

The ReplayStore uses Redis through go-redis/v9. cmd/gateway requires the ReplayStore backend during startup, issues a bounded PING, and refuses to start when Redis is misconfigured or unavailable.

The ReplayStore reuses the same Redis deployment settings as SessionCache and adds two replay-specific environment variables:

GATEWAY_REPLAY_REDIS_KEY_PREFIX with default gateway:replay:
GATEWAY_REPLAY_REDIS_RESERVE_TIMEOUT with default 250ms

Replay keys use this format:

<key_prefix><base64url(device_session_id)>:<base64url(request_id)>

For each accepted request, the replay reservation TTL is computed as:

timestamp_ms + freshness_window - now

The TTL is clamped to a minimum positive duration so requests accepted exactly on the freshness boundary still reserve their replay key.

Revocation Behavior

When a device session is revoked:

the Auth / Session Service updates the source of truth;
it publishes a session update or revoke event;
the gateway invalidates or updates SessionCache;
new unary gRPC requests for that session are rejected;
active SubscribeEvents streams for that exact device_session_id are closed with gRPC FAILED_PRECONDITION and message device session is revoked.

Public Anti-Abuse Model

The public REST layer must distinguish between public auth operations and browser-originated traffic that may burst during a normal first page load.

The gateway uses these public route classes:

public_auth
browser_bootstrap
browser_asset
public_misc

Any classifier result outside this fixed set is normalized to public_misc before the class is stored in request context or used for policy derivation. The canonical base bucket namespace for public REST policy is public_rest/class=<class>.

Public Auth

public_auth is the stable route class for send-email-code and confirm-email-code. This class uses stricter limits and abuse scoring because it directly touches account and session creation flows.

Controls include:

per-IP and per-identity rate limits;
request body size limits;
method allow-lists;
malformed request counters;
elevated logging and security telemetry for repeated failures.

Current defaults:

per-IP: 30 requests / minute, burst=10;
send-email-code identity buckets: 3 requests / 10 minutes, burst=1, keyed by normalized email;
confirm-email-code identity buckets: 6 requests / 10 minutes, burst=2, keyed by normalized challenge_id;
maximum request body size: 8192 bytes;
only POST is accepted for public auth routes.

Configuration surface:

GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_MAX_BODY_BYTES default 8192;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_REQUESTS default 30;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_WINDOW default 1m;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_AUTH_RATE_LIMIT_BURST default 10;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS default 3;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW default 10m;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_SEND_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST default 1;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_REQUESTS default 6;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_WINDOW default 10m;
GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_CONFIRM_EMAIL_CODE_IDENTITY_RATE_LIMIT_BURST default 2.

Browser Bootstrap and Asset Traffic

browser_bootstrap and browser_asset use separate coarse-grained budgets. They may exhibit bursty behavior during the first load and therefore must not be treated as hostile based on burst pattern alone.

This traffic is still constrained by:

dedicated rate limits;
method allow-lists;
body size limits where request bodies are expected;
protocol and path validation;
independent abuse telemetry.

The gateway must not merge these buckets or counters with public_auth.

Current defaults:

browser_bootstrap: 60 requests / minute, burst=20, GET and HEAD only, and no request body;
browser_asset: 300 requests / minute, burst=80, GET and HEAD only, and no request body;
public_misc: 30 requests / minute, burst=10, and no request body.

Configuration surface:

browser_bootstrap: GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_MAX_BODY_BYTES default 0, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_REQUESTS default 60, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_WINDOW default 1m, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_BOOTSTRAP_RATE_LIMIT_BURST default 20;
browser_asset: GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_MAX_BODY_BYTES default 0, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_REQUESTS default 300, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_WINDOW default 1m, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_BROWSER_ASSET_RATE_LIMIT_BURST default 80;
public_misc: GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_MAX_BODY_BYTES default 0, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_REQUESTS default 30, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_WINDOW default 1m, GATEWAY_PUBLIC_HTTP_ANTI_ABUSE_PUBLIC_MISC_RATE_LIMIT_BURST default 10.

Push Delivery Model

The v1 push channel is a gRPC server stream. Long-polling is intentionally out of scope for the first version.

Expected stream behavior:

the client opens SubscribeEvents;
the gateway applies the full authenticated ingress verification pipeline;
the stream is bound to user_id and device_session_id;
the first signed service event is gateway.server_time and its FlatBuffers payload includes server_time_ms;
after that bootstrap event, the stream is registered in PushHub and remains open until client cancellation, server shutdown, queue overflow, session revoke for the same device_session_id, or a later send failure;
internal pub/sub may target all active streams for one user_id or only one device_session_id within that user;
the current per-stream in-memory queue capacity is 64 events and overflow closes only the affected stream;
session revoke closes only streams bound to the same exact device_session_id and returns gRPC FAILED_PRECONDITION with message device session is revoked.

Lifecycle and Shutdown

Gateway process shutdown is coordinated across the public REST listener, authenticated gRPC listener, optional admin listener, internal Redis subscribers, and telemetry runtime.

GATEWAY_SHUTDOWN_TIMEOUT configures the per-component graceful shutdown budget and defaults to 5s. During authenticated gRPC shutdown, the in-memory PushHub closes active streams before gRPC graceful stop, so active SubscribeEvents calls terminate with gRPC UNAVAILABLE and message gateway is shutting down.

Recommended Package Layout

The package layout keeps transport, policy, and downstream adapters separate:

cmd/gateway
internal/app
internal/config
internal/restapi
internal/grpcapi
authn (public — canonical request/response/event signing input shared with external clients and the integration test suite)
internal/session
internal/replay
internal/ratelimit
internal/downstream
internal/push
internal/events
internal/clock

Key Interfaces

The gateway should be built around explicit consumer-side interfaces.

SessionCache

Provides cached session lookup by device_session_id. Returns enough data to verify signatures and identify the authenticated user. The current production implementation is a process-local read-through cache in front of a Redis fallback adapter that uses strict JSON records under a configurable key prefix.

ReplayStore

Tracks recently seen request_id values per device session and rejects replayed requests inside the accepted freshness window. The current production adapter is Redis-backed, uses a dedicated configurable key prefix, and reserves keys with a TTL derived from timestamp_ms + freshness_window - now.

RateLimiter

Applies independent policies for:

public REST route classes;
authenticated gRPC requests by IP;
authenticated gRPC requests by session;
authenticated gRPC requests by user;
authenticated gRPC requests by message class.

The current rate limiter is process-local and in-memory. Public REST keys stay under the public_rest/... namespace, while authenticated gRPC keys stay under authenticated_grpc/..., so both traffic surfaces keep independent buckets even when they share the same limiter backend.

PublicTrafficClassifier

Maps incoming public REST requests to one of the public route classes so that limits and anti-abuse counters remain isolated. The gateway normalizes any unsupported or empty classifier output to public_misc, and public policy code derives the base bucket namespace from the normalized class as public_rest/class=<class>.

AuthServiceClient

Handles public auth commands and session-related updates exchanged with the Auth / Session Service. The gateway contract is:

SendEmailCode(email) -> challenge_id
ConfirmEmailCode(challenge_id, code, client_public_key, time_zone) -> device_session_id

When no concrete implementation is wired, the gateway keeps the public routes available and returns a stable 503 service_unavailable response instead of failing process startup.

DownstreamRouter

Resolves the target downstream service or adapter by the full exact-match message_type literal.

The default cmd/gateway wiring resolves the reserved user.* and lobby.* self-service message types through backendclient.UserRoutes and backendclient.LobbyRoutes. When GATEWAY_BACKEND_HTTP_URL is unset these routes stay mounted and fail closed as dependency-unavailable instead of falling through to a generic route miss.

DownstreamClient

Executes a verified authenticated command against a downstream internal service and returns response payload bytes plus a stable opaque result code. An empty or whitespace-only result code is treated as an internal downstream contract violation.

Downstream clients may be pure pass-through adapters or gateway-owned transcoding adapters. The backendclient adapter decodes authenticated FlatBuffers payloads, calls backend's /api/v1/user/* REST surface with X-User-Id, and re-encodes the JSON result into FlatBuffers before the signed gateway response is emitted.

EventSubscriber

Subscribes to internal pub/sub topics used for:

session cache updates;
revocations;
client-facing event delivery.

The implementation consumes two Redis Streams with replica-safe plain XREAD: one strict full-session snapshot stream for the process-local session cache and one client-facing event stream for live push fan-out.

PushHub

Tracks active SubscribeEvents streams, binds them to authenticated identities, and delivers events to the correct connections. The implementation uses one bounded in-memory queue per stream with a default capacity of 64 events; overflowing one queue closes only that stream and leaves the remaining streams active.

ResponseSigner

Signs unary responses and stream events so clients can verify server-originated messages. The implementation uses one Ed25519 signer loaded from GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH, which must reference a PKCS#8 PEM-encoded private key.

Clock

Provides current server time and supports consistent freshness-window checks.

Error Model and Observability

The gateway should expose stable edge-level error classes instead of leaking internal implementation details.

Minimum error categories:

malformed request;
request too large;
unsupported protocol;
unknown session;
revoked session;
invalid signature;
stale request;
replay detected;
rate limited;
policy denied;
downstream unavailable;
backend unavailable;
gateway shutting down;
internal error.

Observability requirements:

stable correlation identifiers, including request_id and optional trace_id;
structured logs;
security audit events for rejects and abuse signals;
metrics keyed by route class, message type, result code, and reject reason;
no logging of secrets, raw private material, or raw signatures.

The service uses:

go.uber.org/zap for structured JSON logs;
otelgin for the public REST listener;
otelgrpc for the authenticated gRPC listener;
OpenTelemetry metrics exported through Prometheus on the optional admin /metrics listener.

Current custom metric families:

gateway.public_http.requests
gateway.public_http.duration
gateway.authenticated_grpc.requests
gateway.authenticated_grpc.duration
gateway.push.active_streams
gateway.push.stream_closures
gateway.internal_event_drops

The process-wide log level is configured by GATEWAY_LOG_LEVEL and defaults to info. The default OpenTelemetry resource uses service.name=galaxy-edge-gateway when OTEL_SERVICE_NAME is unset. If OTEL_TRACES_EXPORTER is unset or set to none, the gateway keeps tracing runtime enabled but installs no external trace exporter. If OTEL_TRACES_EXPORTER=otlp, the gateway uses the standard OTEL_EXPORTER_OTLP_* environment variables to configure the OTLP trace exporter protocol and endpoint. The protocol selection specifically honors OTEL_EXPORTER_OTLP_TRACES_PROTOCOL first and falls back to OTEL_EXPORTER_OTLP_PROTOCOL when the trace-specific variable is unset. Supported values are http/protobuf and grpc; when both variables are unset, the gateway defaults to http/protobuf.

Structured logs intentionally omit:

public auth e-mail addresses, login codes, and challenge IDs;
client public keys;
raw payload bytes and payload hashes;
raw request or response signatures;
response-signer private key material and Redis credentials.

Malformed internal session and client-event stream entries are no longer silently dropped: the gateway logs the drop and increments gateway.internal_event_drops.

Non-Goals

The gateway is not a business authorization layer and must not grow into a domain coordinator.

The gateway must not:

implement business ownership checks;
validate domain state transitions;
replace the Auth / Session Service as the session source of truth;
degrade into a synchronous pass-through that reloads session state for every authenticated request.

42 KiB Raw Blame History

Edge Gateway

Run and Dependencies

Purpose

Trust Boundary

Transport Matrix

Public REST Surface

Operational Admin Surface

Authenticated gRPC Surface

Envelope and Payload Model

ExecuteCommandRequest

ExecuteCommandResponse

SubscribeEventsRequest

GatewayEvent

Verification and Routing Pipeline

Internal Authenticated Contract

Session Model

Session Cache

Backend Client

Replay Store

Revocation Behavior

Public Anti-Abuse Model

Public Auth

Browser Bootstrap and Asset Traffic

Push Delivery Model

Lifecycle and Shutdown

Recommended Package Layout

Key Interfaces

SessionCache

ReplayStore

RateLimiter

PublicTrafficClassifier

AuthServiceClient

DownstreamRouter

DownstreamClient

EventSubscriber

PushHub

ResponseSigner

Clock

Error Model and Observability

Non-Goals

42 KiB

Raw Blame History