diff --git a/gateway/PLAN.md b/gateway/PLAN.md index 2698a55..3731c97 100644 --- a/gateway/PLAN.md +++ b/gateway/PLAN.md @@ -1,9 +1,493 @@ -# Implementation plan for Edge Gateway service +# Edge Gateway Implementation Plan -## [x] First step +## Summary -Step description. +This plan breaks implementation into small, reviewable phases. +Each phase has a single primary goal, clear deliverables, explicit dependencies, +acceptance criteria, and focused tests. -## [ ] Second step +The intended v1 architecture is: -Step Description. +- unauthenticated public ingress over REST/JSON; +- authenticated ingress over gRPC on HTTP/2; +- FlatBuffers payloads for authenticated business commands; +- protobuf-based gRPC control envelopes; +- authenticated server-streaming push through gRPC; +- separate public traffic classes and isolated anti-abuse counters. + +## Assumptions and Defaults + +- `message_type` is the stable downstream routing key. +- `protocol_version` covers transport and envelope compatibility, not business + payload schema compatibility. +- FlatBuffers are used for business payload bytes only. +- Browser bootstrap and asset traffic are within gateway scope, even when backed + by a pluggable proxy or handler. +- Long-polling is out of scope for v1. + +## Phase 1. Module Skeleton + +Goal: create the runnable gateway process skeleton. + +Artifacts: + +- `cmd/gateway` +- `internal/app` +- base configuration types +- startup and shutdown wiring + +Dependencies: none. + +Acceptance criteria: + +- the process starts with config; +- the process shuts down cleanly on signal; +- lifecycle wiring is testable. + +Targeted tests: + +- startup with valid config; +- shutdown without leaked goroutines. + +## Phase 2. Public REST Server + +Goal: add the unauthenticated HTTP server shell. + +Artifacts: + +- public REST listener +- `GET /healthz` +- `GET /readyz` +- base error serialization +- request classification hook + +Dependencies: Phase 1. + +Acceptance criteria: + +- health endpoints respond deterministically; +- public requests are classified at least into `public_auth` and `browser_*`. + +Targeted tests: + +- health endpoint responses; +- request classification smoke tests. + +## Phase 3. Public Auth REST Handlers + +Goal: expose unauthenticated auth commands through REST/JSON. + +Artifacts: + +- `POST /api/v1/public/auth/send-email-code` +- `POST /api/v1/public/auth/confirm-email-code` +- request and response DTOs +- adapter calls into `AuthServiceClient` + +Dependencies: Phase 2. + +Acceptance criteria: + +- no session authentication is required for these routes; +- handlers delegate only through the auth service adapter. + +Targeted tests: + +- success and validation errors for both routes; +- no session lookup on public auth paths. + +## Phase 4. Public Traffic Classification + +Goal: isolate public traffic into stable anti-abuse classes. + +Artifacts: + +- `PublicTrafficClassifier` +- classes `public_auth`, `browser_bootstrap`, `browser_asset`, `public_misc` +- isolated rate-limit bucket keys + +Dependencies: Phase 2. + +Acceptance criteria: + +- browser traffic does not share buckets with public auth; +- auth counters remain unaffected by asset bursts. + +Targeted tests: + +- per-class routing tests; +- bucket isolation tests. + +## Phase 5. Public REST Anti-Abuse + +Goal: add coarse protection to unauthenticated REST traffic. + +Artifacts: + +- body size limits +- method allow-lists +- malformed request counters +- per-class rate-limit thresholds + +Dependencies: Phase 4. + +Acceptance criteria: + +- first-load browser bursts are not marked hostile because of burst pattern + alone; +- malformed or oversized requests are rejected predictably. + +Targeted tests: + +- bootstrap burst stays outside auth abuse counters; +- invalid methods and oversized bodies are rejected. + +## Phase 6. gRPC Server and Public Contracts + +Goal: bring up authenticated transport over gRPC and HTTP/2. + +Artifacts: + +- gRPC listener +- protobuf service definitions +- `ExecuteCommand` +- `SubscribeEvents` + +Dependencies: Phase 1. + +Acceptance criteria: + +- unary and server-streaming RPCs are reachable; +- the server runs only over HTTP/2. + +Targeted tests: + +- unary transport smoke test; +- stream transport smoke test. + +## Phase 7. Envelope Parsing and Protocol Gate + +Goal: validate the gRPC control envelope before security checks continue. + +Artifacts: + +- envelope parser +- required-field validation +- protocol version gate + +Dependencies: Phase 6. + +Acceptance criteria: + +- unsupported or malformed envelopes are rejected before routing. + +Targeted tests: + +- missing field rejection; +- unsupported `protocol_version` rejection. + +## Phase 8. Session Cache Lookup + +Goal: resolve authenticated identity from cache. + +Artifacts: + +- `SessionCache` +- session lookup pipeline +- revoked versus active session handling + +Dependencies: Phase 7. + +Acceptance criteria: + +- unknown and revoked sessions are blocked before signature verification. + +Targeted tests: + +- cache hit with active session; +- cache miss reject; +- revoked session reject. + +## Phase 9. Payload Hash and Signing Input + +Goal: verify payload integrity before signature verification. + +Artifacts: + +- `payload_hash` verification +- canonical signing input builder + +Dependencies: Phase 8. + +Acceptance criteria: + +- changing payload bytes or envelope fields breaks the signing input. + +Targeted tests: + +- payload hash mismatch reject; +- canonical bytes differ when signed fields change. + +## Phase 10. Client Signature Verification + +Goal: authenticate the request origin using the session public key. + +Artifacts: + +- signature verifier +- deterministic auth reject mapping + +Dependencies: Phase 9. + +Acceptance criteria: + +- wrong key and invalid signature produce stable rejects. + +Targeted tests: + +- success case with valid signature; +- bad signature reject; +- wrong-key reject. + +## Phase 11. Freshness and Anti-Replay + +Goal: enforce transport freshness and replay protection. + +Artifacts: + +- timestamp freshness window +- `ReplayStore` +- replay reservation and rejection logic + +Dependencies: Phase 10. + +Acceptance criteria: + +- stale requests and duplicate `request_id` values are rejected. + +Targeted tests: + +- stale timestamp reject; +- replay reject for same session and request ID; +- distinct sessions do not collide. + +## Phase 12. Authenticated Rate Limits and Policy + +Goal: apply edge policy after transport authenticity is established. + +Artifacts: + +- rate-limit keys for IP, session, user, and message class +- authenticated policy evaluation hook + +Dependencies: Phase 11. + +Acceptance criteria: + +- authenticated buckets are independent from public REST buckets. + +Targeted tests: + +- per-dimension throttling; +- bucket isolation from public traffic. + +## Phase 13. Internal Authenticated Command and Routing + +Goal: forward only verified context to downstream services. + +Artifacts: + +- `AuthenticatedCommand` +- `DownstreamRouter` +- `DownstreamClient` + +Dependencies: Phase 12. + +Acceptance criteria: + +- downstream services receive verified context only; +- raw transport details do not leak as authoritative input. + +Targeted tests: + +- route selection by `message_type`; +- downstream receives the expected authenticated context. + +## Phase 14. Signed Unary Responses + +Goal: return verifiable server responses to authenticated clients. + +Artifacts: + +- response envelope builder +- payload hash generation +- `ResponseSigner` + +Dependencies: Phase 13. + +Acceptance criteria: + +- unary responses always carry the original `request_id`, `payload_hash`, and + server signature. + +Targeted tests: + +- response correlation test; +- server signature generation test. + +## Phase 15. Session Update and Revocation Events + +Goal: keep gateway session state current without synchronous hot-path lookups. + +Artifacts: + +- `EventSubscriber` +- session update handlers +- session revoke handlers + +Dependencies: Phase 8. + +Acceptance criteria: + +- session updates change gateway behavior without per-request sync calls to the + auth service. + +Targeted tests: + +- cache update from event; +- revocation event invalidates cached session. + +## Phase 16. Authenticated Push Stream + +Goal: open a verified server-streaming channel for client-facing delivery. + +Artifacts: + +- `SubscribeEvents` handler +- stream binding to `user_id` and `device_session_id` +- initial server time event + +Dependencies: Phase 15. + +Acceptance criteria: + +- the stream opens only after the full auth pipeline succeeds. + +Targeted tests: + +- authorized stream open; +- rejected stream open for invalid session; +- first event contains server time. + +## Phase 17. Event Fan-Out + +Goal: deliver client-facing events from internal pub/sub to active streams. + +Artifacts: + +- `PushHub` +- event fan-out logic +- user and session targeting rules + +Dependencies: Phase 16. + +Acceptance criteria: + +- events are delivered to the correct active streams only. + +Targeted tests: + +- single-session delivery; +- multi-device delivery for one user; +- unrelated sessions do not receive the event. + +## Phase 18. Revocation-Driven Stream Teardown + +Goal: terminate active delivery channels when a session is revoked. + +Artifacts: + +- stream teardown on revoke +- connection cleanup logic + +Dependencies: Phase 17. + +Acceptance criteria: + +- revocation blocks new unary requests and closes active streams for the same + session. + +Targeted tests: + +- revoke closes active stream; +- revoked session cannot reopen the stream. + +## Phase 19. Observability and Shutdown Hardening + +Goal: make the service operable in production. + +Artifacts: + +- structured logs +- metrics +- trace propagation +- timeout budgets +- graceful shutdown for unary and streaming traffic + +Dependencies: Phase 18. + +Acceptance criteria: + +- shutdown is deterministic; +- logs and metrics expose stable edge outcomes without leaking secrets. + +Targeted tests: + +- shutdown closes listeners and active streams; +- secret and signature values are not logged. + +## Phase 20. Acceptance Pass + +Goal: reconcile implementation, documentation, and regression coverage. + +Artifacts: + +- updated README and PLAN +- final protocol and interface review +- focused regression test run + +Dependencies: Phases 1 through 19. + +Acceptance criteria: + +- implementation matches documented contracts and ordering guarantees; +- docs describe the actual gateway behavior. + +Targeted tests: + +- run focused package tests for gateway packages; +- rerun cross-cutting regression scenarios. + +## Cross-Cutting Regression Scenarios + +- `send_email_code` and `confirm_email_code` are available without session auth + and are still limited by public auth policy. +- Public browser bootstrap and asset bursts do not increase auth abuse counters + and are not rejected as hostile because of intensity alone. +- Any gRPC command without a valid session is rejected before routing. +- Unknown and revoked sessions are handled predictably and consistently where + policy requires identical behavior. +- Signature verification fails when `payload_bytes`, `payload_hash`, + `message_type`, `request_id`, or the signing key changes. +- `payload_hash` is verified before downstream execution. +- Requests outside the freshness window are rejected. +- Reused `request_id` values are rejected within the session replay window. +- Public REST and authenticated gRPC traffic use independent buckets and + independent abuse telemetry. +- Downstream services receive `AuthenticatedCommand`, not raw REST or gRPC + transport requests. +- Unary responses preserve `request_id` correlation and are server-signed. +- Streaming connections open only after the auth pipeline and close on revoke. +- Session cache updates from events change gateway behavior without synchronous + auth-service lookups per request. +- Graceful shutdown terminates unary and streaming traffic cleanly. diff --git a/gateway/README.md b/gateway/README.md index d8e1167..dddb9af 100644 --- a/gateway/README.md +++ b/gateway/README.md @@ -1 +1,414 @@ # Edge Gateway + +## Purpose + +`Edge Gateway` is the only public ingress for Galaxy Plus clients. +It terminates the external transport and security boundary, enforces edge +policies, and routes verified requests to internal services. + +The gateway does not implement domain-specific business logic. +Business validation, authorization, ownership checks, and state transitions +remain inside downstream services. + +## Trust Boundary + +The gateway sits between untrusted external clients and trusted internal +services. + +The gateway is responsible for: + +- parsing external transport requests; +- classifying public REST traffic; +- authenticating protected gRPC traffic; +- loading session state from cache; +- verifying request freshness and anti-replay constraints; +- applying edge rate limits and anti-abuse policy; +- building an authenticated internal command context; +- routing verified commands to internal services; +- maintaining authenticated push delivery connections. + +The gateway is not responsible for: + +- deciding whether a user is allowed to execute a business action; +- validating domain invariants; +- storing the source-of-truth session record; +- implementing business idempotency. + +## Transport Matrix + +The gateway exposes two external transport classes. + +| Transport | Audience | Authentication | Payload format | Primary use | +| --- | --- | --- | --- | --- | +| REST/JSON | Public, unauthenticated traffic | No device session auth | JSON | Public auth commands, health checks, browser/bootstrap traffic | +| gRPC over HTTP/2 | Authenticated clients only | Required | FlatBuffers payload inside protobuf control envelope | Verified commands and push delivery | + +### Public REST Surface + +The public REST surface is used for commands that must work before a device +session exists and for browser-originated traffic that may share the same edge. + +Stable public endpoints: + +- `POST /api/v1/public/auth/send-email-code` +- `POST /api/v1/public/auth/confirm-email-code` +- `GET /healthz` +- `GET /readyz` + +In addition to the fixed endpoints above, the gateway may front browser +bootstrap or asset traffic through a pluggable public handler or proxy. +That traffic belongs to dedicated public route classes and must not share rate +limit buckets or abuse counters with the public auth API. + +### Authenticated gRPC Surface + +All authenticated client requests use HTTP/2 and gRPC. + +The public gRPC service exposes two methods: + +- `ExecuteCommand(ExecuteCommandRequest) returns (ExecuteCommandResponse)` +- `SubscribeEvents(SubscribeEventsRequest) returns (stream GatewayEvent)` + +`ExecuteCommand` is a generic unary RPC. +The gateway routes the request downstream by `message_type` after transport +verification succeeds. + +`SubscribeEvents` is an authenticated server-streaming RPC. +It binds the stream to `user_id` and `device_session_id` and starts by sending +a service event that includes the current server time in milliseconds. + +## Envelope and Payload Model + +The authenticated transport uses a split contract: + +- gRPC control messages are protobuf-based; +- business payload bytes are FlatBuffers; +- signatures are computed over canonical envelope fields and a hash of raw + FlatBuffers bytes. + +The gateway treats `payload_bytes` as opaque business data. +It verifies integrity and forwards verified bytes downstream without rewriting +them. + +### ExecuteCommandRequest + +Required fields: + +- `protocol_version` +- `device_session_id` +- `message_type` +- `timestamp_ms` +- `request_id` +- `payload_bytes` +- `payload_hash` +- `signature` + +Optional fields: + +- `trace_id` + +### ExecuteCommandResponse + +Required fields: + +- `protocol_version` +- `request_id` +- `timestamp_ms` +- `result_code` +- `payload_bytes` +- `payload_hash` +- `signature` + +### SubscribeEventsRequest + +The stream open request reuses the authenticated request model. +It contains the same authentication fields as the unary request and either an +empty payload or a minimal connect payload. + +Required fields: + +- `protocol_version` +- `device_session_id` +- `message_type` +- `timestamp_ms` +- `request_id` +- `payload_hash` +- `signature` + +Optional fields: + +- `payload_bytes` +- `trace_id` + +### GatewayEvent + +Every stream event is a client-facing signed server message. + +Required fields: + +- `event_type` +- `event_id` +- `timestamp_ms` +- `payload_bytes` +- `payload_hash` +- `signature` + +Optional fields: + +- `request_id` +- `trace_id` + +## Verification and Routing Pipeline + +The gateway applies the same strict verification order for authenticated gRPC +ingress. + +1. Parse the control envelope and validate required fields. +2. Check whether `protocol_version` is supported. +3. Resolve `device_session_id` through `SessionCache`. +4. Reject unknown or revoked sessions. +5. Verify that `payload_hash` matches raw `payload_bytes`. +6. Verify the client signature using the public key from session cache. +7. Verify that `timestamp_ms` is inside the accepted freshness window. +8. Verify anti-replay by checking `device_session_id + request_id`. +9. Apply authenticated rate limit and edge policy checks. +10. Build the authenticated internal command context. +11. Route the command downstream by `message_type`. + +No downstream business service should receive a request that has not passed +this full verification pipeline. + +## Internal Authenticated Contract + +Downstream services should receive an internal authenticated command rather than +raw external gRPC transport data. + +The minimum authenticated context is: + +- `user_id` +- `device_session_id` +- `message_type` +- verified `payload_bytes` +- `request_id` +- optional `trace_id` +- optional client metadata needed for logs and tracing + +Downstream services may trust that the gateway has already performed transport +authentication, freshness verification, and anti-replay checks. +They must still perform business authorization and domain validation. + +## Session Model + +The Auth / Session Service is the source of truth for device session state. +The gateway is designed to authenticate the hot path from cache. + +Expected session fields available to the gateway: + +- `device_session_id` +- `user_id` +- client public key +- session status +- revoke metadata +- optional client metadata + +### Session Cache + +`SessionCache` provides the fast path for: + +- session existence checks; +- `device_session_id -> user_id`; +- access to the client public key used for signature verification; +- revoked versus active status checks. + +Cache updates are event-driven. +TTL is allowed only as a safety net and must not replace invalidation events. + +### Revocation Behavior + +When a device session is revoked: + +1. the Auth / Session Service updates the source of truth; +2. it publishes a session update or revoke event; +3. the gateway invalidates or updates `SessionCache`; +4. new unary gRPC requests for that session are rejected; +5. active `SubscribeEvents` streams for that session are closed. + +## Public Anti-Abuse Model + +The public REST layer must distinguish between public auth operations and +browser-originated traffic that may burst during a normal first page load. + +The gateway uses these public route classes: + +- `public_auth` +- `browser_bootstrap` +- `browser_asset` +- `public_misc` + +### Public Auth + +`public_auth` includes `send-email-code` and `confirm-email-code`. +This class uses stricter limits and abuse scoring because it directly touches +account and session creation flows. + +Controls include: + +- per-IP and per-identity rate limits; +- request body size limits; +- method allow-lists; +- malformed request counters; +- elevated logging and security telemetry for repeated failures. + +### Browser Bootstrap and Asset Traffic + +`browser_bootstrap` and `browser_asset` use separate coarse-grained budgets. +They may exhibit bursty behavior during the first load and therefore must not +be treated as hostile based on burst pattern alone. + +This traffic is still constrained by: + +- dedicated rate limits; +- method allow-lists; +- body size limits where request bodies are expected; +- protocol and path validation; +- independent abuse telemetry. + +The gateway must not merge these buckets or counters with `public_auth`. + +## Push Delivery Model + +The v1 push channel is a gRPC server stream. +Long-polling is intentionally out of scope for the first version. + +Expected stream behavior: + +1. the client opens `SubscribeEvents`; +2. the gateway applies the full authenticated ingress verification pipeline; +3. the stream is bound to `user_id` and `device_session_id`; +4. the first service event includes `server_time_ms`; +5. client-facing events from internal pub/sub are fanned out to matching active + streams; +6. revoke events close affected streams. + +## Recommended Package Layout + +The initial package layout should keep transport, policy, and downstream +adapters separate: + +- `cmd/gateway` +- `internal/app` +- `internal/config` +- `internal/restapi` +- `internal/grpcapi` +- `internal/authn` +- `internal/session` +- `internal/replay` +- `internal/ratelimit` +- `internal/downstream` +- `internal/push` +- `internal/events` +- `internal/clock` + +## Key Interfaces + +The gateway should be built around explicit consumer-side interfaces. + +### SessionCache + +Provides cached session lookup by `device_session_id`. +Returns enough data to verify signatures and identify the authenticated user. + +### ReplayStore + +Tracks recently seen `request_id` values per device session and rejects replayed +requests inside the accepted freshness window. + +### RateLimiter + +Applies independent policies for: + +- public REST route classes; +- authenticated gRPC requests by IP; +- authenticated gRPC requests by session; +- authenticated gRPC requests by user; +- authenticated gRPC requests by message class. + +### PublicTrafficClassifier + +Maps incoming public REST requests to one of the public route classes so that +limits and anti-abuse counters remain isolated. + +### AuthServiceClient + +Handles public auth commands and session-related updates exchanged with the +Auth / Session Service. + +### DownstreamRouter + +Resolves the target downstream service or adapter by `message_type`. + +### DownstreamClient + +Executes a verified authenticated command against a downstream internal service +and returns response payload bytes plus a stable result code. + +### EventSubscriber + +Subscribes to internal pub/sub topics used for: + +- session cache updates; +- revocations; +- client-facing event delivery. + +### PushHub + +Tracks active `SubscribeEvents` streams, binds them to authenticated identities, +and delivers events to the correct connections. + +### ResponseSigner + +Signs unary responses and stream events so clients can verify server-originated +messages. + +### Clock + +Provides current server time and supports consistent freshness-window checks. + +## Error Model and Observability + +The gateway should expose stable edge-level error classes instead of leaking +internal implementation details. + +Minimum error categories: + +- malformed request; +- unsupported protocol; +- unknown session; +- revoked session; +- invalid signature; +- stale request; +- replay detected; +- rate limited; +- downstream unavailable; +- internal error. + +Observability requirements: + +- stable correlation identifiers, including `request_id` and optional `trace_id`; +- structured logs; +- security audit events for rejects and abuse signals; +- metrics keyed by route class, message type, result code, and reject reason; +- no logging of secrets, raw private material, or raw signatures. + +## Non-Goals + +The gateway is not a business authorization layer and must not grow into a +domain coordinator. + +The gateway must not: + +- implement business ownership checks; +- validate domain state transitions; +- replace the Auth / Session Service as the session source of truth; +- degrade into a synchronous pass-through that reloads session state for every + authenticated request.