feat: edge gateway service
This commit is contained in:
@@ -0,0 +1,20 @@
|
||||
# Edge Gateway Docs
|
||||
|
||||
This directory keeps service-local documentation that is too detailed for the
|
||||
root architecture documents and too diagram-heavy for the module README.
|
||||
|
||||
Sections:
|
||||
|
||||
- [Runtime and components](runtime.md)
|
||||
- [Public auth, command, and push flows](flows.md)
|
||||
- [Operator runbook](runbook.md)
|
||||
- [Configuration and contract examples](examples.md)
|
||||
- [Example `.env`](../.env.example)
|
||||
|
||||
Primary references:
|
||||
|
||||
- [`../README.md`](../README.md) for service scope, contracts, configuration,
|
||||
and operational behavior
|
||||
- [`../openapi.yaml`](../openapi.yaml) for the public REST contract
|
||||
- [`../../README.md`](../../README.md) for workspace-level architecture
|
||||
- [`../../SECURITY.md`](../../SECURITY.md) for the transport security model
|
||||
@@ -0,0 +1,179 @@
|
||||
# Configuration And Contract Examples
|
||||
|
||||
The examples below are illustrative. Values such as signatures, payload hashes,
|
||||
and FlatBuffers payload bytes are placeholders unless explicitly stated
|
||||
otherwise.
|
||||
|
||||
## Example `.env`
|
||||
|
||||
The repository also includes a ready-to-copy sample file:
|
||||
|
||||
- [`../.env.example`](../.env.example)
|
||||
|
||||
The sample keeps all secrets blank and shows only the settings needed to boot
|
||||
the process and expose the main listeners.
|
||||
|
||||
## Public Auth HTTP Examples
|
||||
|
||||
Start an e-mail challenge:
|
||||
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:8080/api/v1/public/auth/send-email-code \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"email":"pilot@example.com"}'
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```json
|
||||
{
|
||||
"challenge_id": "challenge-123"
|
||||
}
|
||||
```
|
||||
|
||||
Confirm the challenge and register the device public key:
|
||||
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:8080/api/v1/public/auth/confirm-email-code \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"challenge_id": "challenge-123",
|
||||
"code": "123456",
|
||||
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no="
|
||||
}'
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```json
|
||||
{
|
||||
"device_session_id": "device-session-123"
|
||||
}
|
||||
```
|
||||
|
||||
## Authenticated gRPC Envelope Examples
|
||||
|
||||
The authenticated transport is gRPC/protobuf, not JSON over HTTP. The examples
|
||||
below use protobuf-style JSON only to make the logical envelope readable.
|
||||
`bytes` fields are shown as base64 strings, matching the standard protobuf JSON
|
||||
mapping.
|
||||
|
||||
Example `ExecuteCommandRequest`:
|
||||
|
||||
```json
|
||||
{
|
||||
"protocolVersion": "v1",
|
||||
"deviceSessionId": "device-session-123",
|
||||
"messageType": "fleet.move",
|
||||
"timestampMs": "1775121600000",
|
||||
"requestId": "request-123",
|
||||
"payloadBytes": "RkxBVEJVRkZFUlNfUEFZTE9BRA==",
|
||||
"payloadHash": "5fY6Q8V9mK8x2B7v6v0V0m0i1rQ2QF0rQ8V1Yt1r8Ys=",
|
||||
"signature": "3o4v8f3h0Y6I0x1bS7zY+8m0bV1Lk4D3yq8J2n8F1rD7yK9v8M1Q0w2s4a6f8d0Q0m3L6y8R1t5w7x9z0a2cA==",
|
||||
"traceId": "trace-123"
|
||||
}
|
||||
```
|
||||
|
||||
Example `ExecuteCommandResponse`:
|
||||
|
||||
```json
|
||||
{
|
||||
"protocolVersion": "v1",
|
||||
"requestId": "request-123",
|
||||
"timestampMs": "1775121600123",
|
||||
"resultCode": "ok",
|
||||
"payloadBytes": "RkxBVEJVRkZFUlNfUkVTUE9OU0U=",
|
||||
"payloadHash": "wL4n8H1aR2x3M4b5C6d7E8f9G0h1J2k3L4m5N6o7P8Q=",
|
||||
"signature": "2Xb7l9m0n1p2q3r4s5t6u7v8w9x0y1z2A3B4C5D6E7F8G9H0J1K2L3M4N5O6P7Q8R9S0T1U2V3W4X5Y6Z7a8b9cQ=="
|
||||
}
|
||||
```
|
||||
|
||||
Example bootstrap `GatewayEvent` sent after `SubscribeEvents` opens:
|
||||
|
||||
```json
|
||||
{
|
||||
"eventType": "gateway.server_time",
|
||||
"eventId": "request-123",
|
||||
"timestampMs": "1775121600456",
|
||||
"payloadBytes": "RkxBVEJVRkZFUlNfU0VSVkVSX1RJTUU=",
|
||||
"payloadHash": "2b1U3m4N5p6Q7r8S9t0U1v2W3x4Y5z6A7b8C9d0E1f2=",
|
||||
"signature": "4Nf8k2p6s0w4y8A2d6g0j4m8p2t6w0z4C8F2I6L0O4R8U2X6a0d4g8j2m6p0s4v8yA2d6g0j4m8p2t6w0z4C8F2I6A==",
|
||||
"requestId": "request-123",
|
||||
"traceId": "trace-123"
|
||||
}
|
||||
```
|
||||
|
||||
## Redis Examples
|
||||
|
||||
### Session Cache Record
|
||||
|
||||
Example Redis key and JSON value used by the fallback session cache:
|
||||
|
||||
```text
|
||||
gateway:session:device-session-123
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"device_session_id": "device-session-123",
|
||||
"user_id": "user-123",
|
||||
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no=",
|
||||
"status": "active"
|
||||
}
|
||||
```
|
||||
|
||||
### Session Event Stream Entry
|
||||
|
||||
Example session snapshot entry:
|
||||
|
||||
```bash
|
||||
redis-cli XADD gateway:session-events '*' \
|
||||
device_session_id device-session-123 \
|
||||
user_id user-123 \
|
||||
client_public_key 11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no= \
|
||||
status active
|
||||
```
|
||||
|
||||
Revocation entry:
|
||||
|
||||
```bash
|
||||
redis-cli XADD gateway:session-events '*' \
|
||||
device_session_id device-session-123 \
|
||||
user_id user-123 \
|
||||
client_public_key 11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no= \
|
||||
status revoked \
|
||||
revoked_at_ms 1775121700000
|
||||
```
|
||||
|
||||
### Client Event Stream Entry
|
||||
|
||||
User-wide event:
|
||||
|
||||
```bash
|
||||
redis-cli XADD gateway:client-events '*' \
|
||||
user_id user-123 \
|
||||
event_type fleet.updated \
|
||||
event_id event-123 \
|
||||
payload_bytes payload-v1
|
||||
```
|
||||
|
||||
Session-targeted event with correlation:
|
||||
|
||||
```bash
|
||||
redis-cli XADD gateway:client-events '*' \
|
||||
user_id user-123 \
|
||||
device_session_id device-session-123 \
|
||||
event_type fleet.updated \
|
||||
event_id event-124 \
|
||||
payload_bytes payload-v2 \
|
||||
request_id request-123 \
|
||||
trace_id trace-123
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `payload_bytes` in Redis Stream entries must be binary-safe payload data;
|
||||
- the gateway derives `timestamp_ms`, recomputes `payload_hash`, and signs the
|
||||
outgoing event at delivery time;
|
||||
- each gateway replica consumes streams with plain `XREAD`, so publishers must
|
||||
keep retention bounded with `MAXLEN`.
|
||||
@@ -0,0 +1,86 @@
|
||||
# Request and Push Flows
|
||||
|
||||
## Public Auth Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Gateway
|
||||
participant Limiter as Public anti-abuse
|
||||
participant Auth as AuthServiceClient
|
||||
|
||||
Client->>Gateway: POST /api/v1/public/auth/send-email-code
|
||||
Gateway->>Limiter: classify + rate-limit + body checks
|
||||
Limiter-->>Gateway: allowed
|
||||
Gateway->>Auth: SendEmailCode(email)
|
||||
Auth-->>Gateway: challenge_id
|
||||
Gateway-->>Client: 200 {challenge_id}
|
||||
|
||||
Client->>Gateway: POST /api/v1/public/auth/confirm-email-code
|
||||
Gateway->>Limiter: classify + rate-limit + body checks
|
||||
Limiter-->>Gateway: allowed
|
||||
Gateway->>Auth: ConfirmEmailCode(challenge_id, code, client_public_key)
|
||||
Auth-->>Gateway: device_session_id
|
||||
Gateway-->>Client: 200 {device_session_id}
|
||||
```
|
||||
|
||||
## Authenticated ExecuteCommand Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Gateway
|
||||
participant Cache as SessionCache
|
||||
participant Replay as ReplayStore
|
||||
participant Policy as Rate limit / policy
|
||||
participant Downstream
|
||||
|
||||
Client->>Gateway: ExecuteCommand(envelope, payload_bytes, signature)
|
||||
Gateway->>Gateway: validate envelope + protocol_version
|
||||
Gateway->>Cache: lookup(device_session_id)
|
||||
Cache-->>Gateway: session record
|
||||
Gateway->>Gateway: verify payload_hash
|
||||
Gateway->>Gateway: verify Ed25519 signature
|
||||
Gateway->>Gateway: verify freshness window
|
||||
Gateway->>Replay: reserve(device_session_id, request_id, ttl)
|
||||
Replay-->>Gateway: accepted
|
||||
Gateway->>Policy: apply IP/session/user/message_type budgets
|
||||
Policy-->>Gateway: allowed
|
||||
Gateway->>Downstream: verified authenticated command
|
||||
Downstream-->>Gateway: result_code + payload_bytes
|
||||
Gateway->>Gateway: hash payload + sign response
|
||||
Gateway-->>Client: ExecuteCommandResponse + signature
|
||||
```
|
||||
|
||||
## SubscribeEvents Lifecycle
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Gateway
|
||||
participant Cache as SessionCache
|
||||
participant Replay as ReplayStore
|
||||
participant Hub as PushHub
|
||||
participant Stream as Client event stream
|
||||
participant Sess as Session event stream
|
||||
|
||||
Client->>Gateway: SubscribeEvents(envelope, signature)
|
||||
Gateway->>Gateway: validate envelope + verify request
|
||||
Gateway->>Cache: lookup(device_session_id)
|
||||
Cache-->>Gateway: session record
|
||||
Gateway->>Replay: reserve(device_session_id, request_id, ttl)
|
||||
Replay-->>Gateway: accepted
|
||||
Gateway->>Client: gateway.server_time event
|
||||
Gateway->>Hub: register(user_id, device_session_id)
|
||||
|
||||
Stream-->>Gateway: client-facing event for user_id / device_session_id
|
||||
Gateway->>Hub: publish signed event
|
||||
Hub-->>Client: matching event delivery
|
||||
|
||||
Sess-->>Gateway: revoked session snapshot
|
||||
Gateway->>Hub: revoke(device_session_id)
|
||||
Hub-->>Client: stream closes with FAILED_PRECONDITION
|
||||
|
||||
Note over Gateway,Hub: During shutdown the gateway closes PushHub before gRPC graceful stop.
|
||||
Hub-->>Client: stream closes with UNAVAILABLE
|
||||
```
|
||||
@@ -0,0 +1,143 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup, steady-state
|
||||
readiness, shutdown, and push or revoke incidents.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `GATEWAY_SESSION_CACHE_REDIS_ADDR` points to the Redis deployment used for
|
||||
session lookup and both internal event streams.
|
||||
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and
|
||||
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM` reference existing Redis Stream keys or
|
||||
the names publishers will use.
|
||||
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable PKCS#8
|
||||
PEM-encoded Ed25519 private key.
|
||||
- the configured Redis ACL, DB, TLS, and key-prefix settings match the target
|
||||
environment.
|
||||
|
||||
At startup the process performs bounded `PING` checks for:
|
||||
|
||||
- the Redis-backed session cache adapter;
|
||||
- the replay store;
|
||||
- the session event subscriber;
|
||||
- the client event subscriber.
|
||||
|
||||
Startup fails fast if any of those checks fail or if the signer key cannot be
|
||||
loaded.
|
||||
|
||||
Expected listener state after a healthy start:
|
||||
|
||||
- public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`;
|
||||
- authenticated gRPC is enabled on
|
||||
`GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`;
|
||||
- admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
|
||||
|
||||
Known startup caveats:
|
||||
|
||||
- public auth routes stay mounted without an upstream adapter and return
|
||||
`503 service_unavailable`;
|
||||
- authenticated gRPC starts with an empty static router, so `ExecuteCommand`
|
||||
returns gRPC `UNIMPLEMENTED` until downstream routes are injected.
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually guarantee:
|
||||
|
||||
- `GET /healthz` confirms that the public HTTP listener is alive;
|
||||
- `GET /readyz` confirms that the current process is ready to serve public HTTP
|
||||
traffic;
|
||||
- `GET /metrics` is available only on the optional admin listener.
|
||||
|
||||
`/readyz` is process-local. It does not confirm:
|
||||
|
||||
- downstream business-service reachability;
|
||||
- auth upstream adapter reachability;
|
||||
- Redis health after startup;
|
||||
- push fan-out health.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted startup logs for the public and authenticated
|
||||
listeners;
|
||||
2. check `GET /healthz`;
|
||||
3. check `GET /readyz`;
|
||||
4. if admin HTTP is enabled, scrape `GET /metrics`;
|
||||
5. verify the expected Redis deployment and stream names from config.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- the per-component shutdown budget is controlled by
|
||||
`GATEWAY_SHUTDOWN_TIMEOUT`;
|
||||
- internal subscribers are stopped as part of application shutdown;
|
||||
- the in-memory `PushHub` is closed before gRPC graceful stop;
|
||||
- active `SubscribeEvents` streams terminate with gRPC `UNAVAILABLE` and
|
||||
message `gateway is shutting down`.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for listener shutdown and component-stop logs;
|
||||
3. expect connected clients to reconnect after the gateway closes the stream;
|
||||
4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams
|
||||
remain open unexpectedly.
|
||||
|
||||
## Revoke And Push Failure Triage
|
||||
|
||||
### Revocation Does Not Take Effect
|
||||
|
||||
If a revoked session still sends traffic or keeps an active stream:
|
||||
|
||||
1. verify that the auth/session side published a session snapshot with the
|
||||
same `device_session_id` and `status=revoked`;
|
||||
2. verify that the event was written to
|
||||
`GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
|
||||
3. verify the gateway is connected to the same Redis address, DB, and stream;
|
||||
4. confirm the snapshot fields are complete and well-formed;
|
||||
5. check that a later active snapshot did not overwrite the revoked one.
|
||||
|
||||
Expected gateway behavior after the revoke snapshot is consumed:
|
||||
|
||||
- new authenticated requests for that `device_session_id` fail with gRPC
|
||||
`FAILED_PRECONDITION`;
|
||||
- active `SubscribeEvents` streams for that exact `device_session_id` close
|
||||
with the same status.
|
||||
|
||||
### Push Events Are Not Delivered
|
||||
|
||||
If a client reports missing push events:
|
||||
|
||||
1. confirm that the client successfully opened `SubscribeEvents`;
|
||||
2. confirm the stream received the initial `gateway.server_time` bootstrap
|
||||
event;
|
||||
3. confirm the gateway consumed the expected entry from
|
||||
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
|
||||
4. verify `user_id` and optional `device_session_id` in the stream entry match
|
||||
the intended target;
|
||||
5. confirm the event payload fields are well-formed and not dropped as
|
||||
malformed;
|
||||
6. check whether the stream was closed earlier because of revoke, shutdown, or
|
||||
overflow.
|
||||
|
||||
### Stream Closed Unexpectedly
|
||||
|
||||
Use the terminal gRPC status first:
|
||||
|
||||
- `FAILED_PRECONDITION` with `device session is revoked` means the session was
|
||||
revoked;
|
||||
- `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped
|
||||
consuming fast enough and its in-memory queue overflowed;
|
||||
- `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown;
|
||||
- client-side cancellation or transport errors should be investigated on the
|
||||
client or network side.
|
||||
|
||||
For overflow incidents:
|
||||
|
||||
- treat the issue as stream-local, not a global push outage;
|
||||
- inspect client receive behavior and reconnect logic;
|
||||
- look at push metrics and logs around the affected user/session.
|
||||
@@ -0,0 +1,59 @@
|
||||
# Runtime and Components
|
||||
|
||||
The diagram below focuses on the deployed `galaxy/gateway` process and its
|
||||
runtime dependencies.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Clients
|
||||
Public["Public REST clients"]
|
||||
Authd["Authenticated gRPC clients"]
|
||||
end
|
||||
|
||||
subgraph Gateway["Edge Gateway process"]
|
||||
PublicHTTP["Public HTTP listener\n/healthz /readyz /api/v1/public/auth/*"]
|
||||
AuthGRPC["Authenticated gRPC listener\nExecuteCommand / SubscribeEvents"]
|
||||
AdminHTTP["Optional admin HTTP listener\n/metrics"]
|
||||
SessionSnap["In-memory session snapshot cache"]
|
||||
Replay["Replay reservation client"]
|
||||
PushHub["PushHub"]
|
||||
SessSub["Session event subscriber"]
|
||||
ClientSub["Client event subscriber"]
|
||||
Telemetry["Logs, traces, metrics"]
|
||||
end
|
||||
|
||||
Public --> PublicHTTP
|
||||
Authd --> AuthGRPC
|
||||
AuthGRPC --> SessionSnap
|
||||
AuthGRPC --> Replay
|
||||
AuthGRPC --> PushHub
|
||||
SessSub --> SessionSnap
|
||||
SessSub --> PushHub
|
||||
ClientSub --> PushHub
|
||||
PublicHTTP --> Telemetry
|
||||
AuthGRPC --> Telemetry
|
||||
AdminHTTP --> Telemetry
|
||||
|
||||
Redis["Redis\nsession records + replay keys + streams"]
|
||||
AuthSvc["Auth / Session Service"]
|
||||
Downstream["Downstream business services"]
|
||||
Metrics["Prometheus / OTLP collectors"]
|
||||
|
||||
PublicHTTP -. public auth adapter .-> AuthSvc
|
||||
SessionSnap --> Redis
|
||||
Replay --> Redis
|
||||
SessSub --> Redis
|
||||
ClientSub --> Redis
|
||||
AuthGRPC --> Downstream
|
||||
Telemetry --> Metrics
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `cmd/gateway` refuses startup when Redis connectivity or the response signer
|
||||
is misconfigured.
|
||||
- The admin listener is optional and serves only Prometheus text metrics.
|
||||
- Public auth routing stays available without an upstream adapter, but returns
|
||||
`503 service_unavailable`.
|
||||
- Authenticated gRPC starts with an empty static router; `ExecuteCommand`
|
||||
remains `UNIMPLEMENTED` until downstream routes are injected.
|
||||
Reference in New Issue
Block a user