feat: edge gateway service

This commit is contained in:
Ilia Denisov
2026-04-02 19:18:42 +02:00
committed by GitHub
parent 8cde99936c
commit 436c97a38b
95 changed files with 20504 additions and 57 deletions
+20
View File
@@ -0,0 +1,20 @@
# Edge Gateway Docs
This directory keeps service-local documentation that is too detailed for the
root architecture documents and too diagram-heavy for the module README.
Sections:
- [Runtime and components](runtime.md)
- [Public auth, command, and push flows](flows.md)
- [Operator runbook](runbook.md)
- [Configuration and contract examples](examples.md)
- [Example `.env`](../.env.example)
Primary references:
- [`../README.md`](../README.md) for service scope, contracts, configuration,
and operational behavior
- [`../openapi.yaml`](../openapi.yaml) for the public REST contract
- [`../../README.md`](../../README.md) for workspace-level architecture
- [`../../SECURITY.md`](../../SECURITY.md) for the transport security model
+179
View File
@@ -0,0 +1,179 @@
# Configuration And Contract Examples
The examples below are illustrative. Values such as signatures, payload hashes,
and FlatBuffers payload bytes are placeholders unless explicitly stated
otherwise.
## Example `.env`
The repository also includes a ready-to-copy sample file:
- [`../.env.example`](../.env.example)
The sample keeps all secrets blank and shows only the settings needed to boot
the process and expose the main listeners.
## Public Auth HTTP Examples
Start an e-mail challenge:
```bash
curl -X POST http://127.0.0.1:8080/api/v1/public/auth/send-email-code \
-H 'Content-Type: application/json' \
-d '{"email":"pilot@example.com"}'
```
Example response:
```json
{
"challenge_id": "challenge-123"
}
```
Confirm the challenge and register the device public key:
```bash
curl -X POST http://127.0.0.1:8080/api/v1/public/auth/confirm-email-code \
-H 'Content-Type: application/json' \
-d '{
"challenge_id": "challenge-123",
"code": "123456",
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no="
}'
```
Example response:
```json
{
"device_session_id": "device-session-123"
}
```
## Authenticated gRPC Envelope Examples
The authenticated transport is gRPC/protobuf, not JSON over HTTP. The examples
below use protobuf-style JSON only to make the logical envelope readable.
`bytes` fields are shown as base64 strings, matching the standard protobuf JSON
mapping.
Example `ExecuteCommandRequest`:
```json
{
"protocolVersion": "v1",
"deviceSessionId": "device-session-123",
"messageType": "fleet.move",
"timestampMs": "1775121600000",
"requestId": "request-123",
"payloadBytes": "RkxBVEJVRkZFUlNfUEFZTE9BRA==",
"payloadHash": "5fY6Q8V9mK8x2B7v6v0V0m0i1rQ2QF0rQ8V1Yt1r8Ys=",
"signature": "3o4v8f3h0Y6I0x1bS7zY+8m0bV1Lk4D3yq8J2n8F1rD7yK9v8M1Q0w2s4a6f8d0Q0m3L6y8R1t5w7x9z0a2cA==",
"traceId": "trace-123"
}
```
Example `ExecuteCommandResponse`:
```json
{
"protocolVersion": "v1",
"requestId": "request-123",
"timestampMs": "1775121600123",
"resultCode": "ok",
"payloadBytes": "RkxBVEJVRkZFUlNfUkVTUE9OU0U=",
"payloadHash": "wL4n8H1aR2x3M4b5C6d7E8f9G0h1J2k3L4m5N6o7P8Q=",
"signature": "2Xb7l9m0n1p2q3r4s5t6u7v8w9x0y1z2A3B4C5D6E7F8G9H0J1K2L3M4N5O6P7Q8R9S0T1U2V3W4X5Y6Z7a8b9cQ=="
}
```
Example bootstrap `GatewayEvent` sent after `SubscribeEvents` opens:
```json
{
"eventType": "gateway.server_time",
"eventId": "request-123",
"timestampMs": "1775121600456",
"payloadBytes": "RkxBVEJVRkZFUlNfU0VSVkVSX1RJTUU=",
"payloadHash": "2b1U3m4N5p6Q7r8S9t0U1v2W3x4Y5z6A7b8C9d0E1f2=",
"signature": "4Nf8k2p6s0w4y8A2d6g0j4m8p2t6w0z4C8F2I6L0O4R8U2X6a0d4g8j2m6p0s4v8yA2d6g0j4m8p2t6w0z4C8F2I6A==",
"requestId": "request-123",
"traceId": "trace-123"
}
```
## Redis Examples
### Session Cache Record
Example Redis key and JSON value used by the fallback session cache:
```text
gateway:session:device-session-123
```
```json
{
"device_session_id": "device-session-123",
"user_id": "user-123",
"client_public_key": "11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no=",
"status": "active"
}
```
### Session Event Stream Entry
Example session snapshot entry:
```bash
redis-cli XADD gateway:session-events '*' \
device_session_id device-session-123 \
user_id user-123 \
client_public_key 11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no= \
status active
```
Revocation entry:
```bash
redis-cli XADD gateway:session-events '*' \
device_session_id device-session-123 \
user_id user-123 \
client_public_key 11qYAYdk8v3K6Yw8QK6ZlQ2nP4Wm8Cq5g1H0K8vT9no= \
status revoked \
revoked_at_ms 1775121700000
```
### Client Event Stream Entry
User-wide event:
```bash
redis-cli XADD gateway:client-events '*' \
user_id user-123 \
event_type fleet.updated \
event_id event-123 \
payload_bytes payload-v1
```
Session-targeted event with correlation:
```bash
redis-cli XADD gateway:client-events '*' \
user_id user-123 \
device_session_id device-session-123 \
event_type fleet.updated \
event_id event-124 \
payload_bytes payload-v2 \
request_id request-123 \
trace_id trace-123
```
Notes:
- `payload_bytes` in Redis Stream entries must be binary-safe payload data;
- the gateway derives `timestamp_ms`, recomputes `payload_hash`, and signs the
outgoing event at delivery time;
- each gateway replica consumes streams with plain `XREAD`, so publishers must
keep retention bounded with `MAXLEN`.
+86
View File
@@ -0,0 +1,86 @@
# Request and Push Flows
## Public Auth Flow
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Limiter as Public anti-abuse
participant Auth as AuthServiceClient
Client->>Gateway: POST /api/v1/public/auth/send-email-code
Gateway->>Limiter: classify + rate-limit + body checks
Limiter-->>Gateway: allowed
Gateway->>Auth: SendEmailCode(email)
Auth-->>Gateway: challenge_id
Gateway-->>Client: 200 {challenge_id}
Client->>Gateway: POST /api/v1/public/auth/confirm-email-code
Gateway->>Limiter: classify + rate-limit + body checks
Limiter-->>Gateway: allowed
Gateway->>Auth: ConfirmEmailCode(challenge_id, code, client_public_key)
Auth-->>Gateway: device_session_id
Gateway-->>Client: 200 {device_session_id}
```
## Authenticated ExecuteCommand Flow
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Cache as SessionCache
participant Replay as ReplayStore
participant Policy as Rate limit / policy
participant Downstream
Client->>Gateway: ExecuteCommand(envelope, payload_bytes, signature)
Gateway->>Gateway: validate envelope + protocol_version
Gateway->>Cache: lookup(device_session_id)
Cache-->>Gateway: session record
Gateway->>Gateway: verify payload_hash
Gateway->>Gateway: verify Ed25519 signature
Gateway->>Gateway: verify freshness window
Gateway->>Replay: reserve(device_session_id, request_id, ttl)
Replay-->>Gateway: accepted
Gateway->>Policy: apply IP/session/user/message_type budgets
Policy-->>Gateway: allowed
Gateway->>Downstream: verified authenticated command
Downstream-->>Gateway: result_code + payload_bytes
Gateway->>Gateway: hash payload + sign response
Gateway-->>Client: ExecuteCommandResponse + signature
```
## SubscribeEvents Lifecycle
```mermaid
sequenceDiagram
participant Client
participant Gateway
participant Cache as SessionCache
participant Replay as ReplayStore
participant Hub as PushHub
participant Stream as Client event stream
participant Sess as Session event stream
Client->>Gateway: SubscribeEvents(envelope, signature)
Gateway->>Gateway: validate envelope + verify request
Gateway->>Cache: lookup(device_session_id)
Cache-->>Gateway: session record
Gateway->>Replay: reserve(device_session_id, request_id, ttl)
Replay-->>Gateway: accepted
Gateway->>Client: gateway.server_time event
Gateway->>Hub: register(user_id, device_session_id)
Stream-->>Gateway: client-facing event for user_id / device_session_id
Gateway->>Hub: publish signed event
Hub-->>Client: matching event delivery
Sess-->>Gateway: revoked session snapshot
Gateway->>Hub: revoke(device_session_id)
Hub-->>Client: stream closes with FAILED_PRECONDITION
Note over Gateway,Hub: During shutdown the gateway closes PushHub before gRPC graceful stop.
Hub-->>Client: stream closes with UNAVAILABLE
```
+143
View File
@@ -0,0 +1,143 @@
# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and push or revoke incidents.
## Startup Checks
Before starting the process, confirm:
- `GATEWAY_SESSION_CACHE_REDIS_ADDR` points to the Redis deployment used for
session lookup and both internal event streams.
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM` reference existing Redis Stream keys or
the names publishers will use.
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable PKCS#8
PEM-encoded Ed25519 private key.
- the configured Redis ACL, DB, TLS, and key-prefix settings match the target
environment.
At startup the process performs bounded `PING` checks for:
- the Redis-backed session cache adapter;
- the replay store;
- the session event subscriber;
- the client event subscriber.
Startup fails fast if any of those checks fail or if the signer key cannot be
loaded.
Expected listener state after a healthy start:
- public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`;
- authenticated gRPC is enabled on
`GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`;
- admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
Known startup caveats:
- public auth routes stay mounted without an upstream adapter and return
`503 service_unavailable`;
- authenticated gRPC starts with an empty static router, so `ExecuteCommand`
returns gRPC `UNIMPLEMENTED` until downstream routes are injected.
## Readiness
Use the probes according to what they actually guarantee:
- `GET /healthz` confirms that the public HTTP listener is alive;
- `GET /readyz` confirms that the current process is ready to serve public HTTP
traffic;
- `GET /metrics` is available only on the optional admin listener.
`/readyz` is process-local. It does not confirm:
- downstream business-service reachability;
- auth upstream adapter reachability;
- Redis health after startup;
- push fan-out health.
For a practical readiness check in production:
1. confirm the process emitted startup logs for the public and authenticated
listeners;
2. check `GET /healthz`;
3. check `GET /readyz`;
4. if admin HTTP is enabled, scrape `GET /metrics`;
5. verify the expected Redis deployment and stream names from config.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by
`GATEWAY_SHUTDOWN_TIMEOUT`;
- internal subscribers are stopped as part of application shutdown;
- the in-memory `PushHub` is closed before gRPC graceful stop;
- active `SubscribeEvents` streams terminate with gRPC `UNAVAILABLE` and
message `gateway is shutting down`.
During planned restarts:
1. send `SIGTERM`;
2. wait for listener shutdown and component-stop logs;
3. expect connected clients to reconnect after the gateway closes the stream;
4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams
remain open unexpectedly.
## Revoke And Push Failure Triage
### Revocation Does Not Take Effect
If a revoked session still sends traffic or keeps an active stream:
1. verify that the auth/session side published a session snapshot with the
same `device_session_id` and `status=revoked`;
2. verify that the event was written to
`GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
3. verify the gateway is connected to the same Redis address, DB, and stream;
4. confirm the snapshot fields are complete and well-formed;
5. check that a later active snapshot did not overwrite the revoked one.
Expected gateway behavior after the revoke snapshot is consumed:
- new authenticated requests for that `device_session_id` fail with gRPC
`FAILED_PRECONDITION`;
- active `SubscribeEvents` streams for that exact `device_session_id` close
with the same status.
### Push Events Are Not Delivered
If a client reports missing push events:
1. confirm that the client successfully opened `SubscribeEvents`;
2. confirm the stream received the initial `gateway.server_time` bootstrap
event;
3. confirm the gateway consumed the expected entry from
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
4. verify `user_id` and optional `device_session_id` in the stream entry match
the intended target;
5. confirm the event payload fields are well-formed and not dropped as
malformed;
6. check whether the stream was closed earlier because of revoke, shutdown, or
overflow.
### Stream Closed Unexpectedly
Use the terminal gRPC status first:
- `FAILED_PRECONDITION` with `device session is revoked` means the session was
revoked;
- `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped
consuming fast enough and its in-memory queue overflowed;
- `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown;
- client-side cancellation or transport errors should be investigated on the
client or network side.
For overflow incidents:
- treat the issue as stream-local, not a global push outage;
- inspect client receive behavior and reconnect logic;
- look at push metrics and logs around the affected user/session.
+59
View File
@@ -0,0 +1,59 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/gateway` process and its
runtime dependencies.
```mermaid
flowchart LR
subgraph Clients
Public["Public REST clients"]
Authd["Authenticated gRPC clients"]
end
subgraph Gateway["Edge Gateway process"]
PublicHTTP["Public HTTP listener\n/healthz /readyz /api/v1/public/auth/*"]
AuthGRPC["Authenticated gRPC listener\nExecuteCommand / SubscribeEvents"]
AdminHTTP["Optional admin HTTP listener\n/metrics"]
SessionSnap["In-memory session snapshot cache"]
Replay["Replay reservation client"]
PushHub["PushHub"]
SessSub["Session event subscriber"]
ClientSub["Client event subscriber"]
Telemetry["Logs, traces, metrics"]
end
Public --> PublicHTTP
Authd --> AuthGRPC
AuthGRPC --> SessionSnap
AuthGRPC --> Replay
AuthGRPC --> PushHub
SessSub --> SessionSnap
SessSub --> PushHub
ClientSub --> PushHub
PublicHTTP --> Telemetry
AuthGRPC --> Telemetry
AdminHTTP --> Telemetry
Redis["Redis\nsession records + replay keys + streams"]
AuthSvc["Auth / Session Service"]
Downstream["Downstream business services"]
Metrics["Prometheus / OTLP collectors"]
PublicHTTP -. public auth adapter .-> AuthSvc
SessionSnap --> Redis
Replay --> Redis
SessSub --> Redis
ClientSub --> Redis
AuthGRPC --> Downstream
Telemetry --> Metrics
```
Notes:
- `cmd/gateway` refuses startup when Redis connectivity or the response signer
is misconfigured.
- The admin listener is optional and serves only Prometheus text metrics.
- Public auth routing stays available without an upstream adapter, but returns
`503 service_unavailable`.
- Authenticated gRPC starts with an empty static router; `ExecuteCommand`
remains `UNIMPLEMENTED` until downstream routes are injected.