feat: edge gateway service
This commit is contained in:
@@ -0,0 +1,143 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup, steady-state
|
||||
readiness, shutdown, and push or revoke incidents.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `GATEWAY_SESSION_CACHE_REDIS_ADDR` points to the Redis deployment used for
|
||||
session lookup and both internal event streams.
|
||||
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and
|
||||
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM` reference existing Redis Stream keys or
|
||||
the names publishers will use.
|
||||
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable PKCS#8
|
||||
PEM-encoded Ed25519 private key.
|
||||
- the configured Redis ACL, DB, TLS, and key-prefix settings match the target
|
||||
environment.
|
||||
|
||||
At startup the process performs bounded `PING` checks for:
|
||||
|
||||
- the Redis-backed session cache adapter;
|
||||
- the replay store;
|
||||
- the session event subscriber;
|
||||
- the client event subscriber.
|
||||
|
||||
Startup fails fast if any of those checks fail or if the signer key cannot be
|
||||
loaded.
|
||||
|
||||
Expected listener state after a healthy start:
|
||||
|
||||
- public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`;
|
||||
- authenticated gRPC is enabled on
|
||||
`GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`;
|
||||
- admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
|
||||
|
||||
Known startup caveats:
|
||||
|
||||
- public auth routes stay mounted without an upstream adapter and return
|
||||
`503 service_unavailable`;
|
||||
- authenticated gRPC starts with an empty static router, so `ExecuteCommand`
|
||||
returns gRPC `UNIMPLEMENTED` until downstream routes are injected.
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually guarantee:
|
||||
|
||||
- `GET /healthz` confirms that the public HTTP listener is alive;
|
||||
- `GET /readyz` confirms that the current process is ready to serve public HTTP
|
||||
traffic;
|
||||
- `GET /metrics` is available only on the optional admin listener.
|
||||
|
||||
`/readyz` is process-local. It does not confirm:
|
||||
|
||||
- downstream business-service reachability;
|
||||
- auth upstream adapter reachability;
|
||||
- Redis health after startup;
|
||||
- push fan-out health.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted startup logs for the public and authenticated
|
||||
listeners;
|
||||
2. check `GET /healthz`;
|
||||
3. check `GET /readyz`;
|
||||
4. if admin HTTP is enabled, scrape `GET /metrics`;
|
||||
5. verify the expected Redis deployment and stream names from config.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- the per-component shutdown budget is controlled by
|
||||
`GATEWAY_SHUTDOWN_TIMEOUT`;
|
||||
- internal subscribers are stopped as part of application shutdown;
|
||||
- the in-memory `PushHub` is closed before gRPC graceful stop;
|
||||
- active `SubscribeEvents` streams terminate with gRPC `UNAVAILABLE` and
|
||||
message `gateway is shutting down`.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for listener shutdown and component-stop logs;
|
||||
3. expect connected clients to reconnect after the gateway closes the stream;
|
||||
4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams
|
||||
remain open unexpectedly.
|
||||
|
||||
## Revoke And Push Failure Triage
|
||||
|
||||
### Revocation Does Not Take Effect
|
||||
|
||||
If a revoked session still sends traffic or keeps an active stream:
|
||||
|
||||
1. verify that the auth/session side published a session snapshot with the
|
||||
same `device_session_id` and `status=revoked`;
|
||||
2. verify that the event was written to
|
||||
`GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
|
||||
3. verify the gateway is connected to the same Redis address, DB, and stream;
|
||||
4. confirm the snapshot fields are complete and well-formed;
|
||||
5. check that a later active snapshot did not overwrite the revoked one.
|
||||
|
||||
Expected gateway behavior after the revoke snapshot is consumed:
|
||||
|
||||
- new authenticated requests for that `device_session_id` fail with gRPC
|
||||
`FAILED_PRECONDITION`;
|
||||
- active `SubscribeEvents` streams for that exact `device_session_id` close
|
||||
with the same status.
|
||||
|
||||
### Push Events Are Not Delivered
|
||||
|
||||
If a client reports missing push events:
|
||||
|
||||
1. confirm that the client successfully opened `SubscribeEvents`;
|
||||
2. confirm the stream received the initial `gateway.server_time` bootstrap
|
||||
event;
|
||||
3. confirm the gateway consumed the expected entry from
|
||||
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
|
||||
4. verify `user_id` and optional `device_session_id` in the stream entry match
|
||||
the intended target;
|
||||
5. confirm the event payload fields are well-formed and not dropped as
|
||||
malformed;
|
||||
6. check whether the stream was closed earlier because of revoke, shutdown, or
|
||||
overflow.
|
||||
|
||||
### Stream Closed Unexpectedly
|
||||
|
||||
Use the terminal gRPC status first:
|
||||
|
||||
- `FAILED_PRECONDITION` with `device session is revoked` means the session was
|
||||
revoked;
|
||||
- `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped
|
||||
consuming fast enough and its in-memory queue overflowed;
|
||||
- `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown;
|
||||
- client-side cancellation or transport errors should be investigated on the
|
||||
client or network side.
|
||||
|
||||
For overflow incidents:
|
||||
|
||||
- treat the issue as stream-local, not a global push outage;
|
||||
- inspect client receive behavior and reconnect logic;
|
||||
- look at push metrics and logs around the affected user/session.
|
||||
Reference in New Issue
Block a user