Files
Ilia Denisov 118f7c17a2 phase 4: connectrpc on the gateway authenticated edge
Replace the native-gRPC server bootstrap with a single
`connectrpc.com/connect` HTTP/h2c listener. Connect-Go natively
serves Connect, gRPC, and gRPC-Web on the same port, so browsers can
now reach the authenticated surface without giving up the gRPC
framing native and desktop clients may use later. The decorator
stack (envelope → session → payload-hash → signature →
freshness/replay → rate-limit → routing/push) is reused unchanged
behind a small Connect → gRPC adapter and a `grpc.ServerStream`
shim around `*connect.ServerStream`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 11:49:28 +02:00

152 lines
5.8 KiB
Markdown

# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and push or revoke incidents.
## Startup Checks
Before starting the process, confirm:
- `GATEWAY_REDIS_MASTER_ADDR` and `GATEWAY_REDIS_PASSWORD` point to the
Redis deployment used for anti-replay reservations. Optional read
replicas may be listed in `GATEWAY_REDIS_REPLICA_ADDRS` (currently
unused; reserved for future read-routing).
- `GATEWAY_BACKEND_HTTP_URL`, `GATEWAY_BACKEND_GRPC_PUSH_URL`, and
`GATEWAY_BACKEND_GATEWAY_CLIENT_ID` describe the consolidated backend
service the gateway forwards every public auth and authenticated
user/lobby request to and the gRPC push subscription it opens.
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable
PKCS#8 PEM-encoded Ed25519 private key.
- the configured Redis DB and key-prefix settings match the target
environment. Per `ARCHITECTURE.md §Persistence Backends`, Redis traffic
is password-protected and TLS is disabled by policy; the deprecated
`GATEWAY_REDIS_TLS_ENABLED` and `GATEWAY_REDIS_USERNAME` variables are
no longer accepted and cause a hard fail at startup.
At startup the process opens one shared `*redis.Client` (instrumented
via OpenTelemetry tracing and metrics) and performs one bounded `PING`
for the replay store. It also dials backend's gRPC push listener and
opens one `Push.SubscribePush` stream that reconnects with capped
exponential backoff on failure.
Startup fails fast if the Redis ping fails, the backend URL is
malformed, or the signer key cannot be loaded.
Expected listener state after a healthy start:
- public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`;
- authenticated gRPC is enabled on
`GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`;
- admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
Known startup caveats:
- public auth routes stay mounted without an upstream adapter and return
`503 service_unavailable`;
- authenticated gRPC starts with an empty static router, so `ExecuteCommand`
returns gRPC `UNIMPLEMENTED` until downstream routes are injected.
## Readiness
Use the probes according to what they actually guarantee:
- `GET /healthz` confirms that the public HTTP listener is alive;
- `GET /readyz` confirms that the current process is ready to serve public HTTP
traffic;
- `GET /metrics` is available only on the optional admin listener.
`/readyz` is process-local. It does not confirm:
- downstream business-service reachability;
- auth upstream adapter reachability;
- Redis health after startup;
- push fan-out health.
For a practical readiness check in production:
1. confirm the process emitted startup logs for the public and authenticated
listeners;
2. check `GET /healthz`;
3. check `GET /readyz`;
4. if admin HTTP is enabled, scrape `GET /metrics`;
5. verify the expected Redis deployment and stream names from config.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by
`GATEWAY_SHUTDOWN_TIMEOUT`;
- internal subscribers are stopped as part of application shutdown;
- the in-memory `PushHub` is closed before HTTP graceful stop;
- active `SubscribeEvents` streams terminate with `UNAVAILABLE` and
message `gateway is shutting down`.
During planned restarts:
1. send `SIGTERM`;
2. wait for listener shutdown and component-stop logs;
3. expect connected clients to reconnect after the gateway closes the stream;
4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams
remain open unexpectedly.
## Revoke And Push Failure Triage
### Revocation Does Not Take Effect
If a revoked session still sends traffic or keeps an active stream:
1. verify that backend recorded the revocation (the
`/api/v1/internal/sessions/{id}` lookup must return `status=revoked`
for that device session);
2. verify that backend emitted the corresponding `session_invalidation`
frame on `Push.SubscribePush` and that the gateway logs a
matching subscription closure;
3. verify the gateway is connected to the same backend instance via
`GATEWAY_BACKEND_HTTP_URL` / `GATEWAY_BACKEND_GRPC_PUSH_URL`;
4. confirm the next authenticated request from that session is rejected.
Expected gateway behavior after the revoke snapshot is consumed:
- new authenticated requests for that `device_session_id` fail with gRPC
`FAILED_PRECONDITION`;
- active `SubscribeEvents` streams for that exact `device_session_id` close
with the same status.
### Push Events Are Not Delivered
If a client reports missing push events:
1. confirm that the client successfully opened `SubscribeEvents`;
2. confirm the stream received the initial `gateway.server_time`
bootstrap event;
3. confirm the gateway consumed the expected `pushv1.PushEvent` from
backend (look for `push_dispatcher` log lines or
`grpc_push_events_total` increments on the backend side);
4. verify `user_id` and optional `device_session_id` on the
`ClientEvent` match the intended target;
5. confirm the event payload fields are well-formed and not dropped as
malformed;
6. check whether the stream was closed earlier because of revoke,
shutdown, or overflow.
### Stream Closed Unexpectedly
Use the terminal gRPC status first:
- `FAILED_PRECONDITION` with `device session is revoked` means the session was
revoked;
- `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped
consuming fast enough and its in-memory queue overflowed;
- `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown;
- client-side cancellation or transport errors should be investigated on the
client or network side.
For overflow incidents:
- treat the issue as stream-local, not a global push outage;
- inspect client receive behavior and reconnect logic;
- look at push metrics and logs around the affected user/session.