# Operator Runbook This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and push or revoke incidents. ## Startup Checks Before starting the process, confirm: - `GATEWAY_REDIS_MASTER_ADDR` and `GATEWAY_REDIS_PASSWORD` point to the Redis deployment used for session lookup, replay reservations, session-events consumption, and client-events fan-out. Optional read replicas may be listed in `GATEWAY_REDIS_REPLICA_ADDRS` (currently unused; reserved for future read-routing). - `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and `GATEWAY_CLIENT_EVENTS_REDIS_STREAM` reference existing Redis Stream keys or the names publishers will use. - `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable PKCS#8 PEM-encoded Ed25519 private key. - the configured Redis DB and key-prefix settings match the target environment. Per `ARCHITECTURE.md §Persistence Backends`, Redis traffic is password-protected and TLS is disabled by policy; the deprecated `GATEWAY_REDIS_TLS_ENABLED` and `GATEWAY_REDIS_USERNAME` variables are no longer accepted and cause a hard fail at startup. At startup the process opens one shared `*redis.Client` (instrumented via OpenTelemetry tracing and metrics) and performs one bounded `PING`. The session cache, replay store, session-events subscriber, and client-events subscriber all use that client. Startup fails fast if the ping fails or if the signer key cannot be loaded. Expected listener state after a healthy start: - public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`; - authenticated gRPC is enabled on `GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`; - admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty. Known startup caveats: - public auth routes stay mounted without an upstream adapter and return `503 service_unavailable`; - authenticated gRPC starts with an empty static router, so `ExecuteCommand` returns gRPC `UNIMPLEMENTED` until downstream routes are injected. ## Readiness Use the probes according to what they actually guarantee: - `GET /healthz` confirms that the public HTTP listener is alive; - `GET /readyz` confirms that the current process is ready to serve public HTTP traffic; - `GET /metrics` is available only on the optional admin listener. `/readyz` is process-local. It does not confirm: - downstream business-service reachability; - auth upstream adapter reachability; - Redis health after startup; - push fan-out health. For a practical readiness check in production: 1. confirm the process emitted startup logs for the public and authenticated listeners; 2. check `GET /healthz`; 3. check `GET /readyz`; 4. if admin HTTP is enabled, scrape `GET /metrics`; 5. verify the expected Redis deployment and stream names from config. ## Shutdown The process handles `SIGINT` and `SIGTERM`. Shutdown behavior: - the per-component shutdown budget is controlled by `GATEWAY_SHUTDOWN_TIMEOUT`; - internal subscribers are stopped as part of application shutdown; - the in-memory `PushHub` is closed before gRPC graceful stop; - active `SubscribeEvents` streams terminate with gRPC `UNAVAILABLE` and message `gateway is shutting down`. During planned restarts: 1. send `SIGTERM`; 2. wait for listener shutdown and component-stop logs; 3. expect connected clients to reconnect after the gateway closes the stream; 4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams remain open unexpectedly. ## Revoke And Push Failure Triage ### Revocation Does Not Take Effect If a revoked session still sends traffic or keeps an active stream: 1. verify that the auth/session side published a session snapshot with the same `device_session_id` and `status=revoked`; 2. verify that the event was written to `GATEWAY_SESSION_EVENTS_REDIS_STREAM`; 3. verify the gateway is connected to the same Redis address, DB, and stream; 4. confirm the snapshot fields are complete and well-formed; 5. check that a later active snapshot did not overwrite the revoked one. Expected gateway behavior after the revoke snapshot is consumed: - new authenticated requests for that `device_session_id` fail with gRPC `FAILED_PRECONDITION`; - active `SubscribeEvents` streams for that exact `device_session_id` close with the same status. ### Push Events Are Not Delivered If a client reports missing push events: 1. confirm that the client successfully opened `SubscribeEvents`; 2. confirm the stream received the initial `gateway.server_time` bootstrap event; 3. confirm the gateway consumed the expected entry from `GATEWAY_CLIENT_EVENTS_REDIS_STREAM`; 4. verify `user_id` and optional `device_session_id` in the stream entry match the intended target; 5. confirm the event payload fields are well-formed and not dropped as malformed; 6. check whether the stream was closed earlier because of revoke, shutdown, or overflow. ### Stream Closed Unexpectedly Use the terminal gRPC status first: - `FAILED_PRECONDITION` with `device session is revoked` means the session was revoked; - `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped consuming fast enough and its in-memory queue overflowed; - `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown; - client-side cancellation or transport errors should be investigated on the client or network side. For overflow incidents: - treat the issue as stream-local, not a global push outage; - inspect client receive behavior and reconnect logic; - look at push metrics and logs around the affected user/session.