5.8 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and push or revoke incidents.
Startup Checks
Before starting the process, confirm:
GATEWAY_REDIS_MASTER_ADDRandGATEWAY_REDIS_PASSWORDpoint to the Redis deployment used for anti-replay reservations. Optional read replicas may be listed inGATEWAY_REDIS_REPLICA_ADDRS(currently unused; reserved for future read-routing).GATEWAY_BACKEND_HTTP_URL,GATEWAY_BACKEND_GRPC_PUSH_URL, andGATEWAY_BACKEND_GATEWAY_CLIENT_IDdescribe the consolidated backend service the gateway forwards every public auth and authenticated user/lobby request to and the gRPC push subscription it opens.GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATHpoints to a readable PKCS#8 PEM-encoded Ed25519 private key.- the configured Redis DB and key-prefix settings match the target
environment. Per
ARCHITECTURE.md §Persistence Backends, Redis traffic is password-protected and TLS is disabled by policy; the deprecatedGATEWAY_REDIS_TLS_ENABLEDandGATEWAY_REDIS_USERNAMEvariables are no longer accepted and cause a hard fail at startup.
At startup the process opens one shared *redis.Client (instrumented
via OpenTelemetry tracing and metrics) and performs one bounded PING
for the replay store. It also dials backend's gRPC push listener and
opens one Push.SubscribePush stream that reconnects with capped
exponential backoff on failure.
Startup fails fast if the Redis ping fails, the backend URL is malformed, or the signer key cannot be loaded.
Expected listener state after a healthy start:
- public HTTP is enabled on
GATEWAY_PUBLIC_HTTP_ADDRor its default:8080; - authenticated gRPC is enabled on
GATEWAY_AUTHENTICATED_GRPC_ADDRor its default:9090; - admin HTTP is enabled only when
GATEWAY_ADMIN_HTTP_ADDRis non-empty.
Known startup caveats:
- public auth routes stay mounted without an upstream adapter and return
503 service_unavailable; - authenticated gRPC starts with an empty static router, so
ExecuteCommandreturns gRPCUNIMPLEMENTEDuntil downstream routes are injected.
Readiness
Use the probes according to what they actually guarantee:
GET /healthzconfirms that the public HTTP listener is alive;GET /readyzconfirms that the current process is ready to serve public HTTP traffic;GET /metricsis available only on the optional admin listener.
/readyz is process-local. It does not confirm:
- downstream business-service reachability;
- auth upstream adapter reachability;
- Redis health after startup;
- push fan-out health.
For a practical readiness check in production:
- confirm the process emitted startup logs for the public and authenticated listeners;
- check
GET /healthz; - check
GET /readyz; - if admin HTTP is enabled, scrape
GET /metrics; - verify the expected Redis deployment and stream names from config.
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- the per-component shutdown budget is controlled by
GATEWAY_SHUTDOWN_TIMEOUT; - internal subscribers are stopped as part of application shutdown;
- the in-memory
PushHubis closed before gRPC graceful stop; - active
SubscribeEventsstreams terminate with gRPCUNAVAILABLEand messagegateway is shutting down.
During planned restarts:
- send
SIGTERM; - wait for listener shutdown and component-stop logs;
- expect connected clients to reconnect after the gateway closes the stream;
- investigate only if shutdown exceeds
GATEWAY_SHUTDOWN_TIMEOUTor streams remain open unexpectedly.
Revoke And Push Failure Triage
Revocation Does Not Take Effect
If a revoked session still sends traffic or keeps an active stream:
- verify that backend recorded the revocation (the
/api/v1/internal/sessions/{id}lookup must returnstatus=revokedfor that device session); - verify that backend emitted the corresponding
session_invalidationframe onPush.SubscribePushand that the gateway logs a matching subscription closure; - verify the gateway is connected to the same backend instance via
GATEWAY_BACKEND_HTTP_URL/GATEWAY_BACKEND_GRPC_PUSH_URL; - confirm the next authenticated request from that session is rejected.
Expected gateway behavior after the revoke snapshot is consumed:
- new authenticated requests for that
device_session_idfail with gRPCFAILED_PRECONDITION; - active
SubscribeEventsstreams for that exactdevice_session_idclose with the same status.
Push Events Are Not Delivered
If a client reports missing push events:
- confirm that the client successfully opened
SubscribeEvents; - confirm the stream received the initial
gateway.server_timebootstrap event; - confirm the gateway consumed the expected
pushv1.PushEventfrom backend (look forpush_dispatcherlog lines orgrpc_push_events_totalincrements on the backend side); - verify
user_idand optionaldevice_session_idon theClientEventmatch the intended target; - confirm the event payload fields are well-formed and not dropped as malformed;
- check whether the stream was closed earlier because of revoke, shutdown, or overflow.
Stream Closed Unexpectedly
Use the terminal gRPC status first:
FAILED_PRECONDITIONwithdevice session is revokedmeans the session was revoked;RESOURCE_EXHAUSTEDwithpush stream overflowedmeans that stream stopped consuming fast enough and its in-memory queue overflowed;UNAVAILABLEwithgateway is shutting downmeans normal process shutdown;- client-side cancellation or transport errors should be investigated on the client or network side.
For overflow incidents:
- treat the issue as stream-local, not a global push outage;
- inspect client receive behavior and reconnect logic;
- look at push metrics and logs around the affected user/session.