5.0 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and push or revoke incidents.
Startup Checks
Before starting the process, confirm:
GATEWAY_SESSION_CACHE_REDIS_ADDRpoints to the Redis deployment used for session lookup and both internal event streams.GATEWAY_SESSION_EVENTS_REDIS_STREAMandGATEWAY_CLIENT_EVENTS_REDIS_STREAMreference existing Redis Stream keys or the names publishers will use.GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATHpoints to a readable PKCS#8 PEM-encoded Ed25519 private key.- the configured Redis ACL, DB, TLS, and key-prefix settings match the target environment.
At startup the process performs bounded PING checks for:
- the Redis-backed session cache adapter;
- the replay store;
- the session event subscriber;
- the client event subscriber.
Startup fails fast if any of those checks fail or if the signer key cannot be loaded.
Expected listener state after a healthy start:
- public HTTP is enabled on
GATEWAY_PUBLIC_HTTP_ADDRor its default:8080; - authenticated gRPC is enabled on
GATEWAY_AUTHENTICATED_GRPC_ADDRor its default:9090; - admin HTTP is enabled only when
GATEWAY_ADMIN_HTTP_ADDRis non-empty.
Known startup caveats:
- public auth routes stay mounted without an upstream adapter and return
503 service_unavailable; - authenticated gRPC starts with an empty static router, so
ExecuteCommandreturns gRPCUNIMPLEMENTEDuntil downstream routes are injected.
Readiness
Use the probes according to what they actually guarantee:
GET /healthzconfirms that the public HTTP listener is alive;GET /readyzconfirms that the current process is ready to serve public HTTP traffic;GET /metricsis available only on the optional admin listener.
/readyz is process-local. It does not confirm:
- downstream business-service reachability;
- auth upstream adapter reachability;
- Redis health after startup;
- push fan-out health.
For a practical readiness check in production:
- confirm the process emitted startup logs for the public and authenticated listeners;
- check
GET /healthz; - check
GET /readyz; - if admin HTTP is enabled, scrape
GET /metrics; - verify the expected Redis deployment and stream names from config.
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- the per-component shutdown budget is controlled by
GATEWAY_SHUTDOWN_TIMEOUT; - internal subscribers are stopped as part of application shutdown;
- the in-memory
PushHubis closed before gRPC graceful stop; - active
SubscribeEventsstreams terminate with gRPCUNAVAILABLEand messagegateway is shutting down.
During planned restarts:
- send
SIGTERM; - wait for listener shutdown and component-stop logs;
- expect connected clients to reconnect after the gateway closes the stream;
- investigate only if shutdown exceeds
GATEWAY_SHUTDOWN_TIMEOUTor streams remain open unexpectedly.
Revoke And Push Failure Triage
Revocation Does Not Take Effect
If a revoked session still sends traffic or keeps an active stream:
- verify that the auth/session side published a session snapshot with the
same
device_session_idandstatus=revoked; - verify that the event was written to
GATEWAY_SESSION_EVENTS_REDIS_STREAM; - verify the gateway is connected to the same Redis address, DB, and stream;
- confirm the snapshot fields are complete and well-formed;
- check that a later active snapshot did not overwrite the revoked one.
Expected gateway behavior after the revoke snapshot is consumed:
- new authenticated requests for that
device_session_idfail with gRPCFAILED_PRECONDITION; - active
SubscribeEventsstreams for that exactdevice_session_idclose with the same status.
Push Events Are Not Delivered
If a client reports missing push events:
- confirm that the client successfully opened
SubscribeEvents; - confirm the stream received the initial
gateway.server_timebootstrap event; - confirm the gateway consumed the expected entry from
GATEWAY_CLIENT_EVENTS_REDIS_STREAM; - verify
user_idand optionaldevice_session_idin the stream entry match the intended target; - confirm the event payload fields are well-formed and not dropped as malformed;
- check whether the stream was closed earlier because of revoke, shutdown, or overflow.
Stream Closed Unexpectedly
Use the terminal gRPC status first:
FAILED_PRECONDITIONwithdevice session is revokedmeans the session was revoked;RESOURCE_EXHAUSTEDwithpush stream overflowedmeans that stream stopped consuming fast enough and its in-memory queue overflowed;UNAVAILABLEwithgateway is shutting downmeans normal process shutdown;- client-side cancellation or transport errors should be investigated on the client or network side.
For overflow incidents:
- treat the issue as stream-local, not a global push outage;
- inspect client receive behavior and reconnect logic;
- look at push metrics and logs around the affected user/session.