147 lines
5.5 KiB
Markdown
147 lines
5.5 KiB
Markdown
# Operator Runbook
|
|
|
|
This runbook covers the checks that matter most during startup, steady-state
|
|
readiness, shutdown, and push or revoke incidents.
|
|
|
|
## Startup Checks
|
|
|
|
Before starting the process, confirm:
|
|
|
|
- `GATEWAY_REDIS_MASTER_ADDR` and `GATEWAY_REDIS_PASSWORD` point to the Redis
|
|
deployment used for session lookup, replay reservations, session-events
|
|
consumption, and client-events fan-out. Optional read replicas may be
|
|
listed in `GATEWAY_REDIS_REPLICA_ADDRS` (currently unused; reserved for
|
|
future read-routing).
|
|
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and
|
|
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM` reference existing Redis Stream keys
|
|
or the names publishers will use.
|
|
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable PKCS#8
|
|
PEM-encoded Ed25519 private key.
|
|
- the configured Redis DB and key-prefix settings match the target
|
|
environment. Per `ARCHITECTURE.md §Persistence Backends`, Redis traffic is
|
|
password-protected and TLS is disabled by policy; the deprecated
|
|
`GATEWAY_REDIS_TLS_ENABLED` and `GATEWAY_REDIS_USERNAME` variables are no
|
|
longer accepted and cause a hard fail at startup.
|
|
|
|
At startup the process opens one shared `*redis.Client` (instrumented via
|
|
OpenTelemetry tracing and metrics) and performs one bounded `PING`. The
|
|
session cache, replay store, session-events subscriber, and client-events
|
|
subscriber all use that client.
|
|
|
|
Startup fails fast if the ping fails or if the signer key cannot be loaded.
|
|
|
|
Expected listener state after a healthy start:
|
|
|
|
- public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`;
|
|
- authenticated gRPC is enabled on
|
|
`GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`;
|
|
- admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
|
|
|
|
Known startup caveats:
|
|
|
|
- public auth routes stay mounted without an upstream adapter and return
|
|
`503 service_unavailable`;
|
|
- authenticated gRPC starts with an empty static router, so `ExecuteCommand`
|
|
returns gRPC `UNIMPLEMENTED` until downstream routes are injected.
|
|
|
|
## Readiness
|
|
|
|
Use the probes according to what they actually guarantee:
|
|
|
|
- `GET /healthz` confirms that the public HTTP listener is alive;
|
|
- `GET /readyz` confirms that the current process is ready to serve public HTTP
|
|
traffic;
|
|
- `GET /metrics` is available only on the optional admin listener.
|
|
|
|
`/readyz` is process-local. It does not confirm:
|
|
|
|
- downstream business-service reachability;
|
|
- auth upstream adapter reachability;
|
|
- Redis health after startup;
|
|
- push fan-out health.
|
|
|
|
For a practical readiness check in production:
|
|
|
|
1. confirm the process emitted startup logs for the public and authenticated
|
|
listeners;
|
|
2. check `GET /healthz`;
|
|
3. check `GET /readyz`;
|
|
4. if admin HTTP is enabled, scrape `GET /metrics`;
|
|
5. verify the expected Redis deployment and stream names from config.
|
|
|
|
## Shutdown
|
|
|
|
The process handles `SIGINT` and `SIGTERM`.
|
|
|
|
Shutdown behavior:
|
|
|
|
- the per-component shutdown budget is controlled by
|
|
`GATEWAY_SHUTDOWN_TIMEOUT`;
|
|
- internal subscribers are stopped as part of application shutdown;
|
|
- the in-memory `PushHub` is closed before gRPC graceful stop;
|
|
- active `SubscribeEvents` streams terminate with gRPC `UNAVAILABLE` and
|
|
message `gateway is shutting down`.
|
|
|
|
During planned restarts:
|
|
|
|
1. send `SIGTERM`;
|
|
2. wait for listener shutdown and component-stop logs;
|
|
3. expect connected clients to reconnect after the gateway closes the stream;
|
|
4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams
|
|
remain open unexpectedly.
|
|
|
|
## Revoke And Push Failure Triage
|
|
|
|
### Revocation Does Not Take Effect
|
|
|
|
If a revoked session still sends traffic or keeps an active stream:
|
|
|
|
1. verify that the auth/session side published a session snapshot with the
|
|
same `device_session_id` and `status=revoked`;
|
|
2. verify that the event was written to
|
|
`GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
|
|
3. verify the gateway is connected to the same Redis address, DB, and stream;
|
|
4. confirm the snapshot fields are complete and well-formed;
|
|
5. check that a later active snapshot did not overwrite the revoked one.
|
|
|
|
Expected gateway behavior after the revoke snapshot is consumed:
|
|
|
|
- new authenticated requests for that `device_session_id` fail with gRPC
|
|
`FAILED_PRECONDITION`;
|
|
- active `SubscribeEvents` streams for that exact `device_session_id` close
|
|
with the same status.
|
|
|
|
### Push Events Are Not Delivered
|
|
|
|
If a client reports missing push events:
|
|
|
|
1. confirm that the client successfully opened `SubscribeEvents`;
|
|
2. confirm the stream received the initial `gateway.server_time` bootstrap
|
|
event;
|
|
3. confirm the gateway consumed the expected entry from
|
|
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
|
|
4. verify `user_id` and optional `device_session_id` in the stream entry match
|
|
the intended target;
|
|
5. confirm the event payload fields are well-formed and not dropped as
|
|
malformed;
|
|
6. check whether the stream was closed earlier because of revoke, shutdown, or
|
|
overflow.
|
|
|
|
### Stream Closed Unexpectedly
|
|
|
|
Use the terminal gRPC status first:
|
|
|
|
- `FAILED_PRECONDITION` with `device session is revoked` means the session was
|
|
revoked;
|
|
- `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped
|
|
consuming fast enough and its in-memory queue overflowed;
|
|
- `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown;
|
|
- client-side cancellation or transport errors should be investigated on the
|
|
client or network side.
|
|
|
|
For overflow incidents:
|
|
|
|
- treat the issue as stream-local, not a global push outage;
|
|
- inspect client receive behavior and reconnect logic;
|
|
- look at push metrics and logs around the affected user/session.
|