Files
galaxy-game/gateway/docs/runbook.md
T
2026-04-02 19:18:42 +02:00

144 lines
5.0 KiB
Markdown

# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
readiness, shutdown, and push or revoke incidents.
## Startup Checks
Before starting the process, confirm:
- `GATEWAY_SESSION_CACHE_REDIS_ADDR` points to the Redis deployment used for
session lookup and both internal event streams.
- `GATEWAY_SESSION_EVENTS_REDIS_STREAM` and
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM` reference existing Redis Stream keys or
the names publishers will use.
- `GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH` points to a readable PKCS#8
PEM-encoded Ed25519 private key.
- the configured Redis ACL, DB, TLS, and key-prefix settings match the target
environment.
At startup the process performs bounded `PING` checks for:
- the Redis-backed session cache adapter;
- the replay store;
- the session event subscriber;
- the client event subscriber.
Startup fails fast if any of those checks fail or if the signer key cannot be
loaded.
Expected listener state after a healthy start:
- public HTTP is enabled on `GATEWAY_PUBLIC_HTTP_ADDR` or its default `:8080`;
- authenticated gRPC is enabled on
`GATEWAY_AUTHENTICATED_GRPC_ADDR` or its default `:9090`;
- admin HTTP is enabled only when `GATEWAY_ADMIN_HTTP_ADDR` is non-empty.
Known startup caveats:
- public auth routes stay mounted without an upstream adapter and return
`503 service_unavailable`;
- authenticated gRPC starts with an empty static router, so `ExecuteCommand`
returns gRPC `UNIMPLEMENTED` until downstream routes are injected.
## Readiness
Use the probes according to what they actually guarantee:
- `GET /healthz` confirms that the public HTTP listener is alive;
- `GET /readyz` confirms that the current process is ready to serve public HTTP
traffic;
- `GET /metrics` is available only on the optional admin listener.
`/readyz` is process-local. It does not confirm:
- downstream business-service reachability;
- auth upstream adapter reachability;
- Redis health after startup;
- push fan-out health.
For a practical readiness check in production:
1. confirm the process emitted startup logs for the public and authenticated
listeners;
2. check `GET /healthz`;
3. check `GET /readyz`;
4. if admin HTTP is enabled, scrape `GET /metrics`;
5. verify the expected Redis deployment and stream names from config.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- the per-component shutdown budget is controlled by
`GATEWAY_SHUTDOWN_TIMEOUT`;
- internal subscribers are stopped as part of application shutdown;
- the in-memory `PushHub` is closed before gRPC graceful stop;
- active `SubscribeEvents` streams terminate with gRPC `UNAVAILABLE` and
message `gateway is shutting down`.
During planned restarts:
1. send `SIGTERM`;
2. wait for listener shutdown and component-stop logs;
3. expect connected clients to reconnect after the gateway closes the stream;
4. investigate only if shutdown exceeds `GATEWAY_SHUTDOWN_TIMEOUT` or streams
remain open unexpectedly.
## Revoke And Push Failure Triage
### Revocation Does Not Take Effect
If a revoked session still sends traffic or keeps an active stream:
1. verify that the auth/session side published a session snapshot with the
same `device_session_id` and `status=revoked`;
2. verify that the event was written to
`GATEWAY_SESSION_EVENTS_REDIS_STREAM`;
3. verify the gateway is connected to the same Redis address, DB, and stream;
4. confirm the snapshot fields are complete and well-formed;
5. check that a later active snapshot did not overwrite the revoked one.
Expected gateway behavior after the revoke snapshot is consumed:
- new authenticated requests for that `device_session_id` fail with gRPC
`FAILED_PRECONDITION`;
- active `SubscribeEvents` streams for that exact `device_session_id` close
with the same status.
### Push Events Are Not Delivered
If a client reports missing push events:
1. confirm that the client successfully opened `SubscribeEvents`;
2. confirm the stream received the initial `gateway.server_time` bootstrap
event;
3. confirm the gateway consumed the expected entry from
`GATEWAY_CLIENT_EVENTS_REDIS_STREAM`;
4. verify `user_id` and optional `device_session_id` in the stream entry match
the intended target;
5. confirm the event payload fields are well-formed and not dropped as
malformed;
6. check whether the stream was closed earlier because of revoke, shutdown, or
overflow.
### Stream Closed Unexpectedly
Use the terminal gRPC status first:
- `FAILED_PRECONDITION` with `device session is revoked` means the session was
revoked;
- `RESOURCE_EXHAUSTED` with `push stream overflowed` means that stream stopped
consuming fast enough and its in-memory queue overflowed;
- `UNAVAILABLE` with `gateway is shutting down` means normal process shutdown;
- client-side cancellation or transport errors should be investigated on the
client or network side.
For overflow incidents:
- treat the issue as stream-local, not a global push outage;
- inspect client receive behavior and reconnect logic;
- look at push metrics and logs around the affected user/session.