Files
galaxy-game/gateway/docs/runbook.md
T
2026-04-02 19:18:42 +02:00

5.0 KiB

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and push or revoke incidents.

Startup Checks

Before starting the process, confirm:

  • GATEWAY_SESSION_CACHE_REDIS_ADDR points to the Redis deployment used for session lookup and both internal event streams.
  • GATEWAY_SESSION_EVENTS_REDIS_STREAM and GATEWAY_CLIENT_EVENTS_REDIS_STREAM reference existing Redis Stream keys or the names publishers will use.
  • GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH points to a readable PKCS#8 PEM-encoded Ed25519 private key.
  • the configured Redis ACL, DB, TLS, and key-prefix settings match the target environment.

At startup the process performs bounded PING checks for:

  • the Redis-backed session cache adapter;
  • the replay store;
  • the session event subscriber;
  • the client event subscriber.

Startup fails fast if any of those checks fail or if the signer key cannot be loaded.

Expected listener state after a healthy start:

  • public HTTP is enabled on GATEWAY_PUBLIC_HTTP_ADDR or its default :8080;
  • authenticated gRPC is enabled on GATEWAY_AUTHENTICATED_GRPC_ADDR or its default :9090;
  • admin HTTP is enabled only when GATEWAY_ADMIN_HTTP_ADDR is non-empty.

Known startup caveats:

  • public auth routes stay mounted without an upstream adapter and return 503 service_unavailable;
  • authenticated gRPC starts with an empty static router, so ExecuteCommand returns gRPC UNIMPLEMENTED until downstream routes are injected.

Readiness

Use the probes according to what they actually guarantee:

  • GET /healthz confirms that the public HTTP listener is alive;
  • GET /readyz confirms that the current process is ready to serve public HTTP traffic;
  • GET /metrics is available only on the optional admin listener.

/readyz is process-local. It does not confirm:

  • downstream business-service reachability;
  • auth upstream adapter reachability;
  • Redis health after startup;
  • push fan-out health.

For a practical readiness check in production:

  1. confirm the process emitted startup logs for the public and authenticated listeners;
  2. check GET /healthz;
  3. check GET /readyz;
  4. if admin HTTP is enabled, scrape GET /metrics;
  5. verify the expected Redis deployment and stream names from config.

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • the per-component shutdown budget is controlled by GATEWAY_SHUTDOWN_TIMEOUT;
  • internal subscribers are stopped as part of application shutdown;
  • the in-memory PushHub is closed before gRPC graceful stop;
  • active SubscribeEvents streams terminate with gRPC UNAVAILABLE and message gateway is shutting down.

During planned restarts:

  1. send SIGTERM;
  2. wait for listener shutdown and component-stop logs;
  3. expect connected clients to reconnect after the gateway closes the stream;
  4. investigate only if shutdown exceeds GATEWAY_SHUTDOWN_TIMEOUT or streams remain open unexpectedly.

Revoke And Push Failure Triage

Revocation Does Not Take Effect

If a revoked session still sends traffic or keeps an active stream:

  1. verify that the auth/session side published a session snapshot with the same device_session_id and status=revoked;
  2. verify that the event was written to GATEWAY_SESSION_EVENTS_REDIS_STREAM;
  3. verify the gateway is connected to the same Redis address, DB, and stream;
  4. confirm the snapshot fields are complete and well-formed;
  5. check that a later active snapshot did not overwrite the revoked one.

Expected gateway behavior after the revoke snapshot is consumed:

  • new authenticated requests for that device_session_id fail with gRPC FAILED_PRECONDITION;
  • active SubscribeEvents streams for that exact device_session_id close with the same status.

Push Events Are Not Delivered

If a client reports missing push events:

  1. confirm that the client successfully opened SubscribeEvents;
  2. confirm the stream received the initial gateway.server_time bootstrap event;
  3. confirm the gateway consumed the expected entry from GATEWAY_CLIENT_EVENTS_REDIS_STREAM;
  4. verify user_id and optional device_session_id in the stream entry match the intended target;
  5. confirm the event payload fields are well-formed and not dropped as malformed;
  6. check whether the stream was closed earlier because of revoke, shutdown, or overflow.

Stream Closed Unexpectedly

Use the terminal gRPC status first:

  • FAILED_PRECONDITION with device session is revoked means the session was revoked;
  • RESOURCE_EXHAUSTED with push stream overflowed means that stream stopped consuming fast enough and its in-memory queue overflowed;
  • UNAVAILABLE with gateway is shutting down means normal process shutdown;
  • client-side cancellation or transport errors should be investigated on the client or network side.

For overflow incidents:

  • treat the issue as stream-local, not a global push outage;
  • inspect client receive behavior and reconnect logic;
  • look at push metrics and logs around the affected user/session.