Files
galaxy-game/gateway/docs/runbook.md
T
2026-05-06 10:14:55 +03:00

5.8 KiB

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state readiness, shutdown, and push or revoke incidents.

Startup Checks

Before starting the process, confirm:

  • GATEWAY_REDIS_MASTER_ADDR and GATEWAY_REDIS_PASSWORD point to the Redis deployment used for anti-replay reservations. Optional read replicas may be listed in GATEWAY_REDIS_REPLICA_ADDRS (currently unused; reserved for future read-routing).
  • GATEWAY_BACKEND_HTTP_URL, GATEWAY_BACKEND_GRPC_PUSH_URL, and GATEWAY_BACKEND_GATEWAY_CLIENT_ID describe the consolidated backend service the gateway forwards every public auth and authenticated user/lobby request to and the gRPC push subscription it opens.
  • GATEWAY_RESPONSE_SIGNER_PRIVATE_KEY_PEM_PATH points to a readable PKCS#8 PEM-encoded Ed25519 private key.
  • the configured Redis DB and key-prefix settings match the target environment. Per ARCHITECTURE.md §Persistence Backends, Redis traffic is password-protected and TLS is disabled by policy; the deprecated GATEWAY_REDIS_TLS_ENABLED and GATEWAY_REDIS_USERNAME variables are no longer accepted and cause a hard fail at startup.

At startup the process opens one shared *redis.Client (instrumented via OpenTelemetry tracing and metrics) and performs one bounded PING for the replay store. It also dials backend's gRPC push listener and opens one Push.SubscribePush stream that reconnects with capped exponential backoff on failure.

Startup fails fast if the Redis ping fails, the backend URL is malformed, or the signer key cannot be loaded.

Expected listener state after a healthy start:

  • public HTTP is enabled on GATEWAY_PUBLIC_HTTP_ADDR or its default :8080;
  • authenticated gRPC is enabled on GATEWAY_AUTHENTICATED_GRPC_ADDR or its default :9090;
  • admin HTTP is enabled only when GATEWAY_ADMIN_HTTP_ADDR is non-empty.

Known startup caveats:

  • public auth routes stay mounted without an upstream adapter and return 503 service_unavailable;
  • authenticated gRPC starts with an empty static router, so ExecuteCommand returns gRPC UNIMPLEMENTED until downstream routes are injected.

Readiness

Use the probes according to what they actually guarantee:

  • GET /healthz confirms that the public HTTP listener is alive;
  • GET /readyz confirms that the current process is ready to serve public HTTP traffic;
  • GET /metrics is available only on the optional admin listener.

/readyz is process-local. It does not confirm:

  • downstream business-service reachability;
  • auth upstream adapter reachability;
  • Redis health after startup;
  • push fan-out health.

For a practical readiness check in production:

  1. confirm the process emitted startup logs for the public and authenticated listeners;
  2. check GET /healthz;
  3. check GET /readyz;
  4. if admin HTTP is enabled, scrape GET /metrics;
  5. verify the expected Redis deployment and stream names from config.

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • the per-component shutdown budget is controlled by GATEWAY_SHUTDOWN_TIMEOUT;
  • internal subscribers are stopped as part of application shutdown;
  • the in-memory PushHub is closed before gRPC graceful stop;
  • active SubscribeEvents streams terminate with gRPC UNAVAILABLE and message gateway is shutting down.

During planned restarts:

  1. send SIGTERM;
  2. wait for listener shutdown and component-stop logs;
  3. expect connected clients to reconnect after the gateway closes the stream;
  4. investigate only if shutdown exceeds GATEWAY_SHUTDOWN_TIMEOUT or streams remain open unexpectedly.

Revoke And Push Failure Triage

Revocation Does Not Take Effect

If a revoked session still sends traffic or keeps an active stream:

  1. verify that backend recorded the revocation (the /api/v1/internal/sessions/{id} lookup must return status=revoked for that device session);
  2. verify that backend emitted the corresponding session_invalidation frame on Push.SubscribePush and that the gateway logs a matching subscription closure;
  3. verify the gateway is connected to the same backend instance via GATEWAY_BACKEND_HTTP_URL / GATEWAY_BACKEND_GRPC_PUSH_URL;
  4. confirm the next authenticated request from that session is rejected.

Expected gateway behavior after the revoke snapshot is consumed:

  • new authenticated requests for that device_session_id fail with gRPC FAILED_PRECONDITION;
  • active SubscribeEvents streams for that exact device_session_id close with the same status.

Push Events Are Not Delivered

If a client reports missing push events:

  1. confirm that the client successfully opened SubscribeEvents;
  2. confirm the stream received the initial gateway.server_time bootstrap event;
  3. confirm the gateway consumed the expected pushv1.PushEvent from backend (look for push_dispatcher log lines or grpc_push_events_total increments on the backend side);
  4. verify user_id and optional device_session_id on the ClientEvent match the intended target;
  5. confirm the event payload fields are well-formed and not dropped as malformed;
  6. check whether the stream was closed earlier because of revoke, shutdown, or overflow.

Stream Closed Unexpectedly

Use the terminal gRPC status first:

  • FAILED_PRECONDITION with device session is revoked means the session was revoked;
  • RESOURCE_EXHAUSTED with push stream overflowed means that stream stopped consuming fast enough and its in-memory queue overflowed;
  • UNAVAILABLE with gateway is shutting down means normal process shutdown;
  • client-side cancellation or transport errors should be investigated on the client or network side.

For overflow incidents:

  • treat the issue as stream-local, not a global push outage;
  • inspect client receive behavior and reconnect logic;
  • look at push metrics and logs around the affected user/session.