Files
galaxy-game/notification/docs/runbook.md
T
2026-04-22 08:49:45 +02:00

5.5 KiB

Operator Runbook

This runbook covers startup, steady-state verification, shutdown, and common Notification Service incidents.

Startup Checks

Before starting the process, confirm:

  • NOTIFICATION_REDIS_ADDR points to the Redis deployment that stores notification records, routes, idempotency reservations, malformed intents, dead letters, stream offsets, and route schedules
  • Redis ACL, DB, TLS, and timeout settings match the target environment
  • NOTIFICATION_USER_SERVICE_BASE_URL points to the trusted internal User Service
  • NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM matches the stream consumed by Gateway
  • NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM matches the stream consumed by Mail Service
  • administrator email variables are populated for notification types that should notify administrators
  • OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process

At startup the process performs a bounded Redis PING. Startup fails fast if configuration validation or Redis connectivity fails.

Known startup caveats:

  • there is no operator API
  • there is no /metrics route
  • traces and metrics are exported only through configured OpenTelemetry exporters
  • readiness is process-local after successful startup

Steady-State Verification

Practical readiness verification:

  1. confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher
  2. request GET /readyz on NOTIFICATION_INTERNAL_HTTP_ADDR
  3. verify Redis connectivity and OpenTelemetry exporter health out of band
  4. publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream

Expected steady-state signals:

  • notification.route_schedule.depth remains bounded
  • notification.route_schedule.oldest_age_ms stays near the active retry ladder
  • notification.intent_stream.oldest_unprocessed_age_ms remains near zero when producers are healthy
  • notification.route.dead_letters changes rarely
  • malformed-intent logs appear only for bad producer input
  • logs include notification_type, producer, audience_kind, and correlation identifiers where present

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • coordinated shutdown is bounded by NOTIFICATION_SHUTDOWN_TIMEOUT
  • the private probe listener is stopped before process resources are closed
  • route publishers and the intent consumer stop through context cancellation
  • Redis clients are closed after the app stops
  • OpenTelemetry providers are flushed during runtime cleanup

During a planned restart:

  1. send SIGTERM
  2. wait for listener and worker shutdown logs
  3. restart the process with the same Redis, stream, and downstream settings
  4. repeat steady-state verification

Incident Triage

Intent Stream Lag Grows

Symptoms:

  • notification.intent_stream.oldest_unprocessed_age_ms increases
  • no matching route records appear for new stream entries
  • consumer logs stop after a specific stream entry

Checks:

  1. inspect the next unprocessed notification:intents entry
  2. confirm User Service is reachable from Notification Service
  3. if the entry is user-targeted, verify every recipient_user_id exists
  4. inspect malformed-intent records for nearby stream IDs

Expected behavior:

  • malformed input is recorded and the offset advances
  • temporary User Service failure stops progress before offset advancement

Route Schedule Backlog Grows

Symptoms:

  • notification.route_schedule.depth rises steadily
  • notification.route_schedule.oldest_age_ms increases
  • routes remain in pending or failed

Checks:

  1. confirm push and email publisher startup logs are present
  2. confirm Redis latency and connectivity
  3. verify route IDs match the expected push: or email: prefixes
  4. confirm the downstream stream names match Gateway and Mail Service
  5. inspect route last_error_classification

Dead-Letter Spikes

Symptoms:

  • notification.route.dead_letters increases rapidly
  • route records show repeated payload_encoding_failed, gateway_stream_publish_failed, or mail_stream_publish_failed

Checks:

  1. inspect the dead-letter entry and owning route
  2. verify payload fields still match the notification catalog
  3. confirm downstream Redis stream writes are accepted
  4. compare failures across channels to isolate Gateway-specific or Mail-specific issues

Recovery:

  1. correct the downstream dependency or payload problem
  2. publish a new compatible intent with a new producer-owned idempotency_key
  3. keep the old dead-letter record untouched as audit history

Missing Administrator Mail

Symptoms:

  • administrator notification type is accepted
  • no email command reaches mail:delivery_commands
  • route is skipped with recipient config:<notification_type>

Checks:

  1. inspect the type-specific administrator email environment variable
  2. confirm addresses are normalized single email addresses without display names
  3. restart the process after configuration changes

Expected behavior:

  • empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible

Auth-Code Mail Appears Missing

Auth-code mail is intentionally outside Notification Service.

Checks:

  1. inspect Auth / Session Service -> Mail Service logs and delivery records
  2. confirm notification:intents remains unused for auth-code delivery
  3. do not replay auth-code mail through Notification Service