Files
galaxy-game/notification/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

6.2 KiB

Operator Runbook

This runbook covers startup, steady-state verification, shutdown, and common Notification Service incidents.

Startup Checks

Before starting the process, confirm:

  • NOTIFICATION_REDIS_MASTER_ADDR points to the Redis master deployment that hosts the inbound notification:intents stream, the persisted consumer offset, the outbound gateway:client-events and mail:delivery_commands streams, and the temporary route_leases:* keys
  • NOTIFICATION_REDIS_PASSWORD matches the connection password (mandatory; the deprecated NOTIFICATION_REDIS_USERNAME / NOTIFICATION_REDIS_TLS_ENABLED env vars are rejected at startup)
  • NOTIFICATION_POSTGRES_PRIMARY_DSN points to the PostgreSQL primary hosting the notification schema; the role must own records, routes, dead_letters, and malformed_intents
  • NOTIFICATION_USER_SERVICE_BASE_URL points to the trusted internal User Service
  • NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM matches the stream consumed by Gateway
  • NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM matches the stream consumed by Mail Service
  • administrator email variables are populated for notification types that should notify administrators
  • retention knobs (NOTIFICATION_RECORD_RETENTION, NOTIFICATION_MALFORMED_INTENT_RETENTION, NOTIFICATION_CLEANUP_INTERVAL) are sized for the expected operator history window
  • OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process

At startup the process performs a bounded Redis PING, opens the PostgreSQL pool, runs the embedded goose migrations, and only then starts the internal HTTP probe. Startup fails fast if configuration validation, Redis connectivity, PostgreSQL connectivity, or migration application fails.

Known startup caveats:

  • there is no operator API
  • there is no /metrics route
  • traces and metrics are exported only through configured OpenTelemetry exporters
  • readiness is process-local after successful startup

Steady-State Verification

Practical readiness verification:

  1. confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher
  2. request GET /readyz on NOTIFICATION_INTERNAL_HTTP_ADDR
  3. verify Redis connectivity and OpenTelemetry exporter health out of band
  4. publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream

Expected steady-state signals:

  • notification.route_schedule.depth remains bounded
  • notification.route_schedule.oldest_age_ms stays near the active retry ladder
  • notification.intent_stream.oldest_unprocessed_age_ms remains near zero when producers are healthy
  • notification.route.dead_letters changes rarely
  • malformed-intent logs appear only for bad producer input
  • logs include notification_type, producer, audience_kind, and correlation identifiers where present

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • coordinated shutdown is bounded by NOTIFICATION_SHUTDOWN_TIMEOUT
  • the private probe listener is stopped before process resources are closed
  • route publishers and the intent consumer stop through context cancellation
  • Redis clients are closed after the app stops
  • OpenTelemetry providers are flushed during runtime cleanup

During a planned restart:

  1. send SIGTERM
  2. wait for listener and worker shutdown logs
  3. restart the process with the same Redis, stream, and downstream settings
  4. repeat steady-state verification

Incident Triage

Intent Stream Lag Grows

Symptoms:

  • notification.intent_stream.oldest_unprocessed_age_ms increases
  • no matching route records appear for new stream entries
  • consumer logs stop after a specific stream entry

Checks:

  1. inspect the next unprocessed notification:intents entry
  2. confirm User Service is reachable from Notification Service
  3. if the entry is user-targeted, verify every recipient_user_id exists
  4. inspect malformed-intent records for nearby stream IDs

Expected behavior:

  • malformed input is recorded and the offset advances
  • temporary User Service failure stops progress before offset advancement

Route Schedule Backlog Grows

Symptoms:

  • notification.route_schedule.depth rises steadily
  • notification.route_schedule.oldest_age_ms increases
  • routes remain in pending or failed

Checks:

  1. confirm push and email publisher startup logs are present
  2. confirm Redis latency and connectivity
  3. verify route IDs match the expected push: or email: prefixes
  4. confirm the downstream stream names match Gateway and Mail Service
  5. inspect route last_error_classification

Dead-Letter Spikes

Symptoms:

  • notification.route.dead_letters increases rapidly
  • route records show repeated payload_encoding_failed, gateway_stream_publish_failed, or mail_stream_publish_failed

Checks:

  1. inspect the dead-letter entry and owning route
  2. verify payload fields still match the notification catalog
  3. confirm downstream Redis stream writes are accepted
  4. compare failures across channels to isolate Gateway-specific or Mail-specific issues

Recovery:

  1. correct the downstream dependency or payload problem
  2. publish a new compatible intent with a new producer-owned idempotency_key
  3. keep the old dead-letter record untouched as audit history

Missing Administrator Mail

Symptoms:

  • administrator notification type is accepted
  • no email command reaches mail:delivery_commands
  • route is skipped with recipient config:<notification_type>

Checks:

  1. inspect the type-specific administrator email environment variable
  2. confirm addresses are normalized single email addresses without display names
  3. restart the process after configuration changes

Expected behavior:

  • empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible

Auth-Code Mail Appears Missing

Auth-code mail is intentionally outside Notification Service.

Checks:

  1. inspect Auth / Session Service -> Mail Service logs and delivery records
  2. confirm notification:intents remains unused for auth-code delivery
  3. do not replay auth-code mail through Notification Service