# Operator Runbook This runbook covers startup, steady-state verification, shutdown, and common `Notification Service` incidents. ## Startup Checks Before starting the process, confirm: - `NOTIFICATION_REDIS_ADDR` points to the Redis deployment that stores notification records, routes, idempotency reservations, malformed intents, dead letters, stream offsets, and route schedules - Redis ACL, DB, TLS, and timeout settings match the target environment - `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal `User Service` - `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by `Gateway` - `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by `Mail Service` - administrator email variables are populated for notification types that should notify administrators - OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process At startup the process performs a bounded Redis `PING`. Startup fails fast if configuration validation or Redis connectivity fails. Known startup caveats: - there is no operator API - there is no `/metrics` route - traces and metrics are exported only through configured OpenTelemetry exporters - readiness is process-local after successful startup ## Steady-State Verification Practical readiness verification: 1. confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher 2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR` 3. verify Redis connectivity and OpenTelemetry exporter health out of band 4. publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream Expected steady-state signals: - `notification.route_schedule.depth` remains bounded - `notification.route_schedule.oldest_age_ms` stays near the active retry ladder - `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero when producers are healthy - `notification.route.dead_letters` changes rarely - malformed-intent logs appear only for bad producer input - logs include `notification_type`, `producer`, `audience_kind`, and correlation identifiers where present ## Shutdown The process handles `SIGINT` and `SIGTERM`. Shutdown behavior: - coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT` - the private probe listener is stopped before process resources are closed - route publishers and the intent consumer stop through context cancellation - Redis clients are closed after the app stops - OpenTelemetry providers are flushed during runtime cleanup During a planned restart: 1. send `SIGTERM` 2. wait for listener and worker shutdown logs 3. restart the process with the same Redis, stream, and downstream settings 4. repeat steady-state verification ## Incident Triage ### Intent Stream Lag Grows Symptoms: - `notification.intent_stream.oldest_unprocessed_age_ms` increases - no matching route records appear for new stream entries - consumer logs stop after a specific stream entry Checks: 1. inspect the next unprocessed `notification:intents` entry 2. confirm `User Service` is reachable from `Notification Service` 3. if the entry is user-targeted, verify every `recipient_user_id` exists 4. inspect malformed-intent records for nearby stream IDs Expected behavior: - malformed input is recorded and the offset advances - temporary `User Service` failure stops progress before offset advancement ### Route Schedule Backlog Grows Symptoms: - `notification.route_schedule.depth` rises steadily - `notification.route_schedule.oldest_age_ms` increases - routes remain in `pending` or `failed` Checks: 1. confirm push and email publisher startup logs are present 2. confirm Redis latency and connectivity 3. verify route IDs match the expected `push:` or `email:` prefixes 4. confirm the downstream stream names match `Gateway` and `Mail Service` 5. inspect route `last_error_classification` ### Dead-Letter Spikes Symptoms: - `notification.route.dead_letters` increases rapidly - route records show repeated `payload_encoding_failed`, `gateway_stream_publish_failed`, or `mail_stream_publish_failed` Checks: 1. inspect the dead-letter entry and owning route 2. verify payload fields still match the notification catalog 3. confirm downstream Redis stream writes are accepted 4. compare failures across channels to isolate Gateway-specific or Mail-specific issues Recovery: 1. correct the downstream dependency or payload problem 2. publish a new compatible intent with a new producer-owned `idempotency_key` 3. keep the old dead-letter record untouched as audit history ### Missing Administrator Mail Symptoms: - administrator notification type is accepted - no email command reaches `mail:delivery_commands` - route is `skipped` with recipient `config:` Checks: 1. inspect the type-specific administrator email environment variable 2. confirm addresses are normalized single email addresses without display names 3. restart the process after configuration changes Expected behavior: - empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible ### Auth-Code Mail Appears Missing Auth-code mail is intentionally outside `Notification Service`. Checks: 1. inspect `Auth / Session Service -> Mail Service` logs and delivery records 2. confirm `notification:intents` remains unused for auth-code delivery 3. do not replay auth-code mail through `Notification Service`