# Operator Runbook This runbook covers startup, steady-state verification, shutdown, and common `Notification Service` incidents. ## Startup Checks Before starting the process, confirm: - `NOTIFICATION_REDIS_MASTER_ADDR` points to the Redis master deployment that hosts the inbound `notification:intents` stream, the persisted consumer offset, the outbound `gateway:client-events` and `mail:delivery_commands` streams, and the temporary `route_leases:*` keys - `NOTIFICATION_REDIS_PASSWORD` matches the connection password (mandatory; the deprecated `NOTIFICATION_REDIS_USERNAME` / `NOTIFICATION_REDIS_TLS_ENABLED` env vars are rejected at startup) - `NOTIFICATION_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary hosting the `notification` schema; the role must own `records`, `routes`, `dead_letters`, and `malformed_intents` - `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal `User Service` - `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by `Gateway` - `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by `Mail Service` - administrator email variables are populated for notification types that should notify administrators - retention knobs (`NOTIFICATION_RECORD_RETENTION`, `NOTIFICATION_MALFORMED_INTENT_RETENTION`, `NOTIFICATION_CLEANUP_INTERVAL`) are sized for the expected operator history window - OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process At startup the process performs a bounded Redis `PING`, opens the PostgreSQL pool, runs the embedded goose migrations, and only then starts the internal HTTP probe. Startup fails fast if configuration validation, Redis connectivity, PostgreSQL connectivity, or migration application fails. Known startup caveats: - there is no operator API - there is no `/metrics` route - traces and metrics are exported only through configured OpenTelemetry exporters - readiness is process-local after successful startup ## Steady-State Verification Practical readiness verification: 1. confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher 2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR` 3. verify Redis connectivity and OpenTelemetry exporter health out of band 4. publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream Expected steady-state signals: - `notification.route_schedule.depth` remains bounded - `notification.route_schedule.oldest_age_ms` stays near the active retry ladder - `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero when producers are healthy - `notification.route.dead_letters` changes rarely - malformed-intent logs appear only for bad producer input - logs include `notification_type`, `producer`, `audience_kind`, and correlation identifiers where present ## Shutdown The process handles `SIGINT` and `SIGTERM`. Shutdown behavior: - coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT` - the private probe listener is stopped before process resources are closed - route publishers and the intent consumer stop through context cancellation - Redis clients are closed after the app stops - OpenTelemetry providers are flushed during runtime cleanup During a planned restart: 1. send `SIGTERM` 2. wait for listener and worker shutdown logs 3. restart the process with the same Redis, stream, and downstream settings 4. repeat steady-state verification ## Incident Triage ### Intent Stream Lag Grows Symptoms: - `notification.intent_stream.oldest_unprocessed_age_ms` increases - no matching route records appear for new stream entries - consumer logs stop after a specific stream entry Checks: 1. inspect the next unprocessed `notification:intents` entry 2. confirm `User Service` is reachable from `Notification Service` 3. if the entry is user-targeted, verify every `recipient_user_id` exists 4. inspect malformed-intent records for nearby stream IDs Expected behavior: - malformed input is recorded and the offset advances - temporary `User Service` failure stops progress before offset advancement ### Route Schedule Backlog Grows Symptoms: - `notification.route_schedule.depth` rises steadily - `notification.route_schedule.oldest_age_ms` increases - routes remain in `pending` or `failed` Checks: 1. confirm push and email publisher startup logs are present 2. confirm Redis latency and connectivity 3. verify route IDs match the expected `push:` or `email:` prefixes 4. confirm the downstream stream names match `Gateway` and `Mail Service` 5. inspect route `last_error_classification` ### Dead-Letter Spikes Symptoms: - `notification.route.dead_letters` increases rapidly - route records show repeated `payload_encoding_failed`, `gateway_stream_publish_failed`, or `mail_stream_publish_failed` Checks: 1. inspect the dead-letter entry and owning route 2. verify payload fields still match the notification catalog 3. confirm downstream Redis stream writes are accepted 4. compare failures across channels to isolate Gateway-specific or Mail-specific issues Recovery: 1. correct the downstream dependency or payload problem 2. publish a new compatible intent with a new producer-owned `idempotency_key` 3. keep the old dead-letter record untouched as audit history ### Missing Administrator Mail Symptoms: - administrator notification type is accepted - no email command reaches `mail:delivery_commands` - route is `skipped` with recipient `config:` Checks: 1. inspect the type-specific administrator email environment variable 2. confirm addresses are normalized single email addresses without display names 3. restart the process after configuration changes Expected behavior: - empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible ### Auth-Code Mail Appears Missing Auth-code mail is intentionally outside `Notification Service`. Checks: 1. inspect `Auth / Session Service -> Mail Service` logs and delivery records 2. confirm `notification:intents` remains unused for auth-code delivery 3. do not replay auth-code mail through `Notification Service`