6.2 KiB
Operator Runbook
This runbook covers startup, steady-state verification, shutdown, and common
Notification Service incidents.
Startup Checks
Before starting the process, confirm:
NOTIFICATION_REDIS_MASTER_ADDRpoints to the Redis master deployment that hosts the inboundnotification:intentsstream, the persisted consumer offset, the outboundgateway:client-eventsandmail:delivery_commandsstreams, and the temporaryroute_leases:*keysNOTIFICATION_REDIS_PASSWORDmatches the connection password (mandatory; the deprecatedNOTIFICATION_REDIS_USERNAME/NOTIFICATION_REDIS_TLS_ENABLEDenv vars are rejected at startup)NOTIFICATION_POSTGRES_PRIMARY_DSNpoints to the PostgreSQL primary hosting thenotificationschema; the role must ownrecords,routes,dead_letters, andmalformed_intentsNOTIFICATION_USER_SERVICE_BASE_URLpoints to the trusted internalUser ServiceNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAMmatches the stream consumed byGatewayNOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAMmatches the stream consumed byMail Service- administrator email variables are populated for notification types that should notify administrators
- retention knobs (
NOTIFICATION_RECORD_RETENTION,NOTIFICATION_MALFORMED_INTENT_RETENTION,NOTIFICATION_CLEANUP_INTERVAL) are sized for the expected operator history window - OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process
At startup the process performs a bounded Redis PING, opens the
PostgreSQL pool, runs the embedded goose migrations, and only then starts
the internal HTTP probe. Startup fails fast if configuration validation,
Redis connectivity, PostgreSQL connectivity, or migration application
fails.
Known startup caveats:
- there is no operator API
- there is no
/metricsroute - traces and metrics are exported only through configured OpenTelemetry exporters
- readiness is process-local after successful startup
Steady-State Verification
Practical readiness verification:
- confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher
- request
GET /readyzonNOTIFICATION_INTERNAL_HTTP_ADDR - verify Redis connectivity and OpenTelemetry exporter health out of band
- publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream
Expected steady-state signals:
notification.route_schedule.depthremains boundednotification.route_schedule.oldest_age_msstays near the active retry laddernotification.intent_stream.oldest_unprocessed_age_msremains near zero when producers are healthynotification.route.dead_letterschanges rarely- malformed-intent logs appear only for bad producer input
- logs include
notification_type,producer,audience_kind, and correlation identifiers where present
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- coordinated shutdown is bounded by
NOTIFICATION_SHUTDOWN_TIMEOUT - the private probe listener is stopped before process resources are closed
- route publishers and the intent consumer stop through context cancellation
- Redis clients are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
- send
SIGTERM - wait for listener and worker shutdown logs
- restart the process with the same Redis, stream, and downstream settings
- repeat steady-state verification
Incident Triage
Intent Stream Lag Grows
Symptoms:
notification.intent_stream.oldest_unprocessed_age_msincreases- no matching route records appear for new stream entries
- consumer logs stop after a specific stream entry
Checks:
- inspect the next unprocessed
notification:intentsentry - confirm
User Serviceis reachable fromNotification Service - if the entry is user-targeted, verify every
recipient_user_idexists - inspect malformed-intent records for nearby stream IDs
Expected behavior:
- malformed input is recorded and the offset advances
- temporary
User Servicefailure stops progress before offset advancement
Route Schedule Backlog Grows
Symptoms:
notification.route_schedule.depthrises steadilynotification.route_schedule.oldest_age_msincreases- routes remain in
pendingorfailed
Checks:
- confirm push and email publisher startup logs are present
- confirm Redis latency and connectivity
- verify route IDs match the expected
push:oremail:prefixes - confirm the downstream stream names match
GatewayandMail Service - inspect route
last_error_classification
Dead-Letter Spikes
Symptoms:
notification.route.dead_lettersincreases rapidly- route records show repeated
payload_encoding_failed,gateway_stream_publish_failed, ormail_stream_publish_failed
Checks:
- inspect the dead-letter entry and owning route
- verify payload fields still match the notification catalog
- confirm downstream Redis stream writes are accepted
- compare failures across channels to isolate Gateway-specific or Mail-specific issues
Recovery:
- correct the downstream dependency or payload problem
- publish a new compatible intent with a new producer-owned
idempotency_key - keep the old dead-letter record untouched as audit history
Missing Administrator Mail
Symptoms:
- administrator notification type is accepted
- no email command reaches
mail:delivery_commands - route is
skippedwith recipientconfig:<notification_type>
Checks:
- inspect the type-specific administrator email environment variable
- confirm addresses are normalized single email addresses without display names
- restart the process after configuration changes
Expected behavior:
- empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible
Auth-Code Mail Appears Missing
Auth-code mail is intentionally outside Notification Service.
Checks:
- inspect
Auth / Session Service -> Mail Servicelogs and delivery records - confirm
notification:intentsremains unused for auth-code delivery - do not replay auth-code mail through
Notification Service