5.5 KiB
5.5 KiB
Operator Runbook
This runbook covers startup, steady-state verification, shutdown, and common
Notification Service incidents.
Startup Checks
Before starting the process, confirm:
NOTIFICATION_REDIS_ADDRpoints to the Redis deployment that stores notification records, routes, idempotency reservations, malformed intents, dead letters, stream offsets, and route schedules- Redis ACL, DB, TLS, and timeout settings match the target environment
NOTIFICATION_USER_SERVICE_BASE_URLpoints to the trusted internalUser ServiceNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAMmatches the stream consumed byGatewayNOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAMmatches the stream consumed byMail Service- administrator email variables are populated for notification types that should notify administrators
- OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process
At startup the process performs a bounded Redis PING. Startup fails fast if
configuration validation or Redis connectivity fails.
Known startup caveats:
- there is no operator API
- there is no
/metricsroute - traces and metrics are exported only through configured OpenTelemetry exporters
- readiness is process-local after successful startup
Steady-State Verification
Practical readiness verification:
- confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher
- request
GET /readyzonNOTIFICATION_INTERNAL_HTTP_ADDR - verify Redis connectivity and OpenTelemetry exporter health out of band
- publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream
Expected steady-state signals:
notification.route_schedule.depthremains boundednotification.route_schedule.oldest_age_msstays near the active retry laddernotification.intent_stream.oldest_unprocessed_age_msremains near zero when producers are healthynotification.route.dead_letterschanges rarely- malformed-intent logs appear only for bad producer input
- logs include
notification_type,producer,audience_kind, and correlation identifiers where present
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- coordinated shutdown is bounded by
NOTIFICATION_SHUTDOWN_TIMEOUT - the private probe listener is stopped before process resources are closed
- route publishers and the intent consumer stop through context cancellation
- Redis clients are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
- send
SIGTERM - wait for listener and worker shutdown logs
- restart the process with the same Redis, stream, and downstream settings
- repeat steady-state verification
Incident Triage
Intent Stream Lag Grows
Symptoms:
notification.intent_stream.oldest_unprocessed_age_msincreases- no matching route records appear for new stream entries
- consumer logs stop after a specific stream entry
Checks:
- inspect the next unprocessed
notification:intentsentry - confirm
User Serviceis reachable fromNotification Service - if the entry is user-targeted, verify every
recipient_user_idexists - inspect malformed-intent records for nearby stream IDs
Expected behavior:
- malformed input is recorded and the offset advances
- temporary
User Servicefailure stops progress before offset advancement
Route Schedule Backlog Grows
Symptoms:
notification.route_schedule.depthrises steadilynotification.route_schedule.oldest_age_msincreases- routes remain in
pendingorfailed
Checks:
- confirm push and email publisher startup logs are present
- confirm Redis latency and connectivity
- verify route IDs match the expected
push:oremail:prefixes - confirm the downstream stream names match
GatewayandMail Service - inspect route
last_error_classification
Dead-Letter Spikes
Symptoms:
notification.route.dead_lettersincreases rapidly- route records show repeated
payload_encoding_failed,gateway_stream_publish_failed, ormail_stream_publish_failed
Checks:
- inspect the dead-letter entry and owning route
- verify payload fields still match the notification catalog
- confirm downstream Redis stream writes are accepted
- compare failures across channels to isolate Gateway-specific or Mail-specific issues
Recovery:
- correct the downstream dependency or payload problem
- publish a new compatible intent with a new producer-owned
idempotency_key - keep the old dead-letter record untouched as audit history
Missing Administrator Mail
Symptoms:
- administrator notification type is accepted
- no email command reaches
mail:delivery_commands - route is
skippedwith recipientconfig:<notification_type>
Checks:
- inspect the type-specific administrator email environment variable
- confirm addresses are normalized single email addresses without display names
- restart the process after configuration changes
Expected behavior:
- empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible
Auth-Code Mail Appears Missing
Auth-code mail is intentionally outside Notification Service.
Checks:
- inspect
Auth / Session Service -> Mail Servicelogs and delivery records - confirm
notification:intentsremains unused for auth-code delivery - do not replay auth-code mail through
Notification Service