feat: notification service

This commit is contained in:
Ilia Denisov
2026-04-22 08:49:45 +02:00
committed by GitHub
parent 5b7593e6f6
commit 32dc29359a
135 changed files with 21828 additions and 130 deletions
+167
View File
@@ -0,0 +1,167 @@
# Operator Runbook
This runbook covers startup, steady-state verification, shutdown, and common
`Notification Service` incidents.
## Startup Checks
Before starting the process, confirm:
- `NOTIFICATION_REDIS_ADDR` points to the Redis deployment that stores
notification records, routes, idempotency reservations, malformed intents,
dead letters, stream offsets, and route schedules
- Redis ACL, DB, TLS, and timeout settings match the target environment
- `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal
`User Service`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by
`Gateway`
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by
`Mail Service`
- administrator email variables are populated for notification types that
should notify administrators
- OpenTelemetry exporter settings point at the intended collector when traces
or metrics are expected outside the process
At startup the process performs a bounded Redis `PING`. Startup fails fast if
configuration validation or Redis connectivity fails.
Known startup caveats:
- there is no operator API
- there is no `/metrics` route
- traces and metrics are exported only through configured OpenTelemetry
exporters
- readiness is process-local after successful startup
## Steady-State Verification
Practical readiness verification:
1. confirm startup logs for the internal HTTP listener, intent consumer, push
publisher, and email publisher
2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR`
3. verify Redis connectivity and OpenTelemetry exporter health out of band
4. publish a low-risk compatible test intent in a non-production environment
and verify route publication in the downstream stream
Expected steady-state signals:
- `notification.route_schedule.depth` remains bounded
- `notification.route_schedule.oldest_age_ms` stays near the active retry
ladder
- `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero
when producers are healthy
- `notification.route.dead_letters` changes rarely
- malformed-intent logs appear only for bad producer input
- logs include `notification_type`, `producer`, `audience_kind`, and
correlation identifiers where present
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT`
- the private probe listener is stopped before process resources are closed
- route publishers and the intent consumer stop through context cancellation
- Redis clients are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
1. send `SIGTERM`
2. wait for listener and worker shutdown logs
3. restart the process with the same Redis, stream, and downstream settings
4. repeat steady-state verification
## Incident Triage
### Intent Stream Lag Grows
Symptoms:
- `notification.intent_stream.oldest_unprocessed_age_ms` increases
- no matching route records appear for new stream entries
- consumer logs stop after a specific stream entry
Checks:
1. inspect the next unprocessed `notification:intents` entry
2. confirm `User Service` is reachable from `Notification Service`
3. if the entry is user-targeted, verify every `recipient_user_id` exists
4. inspect malformed-intent records for nearby stream IDs
Expected behavior:
- malformed input is recorded and the offset advances
- temporary `User Service` failure stops progress before offset advancement
### Route Schedule Backlog Grows
Symptoms:
- `notification.route_schedule.depth` rises steadily
- `notification.route_schedule.oldest_age_ms` increases
- routes remain in `pending` or `failed`
Checks:
1. confirm push and email publisher startup logs are present
2. confirm Redis latency and connectivity
3. verify route IDs match the expected `push:` or `email:` prefixes
4. confirm the downstream stream names match `Gateway` and `Mail Service`
5. inspect route `last_error_classification`
### Dead-Letter Spikes
Symptoms:
- `notification.route.dead_letters` increases rapidly
- route records show repeated `payload_encoding_failed`,
`gateway_stream_publish_failed`, or `mail_stream_publish_failed`
Checks:
1. inspect the dead-letter entry and owning route
2. verify payload fields still match the notification catalog
3. confirm downstream Redis stream writes are accepted
4. compare failures across channels to isolate Gateway-specific or
Mail-specific issues
Recovery:
1. correct the downstream dependency or payload problem
2. publish a new compatible intent with a new producer-owned
`idempotency_key`
3. keep the old dead-letter record untouched as audit history
### Missing Administrator Mail
Symptoms:
- administrator notification type is accepted
- no email command reaches `mail:delivery_commands`
- route is `skipped` with recipient `config:<notification_type>`
Checks:
1. inspect the type-specific administrator email environment variable
2. confirm addresses are normalized single email addresses without display
names
3. restart the process after configuration changes
Expected behavior:
- empty administrator lists materialize one skipped synthetic route so the
configuration gap remains durable and visible
### Auth-Code Mail Appears Missing
Auth-code mail is intentionally outside `Notification Service`.
Checks:
1. inspect `Auth / Session Service -> Mail Service` logs and delivery records
2. confirm `notification:intents` remains unused for auth-code delivery
3. do not replay auth-code mail through `Notification Service`