feat: notification service
This commit is contained in:
@@ -0,0 +1,167 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers startup, steady-state verification, shutdown, and common
|
||||
`Notification Service` incidents.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `NOTIFICATION_REDIS_ADDR` points to the Redis deployment that stores
|
||||
notification records, routes, idempotency reservations, malformed intents,
|
||||
dead letters, stream offsets, and route schedules
|
||||
- Redis ACL, DB, TLS, and timeout settings match the target environment
|
||||
- `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal
|
||||
`User Service`
|
||||
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by
|
||||
`Gateway`
|
||||
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by
|
||||
`Mail Service`
|
||||
- administrator email variables are populated for notification types that
|
||||
should notify administrators
|
||||
- OpenTelemetry exporter settings point at the intended collector when traces
|
||||
or metrics are expected outside the process
|
||||
|
||||
At startup the process performs a bounded Redis `PING`. Startup fails fast if
|
||||
configuration validation or Redis connectivity fails.
|
||||
|
||||
Known startup caveats:
|
||||
|
||||
- there is no operator API
|
||||
- there is no `/metrics` route
|
||||
- traces and metrics are exported only through configured OpenTelemetry
|
||||
exporters
|
||||
- readiness is process-local after successful startup
|
||||
|
||||
## Steady-State Verification
|
||||
|
||||
Practical readiness verification:
|
||||
|
||||
1. confirm startup logs for the internal HTTP listener, intent consumer, push
|
||||
publisher, and email publisher
|
||||
2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR`
|
||||
3. verify Redis connectivity and OpenTelemetry exporter health out of band
|
||||
4. publish a low-risk compatible test intent in a non-production environment
|
||||
and verify route publication in the downstream stream
|
||||
|
||||
Expected steady-state signals:
|
||||
|
||||
- `notification.route_schedule.depth` remains bounded
|
||||
- `notification.route_schedule.oldest_age_ms` stays near the active retry
|
||||
ladder
|
||||
- `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero
|
||||
when producers are healthy
|
||||
- `notification.route.dead_letters` changes rarely
|
||||
- malformed-intent logs appear only for bad producer input
|
||||
- logs include `notification_type`, `producer`, `audience_kind`, and
|
||||
correlation identifiers where present
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT`
|
||||
- the private probe listener is stopped before process resources are closed
|
||||
- route publishers and the intent consumer stop through context cancellation
|
||||
- Redis clients are closed after the app stops
|
||||
- OpenTelemetry providers are flushed during runtime cleanup
|
||||
|
||||
During a planned restart:
|
||||
|
||||
1. send `SIGTERM`
|
||||
2. wait for listener and worker shutdown logs
|
||||
3. restart the process with the same Redis, stream, and downstream settings
|
||||
4. repeat steady-state verification
|
||||
|
||||
## Incident Triage
|
||||
|
||||
### Intent Stream Lag Grows
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `notification.intent_stream.oldest_unprocessed_age_ms` increases
|
||||
- no matching route records appear for new stream entries
|
||||
- consumer logs stop after a specific stream entry
|
||||
|
||||
Checks:
|
||||
|
||||
1. inspect the next unprocessed `notification:intents` entry
|
||||
2. confirm `User Service` is reachable from `Notification Service`
|
||||
3. if the entry is user-targeted, verify every `recipient_user_id` exists
|
||||
4. inspect malformed-intent records for nearby stream IDs
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- malformed input is recorded and the offset advances
|
||||
- temporary `User Service` failure stops progress before offset advancement
|
||||
|
||||
### Route Schedule Backlog Grows
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `notification.route_schedule.depth` rises steadily
|
||||
- `notification.route_schedule.oldest_age_ms` increases
|
||||
- routes remain in `pending` or `failed`
|
||||
|
||||
Checks:
|
||||
|
||||
1. confirm push and email publisher startup logs are present
|
||||
2. confirm Redis latency and connectivity
|
||||
3. verify route IDs match the expected `push:` or `email:` prefixes
|
||||
4. confirm the downstream stream names match `Gateway` and `Mail Service`
|
||||
5. inspect route `last_error_classification`
|
||||
|
||||
### Dead-Letter Spikes
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `notification.route.dead_letters` increases rapidly
|
||||
- route records show repeated `payload_encoding_failed`,
|
||||
`gateway_stream_publish_failed`, or `mail_stream_publish_failed`
|
||||
|
||||
Checks:
|
||||
|
||||
1. inspect the dead-letter entry and owning route
|
||||
2. verify payload fields still match the notification catalog
|
||||
3. confirm downstream Redis stream writes are accepted
|
||||
4. compare failures across channels to isolate Gateway-specific or
|
||||
Mail-specific issues
|
||||
|
||||
Recovery:
|
||||
|
||||
1. correct the downstream dependency or payload problem
|
||||
2. publish a new compatible intent with a new producer-owned
|
||||
`idempotency_key`
|
||||
3. keep the old dead-letter record untouched as audit history
|
||||
|
||||
### Missing Administrator Mail
|
||||
|
||||
Symptoms:
|
||||
|
||||
- administrator notification type is accepted
|
||||
- no email command reaches `mail:delivery_commands`
|
||||
- route is `skipped` with recipient `config:<notification_type>`
|
||||
|
||||
Checks:
|
||||
|
||||
1. inspect the type-specific administrator email environment variable
|
||||
2. confirm addresses are normalized single email addresses without display
|
||||
names
|
||||
3. restart the process after configuration changes
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- empty administrator lists materialize one skipped synthetic route so the
|
||||
configuration gap remains durable and visible
|
||||
|
||||
### Auth-Code Mail Appears Missing
|
||||
|
||||
Auth-code mail is intentionally outside `Notification Service`.
|
||||
|
||||
Checks:
|
||||
|
||||
1. inspect `Auth / Session Service -> Mail Service` logs and delivery records
|
||||
2. confirm `notification:intents` remains unused for auth-code delivery
|
||||
3. do not replay auth-code mail through `Notification Service`
|
||||
Reference in New Issue
Block a user