168 lines
5.5 KiB
Markdown
168 lines
5.5 KiB
Markdown
# Operator Runbook
|
|
|
|
This runbook covers startup, steady-state verification, shutdown, and common
|
|
`Notification Service` incidents.
|
|
|
|
## Startup Checks
|
|
|
|
Before starting the process, confirm:
|
|
|
|
- `NOTIFICATION_REDIS_ADDR` points to the Redis deployment that stores
|
|
notification records, routes, idempotency reservations, malformed intents,
|
|
dead letters, stream offsets, and route schedules
|
|
- Redis ACL, DB, TLS, and timeout settings match the target environment
|
|
- `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal
|
|
`User Service`
|
|
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by
|
|
`Gateway`
|
|
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by
|
|
`Mail Service`
|
|
- administrator email variables are populated for notification types that
|
|
should notify administrators
|
|
- OpenTelemetry exporter settings point at the intended collector when traces
|
|
or metrics are expected outside the process
|
|
|
|
At startup the process performs a bounded Redis `PING`. Startup fails fast if
|
|
configuration validation or Redis connectivity fails.
|
|
|
|
Known startup caveats:
|
|
|
|
- there is no operator API
|
|
- there is no `/metrics` route
|
|
- traces and metrics are exported only through configured OpenTelemetry
|
|
exporters
|
|
- readiness is process-local after successful startup
|
|
|
|
## Steady-State Verification
|
|
|
|
Practical readiness verification:
|
|
|
|
1. confirm startup logs for the internal HTTP listener, intent consumer, push
|
|
publisher, and email publisher
|
|
2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR`
|
|
3. verify Redis connectivity and OpenTelemetry exporter health out of band
|
|
4. publish a low-risk compatible test intent in a non-production environment
|
|
and verify route publication in the downstream stream
|
|
|
|
Expected steady-state signals:
|
|
|
|
- `notification.route_schedule.depth` remains bounded
|
|
- `notification.route_schedule.oldest_age_ms` stays near the active retry
|
|
ladder
|
|
- `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero
|
|
when producers are healthy
|
|
- `notification.route.dead_letters` changes rarely
|
|
- malformed-intent logs appear only for bad producer input
|
|
- logs include `notification_type`, `producer`, `audience_kind`, and
|
|
correlation identifiers where present
|
|
|
|
## Shutdown
|
|
|
|
The process handles `SIGINT` and `SIGTERM`.
|
|
|
|
Shutdown behavior:
|
|
|
|
- coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT`
|
|
- the private probe listener is stopped before process resources are closed
|
|
- route publishers and the intent consumer stop through context cancellation
|
|
- Redis clients are closed after the app stops
|
|
- OpenTelemetry providers are flushed during runtime cleanup
|
|
|
|
During a planned restart:
|
|
|
|
1. send `SIGTERM`
|
|
2. wait for listener and worker shutdown logs
|
|
3. restart the process with the same Redis, stream, and downstream settings
|
|
4. repeat steady-state verification
|
|
|
|
## Incident Triage
|
|
|
|
### Intent Stream Lag Grows
|
|
|
|
Symptoms:
|
|
|
|
- `notification.intent_stream.oldest_unprocessed_age_ms` increases
|
|
- no matching route records appear for new stream entries
|
|
- consumer logs stop after a specific stream entry
|
|
|
|
Checks:
|
|
|
|
1. inspect the next unprocessed `notification:intents` entry
|
|
2. confirm `User Service` is reachable from `Notification Service`
|
|
3. if the entry is user-targeted, verify every `recipient_user_id` exists
|
|
4. inspect malformed-intent records for nearby stream IDs
|
|
|
|
Expected behavior:
|
|
|
|
- malformed input is recorded and the offset advances
|
|
- temporary `User Service` failure stops progress before offset advancement
|
|
|
|
### Route Schedule Backlog Grows
|
|
|
|
Symptoms:
|
|
|
|
- `notification.route_schedule.depth` rises steadily
|
|
- `notification.route_schedule.oldest_age_ms` increases
|
|
- routes remain in `pending` or `failed`
|
|
|
|
Checks:
|
|
|
|
1. confirm push and email publisher startup logs are present
|
|
2. confirm Redis latency and connectivity
|
|
3. verify route IDs match the expected `push:` or `email:` prefixes
|
|
4. confirm the downstream stream names match `Gateway` and `Mail Service`
|
|
5. inspect route `last_error_classification`
|
|
|
|
### Dead-Letter Spikes
|
|
|
|
Symptoms:
|
|
|
|
- `notification.route.dead_letters` increases rapidly
|
|
- route records show repeated `payload_encoding_failed`,
|
|
`gateway_stream_publish_failed`, or `mail_stream_publish_failed`
|
|
|
|
Checks:
|
|
|
|
1. inspect the dead-letter entry and owning route
|
|
2. verify payload fields still match the notification catalog
|
|
3. confirm downstream Redis stream writes are accepted
|
|
4. compare failures across channels to isolate Gateway-specific or
|
|
Mail-specific issues
|
|
|
|
Recovery:
|
|
|
|
1. correct the downstream dependency or payload problem
|
|
2. publish a new compatible intent with a new producer-owned
|
|
`idempotency_key`
|
|
3. keep the old dead-letter record untouched as audit history
|
|
|
|
### Missing Administrator Mail
|
|
|
|
Symptoms:
|
|
|
|
- administrator notification type is accepted
|
|
- no email command reaches `mail:delivery_commands`
|
|
- route is `skipped` with recipient `config:<notification_type>`
|
|
|
|
Checks:
|
|
|
|
1. inspect the type-specific administrator email environment variable
|
|
2. confirm addresses are normalized single email addresses without display
|
|
names
|
|
3. restart the process after configuration changes
|
|
|
|
Expected behavior:
|
|
|
|
- empty administrator lists materialize one skipped synthetic route so the
|
|
configuration gap remains durable and visible
|
|
|
|
### Auth-Code Mail Appears Missing
|
|
|
|
Auth-code mail is intentionally outside `Notification Service`.
|
|
|
|
Checks:
|
|
|
|
1. inspect `Auth / Session Service -> Mail Service` logs and delivery records
|
|
2. confirm `notification:intents` remains unused for auth-code delivery
|
|
3. do not replay auth-code mail through `Notification Service`
|