181 lines
6.2 KiB
Markdown
181 lines
6.2 KiB
Markdown
# Operator Runbook
|
|
|
|
This runbook covers startup, steady-state verification, shutdown, and common
|
|
`Notification Service` incidents.
|
|
|
|
## Startup Checks
|
|
|
|
Before starting the process, confirm:
|
|
|
|
- `NOTIFICATION_REDIS_MASTER_ADDR` points to the Redis master deployment
|
|
that hosts the inbound `notification:intents` stream, the persisted
|
|
consumer offset, the outbound `gateway:client-events` and
|
|
`mail:delivery_commands` streams, and the temporary `route_leases:*` keys
|
|
- `NOTIFICATION_REDIS_PASSWORD` matches the connection password
|
|
(mandatory; the deprecated `NOTIFICATION_REDIS_USERNAME` /
|
|
`NOTIFICATION_REDIS_TLS_ENABLED` env vars are rejected at startup)
|
|
- `NOTIFICATION_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
|
|
hosting the `notification` schema; the role must own
|
|
`records`, `routes`, `dead_letters`, and `malformed_intents`
|
|
- `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal
|
|
`User Service`
|
|
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by
|
|
`Gateway`
|
|
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by
|
|
`Mail Service`
|
|
- administrator email variables are populated for notification types that
|
|
should notify administrators
|
|
- retention knobs (`NOTIFICATION_RECORD_RETENTION`,
|
|
`NOTIFICATION_MALFORMED_INTENT_RETENTION`,
|
|
`NOTIFICATION_CLEANUP_INTERVAL`) are sized for the expected operator
|
|
history window
|
|
- OpenTelemetry exporter settings point at the intended collector when traces
|
|
or metrics are expected outside the process
|
|
|
|
At startup the process performs a bounded Redis `PING`, opens the
|
|
PostgreSQL pool, runs the embedded goose migrations, and only then starts
|
|
the internal HTTP probe. Startup fails fast if configuration validation,
|
|
Redis connectivity, PostgreSQL connectivity, or migration application
|
|
fails.
|
|
|
|
Known startup caveats:
|
|
|
|
- there is no operator API
|
|
- there is no `/metrics` route
|
|
- traces and metrics are exported only through configured OpenTelemetry
|
|
exporters
|
|
- readiness is process-local after successful startup
|
|
|
|
## Steady-State Verification
|
|
|
|
Practical readiness verification:
|
|
|
|
1. confirm startup logs for the internal HTTP listener, intent consumer, push
|
|
publisher, and email publisher
|
|
2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR`
|
|
3. verify Redis connectivity and OpenTelemetry exporter health out of band
|
|
4. publish a low-risk compatible test intent in a non-production environment
|
|
and verify route publication in the downstream stream
|
|
|
|
Expected steady-state signals:
|
|
|
|
- `notification.route_schedule.depth` remains bounded
|
|
- `notification.route_schedule.oldest_age_ms` stays near the active retry
|
|
ladder
|
|
- `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero
|
|
when producers are healthy
|
|
- `notification.route.dead_letters` changes rarely
|
|
- malformed-intent logs appear only for bad producer input
|
|
- logs include `notification_type`, `producer`, `audience_kind`, and
|
|
correlation identifiers where present
|
|
|
|
## Shutdown
|
|
|
|
The process handles `SIGINT` and `SIGTERM`.
|
|
|
|
Shutdown behavior:
|
|
|
|
- coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT`
|
|
- the private probe listener is stopped before process resources are closed
|
|
- route publishers and the intent consumer stop through context cancellation
|
|
- Redis clients are closed after the app stops
|
|
- OpenTelemetry providers are flushed during runtime cleanup
|
|
|
|
During a planned restart:
|
|
|
|
1. send `SIGTERM`
|
|
2. wait for listener and worker shutdown logs
|
|
3. restart the process with the same Redis, stream, and downstream settings
|
|
4. repeat steady-state verification
|
|
|
|
## Incident Triage
|
|
|
|
### Intent Stream Lag Grows
|
|
|
|
Symptoms:
|
|
|
|
- `notification.intent_stream.oldest_unprocessed_age_ms` increases
|
|
- no matching route records appear for new stream entries
|
|
- consumer logs stop after a specific stream entry
|
|
|
|
Checks:
|
|
|
|
1. inspect the next unprocessed `notification:intents` entry
|
|
2. confirm `User Service` is reachable from `Notification Service`
|
|
3. if the entry is user-targeted, verify every `recipient_user_id` exists
|
|
4. inspect malformed-intent records for nearby stream IDs
|
|
|
|
Expected behavior:
|
|
|
|
- malformed input is recorded and the offset advances
|
|
- temporary `User Service` failure stops progress before offset advancement
|
|
|
|
### Route Schedule Backlog Grows
|
|
|
|
Symptoms:
|
|
|
|
- `notification.route_schedule.depth` rises steadily
|
|
- `notification.route_schedule.oldest_age_ms` increases
|
|
- routes remain in `pending` or `failed`
|
|
|
|
Checks:
|
|
|
|
1. confirm push and email publisher startup logs are present
|
|
2. confirm Redis latency and connectivity
|
|
3. verify route IDs match the expected `push:` or `email:` prefixes
|
|
4. confirm the downstream stream names match `Gateway` and `Mail Service`
|
|
5. inspect route `last_error_classification`
|
|
|
|
### Dead-Letter Spikes
|
|
|
|
Symptoms:
|
|
|
|
- `notification.route.dead_letters` increases rapidly
|
|
- route records show repeated `payload_encoding_failed`,
|
|
`gateway_stream_publish_failed`, or `mail_stream_publish_failed`
|
|
|
|
Checks:
|
|
|
|
1. inspect the dead-letter entry and owning route
|
|
2. verify payload fields still match the notification catalog
|
|
3. confirm downstream Redis stream writes are accepted
|
|
4. compare failures across channels to isolate Gateway-specific or
|
|
Mail-specific issues
|
|
|
|
Recovery:
|
|
|
|
1. correct the downstream dependency or payload problem
|
|
2. publish a new compatible intent with a new producer-owned
|
|
`idempotency_key`
|
|
3. keep the old dead-letter record untouched as audit history
|
|
|
|
### Missing Administrator Mail
|
|
|
|
Symptoms:
|
|
|
|
- administrator notification type is accepted
|
|
- no email command reaches `mail:delivery_commands`
|
|
- route is `skipped` with recipient `config:<notification_type>`
|
|
|
|
Checks:
|
|
|
|
1. inspect the type-specific administrator email environment variable
|
|
2. confirm addresses are normalized single email addresses without display
|
|
names
|
|
3. restart the process after configuration changes
|
|
|
|
Expected behavior:
|
|
|
|
- empty administrator lists materialize one skipped synthetic route so the
|
|
configuration gap remains durable and visible
|
|
|
|
### Auth-Code Mail Appears Missing
|
|
|
|
Auth-code mail is intentionally outside `Notification Service`.
|
|
|
|
Checks:
|
|
|
|
1. inspect `Auth / Session Service -> Mail Service` logs and delivery records
|
|
2. confirm `notification:intents` remains unused for auth-code delivery
|
|
3. do not replay auth-code mail through `Notification Service`
|