Files
galaxy-game/notification/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

181 lines
6.2 KiB
Markdown

# Operator Runbook
This runbook covers startup, steady-state verification, shutdown, and common
`Notification Service` incidents.
## Startup Checks
Before starting the process, confirm:
- `NOTIFICATION_REDIS_MASTER_ADDR` points to the Redis master deployment
that hosts the inbound `notification:intents` stream, the persisted
consumer offset, the outbound `gateway:client-events` and
`mail:delivery_commands` streams, and the temporary `route_leases:*` keys
- `NOTIFICATION_REDIS_PASSWORD` matches the connection password
(mandatory; the deprecated `NOTIFICATION_REDIS_USERNAME` /
`NOTIFICATION_REDIS_TLS_ENABLED` env vars are rejected at startup)
- `NOTIFICATION_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
hosting the `notification` schema; the role must own
`records`, `routes`, `dead_letters`, and `malformed_intents`
- `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal
`User Service`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by
`Gateway`
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by
`Mail Service`
- administrator email variables are populated for notification types that
should notify administrators
- retention knobs (`NOTIFICATION_RECORD_RETENTION`,
`NOTIFICATION_MALFORMED_INTENT_RETENTION`,
`NOTIFICATION_CLEANUP_INTERVAL`) are sized for the expected operator
history window
- OpenTelemetry exporter settings point at the intended collector when traces
or metrics are expected outside the process
At startup the process performs a bounded Redis `PING`, opens the
PostgreSQL pool, runs the embedded goose migrations, and only then starts
the internal HTTP probe. Startup fails fast if configuration validation,
Redis connectivity, PostgreSQL connectivity, or migration application
fails.
Known startup caveats:
- there is no operator API
- there is no `/metrics` route
- traces and metrics are exported only through configured OpenTelemetry
exporters
- readiness is process-local after successful startup
## Steady-State Verification
Practical readiness verification:
1. confirm startup logs for the internal HTTP listener, intent consumer, push
publisher, and email publisher
2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR`
3. verify Redis connectivity and OpenTelemetry exporter health out of band
4. publish a low-risk compatible test intent in a non-production environment
and verify route publication in the downstream stream
Expected steady-state signals:
- `notification.route_schedule.depth` remains bounded
- `notification.route_schedule.oldest_age_ms` stays near the active retry
ladder
- `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero
when producers are healthy
- `notification.route.dead_letters` changes rarely
- malformed-intent logs appear only for bad producer input
- logs include `notification_type`, `producer`, `audience_kind`, and
correlation identifiers where present
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT`
- the private probe listener is stopped before process resources are closed
- route publishers and the intent consumer stop through context cancellation
- Redis clients are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
1. send `SIGTERM`
2. wait for listener and worker shutdown logs
3. restart the process with the same Redis, stream, and downstream settings
4. repeat steady-state verification
## Incident Triage
### Intent Stream Lag Grows
Symptoms:
- `notification.intent_stream.oldest_unprocessed_age_ms` increases
- no matching route records appear for new stream entries
- consumer logs stop after a specific stream entry
Checks:
1. inspect the next unprocessed `notification:intents` entry
2. confirm `User Service` is reachable from `Notification Service`
3. if the entry is user-targeted, verify every `recipient_user_id` exists
4. inspect malformed-intent records for nearby stream IDs
Expected behavior:
- malformed input is recorded and the offset advances
- temporary `User Service` failure stops progress before offset advancement
### Route Schedule Backlog Grows
Symptoms:
- `notification.route_schedule.depth` rises steadily
- `notification.route_schedule.oldest_age_ms` increases
- routes remain in `pending` or `failed`
Checks:
1. confirm push and email publisher startup logs are present
2. confirm Redis latency and connectivity
3. verify route IDs match the expected `push:` or `email:` prefixes
4. confirm the downstream stream names match `Gateway` and `Mail Service`
5. inspect route `last_error_classification`
### Dead-Letter Spikes
Symptoms:
- `notification.route.dead_letters` increases rapidly
- route records show repeated `payload_encoding_failed`,
`gateway_stream_publish_failed`, or `mail_stream_publish_failed`
Checks:
1. inspect the dead-letter entry and owning route
2. verify payload fields still match the notification catalog
3. confirm downstream Redis stream writes are accepted
4. compare failures across channels to isolate Gateway-specific or
Mail-specific issues
Recovery:
1. correct the downstream dependency or payload problem
2. publish a new compatible intent with a new producer-owned
`idempotency_key`
3. keep the old dead-letter record untouched as audit history
### Missing Administrator Mail
Symptoms:
- administrator notification type is accepted
- no email command reaches `mail:delivery_commands`
- route is `skipped` with recipient `config:<notification_type>`
Checks:
1. inspect the type-specific administrator email environment variable
2. confirm addresses are normalized single email addresses without display
names
3. restart the process after configuration changes
Expected behavior:
- empty administrator lists materialize one skipped synthetic route so the
configuration gap remains durable and visible
### Auth-Code Mail Appears Missing
Auth-code mail is intentionally outside `Notification Service`.
Checks:
1. inspect `Auth / Session Service -> Mail Service` logs and delivery records
2. confirm `notification:intents` remains unused for auth-code delivery
3. do not replay auth-code mail through `Notification Service`