developer/galaxy-game

Fork 0

Files

T

Ilia Denisov 32dc29359a feat: notification service

2026-04-22 08:49:45 +02:00

5.5 KiB

Raw Blame History

Operator Runbook

This runbook covers startup, steady-state verification, shutdown, and common Notification Service incidents.

Startup Checks

Before starting the process, confirm:

NOTIFICATION_REDIS_ADDR points to the Redis deployment that stores notification records, routes, idempotency reservations, malformed intents, dead letters, stream offsets, and route schedules
Redis ACL, DB, TLS, and timeout settings match the target environment
NOTIFICATION_USER_SERVICE_BASE_URL points to the trusted internal User Service
NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM matches the stream consumed by Gateway
NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM matches the stream consumed by Mail Service
administrator email variables are populated for notification types that should notify administrators
OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process

At startup the process performs a bounded Redis PING. Startup fails fast if configuration validation or Redis connectivity fails.

Known startup caveats:

there is no operator API
there is no /metrics route
traces and metrics are exported only through configured OpenTelemetry exporters
readiness is process-local after successful startup

Steady-State Verification

Practical readiness verification:

confirm startup logs for the internal HTTP listener, intent consumer, push publisher, and email publisher
request GET /readyz on NOTIFICATION_INTERNAL_HTTP_ADDR
verify Redis connectivity and OpenTelemetry exporter health out of band
publish a low-risk compatible test intent in a non-production environment and verify route publication in the downstream stream

Expected steady-state signals:

notification.route_schedule.depth remains bounded
notification.route_schedule.oldest_age_ms stays near the active retry ladder
notification.intent_stream.oldest_unprocessed_age_ms remains near zero when producers are healthy
notification.route.dead_letters changes rarely
malformed-intent logs appear only for bad producer input
logs include notification_type, producer, audience_kind, and correlation identifiers where present

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

coordinated shutdown is bounded by NOTIFICATION_SHUTDOWN_TIMEOUT
the private probe listener is stopped before process resources are closed
route publishers and the intent consumer stop through context cancellation
Redis clients are closed after the app stops
OpenTelemetry providers are flushed during runtime cleanup

During a planned restart:

send SIGTERM
wait for listener and worker shutdown logs
restart the process with the same Redis, stream, and downstream settings
repeat steady-state verification

Incident Triage

Intent Stream Lag Grows

Symptoms:

notification.intent_stream.oldest_unprocessed_age_ms increases
no matching route records appear for new stream entries
consumer logs stop after a specific stream entry

Checks:

inspect the next unprocessed notification:intents entry
confirm User Service is reachable from Notification Service
if the entry is user-targeted, verify every recipient_user_id exists
inspect malformed-intent records for nearby stream IDs

Expected behavior:

malformed input is recorded and the offset advances
temporary User Service failure stops progress before offset advancement

Route Schedule Backlog Grows

Symptoms:

notification.route_schedule.depth rises steadily
notification.route_schedule.oldest_age_ms increases
routes remain in pending or failed

Checks:

confirm push and email publisher startup logs are present
confirm Redis latency and connectivity
verify route IDs match the expected push: or email: prefixes
confirm the downstream stream names match Gateway and Mail Service
inspect route last_error_classification

Dead-Letter Spikes

Symptoms:

notification.route.dead_letters increases rapidly
route records show repeated payload_encoding_failed, gateway_stream_publish_failed, or mail_stream_publish_failed

Checks:

inspect the dead-letter entry and owning route
verify payload fields still match the notification catalog
confirm downstream Redis stream writes are accepted
compare failures across channels to isolate Gateway-specific or Mail-specific issues

Recovery:

correct the downstream dependency or payload problem
publish a new compatible intent with a new producer-owned idempotency_key
keep the old dead-letter record untouched as audit history

Missing Administrator Mail

Symptoms:

administrator notification type is accepted
no email command reaches mail:delivery_commands
route is skipped with recipient config:<notification_type>

Checks:

inspect the type-specific administrator email environment variable
confirm addresses are normalized single email addresses without display names
restart the process after configuration changes

Expected behavior:

empty administrator lists materialize one skipped synthetic route so the configuration gap remains durable and visible

Auth-Code Mail Appears Missing

Auth-code mail is intentionally outside Notification Service.

Checks:

inspect Auth / Session Service -> Mail Service logs and delivery records
confirm notification:intents remains unused for auth-code delivery
do not replay auth-code mail through Notification Service

5.5 KiB Raw Blame History

Operator Runbook

Startup Checks

Steady-State Verification

Shutdown

Incident Triage

Intent Stream Lag Grows

Route Schedule Backlog Grows

Dead-Letter Spikes

Missing Administrator Mail

Auth-Code Mail Appears Missing

5.5 KiB

Raw Blame History