feat: notification service

This commit is contained in:
Ilia Denisov
2026-04-22 08:49:45 +02:00
committed by GitHub
parent 5b7593e6f6
commit 32dc29359a
135 changed files with 21828 additions and 130 deletions
+25
View File
@@ -0,0 +1,25 @@
# Notification Service Docs
This directory keeps service-local documentation that is more operational or
more example-heavy than [`../README.md`](../README.md).
Sections:
- [Runtime and components](runtime.md)
- [Main flows](flows.md)
- [Operator runbook](runbook.md)
- [Configuration and contract examples](examples.md)
Primary references:
- [`../README.md`](../README.md) for stable service scope, contracts, data
model, Redis layout, and retry policy
- [`../api/intents-asyncapi.yaml`](../api/intents-asyncapi.yaml) for the
producer-to-notification Redis Stream contract
- [`../openapi.yaml`](../openapi.yaml) for the private probe HTTP contract
- [`../../gateway/README.md`](../../gateway/README.md) for client-event fan-out
- [`../../mail/api/delivery-commands-asyncapi.yaml`](../../mail/api/delivery-commands-asyncapi.yaml)
for the trusted async generic mail command contract
- [`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for system-level service
boundaries and transport rules
- [`../../TESTING.md`](../../TESTING.md) for the cross-service testing matrix
+145
View File
@@ -0,0 +1,145 @@
# Configuration and Contract Examples
The examples below are illustrative. IDs, timestamps, and stream keys are
placeholders unless explicitly stated otherwise.
## Example Environment
Minimal local runtime:
```dotenv
NOTIFICATION_REDIS_ADDR=127.0.0.1:6379
NOTIFICATION_INTERNAL_HTTP_ADDR=:8092
NOTIFICATION_USER_SERVICE_BASE_URL=http://127.0.0.1:8091
NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM=gateway:client-events
NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM=mail:delivery_commands
NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED=geo-admin@example.com
NOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILED=ops@example.com
NOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_START=ops@example.com
NOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED=admins@example.com
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
## Probe HTTP Examples
Liveness:
```bash
curl http://127.0.0.1:8092/healthz
```
```json
{
"status": "ok"
}
```
Readiness:
```bash
curl http://127.0.0.1:8092/readyz
```
```json
{
"status": "ready"
}
```
## User-Targeted Intent Example
```bash
redis-cli XADD notification:intents '*' \
notification_type game.turn.ready \
producer game_master \
audience_kind user \
recipient_user_ids_json '["user-1","user-2"]' \
idempotency_key game-master:game-123:turn-54 \
occurred_at_ms 1775121700000 \
request_id request-123 \
trace_id trace-123 \
payload_json '{"game_id":"game-123","game_name":"Nebula Clash","turn_number":54}'
```
Expected effects:
- `Notification Service` resolves both users through `User Service`
- one `push` route and one `email` route are materialized per user
- `Gateway` receives user-wide client events without `device_session_id`
- `Mail Service` receives template-mode commands with
`template_id=game.turn.ready`
## Administrator Intent Example
```bash
redis-cli XADD notification:intents '*' \
notification_type geo.review_recommended \
producer geoprofile \
audience_kind admin_email \
idempotency_key geoprofile:user-123:review-true:1775121700001 \
occurred_at_ms 1775121700001 \
payload_json '{"user_id":"user-123","user_email":"pilot@example.com","observed_country":"DE","usual_connection_country":"PL","review_reason":"country_mismatch"}'
```
Expected effects:
- `Notification Service` does not call `User Service`
- recipients are read from `NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED`
- only email routes are publishable; push route slots are skipped
## Gateway Client Event Shape
Example stream entry appended by `Notification Service`:
```bash
redis-cli XADD gateway:client-events MAXLEN '~' 1024 '*' \
user_id user-1 \
event_type game.turn.ready \
event_id '1775121700000-0/push:user:user-1' \
payload_bytes '<flatbuffers-bytes>' \
request_id request-123 \
trace_id trace-123
```
`Gateway` derives `timestamp_ms`, computes `payload_hash`, signs the outgoing
event, and delivers it to every active stream for `user-1`.
## Mail Command Shape
Example stream entry appended by `Notification Service`:
```bash
redis-cli XADD mail:delivery_commands '*' \
delivery_id '1775121700000-0/email:user:user-1' \
source notification \
payload_mode template \
idempotency_key 'notification:1775121700000-0/email:user:user-1' \
requested_at_ms 1775121700000 \
request_id request-123 \
trace_id trace-123 \
payload_json '{"to":["pilot@example.com"],"cc":[],"bcc":[],"reply_to":[],"template_id":"game.turn.ready","locale":"en","variables":{"game_id":"game-123","game_name":"Nebula Clash","turn_number":54},"attachments":[]}'
```
## Dead-Letter Replay
Replay a dead-lettered route by publishing a new compatible intent with a new
producer-owned `idempotency_key`.
```bash
redis-cli XADD notification:intents '*' \
notification_type game.turn.ready \
producer game_master \
audience_kind user \
recipient_user_ids_json '["user-1"]' \
idempotency_key game-master:game-123:turn-54:manual-replay-1 \
occurred_at_ms 1775121700000 \
payload_json '{"game_id":"game-123","game_name":"Nebula Clash","turn_number":54}'
```
Do not mutate existing `notification_route`,
`notification_dead_letter_entry`, or `notification:route_schedule` records as a
replay workflow.
+130
View File
@@ -0,0 +1,130 @@
# Main Flows
## Producer -> Notification
```mermaid
sequenceDiagram
participant Producer
participant Stream as Redis Stream notification:intents
participant Consumer as Intent consumer
participant Notify as Notification Service
participant Redis
Producer->>Stream: XADD normalized intent
Consumer->>Stream: XREAD from stored offset
Consumer->>Notify: decode and validate envelope
alt malformed intent
Notify->>Redis: record malformed-intent entry
Consumer->>Redis: save stream offset
else duplicate with same normalized content
Notify->>Redis: load accepted notification
Consumer->>Redis: save stream offset
else idempotency conflict
Notify->>Redis: record malformed-intent entry
Consumer->>Redis: save stream offset
else new valid intent
Notify->>Redis: store notification, routes, and idempotency record
Consumer->>Redis: save stream offset
end
```
Duplicate handling is scoped by `(producer, idempotency_key)`. `request_id` and
`trace_id` are observability-only metadata and do not participate in the
idempotency fingerprint.
## User-Targeted Enrichment
```mermaid
sequenceDiagram
participant Consumer as Intent consumer
participant Notify as Notification Service
participant User as User Service
participant Redis
Consumer->>Notify: accepted user-targeted intent
loop each recipient_user_id
Notify->>User: GET /api/v1/internal/users/{user_id}
alt user exists
User-->>Notify: email + preferred_language
else subject_not_found
Notify->>Redis: record malformed intent recipient_not_found
Consumer->>Redis: save stream offset
else temporary failure
Notify-->>Consumer: service unavailable
Consumer-->>Consumer: stop before stream-offset advance
end
end
Notify->>Redis: persist enriched routes
```
User-targeted routes are enriched before durable route write. The currently
supported resolved locale is exactly `en`; unsupported or empty values fall
back to `en`.
## Notification -> Gateway
```mermaid
sequenceDiagram
participant Push as Push publisher
participant Redis
participant Gateway as Edge Gateway
participant Client
Push->>Redis: load due push route
Push->>Redis: acquire temporary route lease
Push->>Push: encode FlatBuffers notification payload
Push->>Redis: XADD MAXLEN ~ gateway client-event stream
Push->>Redis: mark route published and remove from schedule
Gateway->>Redis: XREAD client-event stream
Gateway->>Gateway: sign outgoing GatewayEvent
Gateway-->>Client: fan out to all active user streams
```
`Notification Service` publishes `user_id`, `event_type`, `event_id`,
`payload_bytes`, and optional `request_id` / `trace_id`. It intentionally omits
`device_session_id`.
## Notification -> Mail
```mermaid
sequenceDiagram
participant Email as Email publisher
participant Redis
participant Mail as Mail Service
Email->>Redis: load due email route
Email->>Redis: acquire temporary route lease
Email->>Email: encode template-mode command
Email->>Redis: XADD mail:delivery_commands
Email->>Redis: mark route published and remove from schedule
Mail->>Redis: XREAD mail:delivery_commands
Mail->>Mail: accept template delivery command
```
Notification-generated mail always uses `source=notification`,
`payload_mode=template`, and `template_id == notification_type`.
Auth-code mail is not part of this flow and remains a direct
`Auth / Session Service -> Mail Service` request.
## Retry and Dead Letter
```mermaid
sequenceDiagram
participant Publisher
participant Redis
participant Downstream as Gateway or Mail Service
Publisher->>Redis: load due route
Publisher->>Redis: acquire temporary route lease
Publisher->>Downstream: append downstream stream entry
alt publication succeeds
Publisher->>Redis: mark published and remove schedule member
else retry budget remains
Publisher->>Redis: mark failed and schedule next attempt
else retry budget exhausted
Publisher->>Redis: mark dead_letter and write dead-letter entry
end
```
`push` and `email` retry independently. A dead-lettered route never rolls back
or invalidates a sibling route that already reached `published`.
+167
View File
@@ -0,0 +1,167 @@
# Operator Runbook
This runbook covers startup, steady-state verification, shutdown, and common
`Notification Service` incidents.
## Startup Checks
Before starting the process, confirm:
- `NOTIFICATION_REDIS_ADDR` points to the Redis deployment that stores
notification records, routes, idempotency reservations, malformed intents,
dead letters, stream offsets, and route schedules
- Redis ACL, DB, TLS, and timeout settings match the target environment
- `NOTIFICATION_USER_SERVICE_BASE_URL` points to the trusted internal
`User Service`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` matches the stream consumed by
`Gateway`
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` matches the stream consumed by
`Mail Service`
- administrator email variables are populated for notification types that
should notify administrators
- OpenTelemetry exporter settings point at the intended collector when traces
or metrics are expected outside the process
At startup the process performs a bounded Redis `PING`. Startup fails fast if
configuration validation or Redis connectivity fails.
Known startup caveats:
- there is no operator API
- there is no `/metrics` route
- traces and metrics are exported only through configured OpenTelemetry
exporters
- readiness is process-local after successful startup
## Steady-State Verification
Practical readiness verification:
1. confirm startup logs for the internal HTTP listener, intent consumer, push
publisher, and email publisher
2. request `GET /readyz` on `NOTIFICATION_INTERNAL_HTTP_ADDR`
3. verify Redis connectivity and OpenTelemetry exporter health out of band
4. publish a low-risk compatible test intent in a non-production environment
and verify route publication in the downstream stream
Expected steady-state signals:
- `notification.route_schedule.depth` remains bounded
- `notification.route_schedule.oldest_age_ms` stays near the active retry
ladder
- `notification.intent_stream.oldest_unprocessed_age_ms` remains near zero
when producers are healthy
- `notification.route.dead_letters` changes rarely
- malformed-intent logs appear only for bad producer input
- logs include `notification_type`, `producer`, `audience_kind`, and
correlation identifiers where present
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- coordinated shutdown is bounded by `NOTIFICATION_SHUTDOWN_TIMEOUT`
- the private probe listener is stopped before process resources are closed
- route publishers and the intent consumer stop through context cancellation
- Redis clients are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
1. send `SIGTERM`
2. wait for listener and worker shutdown logs
3. restart the process with the same Redis, stream, and downstream settings
4. repeat steady-state verification
## Incident Triage
### Intent Stream Lag Grows
Symptoms:
- `notification.intent_stream.oldest_unprocessed_age_ms` increases
- no matching route records appear for new stream entries
- consumer logs stop after a specific stream entry
Checks:
1. inspect the next unprocessed `notification:intents` entry
2. confirm `User Service` is reachable from `Notification Service`
3. if the entry is user-targeted, verify every `recipient_user_id` exists
4. inspect malformed-intent records for nearby stream IDs
Expected behavior:
- malformed input is recorded and the offset advances
- temporary `User Service` failure stops progress before offset advancement
### Route Schedule Backlog Grows
Symptoms:
- `notification.route_schedule.depth` rises steadily
- `notification.route_schedule.oldest_age_ms` increases
- routes remain in `pending` or `failed`
Checks:
1. confirm push and email publisher startup logs are present
2. confirm Redis latency and connectivity
3. verify route IDs match the expected `push:` or `email:` prefixes
4. confirm the downstream stream names match `Gateway` and `Mail Service`
5. inspect route `last_error_classification`
### Dead-Letter Spikes
Symptoms:
- `notification.route.dead_letters` increases rapidly
- route records show repeated `payload_encoding_failed`,
`gateway_stream_publish_failed`, or `mail_stream_publish_failed`
Checks:
1. inspect the dead-letter entry and owning route
2. verify payload fields still match the notification catalog
3. confirm downstream Redis stream writes are accepted
4. compare failures across channels to isolate Gateway-specific or
Mail-specific issues
Recovery:
1. correct the downstream dependency or payload problem
2. publish a new compatible intent with a new producer-owned
`idempotency_key`
3. keep the old dead-letter record untouched as audit history
### Missing Administrator Mail
Symptoms:
- administrator notification type is accepted
- no email command reaches `mail:delivery_commands`
- route is `skipped` with recipient `config:<notification_type>`
Checks:
1. inspect the type-specific administrator email environment variable
2. confirm addresses are normalized single email addresses without display
names
3. restart the process after configuration changes
Expected behavior:
- empty administrator lists materialize one skipped synthetic route so the
configuration gap remains durable and visible
### Auth-Code Mail Appears Missing
Auth-code mail is intentionally outside `Notification Service`.
Checks:
1. inspect `Auth / Session Service -> Mail Service` logs and delivery records
2. confirm `notification:intents` remains unused for auth-code delivery
3. do not replay auth-code mail through `Notification Service`
+206
View File
@@ -0,0 +1,206 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/notification` process and
its runtime dependencies.
```mermaid
flowchart LR
subgraph Producers
GM["Game Master"]
Lobby["Game Lobby"]
Geo["Geo Profile Service"]
end
subgraph Notify["Notification Service process"]
Probe["Private probe HTTP listener\n/healthz /readyz"]
Consumer["Notification intent consumer"]
Accept["Intent acceptance service"]
Push["Push route publisher"]
Email["Email route publisher"]
Telemetry["Logs, traces, metrics"]
end
User["User Service"]
Gateway["Edge Gateway\nclient-event stream consumer"]
Mail["Mail Service\ncommand stream consumer"]
Redis["Redis\nstate + streams + schedules"]
GM --> Redis
Lobby --> Redis
Geo --> Redis
Consumer --> Redis
Consumer --> Accept
Accept --> User
Accept --> Redis
Push --> Redis
Email --> Redis
Push --> Gateway
Email --> Mail
Probe --> Telemetry
Consumer --> Telemetry
Push --> Telemetry
Email --> Telemetry
```
## Listener
`notification` exposes exactly one HTTP listener:
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Internal probe HTTP | `:8092` | Private liveness and readiness probes |
Shared listener defaults:
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
Probe routes:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}`
- `readyz` is process-local after successful startup and does not perform a
live Redis ping per request
Intentional omissions:
- no public listener
- no operator API
- there is no `/metrics` route
## Startup Wiring
`cmd/notification` loads config, constructs logging, and builds the runtime
through `internal/app.NewRuntime`.
The runtime wires:
- Redis client with startup connectivity check
- `User Service` HTTP client for recipient enrichment
- private probe HTTP server
- plain `XREAD` intent consumer
- `push` route publisher for `Gateway`
- `email` route publisher for `Mail Service`
- Redis-backed accepted-intent, route, idempotency, malformed-intent,
dead-letter, stream-offset, and schedule stores
- OpenTelemetry traces and metrics exporters
Startup fails fast on invalid configuration or unavailable Redis.
## Background Components
### Intent consumer
- reads one plain `XREAD` stream, default `notification:intents`
- starts from stored offset or `0-0`
- advances offset only after durable acceptance or durable malformed-intent
recording
- stops without offset advancement when `User Service` enrichment has a
temporary failure
### Acceptance service
- validates the normalized intent envelope
- applies idempotency rules for `(producer, idempotency_key)`
- enriches user-targeted recipients before durable route write
- materializes route slots for `push` and `email`
- stores malformed-intent records for invalid payloads, idempotency conflicts,
and unresolved users
### Push publisher
- scans `notification:route_schedule`
- processes only scheduled route IDs beginning with `push:`
- coordinates replicas with temporary route leases
- publishes Gateway client events with `XADD MAXLEN ~`
- omits `device_session_id` so Gateway fans out to all active streams for the
target user
### Email publisher
- scans `notification:route_schedule`
- processes only scheduled route IDs beginning with `email:`
- coordinates replicas with temporary route leases
- publishes Mail Service generic commands with plain `XADD`
- always uses `payload_mode=template`
## Configuration Groups
Required:
- `NOTIFICATION_REDIS_ADDR`
- `NOTIFICATION_USER_SERVICE_BASE_URL`
Core process config:
- `NOTIFICATION_SHUTDOWN_TIMEOUT`
- `NOTIFICATION_LOG_LEVEL`
Internal HTTP config:
- `NOTIFICATION_INTERNAL_HTTP_ADDR` with default `:8092`
- `NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUT` with default `2s`
- `NOTIFICATION_INTERNAL_HTTP_READ_TIMEOUT` with default `10s`
- `NOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUT` with default `1m`
Redis connectivity:
- `NOTIFICATION_REDIS_USERNAME`
- `NOTIFICATION_REDIS_PASSWORD`
- `NOTIFICATION_REDIS_DB`
- `NOTIFICATION_REDIS_TLS_ENABLED`
- `NOTIFICATION_REDIS_OPERATION_TIMEOUT`
- `NOTIFICATION_INTENTS_STREAM`
- `NOTIFICATION_INTENTS_READ_BLOCK_TIMEOUT`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM`
Retry and retention:
- `NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTS`
- `NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTS`
- `NOTIFICATION_ROUTE_BACKOFF_MIN`
- `NOTIFICATION_ROUTE_BACKOFF_MAX`
- `NOTIFICATION_ROUTE_LEASE_TTL`
- `NOTIFICATION_DEAD_LETTER_TTL`
- `NOTIFICATION_RECORD_TTL`
- `NOTIFICATION_IDEMPOTENCY_TTL`
User enrichment:
- `NOTIFICATION_USER_SERVICE_TIMEOUT` with default `1s`
Administrator routing:
- `NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED`
- `NOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILED`
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_START`
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED`
Telemetry:
- `OTEL_SERVICE_NAME`
- `OTEL_TRACES_EXPORTER`
- `OTEL_METRICS_EXPORTER`
- `OTEL_EXPORTER_OTLP_PROTOCOL`
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
- `NOTIFICATION_OTEL_STDOUT_TRACES_ENABLED`
- `NOTIFICATION_OTEL_STDOUT_METRICS_ENABLED`
## Runtime Notes
- `Notification Service` does not create or own notification audiences; it
trusts producers to publish concrete user recipients.
- Administrator recipients are type-specific configuration, not a global list.
- A missing user is treated as a producer input defect.
- A temporary `User Service` outage pauses stream progress for the affected
entry and allows replay after restart.
- Go producers use `galaxy/notificationintent` to build compatible intents.
- Producers append intents with plain `XADD`; producer-side publish failure is
notification degradation and must not roll back already committed source
business state.
- Dead-letter replay is performed by publishing a new compatible intent with a
new `idempotency_key`.