Files
galaxy-game/notification/README.md
T
2026-04-22 08:49:45 +02:00

666 lines
30 KiB
Markdown

# Notification Service
Canonical references:
- [Service-local docs](docs/README.md)
- [Intent AsyncAPI contract](api/intents-asyncapi.yaml)
- [Probe OpenAPI contract](openapi.yaml)
- [Gateway push model](../gateway/README.md)
- [Mail async command contract](../mail/api/delivery-commands-asyncapi.yaml)
- [Notification FlatBuffers payloads](../pkg/schema/fbs/notification.fbs)
- [System architecture](../ARCHITECTURE.md)
## Purpose
`Notification Service` is the internal asynchronous orchestration layer for
platform notifications.
It accepts normalized notification intents from upstream services, materializes
per-recipient routes, enriches user-targeted routes through `User Service`,
publishes client-facing push events toward `Gateway`, publishes non-auth email
commands toward `Mail Service`, and isolates transient downstream failures with
independent retry budgets per channel.
The service is intentionally not a source of truth for:
- game state
- lobby membership
- invite ownership
- review flags
- notification preferences
- email delivery attempts
## Responsibility Boundaries
`Notification Service` is responsible for:
- consuming normalized notification intents from a dedicated Redis Stream
- validating intent envelopes and rejecting malformed or conflicting duplicates
- persisting durable notification and route state
- resolving user contact data from `User Service` by `user_id`
- selecting locale from `User Service.preferred_language` with `en` fallback
- shaping lightweight push payloads for user-facing events
- publishing template-mode email commands to `Mail Service`
- retrying route publication independently for `push` and `email`
- persisting dead-letter entries for exhausted routes
`Notification Service` is not responsible for:
- computing business audiences from `game_id` or other domain identifiers
- owning administrator identity or administrator user records
- sending auth-code email
- storing per-user notification preferences in v1
- exposing an operator REST API in v1
The key design rule is that upstream producers must publish the concrete
`recipient_user_id` values for user-targeted notification intents. For
administrator-only notification types, recipient email addresses are resolved
from `Notification Service` configuration by `notification_type`. Private-game
invite notifications in v1 remain user-bound by internal `user_id` values and
must not target recipients by raw email address.
## Runtime Surface
The implemented process contains:
- one private internal HTTP probe listener
- process-wide structured logging
- process-wide OpenTelemetry runtime
- one shared `galaxy/notificationintent` producer contract module
- one shared Redis client with startup connectivity check
- one trusted `User Service` HTTP enrichment client
- one plain-`XREAD` notification-intent consumer
- one long-lived `push` route publisher
- one long-lived `email` route publisher
- durable accepted-intent, route, idempotency, malformed-intent, and
stream-offset storage in Redis
- user-targeted route enrichment during intent acceptance before durable write
- client-facing `push` publication toward `Gateway`
- template-mode `email` publication toward `Mail Service`
- durable `push` and `email` retry, dead-letter, and temporary lease
coordination in Redis
- OpenTelemetry counters and observable gauges for intent intake, user
enrichment, route publication, route schedule depth, and intent stream lag
- graceful shutdown on process cancellation
Probe contract:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}`
- `readyz` is process-local after successful startup and does not perform a
live Redis ping per request
- there is no `/metrics` route
Runtime behavior:
- the intent consumer reads `notification:intents` with plain `XREAD`
- when no stored stream offset exists, the consumer starts from `0-0`
- the persisted offset advances only after durable acceptance or durable
malformed-intent recording
- user-targeted routes are enriched through `GET /api/v1/internal/users/{user_id}`
before durable route write
- `404 subject_not_found` from `User Service` is recorded under
malformed-intent storage with `failure_code=recipient_not_found`
- temporary `User Service` lookup failures stop the consumer before
stream-offset advance
- due `push` routes are published toward `Gateway` from the shared
`notification:route_schedule`
- due `email` routes are published toward `Mail Service` from the shared
`notification:route_schedule`
- the `push` publisher claims only routes whose `route_id` starts with `push:`
- the `email` publisher claims only routes whose `route_id` starts with `email:`
- replicas coordinate through temporary Redis lease
`notification:route_leases:<notification_id>:<route_id>`
- `Gateway` publication uses `XADD MAXLEN ~` with
`NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`
- `event_id` equals `<notification_id>/<route_id>`
- `Mail Service` publication uses plain `XADD` with no stream trimming
- `delivery_id` equals `<notification_id>/<route_id>`
- `idempotency_key` equals `notification:<notification_id>/<route_id>`
- `requested_at_ms` equals `accepted_at_ms`
- `request_id` and `trace_id` are forwarded when present
- `device_session_id` is intentionally omitted so `Gateway` fans the event out
to every active stream of that user
- Go producers use `galaxy/notificationintent` to construct and publish
compatible intents into `notification:intents`
- producer publication uses plain `XADD` without stream trimming or hidden
helper retries
- a producer-side notification publication failure is notification degradation
and must not roll back the source business state
- metric export uses the configured OpenTelemetry exporters only
- there is still no `/metrics` route
- `notification.route_schedule.depth` and
`notification.route_schedule.oldest_age_ms` are derived from
`notification:route_schedule`
- `notification.intent_stream.oldest_unprocessed_age_ms` is derived from the
persisted intent stream offset and the configured ingress stream
- manual dead-letter replay is performed by publishing a new compatible intent
with a new `idempotency_key`; existing dead-letter records remain audit
history until TTL expiry
The target process shape is one internal-only process with:
- one notification-intent consumer
- one `push` route publisher for `Gateway`
- one `email` route publisher for `Mail Service`
Intentional runtime omissions in v1:
- no public ingress
- no dedicated operator REST API
- no direct client delivery
- no direct SMTP integration
## Configuration
Required:
- `NOTIFICATION_REDIS_ADDR`
- `NOTIFICATION_USER_SERVICE_BASE_URL`
Primary configuration groups:
- process and logging:
- `NOTIFICATION_SHUTDOWN_TIMEOUT`
- `NOTIFICATION_LOG_LEVEL`
- internal probe HTTP:
- `NOTIFICATION_INTERNAL_HTTP_ADDR` with default `:8092`
- `NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUT` with default `2s`
- `NOTIFICATION_INTERNAL_HTTP_READ_TIMEOUT` with default `10s`
- `NOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUT` with default `1m`
- Redis connectivity:
- `NOTIFICATION_REDIS_USERNAME`
- `NOTIFICATION_REDIS_PASSWORD`
- `NOTIFICATION_REDIS_DB`
- `NOTIFICATION_REDIS_TLS_ENABLED`
- `NOTIFICATION_REDIS_OPERATION_TIMEOUT`
- stream names:
- `NOTIFICATION_INTENTS_STREAM` with default `notification:intents`
- `NOTIFICATION_INTENTS_READ_BLOCK_TIMEOUT` with default `2s`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` with default `gateway:client-events`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN` with default `1024`
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` with default `mail:delivery_commands`
- retry and dead-letter:
- `NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTS` with default `3`
- `NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTS` with default `7`
- `NOTIFICATION_ROUTE_BACKOFF_MIN` with default `1s`
- `NOTIFICATION_ROUTE_BACKOFF_MAX` with default `5m`
- `NOTIFICATION_ROUTE_LEASE_TTL` with default `5s`
- `NOTIFICATION_DEAD_LETTER_TTL` with default `720h`
- `NOTIFICATION_RECORD_TTL` with default `720h`
- `NOTIFICATION_IDEMPOTENCY_TTL` with default `168h`
- `User Service` enrichment:
- `NOTIFICATION_USER_SERVICE_TIMEOUT` with default `1s`
- administrator routing:
- `NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED`
- `NOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILED`
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_START`
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED`
- OpenTelemetry:
- standard `OTEL_*` variables
- `NOTIFICATION_OTEL_STDOUT_TRACES_ENABLED`
- `NOTIFICATION_OTEL_STDOUT_METRICS_ENABLED`
Each administrator configuration variable stores a comma-separated list of
email addresses for exactly one `notification_type`. v1 does not use one global
admin-recipient list shared across all administrative events.
## Stable Input Contract
The service accepts intents from one dedicated Redis Stream:
- `notification:intents`
The canonical envelope is defined in
[api/intents-asyncapi.yaml](api/intents-asyncapi.yaml).
Go producers should use the shared `galaxy/notificationintent` module to build
and append compatible stream entries instead of duplicating field names,
payload structs, or validation rules locally.
Required envelope fields:
- `notification_type`
- `producer`
- `audience_kind`
- `idempotency_key`
- `occurred_at_ms`
- `payload_json`
Optional envelope fields:
- `recipient_user_ids_json`
- `request_id`
- `trace_id`
Rules:
- `audience_kind=user` requires `recipient_user_ids_json` with one or more
unique stable `user_id` values
- `audience_kind=admin_email` forbids `recipient_user_ids_json`
- `recipient_user_ids_json` is normalized as an unordered recipient set, so
duplicate `user_id` values are invalid and element order does not affect
idempotency
- `request_id` and `trace_id` are observability-only metadata and do not
participate in the idempotency fingerprint
- `payload_json` is type-specific, must remain backward-compatible for each
`notification_type`, and is normalized structurally for duplicate detection:
insignificant whitespace and object key order are ignored while array order
remains significant
- a replay with the same `(producer, idempotency_key)` and the same normalized
payload is treated as a successful duplicate
- a replay with the same `(producer, idempotency_key)` but different normalized
content is recorded as a conflicting duplicate under malformed-intent storage
with `failure_code=idempotency_conflict` and must not create new routes
- during user enrichment, a missing `user_id` in `User Service` is recorded
under malformed-intent storage with `failure_code=recipient_not_found`
Malformed stream entries do not create durable notification records. They are
logged, metered, and recorded separately for operator inspection.
Accepted intents use the original Redis Stream `stream_entry_id` as
`notification_id`.
## Notification Catalog
`payload_json` fields are normalized by the producer before publication.
| `notification_type` | Producer | Audience | Channels | Required `payload_json` fields |
| --- | --- | --- | --- | --- |
| `geo.review_recommended` | `Geo Profile Service` (`geoprofile`) | configured admin email list (`audience_kind=admin_email`) | `email` | `user_id`, `user_email`, `observed_country`, `usual_connection_country`, `review_reason` |
| `game.turn.ready` | `Game Master` (`game_master`) | active accepted participants (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `turn_number` |
| `game.finished` | `Game Master` (`game_master`) | active accepted participants (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `final_turn_number` |
| `game.generation_failed` | `Game Master` (`game_master`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `game_name`, `failure_reason` |
| `lobby.runtime_paused_after_start` | `Game Lobby` (`game_lobby`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `game_name` |
| `lobby.application.submitted` | `Game Lobby` (`game_lobby`) | private owner (`audience_kind=user`) or public admins (`audience_kind=admin_email`) | private: `push+email`, public: `email` | `game_id`, `game_name`, `applicant_user_id`, `applicant_name` |
| `lobby.membership.approved` | `Game Lobby` (`game_lobby`) | applicant user (`audience_kind=user`) | `push+email` | `game_id`, `game_name` |
| `lobby.membership.rejected` | `Game Lobby` (`game_lobby`) | applicant user (`audience_kind=user`) | `push+email` | `game_id`, `game_name` |
| `lobby.invite.created` | `Game Lobby` (`game_lobby`) | invited user (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `inviter_user_id`, `inviter_name` |
| `lobby.invite.redeemed` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `invitee_user_id`, `invitee_name` |
| `lobby.invite.expired` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `email` | `game_id`, `game_name`, `invitee_user_id`, `invitee_name` |
Rules:
- v1 supports exactly the eleven `notification_type` values listed above
- `lobby.application.submitted` keeps one stable `notification_type` and one
stable `payload_json` shape; private games publish `audience_kind=user`
while public games publish `audience_kind=admin_email`
- `lobby.invite.revoked` deliberately produces no notification in v1 and
remains outside the supported catalog
- private-game invite notifications remain user-bound by internal `user_id`
## Recipient Enrichment And Locale Policy
For `audience_kind=user`, `Notification Service` resolves user records through
the trusted `User Service` lookup endpoint:
- `GET /api/v1/internal/users/{user_id}`
The response supplies:
- `email`
- `preferred_language`
Locale rules:
- current implemented support is exactly one locale: `en`
- exact `preferred_language` is used when supported by `Mail Service`
- unsupported, empty, or invalid language values fall back to `en`
- no intermediate locale reduction is used in v1
- the same resolved locale drives both `push` payload localization decisions
and `Mail Service` template selection
- enrichment runs during intent acceptance before durable route write
- `404 subject_not_found` from `User Service` is treated as permanent producer
input error and becomes malformed-intent `recipient_not_found`
- temporary `User Service` failures stop the consumer before stream-offset
advance so the same stream entry is retried after restart
For `audience_kind=admin_email`, `Notification Service` does not consult
`User Service` and instead resolves recipients from type-specific config.
## Push Contract Toward Gateway
Push events are published into the existing `Gateway` client-events stream.
Stable routing rules:
- `event_type` equals `notification_type`
- `event_id` equals `<notification_id>/<route_id>`
- `user_id` is derived from `recipient_ref=user:<user_id>` for user-targeted
routes
- `request_id` and `trace_id` are forwarded when present
- `device_session_id` is intentionally omitted so `Gateway` fans the event out
to every active stream of that user
`Notification Service` appends `Gateway` events with `XADD MAXLEN ~` using
`NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`.
User-facing push payloads use
[pkg/schema/fbs/notification.fbs](../pkg/schema/fbs/notification.fbs).
| `notification_type` | FlatBuffers table | Payload fields |
| --- | --- | --- |
| `game.turn.ready` | `notification.GameTurnReadyEvent` | `game_id`, `turn_number` |
| `game.finished` | `notification.GameFinishedEvent` | `game_id`, `final_turn_number` |
| `lobby.application.submitted` | `notification.LobbyApplicationSubmittedEvent` | `game_id`, `applicant_user_id` |
| `lobby.membership.approved` | `notification.LobbyMembershipApprovedEvent` | `game_id` |
| `lobby.membership.rejected` | `notification.LobbyMembershipRejectedEvent` | `game_id` |
| `lobby.invite.created` | `notification.LobbyInviteCreatedEvent` | `game_id`, `inviter_user_id` |
| `lobby.invite.redeemed` | `notification.LobbyInviteRedeemedEvent` | `game_id`, `invitee_user_id` |
Only the seven user-facing push notification types above are represented in
`notification.fbs`.
`geo.review_recommended`, `game.generation_failed`,
`lobby.runtime_paused_after_start`, and `lobby.invite.expired` remain outside
this schema because they are email-only in v1.
Checked-in generated Go bindings for this schema live under
[`../pkg/schema/fbs/notification`](../pkg/schema/fbs/notification).
`notification_type` alone determines the concrete FlatBuffers table.
No extra envelope or FlatBuffers `union` is added in v1.
The push payload must stay lightweight and must not attempt to mirror full game,
lobby, or profile state.
`game_name`, human-readable user names, and other full business-state fields
stay out of the push schema.
Clients react to the notification and then fetch fresh business state through
normal service APIs.
## Email Contract Toward Mail Service
Email routes are published to `Mail Service` through
`mail:delivery_commands` using the existing generic async command contract.
Rules:
- `delivery_id` equals `<notification_id>/<route_id>`
- `source` is always `notification`
- `payload_mode` is always `template`
- `idempotency_key` equals `notification:<notification_id>/<route_id>`
- `requested_at_ms` equals `accepted_at_ms`
- `request_id` and `trace_id` are forwarded when present
- `payload_json.to` contains exactly one resolved recipient email
- `payload_json.cc`, `payload_json.bcc`, `payload_json.reply_to`, and
`payload_json.attachments` are empty arrays in v1
- `template_id` equals `notification_type`
- `locale` is the resolved language from the enrichment step or `en`
- template variables are passed through from normalized `payload_json`
`Notification Service` appends `Mail Service` commands with plain `XADD` and
does not manage retention or trimming of `mail:delivery_commands`.
Auth-code email remains a direct `Auth / Session Service -> Mail Service` flow
and does not pass through `Notification Service`.
Initial notification-owned template assets:
| `notification_type` | `template_id` | Required assets |
| --- | --- | --- |
| `geo.review_recommended` | `geo.review_recommended` | `en/subject.tmpl`, `en/text.tmpl` |
| `game.turn.ready` | `game.turn.ready` | `en/subject.tmpl`, `en/text.tmpl` |
| `game.finished` | `game.finished` | `en/subject.tmpl`, `en/text.tmpl` |
| `game.generation_failed` | `game.generation_failed` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.runtime_paused_after_start` | `lobby.runtime_paused_after_start` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.application.submitted` | `lobby.application.submitted` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.membership.approved` | `lobby.membership.approved` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.membership.rejected` | `lobby.membership.rejected` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.invite.created` | `lobby.invite.created` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.invite.redeemed` | `lobby.invite.redeemed` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.invite.expired` | `lobby.invite.expired` | `en/subject.tmpl`, `en/text.tmpl` |
`auth.login_code` does not belong to the notification-owned template set.
## Route Model
One accepted intent materializes:
- one `notification_record`
- zero or more `notification_route` entries
Each route represents exactly one `(channel, recipient_ref)` pair.
Stable route statuses:
- `pending`
- `published`
- `failed`
- `dead_letter`
- `skipped`
Rules:
- `pending` means the route is ready for first publish or retry
- `published` means the route was durably handed off to its downstream channel
- `failed` means the last publish attempt failed and a later retry is scheduled
- `dead_letter` means the route exhausted its retry budget
- `skipped` means the route slot was durably materialized but intentionally not
emitted
Materialization rules:
- every derived `recipient_ref` receives one `push` route slot and one `email`
route slot, except that an empty administrator email list materializes one
synthetic `config:<notification_type>` recipient slot with only a skipped
`email` route
- a route slot whose channel is outside the notification type channel matrix is
materialized as `skipped`
- `recipient_ref` is `user:<user_id>` for user-targeted routes
- `recipient_ref` is `email:<normalized_address>` for configured administrator
email routes
- when an administrator email list is empty, the service materializes one
synthetic recipient slot `config:<notification_type>` with one skipped
`email` route so the configuration gap remains durable and operator-visible
- `route_id` is mandatory and equals `<channel>:<recipient_ref>`
The service-local aggregate notification status is derived from routes and is
not a separate durable source of truth.
## Redis Logical Model
Storage rules:
- durable records are stored as strict JSON blobs
- timestamps are stored in Unix milliseconds
- dynamic Redis key segments are base64url-encoded
- `notification:route_schedule` is one shared sorted set for both `push` and
`email`
| Logical artifact | Redis key |
| --- | --- |
| `notification_record` | `notification:records:<notification_id>` |
| `notification_route` | `notification:routes:<notification_id>:<route_id>` |
| temporary route lease | `notification:route_leases:<notification_id>:<route_id>` |
| `notification_idempotency_record` | `notification:idempotency:<producer>:<idempotency_key>` |
| `notification_dead_letter_entry` | `notification:dead_letters:<notification_id>:<route_id>` |
| malformed intent record | `notification:malformed_intents:<stream_entry_id>` |
| stream offset record | `notification:stream_offsets:<stream>` |
| ingress stream | `notification:intents` |
| route schedule sorted set | `notification:route_schedule` |
| Record | Frozen fields |
| --- | --- |
| `notification_record` | `notification_id`, `notification_type`, `producer`, `audience_kind`, normalized `recipient_user_ids`, normalized `payload_json`, `idempotency_key`, `request_fingerprint`, optional `request_id`, optional `trace_id`, `occurred_at_ms`, `accepted_at_ms`, `updated_at_ms` |
| `notification_route` | `notification_id`, `route_id`, `channel`, `recipient_ref`, `status`, `attempt_count`, `max_attempts`, `next_attempt_at_ms`, optional `resolved_email`, optional `resolved_locale`, optional `last_error_classification`, optional `last_error_message`, optional `last_error_at_ms`, `created_at_ms`, `updated_at_ms`, optional `published_at_ms`, optional `dead_lettered_at_ms`, optional `skipped_at_ms` |
| `notification_idempotency_record` | `producer`, `idempotency_key`, `notification_id`, `request_fingerprint`, `created_at_ms`, `expires_at_ms` |
| `notification_dead_letter_entry` | `notification_id`, `route_id`, `channel`, `recipient_ref`, `final_attempt_count`, `max_attempts`, `failure_classification`, `failure_message`, `created_at_ms`, optional `recovery_hint` |
| malformed intent record | `stream_entry_id`, optional `notification_type`, optional `producer`, optional `idempotency_key`, `failure_code`, `failure_message`, `raw_fields_json`, `recorded_at_ms` |
| stream offset record | `stream`, `last_processed_entry_id`, `updated_at_ms` |
`notification_record.recipient_user_ids` stores a normalized array of unique
`user_id` values and is omitted for `audience_kind=admin_email`.
`notification_record.payload_json` stores the canonical normalized JSON string
used for idempotency fingerprinting.
Temporary route lease keys store one opaque worker token and use
`NOTIFICATION_ROUTE_LEASE_TTL`; they are service-local coordination state
rather than durable records.
`notification:route_schedule` stores one member per scheduled route where score
= `next_attempt_at_ms` and member = full Redis route key with encoded dynamic
segments.
Newly accepted publishable routes enter the schedule immediately with
`status=pending` and `next_attempt_at_ms = accepted_at_ms`.
`failed` routes remain scheduled for retry.
`published`, `dead_letter`, and `skipped` are absent from the schedule.
Only the current lease holder may finalize one due publication attempt.
## Retry And Dead-Letter Policy
Retry budgets are channel-specific:
- `push` publication to `Gateway`: `3` attempts total
- `email` publication to `Mail Service`: `7` attempts total
Rules:
- the first publication attempt happens immediately at `accepted_at_ms`
- after failed attempt `N`, the next delay is `clamp(NOTIFICATION_ROUTE_BACKOFF_MIN * 2^(N-1), NOTIFICATION_ROUTE_BACKOFF_MIN, NOTIFICATION_ROUTE_BACKOFF_MAX)`
- no jitter is added to the retry delay
- `push` and `email` routes are retried independently
- the shared schedule is filtered by route prefix so `push` publishers claim
only `push:` routes and `email` publishers claim only `email:` routes
- `push` and `email` replicas coordinate through
`notification:route_leases:<notification_id>:<route_id>` with
`NOTIFICATION_ROUTE_LEASE_TTL`
- `push` publication failures are classified minimally as
`payload_encoding_failed` and `gateway_stream_publish_failed`
- `email` publication failures are classified minimally as
`payload_encoding_failed` and `mail_stream_publish_failed`
- when a route exhausts its retry budget, it transitions to `dead_letter`,
creates `notification_dead_letter_entry`, and is removed from
`notification:route_schedule`
- one exhausted route entering `dead_letter` must not roll back or invalidate a
sibling route that already reached `published`
- service restarts resume from durable route state and persisted stream offsets
Retention rules:
- `notification_record` and `notification_route` use
`NOTIFICATION_RECORD_TTL`
- `notification_idempotency_record` uses `NOTIFICATION_IDEMPOTENCY_TTL`
- `notification_dead_letter_entry` and malformed intent records use
`NOTIFICATION_DEAD_LETTER_TTL`
- stream offset records do not use TTL
## Observability
The service instruments:
- internal probe HTTP requests
- internal probe HTTP listener startup and shutdown events
- structured logs for accepted, duplicate, and rejected notification intents
- structured logs for `push` and `email` route publication, retry, and
dead-letter transitions
- accepted and duplicate intent outcomes
- malformed intents, including idempotency conflicts and unresolved recipients
- user-enrichment lookup outcomes
- route publish attempts, retries, and dead-letter transitions
- current route-schedule depth and oldest scheduled route age
- oldest unprocessed intent stream entry age
Metric names:
- `notification.intent.outcomes`
- `notification.intent.malformed`
- `notification.user_enrichment.attempts`
- `notification.route.publish_attempts`
- `notification.route.retries`
- `notification.route.dead_letters`
- `notification.route_schedule.depth`
- `notification.route_schedule.oldest_age_ms`
- `notification.intent_stream.oldest_unprocessed_age_ms`
Metrics intentionally avoid high-cardinality attributes such as `user_id`,
email address, `notification_id`, `route_id`, and `idempotency_key`.
Metric attributes may include `notification_type`, `producer`,
`audience_kind`, `channel`, `result`, `outcome`, `failure_code`, and
`failure_classification`.
Structured logs for intent intake, duplicate resolution, malformed-intent
recording, route publication, retry scheduling, and dead-letter transitions use
the same field names where the value exists:
- `notification_id`
- `notification_type`
- `producer`
- `audience_kind`
- `idempotency_key`
- `route_id`
- `channel`
- `request_id`
- `trace_id`
OpenTelemetry trace context is logged as `otel_trace_id` and `otel_span_id`
when the active context carries a valid span.
## Recovery
The supported manual replay path for a dead-lettered notification route is to
publish a new compatible intent to `notification:intents`.
Recovery rules:
- inspect the `notification_dead_letter_entry`, `notification_route`, and
owning `notification_record`
- confirm the downstream dependency or payload problem has been corrected
- publish a new intent with the same semantic `payload_json` and audience
fields, but with a new producer-owned `idempotency_key`
- keep the old `notification_dead_letter_entry` untouched as audit history
until its configured TTL expires
Manual Redis mutation of an existing route record or
`notification:route_schedule` is not a supported replay workflow.
## Verification
Focused service-local coverage verifies:
- configuration loading and validation
- `GET /healthz`
- `GET /readyz`
- absence of `/metrics`
- Redis startup fast-fail behavior
- graceful shutdown of the private probe listener
- valid intent acceptance
- malformed intent rejection
- duplicate and conflicting duplicate handling
- user-targeted route enrichment from `User Service`
- `recipient_not_found` malformed-intent recording for unresolved `user_id`
- temporary `User Service` failure handling without stream-offset advance
- FlatBuffers payload encoding for all seven user-facing `push`
`notification_type` values
- template-mode `Mail Service` command encoding for user and administrator
`email` routes
- due-route loading, lease acquisition, route publication, retry reschedule,
and dead-letter persistence in Redis
- `push` worker success, retry, and duplicate-prevention behavior across
concurrent replicas
- `email` worker success, retry, and duplicate-prevention behavior across
concurrent replicas
- OpenTelemetry metric recording for intent outcomes, malformed intents, user
enrichment, route publication attempts, retries, dead letters, route-schedule
gauges, and intent-stream lag
- Redis-backed route-schedule and intent-stream lag snapshots
- structured log field helper coverage through intake and publisher tests
- intent-consumer restart from `0-0` and from persisted stream offsets
- runtime wiring of the intent consumer and both route publishers
- shared `galaxy/notificationintent` producer constructors, validation, and
Redis Stream publication compatibility
Cross-service coverage verifies:
- `Notification Service -> User Service` enrichment compatibility and failure
handling
- `Notification Service -> Gateway` push compatibility for every user-facing
`notification_type`
- `Notification Service -> Mail Service` template-mode handoff for every
supported email type
- producer compatibility for `Game Master`, `Game Lobby`, and
`Geo Profile Service` through `galaxy/notificationintent`
- explicit regression that auth-code email still bypasses `Notification Service`
- real black-box `Notification Service -> Gateway` push fan-out coverage
- real black-box `Notification Service -> Mail Service` template-mode handoff
coverage
Real producer-boundary suites for `Game Master`, `Game Lobby`, and
`Geo Profile Service` should be added only when those service boundaries exist
in code.