Files
galaxy-game/notification/README.md
T
2026-04-26 20:34:39 +02:00

737 lines
34 KiB
Markdown

# Notification Service
Canonical references:
- [Service-local docs](docs/README.md)
- [Intent AsyncAPI contract](api/intents-asyncapi.yaml)
- [Probe OpenAPI contract](openapi.yaml)
- [Gateway push model](../gateway/README.md)
- [Mail async command contract](../mail/api/delivery-commands-asyncapi.yaml)
- [Notification FlatBuffers payloads](../pkg/schema/fbs/notification.fbs)
- [System architecture](../ARCHITECTURE.md)
## Purpose
`Notification Service` is the internal asynchronous orchestration layer for
platform notifications.
It accepts normalized notification intents from upstream services, materializes
per-recipient routes, enriches user-targeted routes through `User Service`,
publishes client-facing push events toward `Gateway`, publishes non-auth email
commands toward `Mail Service`, and isolates transient downstream failures with
independent retry budgets per channel.
The service is intentionally not a source of truth for:
- game state
- lobby membership
- invite ownership
- review flags
- notification preferences
- email delivery attempts
## Responsibility Boundaries
`Notification Service` is responsible for:
- consuming normalized notification intents from a dedicated Redis Stream
- validating intent envelopes and rejecting malformed or conflicting duplicates
- persisting durable notification and route state
- resolving user contact data from `User Service` by `user_id`
- selecting locale from `User Service.preferred_language` with `en` fallback
- shaping lightweight push payloads for user-facing events
- publishing template-mode email commands to `Mail Service`
- retrying route publication independently for `push` and `email`
- persisting dead-letter entries for exhausted routes
`Notification Service` is not responsible for:
- computing business audiences from `game_id` or other domain identifiers
- owning administrator identity or administrator user records
- sending auth-code email
- storing per-user notification preferences in v1
- exposing an operator REST API in v1
The key design rule is that upstream producers must publish the concrete
`recipient_user_id` values for user-targeted notification intents. For
administrator-only notification types, recipient email addresses are resolved
from `Notification Service` configuration by `notification_type`. Private-game
invite notifications in v1 remain user-bound by internal `user_id` values and
must not target recipients by raw email address.
## Runtime Surface
The implemented process contains:
- one private internal HTTP probe listener
- process-wide structured logging
- process-wide OpenTelemetry runtime
- one shared `galaxy/notificationintent` producer contract module
- one shared Redis client with startup connectivity check
- one trusted `User Service` HTTP enrichment client
- one plain-`XREAD` notification-intent consumer
- one long-lived `push` route publisher
- one long-lived `email` route publisher
- durable accepted-intent, route, idempotency, malformed-intent, and
stream-offset storage in Redis
- user-targeted route enrichment during intent acceptance before durable write
- client-facing `push` publication toward `Gateway`
- template-mode `email` publication toward `Mail Service`
- durable `push` and `email` retry, dead-letter, and temporary lease
coordination in Redis
- OpenTelemetry counters and observable gauges for intent intake, user
enrichment, route publication, route schedule depth, and intent stream lag
- graceful shutdown on process cancellation
Probe contract:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}`
- `readyz` is process-local after successful startup and does not perform a
live Redis ping per request
- there is no `/metrics` route
Runtime behavior:
- the intent consumer reads `notification:intents` with plain `XREAD`
- when no stored stream offset exists, the consumer starts from `0-0`
- the persisted offset advances only after durable acceptance or durable
malformed-intent recording
- user-targeted routes are enriched through `GET /api/v1/internal/users/{user_id}`
before durable route write
- `404 subject_not_found` from `User Service` is recorded under
malformed-intent storage with `failure_code=recipient_not_found`
- temporary `User Service` lookup failures stop the consumer before
stream-offset advance
- due `push` routes are published toward `Gateway` from the shared
`notification:route_schedule`
- due `email` routes are published toward `Mail Service` from the shared
`notification:route_schedule`
- the `push` publisher claims only routes whose `route_id` starts with `push:`
- the `email` publisher claims only routes whose `route_id` starts with `email:`
- replicas coordinate through temporary Redis lease
`notification:route_leases:<notification_id>:<route_id>`
- `Gateway` publication uses `XADD MAXLEN ~` with
`NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`
- `event_id` equals `<notification_id>/<route_id>`
- `Mail Service` publication uses plain `XADD` with no stream trimming
- `delivery_id` equals `<notification_id>/<route_id>`
- `idempotency_key` equals `notification:<notification_id>/<route_id>`
- `requested_at_ms` equals `accepted_at_ms`
- `request_id` and `trace_id` are forwarded when present
- `device_session_id` is intentionally omitted so `Gateway` fans the event out
to every active stream of that user
- Go producers use `galaxy/notificationintent` to construct and publish
compatible intents into `notification:intents`
- producer publication uses plain `XADD` without stream trimming or hidden
helper retries
- a producer-side notification publication failure is notification degradation
and must not roll back the source business state
- metric export uses the configured OpenTelemetry exporters only
- there is still no `/metrics` route
- `notification.route_schedule.depth` and
`notification.route_schedule.oldest_age_ms` are derived from
`notification:route_schedule`
- `notification.intent_stream.oldest_unprocessed_age_ms` is derived from the
persisted intent stream offset and the configured ingress stream
- manual dead-letter replay is performed by publishing a new compatible intent
with a new `idempotency_key`; existing dead-letter records remain audit
history until TTL expiry
The target process shape is one internal-only process with:
- one notification-intent consumer
- one `push` route publisher for `Gateway`
- one `email` route publisher for `Mail Service`
Intentional runtime omissions in v1:
- no public ingress
- no dedicated operator REST API
- no direct client delivery
- no direct SMTP integration
## Configuration
Required:
- `NOTIFICATION_REDIS_MASTER_ADDR`
- `NOTIFICATION_REDIS_PASSWORD`
- `NOTIFICATION_POSTGRES_PRIMARY_DSN`
- `NOTIFICATION_USER_SERVICE_BASE_URL`
Primary configuration groups:
- process and logging:
- `NOTIFICATION_SHUTDOWN_TIMEOUT`
- `NOTIFICATION_LOG_LEVEL`
- internal probe HTTP:
- `NOTIFICATION_INTERNAL_HTTP_ADDR` with default `:8092`
- `NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUT` with default `2s`
- `NOTIFICATION_INTERNAL_HTTP_READ_TIMEOUT` with default `10s`
- `NOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUT` with default `1m`
- Redis connectivity (master/replica/password shape; the deprecated
`NOTIFICATION_REDIS_ADDR`, `NOTIFICATION_REDIS_USERNAME`, and
`NOTIFICATION_REDIS_TLS_ENABLED` env vars are rejected at startup):
- `NOTIFICATION_REDIS_REPLICA_ADDRS` (optional, comma-separated)
- `NOTIFICATION_REDIS_DB`
- `NOTIFICATION_REDIS_OPERATION_TIMEOUT`
- PostgreSQL connectivity:
- `NOTIFICATION_POSTGRES_REPLICA_DSNS` (optional, comma-separated)
- `NOTIFICATION_POSTGRES_OPERATION_TIMEOUT`
- `NOTIFICATION_POSTGRES_MAX_OPEN_CONNS`
- `NOTIFICATION_POSTGRES_MAX_IDLE_CONNS`
- `NOTIFICATION_POSTGRES_CONN_MAX_LIFETIME`
- stream names:
- `NOTIFICATION_INTENTS_STREAM` with default `notification:intents`
- `NOTIFICATION_INTENTS_READ_BLOCK_TIMEOUT` with default `2s`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` with default `gateway:client-events`
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN` with default `1024`
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` with default `mail:delivery_commands`
- retry and dead-letter:
- `NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTS` with default `3`
- `NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTS` with default `7`
- `NOTIFICATION_ROUTE_BACKOFF_MIN` with default `1s`
- `NOTIFICATION_ROUTE_BACKOFF_MAX` with default `5m`
- `NOTIFICATION_ROUTE_LEASE_TTL` with default `5s`
- `NOTIFICATION_IDEMPOTENCY_TTL` with default `168h`
- retention (periodic SQL retention worker; replaces the previous
`NOTIFICATION_DEAD_LETTER_TTL` and `NOTIFICATION_RECORD_TTL` Redis-EXPIRE
knobs):
- `NOTIFICATION_RECORD_RETENTION` with default `720h`
- `NOTIFICATION_MALFORMED_INTENT_RETENTION` with default `2160h`
- `NOTIFICATION_CLEANUP_INTERVAL` with default `1h`
- `User Service` enrichment:
- `NOTIFICATION_USER_SERVICE_TIMEOUT` with default `1s`
- administrator routing:
- `NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED`
- `NOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILED`
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_START`
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED`
- OpenTelemetry:
- standard `OTEL_*` variables
- `NOTIFICATION_OTEL_STDOUT_TRACES_ENABLED`
- `NOTIFICATION_OTEL_STDOUT_METRICS_ENABLED`
Each administrator configuration variable stores a comma-separated list of
email addresses for exactly one `notification_type`. v1 does not use one global
admin-recipient list shared across all administrative events.
## Stable Input Contract
The service accepts intents from one dedicated Redis Stream:
- `notification:intents`
The canonical envelope is defined in
[api/intents-asyncapi.yaml](api/intents-asyncapi.yaml).
Go producers should use the shared `galaxy/notificationintent` module to build
and append compatible stream entries instead of duplicating field names,
payload structs, or validation rules locally.
Required envelope fields:
- `notification_type`
- `producer`
- `audience_kind`
- `idempotency_key`
- `occurred_at_ms`
- `payload_json`
Optional envelope fields:
- `recipient_user_ids_json`
- `request_id`
- `trace_id`
Rules:
- `audience_kind=user` requires `recipient_user_ids_json` with one or more
unique stable `user_id` values
- `audience_kind=admin_email` forbids `recipient_user_ids_json`
- `recipient_user_ids_json` is normalized as an unordered recipient set, so
duplicate `user_id` values are invalid and element order does not affect
idempotency
- `request_id` and `trace_id` are observability-only metadata and do not
participate in the idempotency fingerprint
- `payload_json` is type-specific, must remain backward-compatible for each
`notification_type`, and is normalized structurally for duplicate detection:
insignificant whitespace and object key order are ignored while array order
remains significant
- a replay with the same `(producer, idempotency_key)` and the same normalized
payload is treated as a successful duplicate
- a replay with the same `(producer, idempotency_key)` but different normalized
content is recorded as a conflicting duplicate under malformed-intent storage
with `failure_code=idempotency_conflict` and must not create new routes
- during user enrichment, a missing `user_id` in `User Service` is recorded
under malformed-intent storage with `failure_code=recipient_not_found`
Malformed stream entries do not create durable notification records. They are
logged, metered, and recorded separately for operator inspection.
Accepted intents use the original Redis Stream `stream_entry_id` as
`notification_id`.
## Notification Catalog
`payload_json` fields are normalized by the producer before publication.
| `notification_type` | Producer | Audience | Channels | Required `payload_json` fields |
| --- | --- | --- | --- | --- |
| `geo.review_recommended` | `Geo Profile Service` (`geoprofile`) | configured admin email list (`audience_kind=admin_email`) | `email` | `user_id`, `user_email`, `observed_country`, `usual_connection_country`, `review_reason` |
| `game.turn.ready` | `Game Master` (`game_master`) | active accepted participants (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `turn_number` |
| `game.finished` | `Game Master` (`game_master`) | active accepted participants (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `final_turn_number` |
| `game.generation_failed` | `Game Master` (`game_master`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `game_name`, `failure_reason` |
| `lobby.runtime_paused_after_start` | `Game Lobby` (`game_lobby`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `game_name` |
| `lobby.application.submitted` | `Game Lobby` (`game_lobby`) | private owner (`audience_kind=user`) or public admins (`audience_kind=admin_email`) | private: `push+email`, public: `email` | `game_id`, `game_name`, `applicant_user_id`, `applicant_name` |
| `lobby.membership.approved` | `Game Lobby` (`game_lobby`) | applicant user (`audience_kind=user`) | `push+email` | `game_id`, `game_name` |
| `lobby.membership.rejected` | `Game Lobby` (`game_lobby`) | applicant user (`audience_kind=user`) | `push+email` | `game_id`, `game_name` |
| `lobby.membership.blocked` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `membership_user_id`, `membership_user_name`, `reason` |
| `lobby.invite.created` | `Game Lobby` (`game_lobby`) | invited user (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `inviter_user_id`, `inviter_name` |
| `lobby.invite.redeemed` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `invitee_user_id`, `invitee_name` |
| `lobby.invite.expired` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `email` | `game_id`, `game_name`, `invitee_user_id`, `invitee_name` |
| `lobby.race_name.registration_eligible` | `Game Lobby` (`game_lobby`) | capable member (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `race_name`, `eligible_until_ms` |
| `lobby.race_name.registered` | `Game Lobby` (`game_lobby`) | registering user (`audience_kind=user`) | `push+email` | `race_name` |
| `lobby.race_name.registration_denied` | `Game Lobby` (`game_lobby`) | incapable member (`audience_kind=user`) | `email` | `game_id`, `game_name`, `race_name`, `reason` |
Rules:
- v1 supports exactly the fifteen `notification_type` values listed above
- `lobby.application.submitted` keeps one stable `notification_type` and one
stable `payload_json` shape; private games publish `audience_kind=user`
while public games publish `audience_kind=admin_email`
- `lobby.invite.revoked` deliberately produces no notification in v1 and
remains outside the supported catalog
- private-game invite notifications remain user-bound by internal `user_id`
- `lobby.race_name.registration_eligible` and
`lobby.race_name.registration_denied` are emitted by `Game Lobby` at
`game_finished` based on capability evaluation; the former always pairs
with a 30-day `eligible_until_ms` window
- `lobby.race_name.registered` is emitted on successful
`lobby.race_name.register` commit
## Recipient Enrichment And Locale Policy
For `audience_kind=user`, `Notification Service` resolves user records through
the trusted `User Service` lookup endpoint:
- `GET /api/v1/internal/users/{user_id}`
The response supplies:
- `email`
- `preferred_language`
Locale rules:
- current implemented support is exactly one locale: `en`
- exact `preferred_language` is used when supported by `Mail Service`
- unsupported, empty, or invalid language values fall back to `en`
- no intermediate locale reduction is used in v1
- the same resolved locale drives both `push` payload localization decisions
and `Mail Service` template selection
- enrichment runs during intent acceptance before durable route write
- `404 subject_not_found` from `User Service` is treated as permanent producer
input error and becomes malformed-intent `recipient_not_found`
- temporary `User Service` failures stop the consumer before stream-offset
advance so the same stream entry is retried after restart
For `audience_kind=admin_email`, `Notification Service` does not consult
`User Service` and instead resolves recipients from type-specific config.
## Push Contract Toward Gateway
Push events are published into the existing `Gateway` client-events stream.
Stable routing rules:
- `event_type` equals `notification_type`
- `event_id` equals `<notification_id>/<route_id>`
- `user_id` is derived from `recipient_ref=user:<user_id>` for user-targeted
routes
- `request_id` and `trace_id` are forwarded when present
- `device_session_id` is intentionally omitted so `Gateway` fans the event out
to every active stream of that user
`Notification Service` appends `Gateway` events with `XADD MAXLEN ~` using
`NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`.
User-facing push payloads use
[pkg/schema/fbs/notification.fbs](../pkg/schema/fbs/notification.fbs).
| `notification_type` | FlatBuffers table | Payload fields |
| --- | --- | --- |
| `game.turn.ready` | `notification.GameTurnReadyEvent` | `game_id`, `turn_number` |
| `game.finished` | `notification.GameFinishedEvent` | `game_id`, `final_turn_number` |
| `lobby.application.submitted` | `notification.LobbyApplicationSubmittedEvent` | `game_id`, `applicant_user_id` |
| `lobby.membership.approved` | `notification.LobbyMembershipApprovedEvent` | `game_id` |
| `lobby.membership.rejected` | `notification.LobbyMembershipRejectedEvent` | `game_id` |
| `lobby.membership.blocked` | `notification.LobbyMembershipBlockedEvent` | `game_id`, `membership_user_id`, `reason` |
| `lobby.invite.created` | `notification.LobbyInviteCreatedEvent` | `game_id`, `inviter_user_id` |
| `lobby.invite.redeemed` | `notification.LobbyInviteRedeemedEvent` | `game_id`, `invitee_user_id` |
| `lobby.race_name.registration_eligible` | `notification.LobbyRaceNameRegistrationEligibleEvent` | `game_id`, `race_name`, `eligible_until_ms` |
| `lobby.race_name.registered` | `notification.LobbyRaceNameRegisteredEvent` | `race_name` |
Only the ten user-facing push notification types above are represented in
`notification.fbs`.
`geo.review_recommended`, `game.generation_failed`,
`lobby.runtime_paused_after_start`, `lobby.invite.expired`, and
`lobby.race_name.registration_denied` remain outside this schema because
they are email-only in v1.
Checked-in generated Go bindings for this schema live under
[`../pkg/schema/fbs/notification`](../pkg/schema/fbs/notification).
`notification_type` alone determines the concrete FlatBuffers table.
No extra envelope or FlatBuffers `union` is added in v1.
The push payload must stay lightweight and must not attempt to mirror full game,
lobby, or profile state.
`game_name`, human-readable user names, and other full business-state fields
stay out of the push schema.
Clients react to the notification and then fetch fresh business state through
normal service APIs.
## Email Contract Toward Mail Service
Email routes are published to `Mail Service` through
`mail:delivery_commands` using the existing generic async command contract.
Rules:
- `delivery_id` equals `<notification_id>/<route_id>`
- `source` is always `notification`
- `payload_mode` is always `template`
- `idempotency_key` equals `notification:<notification_id>/<route_id>`
- `requested_at_ms` equals `accepted_at_ms`
- `request_id` and `trace_id` are forwarded when present
- `payload_json.to` contains exactly one resolved recipient email
- `payload_json.cc`, `payload_json.bcc`, `payload_json.reply_to`, and
`payload_json.attachments` are empty arrays in v1
- `template_id` equals `notification_type`
- `locale` is the resolved language from the enrichment step or `en`
- template variables are passed through from normalized `payload_json`
`Notification Service` appends `Mail Service` commands with plain `XADD` and
does not manage retention or trimming of `mail:delivery_commands`.
Auth-code email remains a direct `Auth / Session Service -> Mail Service` flow
and does not pass through `Notification Service`.
Initial notification-owned template assets:
| `notification_type` | `template_id` | Required assets |
| --- | --- | --- |
| `geo.review_recommended` | `geo.review_recommended` | `en/subject.tmpl`, `en/text.tmpl` |
| `game.turn.ready` | `game.turn.ready` | `en/subject.tmpl`, `en/text.tmpl` |
| `game.finished` | `game.finished` | `en/subject.tmpl`, `en/text.tmpl` |
| `game.generation_failed` | `game.generation_failed` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.runtime_paused_after_start` | `lobby.runtime_paused_after_start` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.application.submitted` | `lobby.application.submitted` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.membership.approved` | `lobby.membership.approved` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.membership.rejected` | `lobby.membership.rejected` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.membership.blocked` | `lobby.membership.blocked` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.invite.created` | `lobby.invite.created` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.invite.redeemed` | `lobby.invite.redeemed` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.invite.expired` | `lobby.invite.expired` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.race_name.registration_eligible` | `lobby.race_name.registration_eligible` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.race_name.registered` | `lobby.race_name.registered` | `en/subject.tmpl`, `en/text.tmpl` |
| `lobby.race_name.registration_denied` | `lobby.race_name.registration_denied` | `en/subject.tmpl`, `en/text.tmpl` |
`auth.login_code` does not belong to the notification-owned template set.
## Route Model
One accepted intent materializes:
- one `notification_record`
- zero or more `notification_route` entries
Each route represents exactly one `(channel, recipient_ref)` pair.
Stable route statuses:
- `pending`
- `published`
- `failed`
- `dead_letter`
- `skipped`
Rules:
- `pending` means the route is ready for first publish or retry
- `published` means the route was durably handed off to its downstream channel
- `failed` means the last publish attempt failed and a later retry is scheduled
- `dead_letter` means the route exhausted its retry budget
- `skipped` means the route slot was durably materialized but intentionally not
emitted
Materialization rules:
- every derived `recipient_ref` receives one `push` route slot and one `email`
route slot, except that an empty administrator email list materializes one
synthetic `config:<notification_type>` recipient slot with only a skipped
`email` route
- a route slot whose channel is outside the notification type channel matrix is
materialized as `skipped`
- `recipient_ref` is `user:<user_id>` for user-targeted routes
- `recipient_ref` is `email:<normalized_address>` for configured administrator
email routes
- when an administrator email list is empty, the service materializes one
synthetic recipient slot `config:<notification_type>` with one skipped
`email` route so the configuration gap remains durable and operator-visible
- `route_id` is mandatory and equals `<channel>:<recipient_ref>`
The service-local aggregate notification status is derived from routes and is
not a separate durable source of truth.
## Persistence Model
Durable storage is split between PostgreSQL (table-shaped business state)
and Redis (streams, runtime coordination). The architectural rules live in
[`ARCHITECTURE.md §Persistence Backends`](../ARCHITECTURE.md#persistence-backends);
the per-service decision record is
[`docs/postgres-migration.md`](docs/postgres-migration.md).
### PostgreSQL durable state
The service owns the `notification` schema. Migrations are embedded in the
binary (`internal/adapters/postgres/migrations`) and applied at startup via
`pkg/postgres.RunMigrations` strictly before any HTTP listener becomes
ready. Every time-valued column is `timestamptz`, normalised to UTC by the
adapter on bind and scan.
| Table | Frozen columns |
| --- | --- |
| `records` | `notification_id`, `notification_type`, `producer`, `audience_kind`, `recipient_user_ids` (jsonb), `payload_json`, `idempotency_key`, `request_fingerprint`, `request_id`, `trace_id`, `occurred_at`, `accepted_at`, `updated_at`, `idempotency_expires_at`; `UNIQUE (producer, idempotency_key)` |
| `routes` | `notification_id`, `route_id`, `channel`, `recipient_ref`, `status`, `attempt_count`, `max_attempts`, `next_attempt_at`, `resolved_email`, `resolved_locale`, `last_error_classification`, `last_error_message`, `last_error_at`, `created_at`, `updated_at`, `published_at`, `dead_lettered_at`, `skipped_at`; PRIMARY KEY `(notification_id, route_id)` |
| `dead_letters` | `notification_id`, `route_id`, `channel`, `recipient_ref`, `final_attempt_count`, `max_attempts`, `failure_classification`, `failure_message`, `recovery_hint`, `created_at`; PRIMARY KEY `(notification_id, route_id)` cascading from `routes` |
| `malformed_intents` | `stream_entry_id`, `notification_type`, `producer`, `idempotency_key`, `failure_code`, `failure_message`, `raw_fields` (jsonb), `recorded_at` |
Storage rules:
- the durable `records` row IS the idempotency reservation; the
`(producer, idempotency_key)` UNIQUE constraint surfaces conflicts as
`acceptintent.ErrConflict`
- `next_attempt_at` is non-NULL only while the route is a scheduling
candidate (`status=pending|failed`); the partial index `routes_due_idx`
drives the publishers' `ListDueRoutes` scan
- `payload_json` stores the canonical normalized JSON string used for
idempotency fingerprinting; `recipient_user_ids` is JSONB and omitted
for `audience_kind=admin_email`
- terminal transitions clear `next_attempt_at` and stamp the appropriate
terminal column (`published_at` / `dead_lettered_at` / `skipped_at`)
- record-level retention deletes cascade to `routes` and `dead_letters`
via `ON DELETE CASCADE`
### Redis runtime-coordination state
| Logical artifact | Redis key |
| --- | --- |
| temporary route lease | `notification:route_leases:<notification_id>:<route_id>` |
| stream offset record | `notification:stream_offsets:<stream>` |
| ingress stream | `notification:intents` |
Storage rules:
- dynamic Redis key segments are base64url-encoded
- temporary route lease keys store one opaque worker token and use
`NOTIFICATION_ROUTE_LEASE_TTL`; they are service-local coordination
state rather than durable records, retained on Redis as a per-replica
exclusivity hint atop the SQL claim
- stream offset records persist plain-XREAD consumer progress for
`notification:intents` and never expire
- the outbound streams `gateway:client-events` and `mail:delivery_commands`
remain Redis Streams owned by Gateway and Mail Service respectively;
Notification Service emits one entry through `XADD` before committing
the route's PostgreSQL state transition
### Publisher claim and lease coordination
`Push` and `Email` publishers share the same scheduling pattern:
- `routes_due_idx` (the partial index on `next_attempt_at`) replaces the
former `notification:route_schedule` ZSET; the SQL query
`SELECT notification_id, route_id FROM routes WHERE next_attempt_at IS
NOT NULL AND next_attempt_at <= now() ORDER BY next_attempt_at ASC LIMIT
N` returns the next due batch
- `push` publishers filter for `route_id` prefix `push:`; `email`
publishers filter for prefix `email:` so the two workers do not contend
- `push` and `email` replicas coordinate through
`notification:route_leases:<notification_id>:<route_id>` with
`NOTIFICATION_ROUTE_LEASE_TTL`
- only the current lease holder finalises one due publication attempt;
the durable transition is a `Complete*` SQL transaction with optimistic
concurrency on `routes.updated_at` so a stale lease cannot overwrite a
fresher row state
- newly accepted publishable routes enter the partial index immediately
with `status=pending` and `next_attempt_at = accepted_at`
- `failed` routes remain in the partial index for retry
- `published`, `dead_letter`, and `skipped` clear `next_attempt_at` and
drop out of the index
## Retry And Dead-Letter Policy
Retry budgets are channel-specific:
- `push` publication to `Gateway`: `3` attempts total
- `email` publication to `Mail Service`: `7` attempts total
Rules:
- the first publication attempt happens immediately at `accepted_at_ms`
- after failed attempt `N`, the next delay is `clamp(NOTIFICATION_ROUTE_BACKOFF_MIN * 2^(N-1), NOTIFICATION_ROUTE_BACKOFF_MIN, NOTIFICATION_ROUTE_BACKOFF_MAX)`
- no jitter is added to the retry delay
- `push` and `email` routes are retried independently
- the shared schedule is filtered by route prefix so `push` publishers claim
only `push:` routes and `email` publishers claim only `email:` routes
- `push` and `email` replicas coordinate through
`notification:route_leases:<notification_id>:<route_id>` with
`NOTIFICATION_ROUTE_LEASE_TTL`
- `push` publication failures are classified minimally as
`payload_encoding_failed` and `gateway_stream_publish_failed`
- `email` publication failures are classified minimally as
`payload_encoding_failed` and `mail_stream_publish_failed`
- when a route exhausts its retry budget, it transitions to `dead_letter`,
creates `notification_dead_letter_entry`, and is removed from
`notification:route_schedule`
- one exhausted route entering `dead_letter` must not roll back or invalidate a
sibling route that already reached `published`
- service restarts resume from durable route state and persisted stream offsets
Retention rules:
- `records` and their cascaded `routes` / `dead_letters` use
`NOTIFICATION_RECORD_RETENTION` (deleted by the periodic SQL retention
worker after the configured window; cascade clears dependent rows)
- the per-record idempotency window (`records.idempotency_expires_at`)
uses `NOTIFICATION_IDEMPOTENCY_TTL`
- `malformed_intents` use `NOTIFICATION_MALFORMED_INTENT_RETENTION`
(independent retention pass)
- the retention worker runs once per `NOTIFICATION_CLEANUP_INTERVAL`
- stream offset records do not expire
## Observability
The service instruments:
- internal probe HTTP requests
- internal probe HTTP listener startup and shutdown events
- structured logs for accepted, duplicate, and rejected notification intents
- structured logs for `push` and `email` route publication, retry, and
dead-letter transitions
- accepted and duplicate intent outcomes
- malformed intents, including idempotency conflicts and unresolved recipients
- user-enrichment lookup outcomes
- route publish attempts, retries, and dead-letter transitions
- current route-schedule depth and oldest scheduled route age
- oldest unprocessed intent stream entry age
Metric names:
- `notification.intent.outcomes`
- `notification.intent.malformed`
- `notification.user_enrichment.attempts`
- `notification.route.publish_attempts`
- `notification.route.retries`
- `notification.route.dead_letters`
- `notification.route_schedule.depth`
- `notification.route_schedule.oldest_age_ms`
- `notification.intent_stream.oldest_unprocessed_age_ms`
Metrics intentionally avoid high-cardinality attributes such as `user_id`,
email address, `notification_id`, `route_id`, and `idempotency_key`.
Metric attributes may include `notification_type`, `producer`,
`audience_kind`, `channel`, `result`, `outcome`, `failure_code`, and
`failure_classification`.
Structured logs for intent intake, duplicate resolution, malformed-intent
recording, route publication, retry scheduling, and dead-letter transitions use
the same field names where the value exists:
- `notification_id`
- `notification_type`
- `producer`
- `audience_kind`
- `idempotency_key`
- `route_id`
- `channel`
- `request_id`
- `trace_id`
OpenTelemetry trace context is logged as `otel_trace_id` and `otel_span_id`
when the active context carries a valid span.
## Recovery
The supported manual replay path for a dead-lettered notification route is to
publish a new compatible intent to `notification:intents`.
Recovery rules:
- inspect the `notification_dead_letter_entry`, `notification_route`, and
owning `notification_record`
- confirm the downstream dependency or payload problem has been corrected
- publish a new intent with the same semantic `payload_json` and audience
fields, but with a new producer-owned `idempotency_key`
- keep the old `notification_dead_letter_entry` untouched as audit history
until its configured TTL expires
Manual Redis mutation of an existing route record or
`notification:route_schedule` is not a supported replay workflow.
## Verification
Focused service-local coverage verifies:
- configuration loading and validation
- `GET /healthz`
- `GET /readyz`
- absence of `/metrics`
- Redis startup fast-fail behavior
- graceful shutdown of the private probe listener
- valid intent acceptance
- malformed intent rejection
- duplicate and conflicting duplicate handling
- user-targeted route enrichment from `User Service`
- `recipient_not_found` malformed-intent recording for unresolved `user_id`
- temporary `User Service` failure handling without stream-offset advance
- FlatBuffers payload encoding for all seven user-facing `push`
`notification_type` values
- template-mode `Mail Service` command encoding for user and administrator
`email` routes
- due-route loading, lease acquisition, route publication, retry reschedule,
and dead-letter persistence in Redis
- `push` worker success, retry, and duplicate-prevention behavior across
concurrent replicas
- `email` worker success, retry, and duplicate-prevention behavior across
concurrent replicas
- OpenTelemetry metric recording for intent outcomes, malformed intents, user
enrichment, route publication attempts, retries, dead letters, route-schedule
gauges, and intent-stream lag
- Redis-backed route-schedule and intent-stream lag snapshots
- structured log field helper coverage through intake and publisher tests
- intent-consumer restart from `0-0` and from persisted stream offsets
- runtime wiring of the intent consumer and both route publishers
- shared `galaxy/notificationintent` producer constructors, validation, and
Redis Stream publication compatibility
Cross-service coverage verifies:
- `Notification Service -> User Service` enrichment compatibility and failure
handling
- `Notification Service -> Gateway` push compatibility for every user-facing
`notification_type`
- `Notification Service -> Mail Service` template-mode handoff for every
supported email type
- producer compatibility for `Game Master`, `Game Lobby`, and
`Geo Profile Service` through `galaxy/notificationintent`
- explicit regression that auth-code email still bypasses `Notification Service`
- real black-box `Notification Service -> Gateway` push fan-out coverage
- real black-box `Notification Service -> Mail Service` template-mode handoff
coverage
Real producer-boundary suites for `Game Master`, `Game Lobby`, and
`Geo Profile Service` should be added only when those service boundaries exist
in code.