754 lines
36 KiB
Markdown
754 lines
36 KiB
Markdown
# Notification Service
|
|
|
|
Canonical references:
|
|
|
|
- [Service-local docs](docs/README.md)
|
|
- [Intent AsyncAPI contract](api/intents-asyncapi.yaml)
|
|
- [Probe OpenAPI contract](openapi.yaml)
|
|
- [Gateway push model](../gateway/README.md)
|
|
- [Mail async command contract](../mail/api/delivery-commands-asyncapi.yaml)
|
|
- [Notification FlatBuffers payloads](../pkg/schema/fbs/notification.fbs)
|
|
- [System architecture](../ARCHITECTURE.md)
|
|
|
|
## Purpose
|
|
|
|
`Notification Service` is the internal asynchronous orchestration layer for
|
|
platform notifications.
|
|
|
|
It accepts normalized notification intents from upstream services, materializes
|
|
per-recipient routes, enriches user-targeted routes through `User Service`,
|
|
publishes client-facing push events toward `Gateway`, publishes non-auth email
|
|
commands toward `Mail Service`, and isolates transient downstream failures with
|
|
independent retry budgets per channel.
|
|
|
|
The service is intentionally not a source of truth for:
|
|
|
|
- game state
|
|
- lobby membership
|
|
- invite ownership
|
|
- review flags
|
|
- notification preferences
|
|
- email delivery attempts
|
|
|
|
## Responsibility Boundaries
|
|
|
|
`Notification Service` is responsible for:
|
|
|
|
- consuming normalized notification intents from a dedicated Redis Stream
|
|
- validating intent envelopes and rejecting malformed or conflicting duplicates
|
|
- persisting durable notification and route state
|
|
- resolving user contact data from `User Service` by `user_id`
|
|
- selecting locale from `User Service.preferred_language` with `en` fallback
|
|
- shaping lightweight push payloads for user-facing events
|
|
- publishing template-mode email commands to `Mail Service`
|
|
- retrying route publication independently for `push` and `email`
|
|
- persisting dead-letter entries for exhausted routes
|
|
|
|
`Notification Service` is not responsible for:
|
|
|
|
- computing business audiences from `game_id` or other domain identifiers
|
|
- owning administrator identity or administrator user records
|
|
- sending auth-code email
|
|
- storing per-user notification preferences in v1
|
|
- exposing an operator REST API in v1
|
|
|
|
The key design rule is that upstream producers must publish the concrete
|
|
`recipient_user_id` values for user-targeted notification intents. For
|
|
administrator-only notification types, recipient email addresses are resolved
|
|
from `Notification Service` configuration by `notification_type`. Private-game
|
|
invite notifications in v1 remain user-bound by internal `user_id` values and
|
|
must not target recipients by raw email address.
|
|
|
|
## Runtime Surface
|
|
|
|
The implemented process contains:
|
|
|
|
- one private internal HTTP probe listener
|
|
- process-wide structured logging
|
|
- process-wide OpenTelemetry runtime
|
|
- one shared `galaxy/notificationintent` producer contract module
|
|
- one shared Redis client with startup connectivity check
|
|
- one trusted `User Service` HTTP enrichment client
|
|
- one plain-`XREAD` notification-intent consumer
|
|
- one long-lived `push` route publisher
|
|
- one long-lived `email` route publisher
|
|
- durable accepted-intent, route, idempotency, malformed-intent, and
|
|
stream-offset storage in Redis
|
|
- user-targeted route enrichment during intent acceptance before durable write
|
|
- client-facing `push` publication toward `Gateway`
|
|
- template-mode `email` publication toward `Mail Service`
|
|
- durable `push` and `email` retry, dead-letter, and temporary lease
|
|
coordination in Redis
|
|
- OpenTelemetry counters and observable gauges for intent intake, user
|
|
enrichment, route publication, route schedule depth, and intent stream lag
|
|
- graceful shutdown on process cancellation
|
|
|
|
Probe contract:
|
|
|
|
- `GET /healthz` returns `{"status":"ok"}`
|
|
- `GET /readyz` returns `{"status":"ready"}`
|
|
- `readyz` is process-local after successful startup and does not perform a
|
|
live Redis ping per request
|
|
- there is no `/metrics` route
|
|
|
|
Runtime behavior:
|
|
|
|
- the intent consumer reads `notification:intents` with plain `XREAD`
|
|
- when no stored stream offset exists, the consumer starts from `0-0`
|
|
- the persisted offset advances only after durable acceptance or durable
|
|
malformed-intent recording
|
|
- user-targeted routes are enriched through `GET /api/v1/internal/users/{user_id}`
|
|
before durable route write
|
|
- `404 subject_not_found` from `User Service` is recorded under
|
|
malformed-intent storage with `failure_code=recipient_not_found`
|
|
- temporary `User Service` lookup failures stop the consumer before
|
|
stream-offset advance
|
|
- due `push` routes are published toward `Gateway` from the shared
|
|
`notification:route_schedule`
|
|
- due `email` routes are published toward `Mail Service` from the shared
|
|
`notification:route_schedule`
|
|
- the `push` publisher claims only routes whose `route_id` starts with `push:`
|
|
- the `email` publisher claims only routes whose `route_id` starts with `email:`
|
|
- replicas coordinate through temporary Redis lease
|
|
`notification:route_leases:<notification_id>:<route_id>`
|
|
- `Gateway` publication uses `XADD MAXLEN ~` with
|
|
`NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`
|
|
- `event_id` equals `<notification_id>/<route_id>`
|
|
- `Mail Service` publication uses plain `XADD` with no stream trimming
|
|
- `delivery_id` equals `<notification_id>/<route_id>`
|
|
- `idempotency_key` equals `notification:<notification_id>/<route_id>`
|
|
- `requested_at_ms` equals `accepted_at_ms`
|
|
- `request_id` and `trace_id` are forwarded when present
|
|
- `device_session_id` is intentionally omitted so `Gateway` fans the event out
|
|
to every active stream of that user
|
|
- Go producers use `galaxy/notificationintent` to construct and publish
|
|
compatible intents into `notification:intents`
|
|
- producer publication uses plain `XADD` without stream trimming or hidden
|
|
helper retries
|
|
- a producer-side notification publication failure is notification degradation
|
|
and must not roll back the source business state
|
|
- metric export uses the configured OpenTelemetry exporters only
|
|
- there is still no `/metrics` route
|
|
- `notification.route_schedule.depth` and
|
|
`notification.route_schedule.oldest_age_ms` are derived from
|
|
`notification:route_schedule`
|
|
- `notification.intent_stream.oldest_unprocessed_age_ms` is derived from the
|
|
persisted intent stream offset and the configured ingress stream
|
|
- manual dead-letter replay is performed by publishing a new compatible intent
|
|
with a new `idempotency_key`; existing dead-letter records remain audit
|
|
history until TTL expiry
|
|
|
|
The target process shape is one internal-only process with:
|
|
|
|
- one notification-intent consumer
|
|
- one `push` route publisher for `Gateway`
|
|
- one `email` route publisher for `Mail Service`
|
|
|
|
Intentional runtime omissions in v1:
|
|
|
|
- no public ingress
|
|
- no dedicated operator REST API
|
|
- no direct client delivery
|
|
- no direct SMTP integration
|
|
|
|
## Configuration
|
|
|
|
Required:
|
|
|
|
- `NOTIFICATION_REDIS_MASTER_ADDR`
|
|
- `NOTIFICATION_REDIS_PASSWORD`
|
|
- `NOTIFICATION_POSTGRES_PRIMARY_DSN`
|
|
- `NOTIFICATION_USER_SERVICE_BASE_URL`
|
|
|
|
Primary configuration groups:
|
|
|
|
- process and logging:
|
|
- `NOTIFICATION_SHUTDOWN_TIMEOUT`
|
|
- `NOTIFICATION_LOG_LEVEL`
|
|
- internal probe HTTP:
|
|
- `NOTIFICATION_INTERNAL_HTTP_ADDR` with default `:8092`
|
|
- `NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUT` with default `2s`
|
|
- `NOTIFICATION_INTERNAL_HTTP_READ_TIMEOUT` with default `10s`
|
|
- `NOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUT` with default `1m`
|
|
- Redis connectivity (master/replica/password shape; the deprecated
|
|
`NOTIFICATION_REDIS_ADDR`, `NOTIFICATION_REDIS_USERNAME`, and
|
|
`NOTIFICATION_REDIS_TLS_ENABLED` env vars are rejected at startup):
|
|
- `NOTIFICATION_REDIS_REPLICA_ADDRS` (optional, comma-separated)
|
|
- `NOTIFICATION_REDIS_DB`
|
|
- `NOTIFICATION_REDIS_OPERATION_TIMEOUT`
|
|
- PostgreSQL connectivity:
|
|
- `NOTIFICATION_POSTGRES_REPLICA_DSNS` (optional, comma-separated)
|
|
- `NOTIFICATION_POSTGRES_OPERATION_TIMEOUT`
|
|
- `NOTIFICATION_POSTGRES_MAX_OPEN_CONNS`
|
|
- `NOTIFICATION_POSTGRES_MAX_IDLE_CONNS`
|
|
- `NOTIFICATION_POSTGRES_CONN_MAX_LIFETIME`
|
|
- stream names:
|
|
- `NOTIFICATION_INTENTS_STREAM` with default `notification:intents`
|
|
- `NOTIFICATION_INTENTS_READ_BLOCK_TIMEOUT` with default `2s`
|
|
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM` with default `gateway:client-events`
|
|
- `NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN` with default `1024`
|
|
- `NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAM` with default `mail:delivery_commands`
|
|
- retry and dead-letter:
|
|
- `NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTS` with default `3`
|
|
- `NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTS` with default `7`
|
|
- `NOTIFICATION_ROUTE_BACKOFF_MIN` with default `1s`
|
|
- `NOTIFICATION_ROUTE_BACKOFF_MAX` with default `5m`
|
|
- `NOTIFICATION_ROUTE_LEASE_TTL` with default `5s`
|
|
- `NOTIFICATION_IDEMPOTENCY_TTL` with default `168h`
|
|
- retention (periodic SQL retention worker; replaces the previous
|
|
`NOTIFICATION_DEAD_LETTER_TTL` and `NOTIFICATION_RECORD_TTL` Redis-EXPIRE
|
|
knobs):
|
|
- `NOTIFICATION_RECORD_RETENTION` with default `720h`
|
|
- `NOTIFICATION_MALFORMED_INTENT_RETENTION` with default `2160h`
|
|
- `NOTIFICATION_CLEANUP_INTERVAL` with default `1h`
|
|
- `User Service` enrichment:
|
|
- `NOTIFICATION_USER_SERVICE_TIMEOUT` with default `1s`
|
|
- administrator routing:
|
|
- `NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDED`
|
|
- `NOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILED`
|
|
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_START`
|
|
- `NOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED`
|
|
- `NOTIFICATION_ADMIN_EMAILS_RUNTIME_IMAGE_PULL_FAILED`
|
|
- `NOTIFICATION_ADMIN_EMAILS_RUNTIME_CONTAINER_START_FAILED`
|
|
- `NOTIFICATION_ADMIN_EMAILS_RUNTIME_START_CONFIG_INVALID`
|
|
- OpenTelemetry:
|
|
- standard `OTEL_*` variables
|
|
- `NOTIFICATION_OTEL_STDOUT_TRACES_ENABLED`
|
|
- `NOTIFICATION_OTEL_STDOUT_METRICS_ENABLED`
|
|
|
|
Each administrator configuration variable stores a comma-separated list of
|
|
email addresses for exactly one `notification_type`. v1 does not use one global
|
|
admin-recipient list shared across all administrative events.
|
|
|
|
## Stable Input Contract
|
|
|
|
The service accepts intents from one dedicated Redis Stream:
|
|
|
|
- `notification:intents`
|
|
|
|
The canonical envelope is defined in
|
|
[api/intents-asyncapi.yaml](api/intents-asyncapi.yaml).
|
|
Go producers should use the shared `galaxy/notificationintent` module to build
|
|
and append compatible stream entries instead of duplicating field names,
|
|
payload structs, or validation rules locally.
|
|
|
|
Required envelope fields:
|
|
|
|
- `notification_type`
|
|
- `producer`
|
|
- `audience_kind`
|
|
- `idempotency_key`
|
|
- `occurred_at_ms`
|
|
- `payload_json`
|
|
|
|
Optional envelope fields:
|
|
|
|
- `recipient_user_ids_json`
|
|
- `request_id`
|
|
- `trace_id`
|
|
|
|
Rules:
|
|
|
|
- `audience_kind=user` requires `recipient_user_ids_json` with one or more
|
|
unique stable `user_id` values
|
|
- `audience_kind=admin_email` forbids `recipient_user_ids_json`
|
|
- `recipient_user_ids_json` is normalized as an unordered recipient set, so
|
|
duplicate `user_id` values are invalid and element order does not affect
|
|
idempotency
|
|
- `request_id` and `trace_id` are observability-only metadata and do not
|
|
participate in the idempotency fingerprint
|
|
- `payload_json` is type-specific, must remain backward-compatible for each
|
|
`notification_type`, and is normalized structurally for duplicate detection:
|
|
insignificant whitespace and object key order are ignored while array order
|
|
remains significant
|
|
- a replay with the same `(producer, idempotency_key)` and the same normalized
|
|
payload is treated as a successful duplicate
|
|
- a replay with the same `(producer, idempotency_key)` but different normalized
|
|
content is recorded as a conflicting duplicate under malformed-intent storage
|
|
with `failure_code=idempotency_conflict` and must not create new routes
|
|
- during user enrichment, a missing `user_id` in `User Service` is recorded
|
|
under malformed-intent storage with `failure_code=recipient_not_found`
|
|
|
|
Malformed stream entries do not create durable notification records. They are
|
|
logged, metered, and recorded separately for operator inspection.
|
|
Accepted intents use the original Redis Stream `stream_entry_id` as
|
|
`notification_id`.
|
|
|
|
## Notification Catalog
|
|
|
|
`payload_json` fields are normalized by the producer before publication.
|
|
|
|
| `notification_type` | Producer | Audience | Channels | Required `payload_json` fields |
|
|
| --- | --- | --- | --- | --- |
|
|
| `geo.review_recommended` | `Geo Profile Service` (`geoprofile`) | configured admin email list (`audience_kind=admin_email`) | `email` | `user_id`, `user_email`, `observed_country`, `usual_connection_country`, `review_reason` |
|
|
| `game.turn.ready` | `Game Master` (`game_master`) | active accepted participants (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `turn_number` |
|
|
| `game.finished` | `Game Master` (`game_master`) | active accepted participants (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `final_turn_number` |
|
|
| `game.generation_failed` | `Game Master` (`game_master`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `game_name`, `failure_reason` |
|
|
| `lobby.runtime_paused_after_start` | `Game Lobby` (`game_lobby`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `game_name` |
|
|
| `lobby.application.submitted` | `Game Lobby` (`game_lobby`) | private owner (`audience_kind=user`) or public admins (`audience_kind=admin_email`) | private: `push+email`, public: `email` | `game_id`, `game_name`, `applicant_user_id`, `applicant_name` |
|
|
| `lobby.membership.approved` | `Game Lobby` (`game_lobby`) | applicant user (`audience_kind=user`) | `push+email` | `game_id`, `game_name` |
|
|
| `lobby.membership.rejected` | `Game Lobby` (`game_lobby`) | applicant user (`audience_kind=user`) | `push+email` | `game_id`, `game_name` |
|
|
| `lobby.membership.blocked` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `membership_user_id`, `membership_user_name`, `reason` |
|
|
| `lobby.invite.created` | `Game Lobby` (`game_lobby`) | invited user (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `inviter_user_id`, `inviter_name` |
|
|
| `lobby.invite.redeemed` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `invitee_user_id`, `invitee_name` |
|
|
| `lobby.invite.expired` | `Game Lobby` (`game_lobby`) | private-game owner (`audience_kind=user`) | `email` | `game_id`, `game_name`, `invitee_user_id`, `invitee_name` |
|
|
| `lobby.race_name.registration_eligible` | `Game Lobby` (`game_lobby`) | capable member (`audience_kind=user`) | `push+email` | `game_id`, `game_name`, `race_name`, `eligible_until_ms` |
|
|
| `lobby.race_name.registered` | `Game Lobby` (`game_lobby`) | registering user (`audience_kind=user`) | `push+email` | `race_name` |
|
|
| `lobby.race_name.registration_denied` | `Game Lobby` (`game_lobby`) | incapable member (`audience_kind=user`) | `email` | `game_id`, `game_name`, `race_name`, `reason` |
|
|
| `runtime.image_pull_failed` | `Runtime Manager` (`runtime_manager`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `image_ref`, `error_code`, `error_message`, `attempted_at_ms` |
|
|
| `runtime.container_start_failed` | `Runtime Manager` (`runtime_manager`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `image_ref`, `error_code`, `error_message`, `attempted_at_ms` |
|
|
| `runtime.start_config_invalid` | `Runtime Manager` (`runtime_manager`) | configured admin email list (`audience_kind=admin_email`) | `email` | `game_id`, `image_ref`, `error_code`, `error_message`, `attempted_at_ms` |
|
|
|
|
Rules:
|
|
|
|
- v1 supports exactly the eighteen `notification_type` values listed above
|
|
- the three `game.*` types — `game.turn.ready`, `game.finished`, and
|
|
`game.generation_failed` — are produced exclusively by `Game Master`
|
|
- `lobby.application.submitted` keeps one stable `notification_type` and one
|
|
stable `payload_json` shape; private games publish `audience_kind=user`
|
|
while public games publish `audience_kind=admin_email`
|
|
- `lobby.invite.revoked` deliberately produces no notification in v1 and
|
|
remains outside the supported catalog
|
|
- private-game invite notifications remain user-bound by internal `user_id`
|
|
- `lobby.race_name.registration_eligible` and
|
|
`lobby.race_name.registration_denied` are emitted by `Game Lobby` at
|
|
`game_finished` based on capability evaluation; the former always pairs
|
|
with a 30-day `eligible_until_ms` window
|
|
- `lobby.race_name.registered` is emitted on successful
|
|
`lobby.race_name.register` commit
|
|
- the three `runtime.*` types are emitted by `Runtime Manager` only on
|
|
first-touch start failures (image pull, container create/start, start
|
|
configuration validation); they are administrator-only in v1 and have no
|
|
push counterpart. `Runtime Manager` does not publish notifications for
|
|
ongoing health changes — those flow through `runtime:health_events` and
|
|
are escalated by `Game Master` if needed.
|
|
|
|
## Recipient Enrichment And Locale Policy
|
|
|
|
For `audience_kind=user`, `Notification Service` resolves user records through
|
|
the trusted `User Service` lookup endpoint:
|
|
|
|
- `GET /api/v1/internal/users/{user_id}`
|
|
|
|
The response supplies:
|
|
|
|
- `email`
|
|
- `preferred_language`
|
|
|
|
Locale rules:
|
|
|
|
- current implemented support is exactly one locale: `en`
|
|
- exact `preferred_language` is used when supported by `Mail Service`
|
|
- unsupported, empty, or invalid language values fall back to `en`
|
|
- no intermediate locale reduction is used in v1
|
|
- the same resolved locale drives both `push` payload localization decisions
|
|
and `Mail Service` template selection
|
|
- enrichment runs during intent acceptance before durable route write
|
|
- `404 subject_not_found` from `User Service` is treated as permanent producer
|
|
input error and becomes malformed-intent `recipient_not_found`
|
|
- temporary `User Service` failures stop the consumer before stream-offset
|
|
advance so the same stream entry is retried after restart
|
|
|
|
For `audience_kind=admin_email`, `Notification Service` does not consult
|
|
`User Service` and instead resolves recipients from type-specific config.
|
|
|
|
## Push Contract Toward Gateway
|
|
|
|
Push events are published into the existing `Gateway` client-events stream.
|
|
|
|
Stable routing rules:
|
|
|
|
- `event_type` equals `notification_type`
|
|
- `event_id` equals `<notification_id>/<route_id>`
|
|
- `user_id` is derived from `recipient_ref=user:<user_id>` for user-targeted
|
|
routes
|
|
- `request_id` and `trace_id` are forwarded when present
|
|
- `device_session_id` is intentionally omitted so `Gateway` fans the event out
|
|
to every active stream of that user
|
|
|
|
`Notification Service` appends `Gateway` events with `XADD MAXLEN ~` using
|
|
`NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN`.
|
|
|
|
User-facing push payloads use
|
|
[pkg/schema/fbs/notification.fbs](../pkg/schema/fbs/notification.fbs).
|
|
|
|
| `notification_type` | FlatBuffers table | Payload fields |
|
|
| --- | --- | --- |
|
|
| `game.turn.ready` | `notification.GameTurnReadyEvent` | `game_id`, `turn_number` |
|
|
| `game.finished` | `notification.GameFinishedEvent` | `game_id`, `final_turn_number` |
|
|
| `lobby.application.submitted` | `notification.LobbyApplicationSubmittedEvent` | `game_id`, `applicant_user_id` |
|
|
| `lobby.membership.approved` | `notification.LobbyMembershipApprovedEvent` | `game_id` |
|
|
| `lobby.membership.rejected` | `notification.LobbyMembershipRejectedEvent` | `game_id` |
|
|
| `lobby.membership.blocked` | `notification.LobbyMembershipBlockedEvent` | `game_id`, `membership_user_id`, `reason` |
|
|
| `lobby.invite.created` | `notification.LobbyInviteCreatedEvent` | `game_id`, `inviter_user_id` |
|
|
| `lobby.invite.redeemed` | `notification.LobbyInviteRedeemedEvent` | `game_id`, `invitee_user_id` |
|
|
| `lobby.race_name.registration_eligible` | `notification.LobbyRaceNameRegistrationEligibleEvent` | `game_id`, `race_name`, `eligible_until_ms` |
|
|
| `lobby.race_name.registered` | `notification.LobbyRaceNameRegisteredEvent` | `race_name` |
|
|
|
|
Only the ten user-facing push notification types above are represented in
|
|
`notification.fbs`.
|
|
`geo.review_recommended`, `game.generation_failed`,
|
|
`lobby.runtime_paused_after_start`, `lobby.invite.expired`, and
|
|
`lobby.race_name.registration_denied` remain outside this schema because
|
|
they are email-only in v1.
|
|
|
|
Checked-in generated Go bindings for this schema live under
|
|
[`../pkg/schema/fbs/notification`](../pkg/schema/fbs/notification).
|
|
|
|
`notification_type` alone determines the concrete FlatBuffers table.
|
|
No extra envelope or FlatBuffers `union` is added in v1.
|
|
|
|
The push payload must stay lightweight and must not attempt to mirror full game,
|
|
lobby, or profile state.
|
|
`game_name`, human-readable user names, and other full business-state fields
|
|
stay out of the push schema.
|
|
Clients react to the notification and then fetch fresh business state through
|
|
normal service APIs.
|
|
|
|
## Email Contract Toward Mail Service
|
|
|
|
Email routes are published to `Mail Service` through
|
|
`mail:delivery_commands` using the existing generic async command contract.
|
|
|
|
Rules:
|
|
|
|
- `delivery_id` equals `<notification_id>/<route_id>`
|
|
- `source` is always `notification`
|
|
- `payload_mode` is always `template`
|
|
- `idempotency_key` equals `notification:<notification_id>/<route_id>`
|
|
- `requested_at_ms` equals `accepted_at_ms`
|
|
- `request_id` and `trace_id` are forwarded when present
|
|
- `payload_json.to` contains exactly one resolved recipient email
|
|
- `payload_json.cc`, `payload_json.bcc`, `payload_json.reply_to`, and
|
|
`payload_json.attachments` are empty arrays in v1
|
|
- `template_id` equals `notification_type`
|
|
- `locale` is the resolved language from the enrichment step or `en`
|
|
- template variables are passed through from normalized `payload_json`
|
|
|
|
`Notification Service` appends `Mail Service` commands with plain `XADD` and
|
|
does not manage retention or trimming of `mail:delivery_commands`.
|
|
|
|
Auth-code email remains a direct `Auth / Session Service -> Mail Service` flow
|
|
and does not pass through `Notification Service`.
|
|
|
|
Initial notification-owned template assets:
|
|
|
|
| `notification_type` | `template_id` | Required assets |
|
|
| --- | --- | --- |
|
|
| `geo.review_recommended` | `geo.review_recommended` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `game.turn.ready` | `game.turn.ready` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `game.finished` | `game.finished` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `game.generation_failed` | `game.generation_failed` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.runtime_paused_after_start` | `lobby.runtime_paused_after_start` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.application.submitted` | `lobby.application.submitted` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.membership.approved` | `lobby.membership.approved` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.membership.rejected` | `lobby.membership.rejected` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.membership.blocked` | `lobby.membership.blocked` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.invite.created` | `lobby.invite.created` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.invite.redeemed` | `lobby.invite.redeemed` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.invite.expired` | `lobby.invite.expired` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.race_name.registration_eligible` | `lobby.race_name.registration_eligible` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.race_name.registered` | `lobby.race_name.registered` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `lobby.race_name.registration_denied` | `lobby.race_name.registration_denied` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `runtime.image_pull_failed` | `runtime.image_pull_failed` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `runtime.container_start_failed` | `runtime.container_start_failed` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
| `runtime.start_config_invalid` | `runtime.start_config_invalid` | `en/subject.tmpl`, `en/text.tmpl` |
|
|
|
|
`auth.login_code` does not belong to the notification-owned template set.
|
|
|
|
## Route Model
|
|
|
|
One accepted intent materializes:
|
|
|
|
- one `notification_record`
|
|
- zero or more `notification_route` entries
|
|
|
|
Each route represents exactly one `(channel, recipient_ref)` pair.
|
|
|
|
Stable route statuses:
|
|
|
|
- `pending`
|
|
- `published`
|
|
- `failed`
|
|
- `dead_letter`
|
|
- `skipped`
|
|
|
|
Rules:
|
|
|
|
- `pending` means the route is ready for first publish or retry
|
|
- `published` means the route was durably handed off to its downstream channel
|
|
- `failed` means the last publish attempt failed and a later retry is scheduled
|
|
- `dead_letter` means the route exhausted its retry budget
|
|
- `skipped` means the route slot was durably materialized but intentionally not
|
|
emitted
|
|
|
|
Materialization rules:
|
|
|
|
- every derived `recipient_ref` receives one `push` route slot and one `email`
|
|
route slot, except that an empty administrator email list materializes one
|
|
synthetic `config:<notification_type>` recipient slot with only a skipped
|
|
`email` route
|
|
- a route slot whose channel is outside the notification type channel matrix is
|
|
materialized as `skipped`
|
|
- `recipient_ref` is `user:<user_id>` for user-targeted routes
|
|
- `recipient_ref` is `email:<normalized_address>` for configured administrator
|
|
email routes
|
|
- when an administrator email list is empty, the service materializes one
|
|
synthetic recipient slot `config:<notification_type>` with one skipped
|
|
`email` route so the configuration gap remains durable and operator-visible
|
|
- `route_id` is mandatory and equals `<channel>:<recipient_ref>`
|
|
|
|
The service-local aggregate notification status is derived from routes and is
|
|
not a separate durable source of truth.
|
|
|
|
## Persistence Model
|
|
|
|
Durable storage is split between PostgreSQL (table-shaped business state)
|
|
and Redis (streams, runtime coordination). The architectural rules live in
|
|
[`ARCHITECTURE.md §Persistence Backends`](../ARCHITECTURE.md#persistence-backends);
|
|
the per-service decision record is
|
|
[`docs/postgres-migration.md`](docs/postgres-migration.md).
|
|
|
|
### PostgreSQL durable state
|
|
|
|
The service owns the `notification` schema. Migrations are embedded in the
|
|
binary (`internal/adapters/postgres/migrations`) and applied at startup via
|
|
`pkg/postgres.RunMigrations` strictly before any HTTP listener becomes
|
|
ready. Every time-valued column is `timestamptz`, normalised to UTC by the
|
|
adapter on bind and scan.
|
|
|
|
| Table | Frozen columns |
|
|
| --- | --- |
|
|
| `records` | `notification_id`, `notification_type`, `producer`, `audience_kind`, `recipient_user_ids` (jsonb), `payload_json`, `idempotency_key`, `request_fingerprint`, `request_id`, `trace_id`, `occurred_at`, `accepted_at`, `updated_at`, `idempotency_expires_at`; `UNIQUE (producer, idempotency_key)` |
|
|
| `routes` | `notification_id`, `route_id`, `channel`, `recipient_ref`, `status`, `attempt_count`, `max_attempts`, `next_attempt_at`, `resolved_email`, `resolved_locale`, `last_error_classification`, `last_error_message`, `last_error_at`, `created_at`, `updated_at`, `published_at`, `dead_lettered_at`, `skipped_at`; PRIMARY KEY `(notification_id, route_id)` |
|
|
| `dead_letters` | `notification_id`, `route_id`, `channel`, `recipient_ref`, `final_attempt_count`, `max_attempts`, `failure_classification`, `failure_message`, `recovery_hint`, `created_at`; PRIMARY KEY `(notification_id, route_id)` cascading from `routes` |
|
|
| `malformed_intents` | `stream_entry_id`, `notification_type`, `producer`, `idempotency_key`, `failure_code`, `failure_message`, `raw_fields` (jsonb), `recorded_at` |
|
|
|
|
Storage rules:
|
|
|
|
- the durable `records` row IS the idempotency reservation; the
|
|
`(producer, idempotency_key)` UNIQUE constraint surfaces conflicts as
|
|
`acceptintent.ErrConflict`
|
|
- `next_attempt_at` is non-NULL only while the route is a scheduling
|
|
candidate (`status=pending|failed`); the partial index `routes_due_idx`
|
|
drives the publishers' `ListDueRoutes` scan
|
|
- `payload_json` stores the canonical normalized JSON string used for
|
|
idempotency fingerprinting; `recipient_user_ids` is JSONB and omitted
|
|
for `audience_kind=admin_email`
|
|
- terminal transitions clear `next_attempt_at` and stamp the appropriate
|
|
terminal column (`published_at` / `dead_lettered_at` / `skipped_at`)
|
|
- record-level retention deletes cascade to `routes` and `dead_letters`
|
|
via `ON DELETE CASCADE`
|
|
|
|
### Redis runtime-coordination state
|
|
|
|
| Logical artifact | Redis key |
|
|
| --- | --- |
|
|
| temporary route lease | `notification:route_leases:<notification_id>:<route_id>` |
|
|
| stream offset record | `notification:stream_offsets:<stream>` |
|
|
| ingress stream | `notification:intents` |
|
|
|
|
Storage rules:
|
|
|
|
- dynamic Redis key segments are base64url-encoded
|
|
- temporary route lease keys store one opaque worker token and use
|
|
`NOTIFICATION_ROUTE_LEASE_TTL`; they are service-local coordination
|
|
state rather than durable records, retained on Redis as a per-replica
|
|
exclusivity hint atop the SQL claim
|
|
- stream offset records persist plain-XREAD consumer progress for
|
|
`notification:intents` and never expire
|
|
- the outbound streams `gateway:client-events` and `mail:delivery_commands`
|
|
remain Redis Streams owned by Gateway and Mail Service respectively;
|
|
Notification Service emits one entry through `XADD` before committing
|
|
the route's PostgreSQL state transition
|
|
|
|
### Publisher claim and lease coordination
|
|
|
|
`Push` and `Email` publishers share the same scheduling pattern:
|
|
|
|
- `routes_due_idx` (the partial index on `next_attempt_at`) replaces the
|
|
former `notification:route_schedule` ZSET; the SQL query
|
|
`SELECT notification_id, route_id FROM routes WHERE next_attempt_at IS
|
|
NOT NULL AND next_attempt_at <= now() ORDER BY next_attempt_at ASC LIMIT
|
|
N` returns the next due batch
|
|
- `push` publishers filter for `route_id` prefix `push:`; `email`
|
|
publishers filter for prefix `email:` so the two workers do not contend
|
|
- `push` and `email` replicas coordinate through
|
|
`notification:route_leases:<notification_id>:<route_id>` with
|
|
`NOTIFICATION_ROUTE_LEASE_TTL`
|
|
- only the current lease holder finalises one due publication attempt;
|
|
the durable transition is a `Complete*` SQL transaction with optimistic
|
|
concurrency on `routes.updated_at` so a stale lease cannot overwrite a
|
|
fresher row state
|
|
- newly accepted publishable routes enter the partial index immediately
|
|
with `status=pending` and `next_attempt_at = accepted_at`
|
|
- `failed` routes remain in the partial index for retry
|
|
- `published`, `dead_letter`, and `skipped` clear `next_attempt_at` and
|
|
drop out of the index
|
|
|
|
## Retry And Dead-Letter Policy
|
|
|
|
Retry budgets are channel-specific:
|
|
|
|
- `push` publication to `Gateway`: `3` attempts total
|
|
- `email` publication to `Mail Service`: `7` attempts total
|
|
|
|
Rules:
|
|
|
|
- the first publication attempt happens immediately at `accepted_at_ms`
|
|
- after failed attempt `N`, the next delay is `clamp(NOTIFICATION_ROUTE_BACKOFF_MIN * 2^(N-1), NOTIFICATION_ROUTE_BACKOFF_MIN, NOTIFICATION_ROUTE_BACKOFF_MAX)`
|
|
- no jitter is added to the retry delay
|
|
- `push` and `email` routes are retried independently
|
|
- the shared schedule is filtered by route prefix so `push` publishers claim
|
|
only `push:` routes and `email` publishers claim only `email:` routes
|
|
- `push` and `email` replicas coordinate through
|
|
`notification:route_leases:<notification_id>:<route_id>` with
|
|
`NOTIFICATION_ROUTE_LEASE_TTL`
|
|
- `push` publication failures are classified minimally as
|
|
`payload_encoding_failed` and `gateway_stream_publish_failed`
|
|
- `email` publication failures are classified minimally as
|
|
`payload_encoding_failed` and `mail_stream_publish_failed`
|
|
- when a route exhausts its retry budget, it transitions to `dead_letter`,
|
|
creates `notification_dead_letter_entry`, and is removed from
|
|
`notification:route_schedule`
|
|
- one exhausted route entering `dead_letter` must not roll back or invalidate a
|
|
sibling route that already reached `published`
|
|
- service restarts resume from durable route state and persisted stream offsets
|
|
|
|
Retention rules:
|
|
|
|
- `records` and their cascaded `routes` / `dead_letters` use
|
|
`NOTIFICATION_RECORD_RETENTION` (deleted by the periodic SQL retention
|
|
worker after the configured window; cascade clears dependent rows)
|
|
- the per-record idempotency window (`records.idempotency_expires_at`)
|
|
uses `NOTIFICATION_IDEMPOTENCY_TTL`
|
|
- `malformed_intents` use `NOTIFICATION_MALFORMED_INTENT_RETENTION`
|
|
(independent retention pass)
|
|
- the retention worker runs once per `NOTIFICATION_CLEANUP_INTERVAL`
|
|
- stream offset records do not expire
|
|
|
|
## Observability
|
|
|
|
The service instruments:
|
|
|
|
- internal probe HTTP requests
|
|
- internal probe HTTP listener startup and shutdown events
|
|
- structured logs for accepted, duplicate, and rejected notification intents
|
|
- structured logs for `push` and `email` route publication, retry, and
|
|
dead-letter transitions
|
|
- accepted and duplicate intent outcomes
|
|
- malformed intents, including idempotency conflicts and unresolved recipients
|
|
- user-enrichment lookup outcomes
|
|
- route publish attempts, retries, and dead-letter transitions
|
|
- current route-schedule depth and oldest scheduled route age
|
|
- oldest unprocessed intent stream entry age
|
|
|
|
Metric names:
|
|
|
|
- `notification.intent.outcomes`
|
|
- `notification.intent.malformed`
|
|
- `notification.user_enrichment.attempts`
|
|
- `notification.route.publish_attempts`
|
|
- `notification.route.retries`
|
|
- `notification.route.dead_letters`
|
|
- `notification.route_schedule.depth`
|
|
- `notification.route_schedule.oldest_age_ms`
|
|
- `notification.intent_stream.oldest_unprocessed_age_ms`
|
|
|
|
Metrics intentionally avoid high-cardinality attributes such as `user_id`,
|
|
email address, `notification_id`, `route_id`, and `idempotency_key`.
|
|
|
|
Metric attributes may include `notification_type`, `producer`,
|
|
`audience_kind`, `channel`, `result`, `outcome`, `failure_code`, and
|
|
`failure_classification`.
|
|
|
|
Structured logs for intent intake, duplicate resolution, malformed-intent
|
|
recording, route publication, retry scheduling, and dead-letter transitions use
|
|
the same field names where the value exists:
|
|
|
|
- `notification_id`
|
|
- `notification_type`
|
|
- `producer`
|
|
- `audience_kind`
|
|
- `idempotency_key`
|
|
- `route_id`
|
|
- `channel`
|
|
- `request_id`
|
|
- `trace_id`
|
|
|
|
OpenTelemetry trace context is logged as `otel_trace_id` and `otel_span_id`
|
|
when the active context carries a valid span.
|
|
|
|
## Recovery
|
|
|
|
The supported manual replay path for a dead-lettered notification route is to
|
|
publish a new compatible intent to `notification:intents`.
|
|
|
|
Recovery rules:
|
|
|
|
- inspect the `notification_dead_letter_entry`, `notification_route`, and
|
|
owning `notification_record`
|
|
- confirm the downstream dependency or payload problem has been corrected
|
|
- publish a new intent with the same semantic `payload_json` and audience
|
|
fields, but with a new producer-owned `idempotency_key`
|
|
- keep the old `notification_dead_letter_entry` untouched as audit history
|
|
until its configured TTL expires
|
|
|
|
Manual Redis mutation of an existing route record or
|
|
`notification:route_schedule` is not a supported replay workflow.
|
|
|
|
## Verification
|
|
|
|
Focused service-local coverage verifies:
|
|
|
|
- configuration loading and validation
|
|
- `GET /healthz`
|
|
- `GET /readyz`
|
|
- absence of `/metrics`
|
|
- Redis startup fast-fail behavior
|
|
- graceful shutdown of the private probe listener
|
|
- valid intent acceptance
|
|
- malformed intent rejection
|
|
- duplicate and conflicting duplicate handling
|
|
- user-targeted route enrichment from `User Service`
|
|
- `recipient_not_found` malformed-intent recording for unresolved `user_id`
|
|
- temporary `User Service` failure handling without stream-offset advance
|
|
- FlatBuffers payload encoding for all seven user-facing `push`
|
|
`notification_type` values
|
|
- template-mode `Mail Service` command encoding for user and administrator
|
|
`email` routes
|
|
- due-route loading, lease acquisition, route publication, retry reschedule,
|
|
and dead-letter persistence in Redis
|
|
- `push` worker success, retry, and duplicate-prevention behavior across
|
|
concurrent replicas
|
|
- `email` worker success, retry, and duplicate-prevention behavior across
|
|
concurrent replicas
|
|
- OpenTelemetry metric recording for intent outcomes, malformed intents, user
|
|
enrichment, route publication attempts, retries, dead letters, route-schedule
|
|
gauges, and intent-stream lag
|
|
- Redis-backed route-schedule and intent-stream lag snapshots
|
|
- structured log field helper coverage through intake and publisher tests
|
|
- intent-consumer restart from `0-0` and from persisted stream offsets
|
|
- runtime wiring of the intent consumer and both route publishers
|
|
- shared `galaxy/notificationintent` producer constructors, validation, and
|
|
Redis Stream publication compatibility
|
|
|
|
Cross-service coverage verifies:
|
|
|
|
- `Notification Service -> User Service` enrichment compatibility and failure
|
|
handling
|
|
- `Notification Service -> Gateway` push compatibility for every user-facing
|
|
`notification_type`
|
|
- `Notification Service -> Mail Service` template-mode handoff for every
|
|
supported email type
|
|
- producer compatibility for `Game Master`, `Game Lobby`, and
|
|
`Geo Profile Service` through `galaxy/notificationintent`
|
|
- explicit regression that auth-code email still bypasses `Notification Service`
|
|
- real black-box `Notification Service -> Gateway` push fan-out coverage
|
|
- real black-box `Notification Service -> Mail Service` template-mode handoff
|
|
coverage
|
|
|
|
Real producer-boundary suites for `Game Master`, `Game Lobby`, and
|
|
`Geo Profile Service` should be added only when those service boundaries exist
|
|
in code.
|