36 KiB
Notification Service
Canonical references:
- Service-local docs
- Intent AsyncAPI contract
- Probe OpenAPI contract
- Gateway push model
- Mail async command contract
- Notification FlatBuffers payloads
- System architecture
Purpose
Notification Service is the internal asynchronous orchestration layer for
platform notifications.
It accepts normalized notification intents from upstream services, materializes
per-recipient routes, enriches user-targeted routes through User Service,
publishes client-facing push events toward Gateway, publishes non-auth email
commands toward Mail Service, and isolates transient downstream failures with
independent retry budgets per channel.
The service is intentionally not a source of truth for:
- game state
- lobby membership
- invite ownership
- review flags
- notification preferences
- email delivery attempts
Responsibility Boundaries
Notification Service is responsible for:
- consuming normalized notification intents from a dedicated Redis Stream
- validating intent envelopes and rejecting malformed or conflicting duplicates
- persisting durable notification and route state
- resolving user contact data from
User Servicebyuser_id - selecting locale from
User Service.preferred_languagewithenfallback - shaping lightweight push payloads for user-facing events
- publishing template-mode email commands to
Mail Service - retrying route publication independently for
pushandemail - persisting dead-letter entries for exhausted routes
Notification Service is not responsible for:
- computing business audiences from
game_idor other domain identifiers - owning administrator identity or administrator user records
- sending auth-code email
- storing per-user notification preferences in v1
- exposing an operator REST API in v1
The key design rule is that upstream producers must publish the concrete
recipient_user_id values for user-targeted notification intents. For
administrator-only notification types, recipient email addresses are resolved
from Notification Service configuration by notification_type. Private-game
invite notifications in v1 remain user-bound by internal user_id values and
must not target recipients by raw email address.
Runtime Surface
The implemented process contains:
- one private internal HTTP probe listener
- process-wide structured logging
- process-wide OpenTelemetry runtime
- one shared
galaxy/notificationintentproducer contract module - one shared Redis client with startup connectivity check
- one trusted
User ServiceHTTP enrichment client - one plain-
XREADnotification-intent consumer - one long-lived
pushroute publisher - one long-lived
emailroute publisher - durable accepted-intent, route, idempotency, malformed-intent, and stream-offset storage in Redis
- user-targeted route enrichment during intent acceptance before durable write
- client-facing
pushpublication towardGateway - template-mode
emailpublication towardMail Service - durable
pushandemailretry, dead-letter, and temporary lease coordination in Redis - OpenTelemetry counters and observable gauges for intent intake, user enrichment, route publication, route schedule depth, and intent stream lag
- graceful shutdown on process cancellation
Probe contract:
GET /healthzreturns{"status":"ok"}GET /readyzreturns{"status":"ready"}readyzis process-local after successful startup and does not perform a live Redis ping per request- there is no
/metricsroute
Runtime behavior:
- the intent consumer reads
notification:intentswith plainXREAD - when no stored stream offset exists, the consumer starts from
0-0 - the persisted offset advances only after durable acceptance or durable malformed-intent recording
- user-targeted routes are enriched through
GET /api/v1/internal/users/{user_id}before durable route write 404 subject_not_foundfromUser Serviceis recorded under malformed-intent storage withfailure_code=recipient_not_found- temporary
User Servicelookup failures stop the consumer before stream-offset advance - due
pushroutes are published towardGatewayfrom the sharednotification:route_schedule - due
emailroutes are published towardMail Servicefrom the sharednotification:route_schedule - the
pushpublisher claims only routes whoseroute_idstarts withpush: - the
emailpublisher claims only routes whoseroute_idstarts withemail: - replicas coordinate through temporary Redis lease
notification:route_leases:<notification_id>:<route_id> Gatewaypublication usesXADD MAXLEN ~withNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LENevent_idequals<notification_id>/<route_id>Mail Servicepublication uses plainXADDwith no stream trimmingdelivery_idequals<notification_id>/<route_id>idempotency_keyequalsnotification:<notification_id>/<route_id>requested_at_msequalsaccepted_at_msrequest_idandtrace_idare forwarded when presentdevice_session_idis intentionally omitted soGatewayfans the event out to every active stream of that user- Go producers use
galaxy/notificationintentto construct and publish compatible intents intonotification:intents - producer publication uses plain
XADDwithout stream trimming or hidden helper retries - a producer-side notification publication failure is notification degradation and must not roll back the source business state
- metric export uses the configured OpenTelemetry exporters only
- there is still no
/metricsroute notification.route_schedule.depthandnotification.route_schedule.oldest_age_msare derived fromnotification:route_schedulenotification.intent_stream.oldest_unprocessed_age_msis derived from the persisted intent stream offset and the configured ingress stream- manual dead-letter replay is performed by publishing a new compatible intent
with a new
idempotency_key; existing dead-letter records remain audit history until TTL expiry
The target process shape is one internal-only process with:
- one notification-intent consumer
- one
pushroute publisher forGateway - one
emailroute publisher forMail Service
Intentional runtime omissions in v1:
- no public ingress
- no dedicated operator REST API
- no direct client delivery
- no direct SMTP integration
Configuration
Required:
NOTIFICATION_REDIS_MASTER_ADDRNOTIFICATION_REDIS_PASSWORDNOTIFICATION_POSTGRES_PRIMARY_DSNNOTIFICATION_USER_SERVICE_BASE_URL
Primary configuration groups:
- process and logging:
NOTIFICATION_SHUTDOWN_TIMEOUTNOTIFICATION_LOG_LEVEL
- internal probe HTTP:
NOTIFICATION_INTERNAL_HTTP_ADDRwith default:8092NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUTwith default2sNOTIFICATION_INTERNAL_HTTP_READ_TIMEOUTwith default10sNOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUTwith default1m
- Redis connectivity (master/replica/password shape; the deprecated
NOTIFICATION_REDIS_ADDR,NOTIFICATION_REDIS_USERNAME, andNOTIFICATION_REDIS_TLS_ENABLEDenv vars are rejected at startup):NOTIFICATION_REDIS_REPLICA_ADDRS(optional, comma-separated)NOTIFICATION_REDIS_DBNOTIFICATION_REDIS_OPERATION_TIMEOUT
- PostgreSQL connectivity:
NOTIFICATION_POSTGRES_REPLICA_DSNS(optional, comma-separated)NOTIFICATION_POSTGRES_OPERATION_TIMEOUTNOTIFICATION_POSTGRES_MAX_OPEN_CONNSNOTIFICATION_POSTGRES_MAX_IDLE_CONNSNOTIFICATION_POSTGRES_CONN_MAX_LIFETIME
- stream names:
NOTIFICATION_INTENTS_STREAMwith defaultnotification:intentsNOTIFICATION_INTENTS_READ_BLOCK_TIMEOUTwith default2sNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAMwith defaultgateway:client-eventsNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LENwith default1024NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAMwith defaultmail:delivery_commands
- retry and dead-letter:
NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTSwith default3NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTSwith default7NOTIFICATION_ROUTE_BACKOFF_MINwith default1sNOTIFICATION_ROUTE_BACKOFF_MAXwith default5mNOTIFICATION_ROUTE_LEASE_TTLwith default5sNOTIFICATION_IDEMPOTENCY_TTLwith default168h
- retention (periodic SQL retention worker; replaces the previous
NOTIFICATION_DEAD_LETTER_TTLandNOTIFICATION_RECORD_TTLRedis-EXPIRE knobs):NOTIFICATION_RECORD_RETENTIONwith default720hNOTIFICATION_MALFORMED_INTENT_RETENTIONwith default2160hNOTIFICATION_CLEANUP_INTERVALwith default1h
User Serviceenrichment:NOTIFICATION_USER_SERVICE_TIMEOUTwith default1s
- administrator routing:
NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDEDNOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILEDNOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_STARTNOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTEDNOTIFICATION_ADMIN_EMAILS_RUNTIME_IMAGE_PULL_FAILEDNOTIFICATION_ADMIN_EMAILS_RUNTIME_CONTAINER_START_FAILEDNOTIFICATION_ADMIN_EMAILS_RUNTIME_START_CONFIG_INVALID
- OpenTelemetry:
- standard
OTEL_*variables NOTIFICATION_OTEL_STDOUT_TRACES_ENABLEDNOTIFICATION_OTEL_STDOUT_METRICS_ENABLED
- standard
Each administrator configuration variable stores a comma-separated list of
email addresses for exactly one notification_type. v1 does not use one global
admin-recipient list shared across all administrative events.
Stable Input Contract
The service accepts intents from one dedicated Redis Stream:
notification:intents
The canonical envelope is defined in
api/intents-asyncapi.yaml.
Go producers should use the shared galaxy/notificationintent module to build
and append compatible stream entries instead of duplicating field names,
payload structs, or validation rules locally.
Required envelope fields:
notification_typeproduceraudience_kindidempotency_keyoccurred_at_mspayload_json
Optional envelope fields:
recipient_user_ids_jsonrequest_idtrace_id
Rules:
audience_kind=userrequiresrecipient_user_ids_jsonwith one or more unique stableuser_idvaluesaudience_kind=admin_emailforbidsrecipient_user_ids_jsonrecipient_user_ids_jsonis normalized as an unordered recipient set, so duplicateuser_idvalues are invalid and element order does not affect idempotencyrequest_idandtrace_idare observability-only metadata and do not participate in the idempotency fingerprintpayload_jsonis type-specific, must remain backward-compatible for eachnotification_type, and is normalized structurally for duplicate detection: insignificant whitespace and object key order are ignored while array order remains significant- a replay with the same
(producer, idempotency_key)and the same normalized payload is treated as a successful duplicate - a replay with the same
(producer, idempotency_key)but different normalized content is recorded as a conflicting duplicate under malformed-intent storage withfailure_code=idempotency_conflictand must not create new routes - during user enrichment, a missing
user_idinUser Serviceis recorded under malformed-intent storage withfailure_code=recipient_not_found
Malformed stream entries do not create durable notification records. They are
logged, metered, and recorded separately for operator inspection.
Accepted intents use the original Redis Stream stream_entry_id as
notification_id.
Notification Catalog
payload_json fields are normalized by the producer before publication.
notification_type |
Producer | Audience | Channels | Required payload_json fields |
|---|---|---|---|---|
geo.review_recommended |
Geo Profile Service (geoprofile) |
configured admin email list (audience_kind=admin_email) |
email |
user_id, user_email, observed_country, usual_connection_country, review_reason |
game.turn.ready |
Game Master (game_master) |
active accepted participants (audience_kind=user) |
push+email |
game_id, game_name, turn_number |
game.finished |
Game Master (game_master) |
active accepted participants (audience_kind=user) |
push+email |
game_id, game_name, final_turn_number |
game.generation_failed |
Game Master (game_master) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, game_name, failure_reason |
lobby.runtime_paused_after_start |
Game Lobby (game_lobby) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, game_name |
lobby.application.submitted |
Game Lobby (game_lobby) |
private owner (audience_kind=user) or public admins (audience_kind=admin_email) |
private: push+email, public: email |
game_id, game_name, applicant_user_id, applicant_name |
lobby.membership.approved |
Game Lobby (game_lobby) |
applicant user (audience_kind=user) |
push+email |
game_id, game_name |
lobby.membership.rejected |
Game Lobby (game_lobby) |
applicant user (audience_kind=user) |
push+email |
game_id, game_name |
lobby.membership.blocked |
Game Lobby (game_lobby) |
private-game owner (audience_kind=user) |
push+email |
game_id, game_name, membership_user_id, membership_user_name, reason |
lobby.invite.created |
Game Lobby (game_lobby) |
invited user (audience_kind=user) |
push+email |
game_id, game_name, inviter_user_id, inviter_name |
lobby.invite.redeemed |
Game Lobby (game_lobby) |
private-game owner (audience_kind=user) |
push+email |
game_id, game_name, invitee_user_id, invitee_name |
lobby.invite.expired |
Game Lobby (game_lobby) |
private-game owner (audience_kind=user) |
email |
game_id, game_name, invitee_user_id, invitee_name |
lobby.race_name.registration_eligible |
Game Lobby (game_lobby) |
capable member (audience_kind=user) |
push+email |
game_id, game_name, race_name, eligible_until_ms |
lobby.race_name.registered |
Game Lobby (game_lobby) |
registering user (audience_kind=user) |
push+email |
race_name |
lobby.race_name.registration_denied |
Game Lobby (game_lobby) |
incapable member (audience_kind=user) |
email |
game_id, game_name, race_name, reason |
runtime.image_pull_failed |
Runtime Manager (runtime_manager) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, image_ref, error_code, error_message, attempted_at_ms |
runtime.container_start_failed |
Runtime Manager (runtime_manager) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, image_ref, error_code, error_message, attempted_at_ms |
runtime.start_config_invalid |
Runtime Manager (runtime_manager) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, image_ref, error_code, error_message, attempted_at_ms |
Rules:
- v1 supports exactly the eighteen
notification_typevalues listed above lobby.application.submittedkeeps one stablenotification_typeand one stablepayload_jsonshape; private games publishaudience_kind=userwhile public games publishaudience_kind=admin_emaillobby.invite.revokeddeliberately produces no notification in v1 and remains outside the supported catalog- private-game invite notifications remain user-bound by internal
user_id lobby.race_name.registration_eligibleandlobby.race_name.registration_deniedare emitted byGame Lobbyatgame_finishedbased on capability evaluation; the former always pairs with a 30-dayeligible_until_mswindowlobby.race_name.registeredis emitted on successfullobby.race_name.registercommit- the three
runtime.*types are emitted byRuntime Manageronly on first-touch start failures (image pull, container create/start, start configuration validation); they are administrator-only in v1 and have no push counterpart.Runtime Managerdoes not publish notifications for ongoing health changes — those flow throughruntime:health_eventsand are escalated byGame Masterif needed.
Recipient Enrichment And Locale Policy
For audience_kind=user, Notification Service resolves user records through
the trusted User Service lookup endpoint:
GET /api/v1/internal/users/{user_id}
The response supplies:
emailpreferred_language
Locale rules:
- current implemented support is exactly one locale:
en - exact
preferred_languageis used when supported byMail Service - unsupported, empty, or invalid language values fall back to
en - no intermediate locale reduction is used in v1
- the same resolved locale drives both
pushpayload localization decisions andMail Servicetemplate selection - enrichment runs during intent acceptance before durable route write
404 subject_not_foundfromUser Serviceis treated as permanent producer input error and becomes malformed-intentrecipient_not_found- temporary
User Servicefailures stop the consumer before stream-offset advance so the same stream entry is retried after restart
For audience_kind=admin_email, Notification Service does not consult
User Service and instead resolves recipients from type-specific config.
Push Contract Toward Gateway
Push events are published into the existing Gateway client-events stream.
Stable routing rules:
event_typeequalsnotification_typeevent_idequals<notification_id>/<route_id>user_idis derived fromrecipient_ref=user:<user_id>for user-targeted routesrequest_idandtrace_idare forwarded when presentdevice_session_idis intentionally omitted soGatewayfans the event out to every active stream of that user
Notification Service appends Gateway events with XADD MAXLEN ~ using
NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN.
User-facing push payloads use pkg/schema/fbs/notification.fbs.
notification_type |
FlatBuffers table | Payload fields |
|---|---|---|
game.turn.ready |
notification.GameTurnReadyEvent |
game_id, turn_number |
game.finished |
notification.GameFinishedEvent |
game_id, final_turn_number |
lobby.application.submitted |
notification.LobbyApplicationSubmittedEvent |
game_id, applicant_user_id |
lobby.membership.approved |
notification.LobbyMembershipApprovedEvent |
game_id |
lobby.membership.rejected |
notification.LobbyMembershipRejectedEvent |
game_id |
lobby.membership.blocked |
notification.LobbyMembershipBlockedEvent |
game_id, membership_user_id, reason |
lobby.invite.created |
notification.LobbyInviteCreatedEvent |
game_id, inviter_user_id |
lobby.invite.redeemed |
notification.LobbyInviteRedeemedEvent |
game_id, invitee_user_id |
lobby.race_name.registration_eligible |
notification.LobbyRaceNameRegistrationEligibleEvent |
game_id, race_name, eligible_until_ms |
lobby.race_name.registered |
notification.LobbyRaceNameRegisteredEvent |
race_name |
Only the ten user-facing push notification types above are represented in
notification.fbs.
geo.review_recommended, game.generation_failed,
lobby.runtime_paused_after_start, lobby.invite.expired, and
lobby.race_name.registration_denied remain outside this schema because
they are email-only in v1.
Checked-in generated Go bindings for this schema live under
../pkg/schema/fbs/notification.
notification_type alone determines the concrete FlatBuffers table.
No extra envelope or FlatBuffers union is added in v1.
The push payload must stay lightweight and must not attempt to mirror full game,
lobby, or profile state.
game_name, human-readable user names, and other full business-state fields
stay out of the push schema.
Clients react to the notification and then fetch fresh business state through
normal service APIs.
Email Contract Toward Mail Service
Email routes are published to Mail Service through
mail:delivery_commands using the existing generic async command contract.
Rules:
delivery_idequals<notification_id>/<route_id>sourceis alwaysnotificationpayload_modeis alwaystemplateidempotency_keyequalsnotification:<notification_id>/<route_id>requested_at_msequalsaccepted_at_msrequest_idandtrace_idare forwarded when presentpayload_json.tocontains exactly one resolved recipient emailpayload_json.cc,payload_json.bcc,payload_json.reply_to, andpayload_json.attachmentsare empty arrays in v1template_idequalsnotification_typelocaleis the resolved language from the enrichment step oren- template variables are passed through from normalized
payload_json
Notification Service appends Mail Service commands with plain XADD and
does not manage retention or trimming of mail:delivery_commands.
Auth-code email remains a direct Auth / Session Service -> Mail Service flow
and does not pass through Notification Service.
Initial notification-owned template assets:
notification_type |
template_id |
Required assets |
|---|---|---|
geo.review_recommended |
geo.review_recommended |
en/subject.tmpl, en/text.tmpl |
game.turn.ready |
game.turn.ready |
en/subject.tmpl, en/text.tmpl |
game.finished |
game.finished |
en/subject.tmpl, en/text.tmpl |
game.generation_failed |
game.generation_failed |
en/subject.tmpl, en/text.tmpl |
lobby.runtime_paused_after_start |
lobby.runtime_paused_after_start |
en/subject.tmpl, en/text.tmpl |
lobby.application.submitted |
lobby.application.submitted |
en/subject.tmpl, en/text.tmpl |
lobby.membership.approved |
lobby.membership.approved |
en/subject.tmpl, en/text.tmpl |
lobby.membership.rejected |
lobby.membership.rejected |
en/subject.tmpl, en/text.tmpl |
lobby.membership.blocked |
lobby.membership.blocked |
en/subject.tmpl, en/text.tmpl |
lobby.invite.created |
lobby.invite.created |
en/subject.tmpl, en/text.tmpl |
lobby.invite.redeemed |
lobby.invite.redeemed |
en/subject.tmpl, en/text.tmpl |
lobby.invite.expired |
lobby.invite.expired |
en/subject.tmpl, en/text.tmpl |
lobby.race_name.registration_eligible |
lobby.race_name.registration_eligible |
en/subject.tmpl, en/text.tmpl |
lobby.race_name.registered |
lobby.race_name.registered |
en/subject.tmpl, en/text.tmpl |
lobby.race_name.registration_denied |
lobby.race_name.registration_denied |
en/subject.tmpl, en/text.tmpl |
runtime.image_pull_failed |
runtime.image_pull_failed |
en/subject.tmpl, en/text.tmpl |
runtime.container_start_failed |
runtime.container_start_failed |
en/subject.tmpl, en/text.tmpl |
runtime.start_config_invalid |
runtime.start_config_invalid |
en/subject.tmpl, en/text.tmpl |
auth.login_code does not belong to the notification-owned template set.
Route Model
One accepted intent materializes:
- one
notification_record - zero or more
notification_routeentries
Each route represents exactly one (channel, recipient_ref) pair.
Stable route statuses:
pendingpublishedfaileddead_letterskipped
Rules:
pendingmeans the route is ready for first publish or retrypublishedmeans the route was durably handed off to its downstream channelfailedmeans the last publish attempt failed and a later retry is scheduleddead_lettermeans the route exhausted its retry budgetskippedmeans the route slot was durably materialized but intentionally not emitted
Materialization rules:
- every derived
recipient_refreceives onepushroute slot and oneemailroute slot, except that an empty administrator email list materializes one syntheticconfig:<notification_type>recipient slot with only a skippedemailroute - a route slot whose channel is outside the notification type channel matrix is
materialized as
skipped recipient_refisuser:<user_id>for user-targeted routesrecipient_refisemail:<normalized_address>for configured administrator email routes- when an administrator email list is empty, the service materializes one
synthetic recipient slot
config:<notification_type>with one skippedemailroute so the configuration gap remains durable and operator-visible route_idis mandatory and equals<channel>:<recipient_ref>
The service-local aggregate notification status is derived from routes and is not a separate durable source of truth.
Persistence Model
Durable storage is split between PostgreSQL (table-shaped business state)
and Redis (streams, runtime coordination). The architectural rules live in
ARCHITECTURE.md §Persistence Backends;
the per-service decision record is
docs/postgres-migration.md.
PostgreSQL durable state
The service owns the notification schema. Migrations are embedded in the
binary (internal/adapters/postgres/migrations) and applied at startup via
pkg/postgres.RunMigrations strictly before any HTTP listener becomes
ready. Every time-valued column is timestamptz, normalised to UTC by the
adapter on bind and scan.
| Table | Frozen columns |
|---|---|
records |
notification_id, notification_type, producer, audience_kind, recipient_user_ids (jsonb), payload_json, idempotency_key, request_fingerprint, request_id, trace_id, occurred_at, accepted_at, updated_at, idempotency_expires_at; UNIQUE (producer, idempotency_key) |
routes |
notification_id, route_id, channel, recipient_ref, status, attempt_count, max_attempts, next_attempt_at, resolved_email, resolved_locale, last_error_classification, last_error_message, last_error_at, created_at, updated_at, published_at, dead_lettered_at, skipped_at; PRIMARY KEY (notification_id, route_id) |
dead_letters |
notification_id, route_id, channel, recipient_ref, final_attempt_count, max_attempts, failure_classification, failure_message, recovery_hint, created_at; PRIMARY KEY (notification_id, route_id) cascading from routes |
malformed_intents |
stream_entry_id, notification_type, producer, idempotency_key, failure_code, failure_message, raw_fields (jsonb), recorded_at |
Storage rules:
- the durable
recordsrow IS the idempotency reservation; the(producer, idempotency_key)UNIQUE constraint surfaces conflicts asacceptintent.ErrConflict next_attempt_atis non-NULL only while the route is a scheduling candidate (status=pending|failed); the partial indexroutes_due_idxdrives the publishers'ListDueRoutesscanpayload_jsonstores the canonical normalized JSON string used for idempotency fingerprinting;recipient_user_idsis JSONB and omitted foraudience_kind=admin_email- terminal transitions clear
next_attempt_atand stamp the appropriate terminal column (published_at/dead_lettered_at/skipped_at) - record-level retention deletes cascade to
routesanddead_lettersviaON DELETE CASCADE
Redis runtime-coordination state
| Logical artifact | Redis key |
|---|---|
| temporary route lease | notification:route_leases:<notification_id>:<route_id> |
| stream offset record | notification:stream_offsets:<stream> |
| ingress stream | notification:intents |
Storage rules:
- dynamic Redis key segments are base64url-encoded
- temporary route lease keys store one opaque worker token and use
NOTIFICATION_ROUTE_LEASE_TTL; they are service-local coordination state rather than durable records, retained on Redis as a per-replica exclusivity hint atop the SQL claim - stream offset records persist plain-XREAD consumer progress for
notification:intentsand never expire - the outbound streams
gateway:client-eventsandmail:delivery_commandsremain Redis Streams owned by Gateway and Mail Service respectively; Notification Service emits one entry throughXADDbefore committing the route's PostgreSQL state transition
Publisher claim and lease coordination
Push and Email publishers share the same scheduling pattern:
routes_due_idx(the partial index onnext_attempt_at) replaces the formernotification:route_scheduleZSET; the SQL querySELECT notification_id, route_id FROM routes WHERE next_attempt_at IS NOT NULL AND next_attempt_at <= now() ORDER BY next_attempt_at ASC LIMIT Nreturns the next due batchpushpublishers filter forroute_idprefixpush:;emailpublishers filter for prefixemail:so the two workers do not contendpushandemailreplicas coordinate throughnotification:route_leases:<notification_id>:<route_id>withNOTIFICATION_ROUTE_LEASE_TTL- only the current lease holder finalises one due publication attempt;
the durable transition is a
Complete*SQL transaction with optimistic concurrency onroutes.updated_atso a stale lease cannot overwrite a fresher row state - newly accepted publishable routes enter the partial index immediately
with
status=pendingandnext_attempt_at = accepted_at failedroutes remain in the partial index for retrypublished,dead_letter, andskippedclearnext_attempt_atand drop out of the index
Retry And Dead-Letter Policy
Retry budgets are channel-specific:
pushpublication toGateway:3attempts totalemailpublication toMail Service:7attempts total
Rules:
- the first publication attempt happens immediately at
accepted_at_ms - after failed attempt
N, the next delay isclamp(NOTIFICATION_ROUTE_BACKOFF_MIN * 2^(N-1), NOTIFICATION_ROUTE_BACKOFF_MIN, NOTIFICATION_ROUTE_BACKOFF_MAX) - no jitter is added to the retry delay
pushandemailroutes are retried independently- the shared schedule is filtered by route prefix so
pushpublishers claim onlypush:routes andemailpublishers claim onlyemail:routes pushandemailreplicas coordinate throughnotification:route_leases:<notification_id>:<route_id>withNOTIFICATION_ROUTE_LEASE_TTLpushpublication failures are classified minimally aspayload_encoding_failedandgateway_stream_publish_failedemailpublication failures are classified minimally aspayload_encoding_failedandmail_stream_publish_failed- when a route exhausts its retry budget, it transitions to
dead_letter, createsnotification_dead_letter_entry, and is removed fromnotification:route_schedule - one exhausted route entering
dead_lettermust not roll back or invalidate a sibling route that already reachedpublished - service restarts resume from durable route state and persisted stream offsets
Retention rules:
recordsand their cascadedroutes/dead_lettersuseNOTIFICATION_RECORD_RETENTION(deleted by the periodic SQL retention worker after the configured window; cascade clears dependent rows)- the per-record idempotency window (
records.idempotency_expires_at) usesNOTIFICATION_IDEMPOTENCY_TTL malformed_intentsuseNOTIFICATION_MALFORMED_INTENT_RETENTION(independent retention pass)- the retention worker runs once per
NOTIFICATION_CLEANUP_INTERVAL - stream offset records do not expire
Observability
The service instruments:
- internal probe HTTP requests
- internal probe HTTP listener startup and shutdown events
- structured logs for accepted, duplicate, and rejected notification intents
- structured logs for
pushandemailroute publication, retry, and dead-letter transitions - accepted and duplicate intent outcomes
- malformed intents, including idempotency conflicts and unresolved recipients
- user-enrichment lookup outcomes
- route publish attempts, retries, and dead-letter transitions
- current route-schedule depth and oldest scheduled route age
- oldest unprocessed intent stream entry age
Metric names:
notification.intent.outcomesnotification.intent.malformednotification.user_enrichment.attemptsnotification.route.publish_attemptsnotification.route.retriesnotification.route.dead_lettersnotification.route_schedule.depthnotification.route_schedule.oldest_age_msnotification.intent_stream.oldest_unprocessed_age_ms
Metrics intentionally avoid high-cardinality attributes such as user_id,
email address, notification_id, route_id, and idempotency_key.
Metric attributes may include notification_type, producer,
audience_kind, channel, result, outcome, failure_code, and
failure_classification.
Structured logs for intent intake, duplicate resolution, malformed-intent recording, route publication, retry scheduling, and dead-letter transitions use the same field names where the value exists:
notification_idnotification_typeproduceraudience_kindidempotency_keyroute_idchannelrequest_idtrace_id
OpenTelemetry trace context is logged as otel_trace_id and otel_span_id
when the active context carries a valid span.
Recovery
The supported manual replay path for a dead-lettered notification route is to
publish a new compatible intent to notification:intents.
Recovery rules:
- inspect the
notification_dead_letter_entry,notification_route, and owningnotification_record - confirm the downstream dependency or payload problem has been corrected
- publish a new intent with the same semantic
payload_jsonand audience fields, but with a new producer-ownedidempotency_key - keep the old
notification_dead_letter_entryuntouched as audit history until its configured TTL expires
Manual Redis mutation of an existing route record or
notification:route_schedule is not a supported replay workflow.
Verification
Focused service-local coverage verifies:
- configuration loading and validation
GET /healthzGET /readyz- absence of
/metrics - Redis startup fast-fail behavior
- graceful shutdown of the private probe listener
- valid intent acceptance
- malformed intent rejection
- duplicate and conflicting duplicate handling
- user-targeted route enrichment from
User Service recipient_not_foundmalformed-intent recording for unresolveduser_id- temporary
User Servicefailure handling without stream-offset advance - FlatBuffers payload encoding for all seven user-facing
pushnotification_typevalues - template-mode
Mail Servicecommand encoding for user and administratoremailroutes - due-route loading, lease acquisition, route publication, retry reschedule, and dead-letter persistence in Redis
pushworker success, retry, and duplicate-prevention behavior across concurrent replicasemailworker success, retry, and duplicate-prevention behavior across concurrent replicas- OpenTelemetry metric recording for intent outcomes, malformed intents, user enrichment, route publication attempts, retries, dead letters, route-schedule gauges, and intent-stream lag
- Redis-backed route-schedule and intent-stream lag snapshots
- structured log field helper coverage through intake and publisher tests
- intent-consumer restart from
0-0and from persisted stream offsets - runtime wiring of the intent consumer and both route publishers
- shared
galaxy/notificationintentproducer constructors, validation, and Redis Stream publication compatibility
Cross-service coverage verifies:
Notification Service -> User Serviceenrichment compatibility and failure handlingNotification Service -> Gatewaypush compatibility for every user-facingnotification_typeNotification Service -> Mail Servicetemplate-mode handoff for every supported email type- producer compatibility for
Game Master,Game Lobby, andGeo Profile Servicethroughgalaxy/notificationintent - explicit regression that auth-code email still bypasses
Notification Service - real black-box
Notification Service -> Gatewaypush fan-out coverage - real black-box
Notification Service -> Mail Servicetemplate-mode handoff coverage
Real producer-boundary suites for Game Master, Game Lobby, and
Geo Profile Service should be added only when those service boundaries exist
in code.