32 KiB
Notification Service
Canonical references:
- Service-local docs
- Intent AsyncAPI contract
- Probe OpenAPI contract
- Gateway push model
- Mail async command contract
- Notification FlatBuffers payloads
- System architecture
Purpose
Notification Service is the internal asynchronous orchestration layer for
platform notifications.
It accepts normalized notification intents from upstream services, materializes
per-recipient routes, enriches user-targeted routes through User Service,
publishes client-facing push events toward Gateway, publishes non-auth email
commands toward Mail Service, and isolates transient downstream failures with
independent retry budgets per channel.
The service is intentionally not a source of truth for:
- game state
- lobby membership
- invite ownership
- review flags
- notification preferences
- email delivery attempts
Responsibility Boundaries
Notification Service is responsible for:
- consuming normalized notification intents from a dedicated Redis Stream
- validating intent envelopes and rejecting malformed or conflicting duplicates
- persisting durable notification and route state
- resolving user contact data from
User Servicebyuser_id - selecting locale from
User Service.preferred_languagewithenfallback - shaping lightweight push payloads for user-facing events
- publishing template-mode email commands to
Mail Service - retrying route publication independently for
pushandemail - persisting dead-letter entries for exhausted routes
Notification Service is not responsible for:
- computing business audiences from
game_idor other domain identifiers - owning administrator identity or administrator user records
- sending auth-code email
- storing per-user notification preferences in v1
- exposing an operator REST API in v1
The key design rule is that upstream producers must publish the concrete
recipient_user_id values for user-targeted notification intents. For
administrator-only notification types, recipient email addresses are resolved
from Notification Service configuration by notification_type. Private-game
invite notifications in v1 remain user-bound by internal user_id values and
must not target recipients by raw email address.
Runtime Surface
The implemented process contains:
- one private internal HTTP probe listener
- process-wide structured logging
- process-wide OpenTelemetry runtime
- one shared
galaxy/notificationintentproducer contract module - one shared Redis client with startup connectivity check
- one trusted
User ServiceHTTP enrichment client - one plain-
XREADnotification-intent consumer - one long-lived
pushroute publisher - one long-lived
emailroute publisher - durable accepted-intent, route, idempotency, malformed-intent, and stream-offset storage in Redis
- user-targeted route enrichment during intent acceptance before durable write
- client-facing
pushpublication towardGateway - template-mode
emailpublication towardMail Service - durable
pushandemailretry, dead-letter, and temporary lease coordination in Redis - OpenTelemetry counters and observable gauges for intent intake, user enrichment, route publication, route schedule depth, and intent stream lag
- graceful shutdown on process cancellation
Probe contract:
GET /healthzreturns{"status":"ok"}GET /readyzreturns{"status":"ready"}readyzis process-local after successful startup and does not perform a live Redis ping per request- there is no
/metricsroute
Runtime behavior:
- the intent consumer reads
notification:intentswith plainXREAD - when no stored stream offset exists, the consumer starts from
0-0 - the persisted offset advances only after durable acceptance or durable malformed-intent recording
- user-targeted routes are enriched through
GET /api/v1/internal/users/{user_id}before durable route write 404 subject_not_foundfromUser Serviceis recorded under malformed-intent storage withfailure_code=recipient_not_found- temporary
User Servicelookup failures stop the consumer before stream-offset advance - due
pushroutes are published towardGatewayfrom the sharednotification:route_schedule - due
emailroutes are published towardMail Servicefrom the sharednotification:route_schedule - the
pushpublisher claims only routes whoseroute_idstarts withpush: - the
emailpublisher claims only routes whoseroute_idstarts withemail: - replicas coordinate through temporary Redis lease
notification:route_leases:<notification_id>:<route_id> Gatewaypublication usesXADD MAXLEN ~withNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LENevent_idequals<notification_id>/<route_id>Mail Servicepublication uses plainXADDwith no stream trimmingdelivery_idequals<notification_id>/<route_id>idempotency_keyequalsnotification:<notification_id>/<route_id>requested_at_msequalsaccepted_at_msrequest_idandtrace_idare forwarded when presentdevice_session_idis intentionally omitted soGatewayfans the event out to every active stream of that user- Go producers use
galaxy/notificationintentto construct and publish compatible intents intonotification:intents - producer publication uses plain
XADDwithout stream trimming or hidden helper retries - a producer-side notification publication failure is notification degradation and must not roll back the source business state
- metric export uses the configured OpenTelemetry exporters only
- there is still no
/metricsroute notification.route_schedule.depthandnotification.route_schedule.oldest_age_msare derived fromnotification:route_schedulenotification.intent_stream.oldest_unprocessed_age_msis derived from the persisted intent stream offset and the configured ingress stream- manual dead-letter replay is performed by publishing a new compatible intent
with a new
idempotency_key; existing dead-letter records remain audit history until TTL expiry
The target process shape is one internal-only process with:
- one notification-intent consumer
- one
pushroute publisher forGateway - one
emailroute publisher forMail Service
Intentional runtime omissions in v1:
- no public ingress
- no dedicated operator REST API
- no direct client delivery
- no direct SMTP integration
Configuration
Required:
NOTIFICATION_REDIS_ADDRNOTIFICATION_USER_SERVICE_BASE_URL
Primary configuration groups:
- process and logging:
NOTIFICATION_SHUTDOWN_TIMEOUTNOTIFICATION_LOG_LEVEL
- internal probe HTTP:
NOTIFICATION_INTERNAL_HTTP_ADDRwith default:8092NOTIFICATION_INTERNAL_HTTP_READ_HEADER_TIMEOUTwith default2sNOTIFICATION_INTERNAL_HTTP_READ_TIMEOUTwith default10sNOTIFICATION_INTERNAL_HTTP_IDLE_TIMEOUTwith default1m
- Redis connectivity:
NOTIFICATION_REDIS_USERNAMENOTIFICATION_REDIS_PASSWORDNOTIFICATION_REDIS_DBNOTIFICATION_REDIS_TLS_ENABLEDNOTIFICATION_REDIS_OPERATION_TIMEOUT
- stream names:
NOTIFICATION_INTENTS_STREAMwith defaultnotification:intentsNOTIFICATION_INTENTS_READ_BLOCK_TIMEOUTwith default2sNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAMwith defaultgateway:client-eventsNOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LENwith default1024NOTIFICATION_MAIL_DELIVERY_COMMANDS_STREAMwith defaultmail:delivery_commands
- retry and dead-letter:
NOTIFICATION_PUSH_RETRY_MAX_ATTEMPTSwith default3NOTIFICATION_EMAIL_RETRY_MAX_ATTEMPTSwith default7NOTIFICATION_ROUTE_BACKOFF_MINwith default1sNOTIFICATION_ROUTE_BACKOFF_MAXwith default5mNOTIFICATION_ROUTE_LEASE_TTLwith default5sNOTIFICATION_DEAD_LETTER_TTLwith default720hNOTIFICATION_RECORD_TTLwith default720hNOTIFICATION_IDEMPOTENCY_TTLwith default168h
User Serviceenrichment:NOTIFICATION_USER_SERVICE_TIMEOUTwith default1s
- administrator routing:
NOTIFICATION_ADMIN_EMAILS_GEO_REVIEW_RECOMMENDEDNOTIFICATION_ADMIN_EMAILS_GAME_GENERATION_FAILEDNOTIFICATION_ADMIN_EMAILS_LOBBY_RUNTIME_PAUSED_AFTER_STARTNOTIFICATION_ADMIN_EMAILS_LOBBY_APPLICATION_SUBMITTED
- OpenTelemetry:
- standard
OTEL_*variables NOTIFICATION_OTEL_STDOUT_TRACES_ENABLEDNOTIFICATION_OTEL_STDOUT_METRICS_ENABLED
- standard
Each administrator configuration variable stores a comma-separated list of
email addresses for exactly one notification_type. v1 does not use one global
admin-recipient list shared across all administrative events.
Stable Input Contract
The service accepts intents from one dedicated Redis Stream:
notification:intents
The canonical envelope is defined in
api/intents-asyncapi.yaml.
Go producers should use the shared galaxy/notificationintent module to build
and append compatible stream entries instead of duplicating field names,
payload structs, or validation rules locally.
Required envelope fields:
notification_typeproduceraudience_kindidempotency_keyoccurred_at_mspayload_json
Optional envelope fields:
recipient_user_ids_jsonrequest_idtrace_id
Rules:
audience_kind=userrequiresrecipient_user_ids_jsonwith one or more unique stableuser_idvaluesaudience_kind=admin_emailforbidsrecipient_user_ids_jsonrecipient_user_ids_jsonis normalized as an unordered recipient set, so duplicateuser_idvalues are invalid and element order does not affect idempotencyrequest_idandtrace_idare observability-only metadata and do not participate in the idempotency fingerprintpayload_jsonis type-specific, must remain backward-compatible for eachnotification_type, and is normalized structurally for duplicate detection: insignificant whitespace and object key order are ignored while array order remains significant- a replay with the same
(producer, idempotency_key)and the same normalized payload is treated as a successful duplicate - a replay with the same
(producer, idempotency_key)but different normalized content is recorded as a conflicting duplicate under malformed-intent storage withfailure_code=idempotency_conflictand must not create new routes - during user enrichment, a missing
user_idinUser Serviceis recorded under malformed-intent storage withfailure_code=recipient_not_found
Malformed stream entries do not create durable notification records. They are
logged, metered, and recorded separately for operator inspection.
Accepted intents use the original Redis Stream stream_entry_id as
notification_id.
Notification Catalog
payload_json fields are normalized by the producer before publication.
notification_type |
Producer | Audience | Channels | Required payload_json fields |
|---|---|---|---|---|
geo.review_recommended |
Geo Profile Service (geoprofile) |
configured admin email list (audience_kind=admin_email) |
email |
user_id, user_email, observed_country, usual_connection_country, review_reason |
game.turn.ready |
Game Master (game_master) |
active accepted participants (audience_kind=user) |
push+email |
game_id, game_name, turn_number |
game.finished |
Game Master (game_master) |
active accepted participants (audience_kind=user) |
push+email |
game_id, game_name, final_turn_number |
game.generation_failed |
Game Master (game_master) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, game_name, failure_reason |
lobby.runtime_paused_after_start |
Game Lobby (game_lobby) |
configured admin email list (audience_kind=admin_email) |
email |
game_id, game_name |
lobby.application.submitted |
Game Lobby (game_lobby) |
private owner (audience_kind=user) or public admins (audience_kind=admin_email) |
private: push+email, public: email |
game_id, game_name, applicant_user_id, applicant_name |
lobby.membership.approved |
Game Lobby (game_lobby) |
applicant user (audience_kind=user) |
push+email |
game_id, game_name |
lobby.membership.rejected |
Game Lobby (game_lobby) |
applicant user (audience_kind=user) |
push+email |
game_id, game_name |
lobby.membership.blocked |
Game Lobby (game_lobby) |
private-game owner (audience_kind=user) |
push+email |
game_id, game_name, membership_user_id, membership_user_name, reason |
lobby.invite.created |
Game Lobby (game_lobby) |
invited user (audience_kind=user) |
push+email |
game_id, game_name, inviter_user_id, inviter_name |
lobby.invite.redeemed |
Game Lobby (game_lobby) |
private-game owner (audience_kind=user) |
push+email |
game_id, game_name, invitee_user_id, invitee_name |
lobby.invite.expired |
Game Lobby (game_lobby) |
private-game owner (audience_kind=user) |
email |
game_id, game_name, invitee_user_id, invitee_name |
lobby.race_name.registration_eligible |
Game Lobby (game_lobby) |
capable member (audience_kind=user) |
push+email |
game_id, game_name, race_name, eligible_until_ms |
lobby.race_name.registered |
Game Lobby (game_lobby) |
registering user (audience_kind=user) |
push+email |
race_name |
lobby.race_name.registration_denied |
Game Lobby (game_lobby) |
incapable member (audience_kind=user) |
email |
game_id, game_name, race_name, reason |
Rules:
- v1 supports exactly the fifteen
notification_typevalues listed above lobby.application.submittedkeeps one stablenotification_typeand one stablepayload_jsonshape; private games publishaudience_kind=userwhile public games publishaudience_kind=admin_emaillobby.invite.revokeddeliberately produces no notification in v1 and remains outside the supported catalog- private-game invite notifications remain user-bound by internal
user_id lobby.race_name.registration_eligibleandlobby.race_name.registration_deniedare emitted byGame Lobbyatgame_finishedbased on capability evaluation; the former always pairs with a 30-dayeligible_until_mswindowlobby.race_name.registeredis emitted on successfullobby.race_name.registercommit
Recipient Enrichment And Locale Policy
For audience_kind=user, Notification Service resolves user records through
the trusted User Service lookup endpoint:
GET /api/v1/internal/users/{user_id}
The response supplies:
emailpreferred_language
Locale rules:
- current implemented support is exactly one locale:
en - exact
preferred_languageis used when supported byMail Service - unsupported, empty, or invalid language values fall back to
en - no intermediate locale reduction is used in v1
- the same resolved locale drives both
pushpayload localization decisions andMail Servicetemplate selection - enrichment runs during intent acceptance before durable route write
404 subject_not_foundfromUser Serviceis treated as permanent producer input error and becomes malformed-intentrecipient_not_found- temporary
User Servicefailures stop the consumer before stream-offset advance so the same stream entry is retried after restart
For audience_kind=admin_email, Notification Service does not consult
User Service and instead resolves recipients from type-specific config.
Push Contract Toward Gateway
Push events are published into the existing Gateway client-events stream.
Stable routing rules:
event_typeequalsnotification_typeevent_idequals<notification_id>/<route_id>user_idis derived fromrecipient_ref=user:<user_id>for user-targeted routesrequest_idandtrace_idare forwarded when presentdevice_session_idis intentionally omitted soGatewayfans the event out to every active stream of that user
Notification Service appends Gateway events with XADD MAXLEN ~ using
NOTIFICATION_GATEWAY_CLIENT_EVENTS_STREAM_MAX_LEN.
User-facing push payloads use pkg/schema/fbs/notification.fbs.
notification_type |
FlatBuffers table | Payload fields |
|---|---|---|
game.turn.ready |
notification.GameTurnReadyEvent |
game_id, turn_number |
game.finished |
notification.GameFinishedEvent |
game_id, final_turn_number |
lobby.application.submitted |
notification.LobbyApplicationSubmittedEvent |
game_id, applicant_user_id |
lobby.membership.approved |
notification.LobbyMembershipApprovedEvent |
game_id |
lobby.membership.rejected |
notification.LobbyMembershipRejectedEvent |
game_id |
lobby.membership.blocked |
notification.LobbyMembershipBlockedEvent |
game_id, membership_user_id, reason |
lobby.invite.created |
notification.LobbyInviteCreatedEvent |
game_id, inviter_user_id |
lobby.invite.redeemed |
notification.LobbyInviteRedeemedEvent |
game_id, invitee_user_id |
lobby.race_name.registration_eligible |
notification.LobbyRaceNameRegistrationEligibleEvent |
game_id, race_name, eligible_until_ms |
lobby.race_name.registered |
notification.LobbyRaceNameRegisteredEvent |
race_name |
Only the ten user-facing push notification types above are represented in
notification.fbs.
geo.review_recommended, game.generation_failed,
lobby.runtime_paused_after_start, lobby.invite.expired, and
lobby.race_name.registration_denied remain outside this schema because
they are email-only in v1.
Checked-in generated Go bindings for this schema live under
../pkg/schema/fbs/notification.
notification_type alone determines the concrete FlatBuffers table.
No extra envelope or FlatBuffers union is added in v1.
The push payload must stay lightweight and must not attempt to mirror full game,
lobby, or profile state.
game_name, human-readable user names, and other full business-state fields
stay out of the push schema.
Clients react to the notification and then fetch fresh business state through
normal service APIs.
Email Contract Toward Mail Service
Email routes are published to Mail Service through
mail:delivery_commands using the existing generic async command contract.
Rules:
delivery_idequals<notification_id>/<route_id>sourceis alwaysnotificationpayload_modeis alwaystemplateidempotency_keyequalsnotification:<notification_id>/<route_id>requested_at_msequalsaccepted_at_msrequest_idandtrace_idare forwarded when presentpayload_json.tocontains exactly one resolved recipient emailpayload_json.cc,payload_json.bcc,payload_json.reply_to, andpayload_json.attachmentsare empty arrays in v1template_idequalsnotification_typelocaleis the resolved language from the enrichment step oren- template variables are passed through from normalized
payload_json
Notification Service appends Mail Service commands with plain XADD and
does not manage retention or trimming of mail:delivery_commands.
Auth-code email remains a direct Auth / Session Service -> Mail Service flow
and does not pass through Notification Service.
Initial notification-owned template assets:
notification_type |
template_id |
Required assets |
|---|---|---|
geo.review_recommended |
geo.review_recommended |
en/subject.tmpl, en/text.tmpl |
game.turn.ready |
game.turn.ready |
en/subject.tmpl, en/text.tmpl |
game.finished |
game.finished |
en/subject.tmpl, en/text.tmpl |
game.generation_failed |
game.generation_failed |
en/subject.tmpl, en/text.tmpl |
lobby.runtime_paused_after_start |
lobby.runtime_paused_after_start |
en/subject.tmpl, en/text.tmpl |
lobby.application.submitted |
lobby.application.submitted |
en/subject.tmpl, en/text.tmpl |
lobby.membership.approved |
lobby.membership.approved |
en/subject.tmpl, en/text.tmpl |
lobby.membership.rejected |
lobby.membership.rejected |
en/subject.tmpl, en/text.tmpl |
lobby.membership.blocked |
lobby.membership.blocked |
en/subject.tmpl, en/text.tmpl |
lobby.invite.created |
lobby.invite.created |
en/subject.tmpl, en/text.tmpl |
lobby.invite.redeemed |
lobby.invite.redeemed |
en/subject.tmpl, en/text.tmpl |
lobby.invite.expired |
lobby.invite.expired |
en/subject.tmpl, en/text.tmpl |
lobby.race_name.registration_eligible |
lobby.race_name.registration_eligible |
en/subject.tmpl, en/text.tmpl |
lobby.race_name.registered |
lobby.race_name.registered |
en/subject.tmpl, en/text.tmpl |
lobby.race_name.registration_denied |
lobby.race_name.registration_denied |
en/subject.tmpl, en/text.tmpl |
auth.login_code does not belong to the notification-owned template set.
Route Model
One accepted intent materializes:
- one
notification_record - zero or more
notification_routeentries
Each route represents exactly one (channel, recipient_ref) pair.
Stable route statuses:
pendingpublishedfaileddead_letterskipped
Rules:
pendingmeans the route is ready for first publish or retrypublishedmeans the route was durably handed off to its downstream channelfailedmeans the last publish attempt failed and a later retry is scheduleddead_lettermeans the route exhausted its retry budgetskippedmeans the route slot was durably materialized but intentionally not emitted
Materialization rules:
- every derived
recipient_refreceives onepushroute slot and oneemailroute slot, except that an empty administrator email list materializes one syntheticconfig:<notification_type>recipient slot with only a skippedemailroute - a route slot whose channel is outside the notification type channel matrix is
materialized as
skipped recipient_refisuser:<user_id>for user-targeted routesrecipient_refisemail:<normalized_address>for configured administrator email routes- when an administrator email list is empty, the service materializes one
synthetic recipient slot
config:<notification_type>with one skippedemailroute so the configuration gap remains durable and operator-visible route_idis mandatory and equals<channel>:<recipient_ref>
The service-local aggregate notification status is derived from routes and is not a separate durable source of truth.
Redis Logical Model
Storage rules:
- durable records are stored as strict JSON blobs
- timestamps are stored in Unix milliseconds
- dynamic Redis key segments are base64url-encoded
notification:route_scheduleis one shared sorted set for bothpushandemail
| Logical artifact | Redis key |
|---|---|
notification_record |
notification:records:<notification_id> |
notification_route |
notification:routes:<notification_id>:<route_id> |
| temporary route lease | notification:route_leases:<notification_id>:<route_id> |
notification_idempotency_record |
notification:idempotency:<producer>:<idempotency_key> |
notification_dead_letter_entry |
notification:dead_letters:<notification_id>:<route_id> |
| malformed intent record | notification:malformed_intents:<stream_entry_id> |
| stream offset record | notification:stream_offsets:<stream> |
| ingress stream | notification:intents |
| route schedule sorted set | notification:route_schedule |
| Record | Frozen fields |
|---|---|
notification_record |
notification_id, notification_type, producer, audience_kind, normalized recipient_user_ids, normalized payload_json, idempotency_key, request_fingerprint, optional request_id, optional trace_id, occurred_at_ms, accepted_at_ms, updated_at_ms |
notification_route |
notification_id, route_id, channel, recipient_ref, status, attempt_count, max_attempts, next_attempt_at_ms, optional resolved_email, optional resolved_locale, optional last_error_classification, optional last_error_message, optional last_error_at_ms, created_at_ms, updated_at_ms, optional published_at_ms, optional dead_lettered_at_ms, optional skipped_at_ms |
notification_idempotency_record |
producer, idempotency_key, notification_id, request_fingerprint, created_at_ms, expires_at_ms |
notification_dead_letter_entry |
notification_id, route_id, channel, recipient_ref, final_attempt_count, max_attempts, failure_classification, failure_message, created_at_ms, optional recovery_hint |
| malformed intent record | stream_entry_id, optional notification_type, optional producer, optional idempotency_key, failure_code, failure_message, raw_fields_json, recorded_at_ms |
| stream offset record | stream, last_processed_entry_id, updated_at_ms |
notification_record.recipient_user_ids stores a normalized array of unique
user_id values and is omitted for audience_kind=admin_email.
notification_record.payload_json stores the canonical normalized JSON string
used for idempotency fingerprinting.
Temporary route lease keys store one opaque worker token and use
NOTIFICATION_ROUTE_LEASE_TTL; they are service-local coordination state
rather than durable records.
notification:route_schedule stores one member per scheduled route where score
= next_attempt_at_ms and member = full Redis route key with encoded dynamic
segments.
Newly accepted publishable routes enter the schedule immediately with
status=pending and next_attempt_at_ms = accepted_at_ms.
failed routes remain scheduled for retry.
published, dead_letter, and skipped are absent from the schedule.
Only the current lease holder may finalize one due publication attempt.
Retry And Dead-Letter Policy
Retry budgets are channel-specific:
pushpublication toGateway:3attempts totalemailpublication toMail Service:7attempts total
Rules:
- the first publication attempt happens immediately at
accepted_at_ms - after failed attempt
N, the next delay isclamp(NOTIFICATION_ROUTE_BACKOFF_MIN * 2^(N-1), NOTIFICATION_ROUTE_BACKOFF_MIN, NOTIFICATION_ROUTE_BACKOFF_MAX) - no jitter is added to the retry delay
pushandemailroutes are retried independently- the shared schedule is filtered by route prefix so
pushpublishers claim onlypush:routes andemailpublishers claim onlyemail:routes pushandemailreplicas coordinate throughnotification:route_leases:<notification_id>:<route_id>withNOTIFICATION_ROUTE_LEASE_TTLpushpublication failures are classified minimally aspayload_encoding_failedandgateway_stream_publish_failedemailpublication failures are classified minimally aspayload_encoding_failedandmail_stream_publish_failed- when a route exhausts its retry budget, it transitions to
dead_letter, createsnotification_dead_letter_entry, and is removed fromnotification:route_schedule - one exhausted route entering
dead_lettermust not roll back or invalidate a sibling route that already reachedpublished - service restarts resume from durable route state and persisted stream offsets
Retention rules:
notification_recordandnotification_routeuseNOTIFICATION_RECORD_TTLnotification_idempotency_recordusesNOTIFICATION_IDEMPOTENCY_TTLnotification_dead_letter_entryand malformed intent records useNOTIFICATION_DEAD_LETTER_TTL- stream offset records do not use TTL
Observability
The service instruments:
- internal probe HTTP requests
- internal probe HTTP listener startup and shutdown events
- structured logs for accepted, duplicate, and rejected notification intents
- structured logs for
pushandemailroute publication, retry, and dead-letter transitions - accepted and duplicate intent outcomes
- malformed intents, including idempotency conflicts and unresolved recipients
- user-enrichment lookup outcomes
- route publish attempts, retries, and dead-letter transitions
- current route-schedule depth and oldest scheduled route age
- oldest unprocessed intent stream entry age
Metric names:
notification.intent.outcomesnotification.intent.malformednotification.user_enrichment.attemptsnotification.route.publish_attemptsnotification.route.retriesnotification.route.dead_lettersnotification.route_schedule.depthnotification.route_schedule.oldest_age_msnotification.intent_stream.oldest_unprocessed_age_ms
Metrics intentionally avoid high-cardinality attributes such as user_id,
email address, notification_id, route_id, and idempotency_key.
Metric attributes may include notification_type, producer,
audience_kind, channel, result, outcome, failure_code, and
failure_classification.
Structured logs for intent intake, duplicate resolution, malformed-intent recording, route publication, retry scheduling, and dead-letter transitions use the same field names where the value exists:
notification_idnotification_typeproduceraudience_kindidempotency_keyroute_idchannelrequest_idtrace_id
OpenTelemetry trace context is logged as otel_trace_id and otel_span_id
when the active context carries a valid span.
Recovery
The supported manual replay path for a dead-lettered notification route is to
publish a new compatible intent to notification:intents.
Recovery rules:
- inspect the
notification_dead_letter_entry,notification_route, and owningnotification_record - confirm the downstream dependency or payload problem has been corrected
- publish a new intent with the same semantic
payload_jsonand audience fields, but with a new producer-ownedidempotency_key - keep the old
notification_dead_letter_entryuntouched as audit history until its configured TTL expires
Manual Redis mutation of an existing route record or
notification:route_schedule is not a supported replay workflow.
Verification
Focused service-local coverage verifies:
- configuration loading and validation
GET /healthzGET /readyz- absence of
/metrics - Redis startup fast-fail behavior
- graceful shutdown of the private probe listener
- valid intent acceptance
- malformed intent rejection
- duplicate and conflicting duplicate handling
- user-targeted route enrichment from
User Service recipient_not_foundmalformed-intent recording for unresolveduser_id- temporary
User Servicefailure handling without stream-offset advance - FlatBuffers payload encoding for all seven user-facing
pushnotification_typevalues - template-mode
Mail Servicecommand encoding for user and administratoremailroutes - due-route loading, lease acquisition, route publication, retry reschedule, and dead-letter persistence in Redis
pushworker success, retry, and duplicate-prevention behavior across concurrent replicasemailworker success, retry, and duplicate-prevention behavior across concurrent replicas- OpenTelemetry metric recording for intent outcomes, malformed intents, user enrichment, route publication attempts, retries, dead letters, route-schedule gauges, and intent-stream lag
- Redis-backed route-schedule and intent-stream lag snapshots
- structured log field helper coverage through intake and publisher tests
- intent-consumer restart from
0-0and from persisted stream offsets - runtime wiring of the intent consumer and both route publishers
- shared
galaxy/notificationintentproducer constructors, validation, and Redis Stream publication compatibility
Cross-service coverage verifies:
Notification Service -> User Serviceenrichment compatibility and failure handlingNotification Service -> Gatewaypush compatibility for every user-facingnotification_typeNotification Service -> Mail Servicetemplate-mode handoff for every supported email type- producer compatibility for
Game Master,Game Lobby, andGeo Profile Servicethroughgalaxy/notificationintent - explicit regression that auth-code email still bypasses
Notification Service - real black-box
Notification Service -> Gatewaypush fan-out coverage - real black-box
Notification Service -> Mail Servicetemplate-mode handoff coverage
Real producer-boundary suites for Game Master, Game Lobby, and
Geo Profile Service should be added only when those service boundaries exist
in code.