461 lines
12 KiB
Markdown
461 lines
12 KiB
Markdown
# Mail Service
|
|
|
|
`Mail Service` is the internal e-mail delivery service of Galaxy.
|
|
|
|
Canonical contracts:
|
|
|
|
- [Internal REST API](api/internal-openapi.yaml)
|
|
- [Async generic command contract](api/delivery-commands-asyncapi.yaml)
|
|
- [Extended service docs](docs/README.md)
|
|
|
|
## Purpose
|
|
|
|
`Mail Service` owns durable intake, rendering, execution, retry, audit, and
|
|
operator recovery for outbound e-mail.
|
|
|
|
It does not decide whether a business event should become e-mail. That
|
|
decision belongs to `Notification Service`.
|
|
|
|
## Responsibility Boundaries
|
|
|
|
`Mail Service` is responsible for:
|
|
|
|
- direct auth-code mail intake from `Auth / Session Service`
|
|
- async generic mail intake from `Notification Service`
|
|
- validation of recipient envelope, payload shape, locale, and attachments
|
|
- deterministic template rendering for template-mode deliveries
|
|
- provider execution through `stub` or `smtp`
|
|
- retry scheduling, dead-letter escalation, and operator-visible audit state
|
|
- trusted operator reads and resend by clone creation
|
|
|
|
`Mail Service` is not responsible for:
|
|
|
|
- end-user authentication or authorization
|
|
- notification preference ownership
|
|
- deciding whether non-auth mail should be sent at all
|
|
- direct calls from `Geo Profile Service`
|
|
- hot-reloading templates or editing template catalog state at runtime
|
|
|
|
Cross-service routing rules:
|
|
|
|
- `Auth / Session Service -> Mail Service` is synchronous trusted REST
|
|
- `Notification Service -> Mail Service` is asynchronous `Redis Streams`
|
|
- `Geo Profile Service` must route optional admin e-mail through
|
|
`Notification Service`, not directly to `Mail Service`
|
|
|
|
## Runtime Surface
|
|
|
|
`cmd/mail` starts one internal-only process with:
|
|
|
|
- one trusted internal HTTP listener on `MAIL_INTERNAL_HTTP_ADDR`
|
|
- one async command consumer
|
|
- one attempt scheduler
|
|
- one attempt worker pool
|
|
- one cleanup worker
|
|
|
|
The service has no public ingress and no dedicated admin listener.
|
|
|
|
Intentional runtime omissions:
|
|
|
|
- no `/healthz`
|
|
- no `/readyz`
|
|
- no `/metrics`
|
|
|
|
Operational behavior:
|
|
|
|
- startup performs bounded Redis connectivity checks and fails fast on invalid
|
|
runtime configuration
|
|
- the template catalog is parsed once at startup and kept immutable for the
|
|
lifetime of the process
|
|
- template changes require process restart
|
|
- operator handlers execute under `MAIL_OPERATOR_REQUEST_TIMEOUT`
|
|
|
|
## Configuration
|
|
|
|
Required for all starts:
|
|
|
|
- `MAIL_REDIS_ADDR`
|
|
|
|
Primary configuration groups:
|
|
|
|
- process and logging:
|
|
- `MAIL_SHUTDOWN_TIMEOUT`
|
|
- `MAIL_LOG_LEVEL`
|
|
- internal HTTP:
|
|
- `MAIL_INTERNAL_HTTP_ADDR`
|
|
- `MAIL_INTERNAL_HTTP_READ_HEADER_TIMEOUT`
|
|
- `MAIL_INTERNAL_HTTP_READ_TIMEOUT`
|
|
- `MAIL_INTERNAL_HTTP_IDLE_TIMEOUT`
|
|
- Redis connectivity:
|
|
- `MAIL_REDIS_USERNAME`
|
|
- `MAIL_REDIS_PASSWORD`
|
|
- `MAIL_REDIS_DB`
|
|
- `MAIL_REDIS_TLS_ENABLED`
|
|
- `MAIL_REDIS_OPERATION_TIMEOUT`
|
|
- `MAIL_REDIS_COMMAND_STREAM`
|
|
- SMTP provider:
|
|
- `MAIL_SMTP_MODE=stub|smtp`
|
|
- `MAIL_SMTP_ADDR`
|
|
- `MAIL_SMTP_USERNAME`
|
|
- `MAIL_SMTP_PASSWORD`
|
|
- `MAIL_SMTP_FROM_EMAIL`
|
|
- `MAIL_SMTP_FROM_NAME`
|
|
- `MAIL_SMTP_TIMEOUT`
|
|
- `MAIL_SMTP_INSECURE_SKIP_VERIFY`
|
|
- template catalog:
|
|
- `MAIL_TEMPLATE_DIR`
|
|
- worker and operator behavior:
|
|
- `MAIL_ATTEMPT_WORKER_CONCURRENCY`
|
|
- `MAIL_STREAM_BLOCK_TIMEOUT`
|
|
- `MAIL_OPERATOR_REQUEST_TIMEOUT`
|
|
- OpenTelemetry:
|
|
- `OTEL_SERVICE_NAME`
|
|
- `OTEL_TRACES_EXPORTER`
|
|
- `OTEL_METRICS_EXPORTER`
|
|
- `OTEL_EXPORTER_OTLP_PROTOCOL`
|
|
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
|
|
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
|
|
- `MAIL_OTEL_STDOUT_TRACES_ENABLED`
|
|
- `MAIL_OTEL_STDOUT_METRICS_ENABLED`
|
|
|
|
Defaults worth knowing:
|
|
|
|
- `MAIL_INTERNAL_HTTP_ADDR=:8080`
|
|
- `MAIL_SMTP_MODE=stub`
|
|
- `MAIL_SMTP_TIMEOUT=15s`
|
|
|
|
Additional SMTP note:
|
|
|
|
- `MAIL_SMTP_INSECURE_SKIP_VERIFY=false` by default and is intended only for
|
|
local self-signed SMTP capture or similar non-production environments
|
|
- `MAIL_TEMPLATE_DIR=templates`
|
|
- `MAIL_ATTEMPT_WORKER_CONCURRENCY=4`
|
|
- `MAIL_STREAM_BLOCK_TIMEOUT=2s`
|
|
- `MAIL_OPERATOR_REQUEST_TIMEOUT=5s`
|
|
- `MAIL_SHUTDOWN_TIMEOUT=5s`
|
|
|
|
Current implementation caveats:
|
|
|
|
- `MAIL_REDIS_COMMAND_STREAM` is effective for the async command consumer
|
|
- `MAIL_REDIS_ATTEMPT_SCHEDULE_KEY` and `MAIL_REDIS_DEAD_LETTER_PREFIX` are
|
|
parsed but the Redis adapters still use the fixed keys
|
|
`mail:attempt_schedule` and `mail:dead_letters:<delivery_id>`
|
|
- `MAIL_IDEMPOTENCY_TTL`, `MAIL_DELIVERY_TTL`, and `MAIL_ATTEMPT_TTL` are
|
|
parsed but the Redis adapters still enforce fixed retentions of `7d`, `30d`,
|
|
and `90d`
|
|
|
|
## Stable Input Contracts
|
|
|
|
### 1. Auth delivery REST
|
|
|
|
Route:
|
|
|
|
- `POST /api/v1/internal/login-code-deliveries`
|
|
|
|
Headers:
|
|
|
|
- required `Idempotency-Key`
|
|
|
|
Request body:
|
|
|
|
- `email`
|
|
- `code`
|
|
- `locale`
|
|
|
|
Stable success outcomes:
|
|
|
|
- `sent`
|
|
- `suppressed`
|
|
|
|
Important semantics:
|
|
|
|
- `sent` means the request was durably accepted into the internal
|
|
mail-delivery pipeline
|
|
- `sent` does not mean that SMTP delivery has already completed
|
|
- new durable auth deliveries surface as:
|
|
- `queued` in `MAIL_SMTP_MODE=smtp`
|
|
- `suppressed` in `MAIL_SMTP_MODE=stub`
|
|
- duplicate replays with the same normalized request return the same stable
|
|
outcome
|
|
- mismatched replays on the same `(source, idempotency_key)` return
|
|
`409 conflict`
|
|
|
|
### 2. Async generic command intake
|
|
|
|
Ingress stream:
|
|
|
|
- `mail:delivery_commands`
|
|
|
|
Stable envelope fields:
|
|
|
|
- `delivery_id`
|
|
- `source`
|
|
- `payload_mode`
|
|
- `idempotency_key`
|
|
- `request_id`
|
|
- `trace_id`
|
|
- `payload_json`
|
|
|
|
Contract rules:
|
|
|
|
- async `source` is fixed to `notification`
|
|
- supported `payload_mode` values are `rendered` and `template`
|
|
- `request_id` and `trace_id` are observability-only metadata and do not
|
|
participate in idempotency fingerprinting
|
|
- malformed commands are metered, logged, and recorded as dedicated
|
|
malformed-command entries
|
|
- malformed commands do not create a durable delivery record
|
|
- stream offset advances only after durable acceptance or durable
|
|
malformed-command recording
|
|
|
|
### 3. Trusted operator REST
|
|
|
|
Routes:
|
|
|
|
- `GET /api/v1/internal/deliveries`
|
|
- `GET /api/v1/internal/deliveries/{delivery_id}`
|
|
- `GET /api/v1/internal/deliveries/{delivery_id}/attempts`
|
|
- `POST /api/v1/internal/deliveries/{delivery_id}/resend`
|
|
|
|
List filters:
|
|
|
|
- `recipient`
|
|
- `status`
|
|
- `source`
|
|
- `template_id`
|
|
- `idempotency_key`
|
|
- `from_created_at_ms`
|
|
- `to_created_at_ms`
|
|
- `limit`
|
|
- `cursor`
|
|
|
|
Stable list behavior:
|
|
|
|
- ordering is `created_at_ms DESC`, then `delivery_id DESC`
|
|
- cursor is an opaque base64url encoding of `created_at_ms:delivery_id`
|
|
- `idempotency_key` without `source` matches across all stable sources
|
|
|
|
Stable resend rules:
|
|
|
|
- resend is clone-only
|
|
- resend is allowed only for terminal delivery states
|
|
- resend creates a new delivery with `source=operator_resend`
|
|
- resend clones preserve audit history of the original instead of mutating it
|
|
|
|
## Delivery Model
|
|
|
|
### Source vocabulary
|
|
|
|
Stable `mail_delivery.source` values:
|
|
|
|
- `authsession`
|
|
- `notification`
|
|
- `operator_resend`
|
|
|
|
### Payload modes
|
|
|
|
Stable `mail_delivery.payload_mode` values:
|
|
|
|
- `rendered`
|
|
- `template`
|
|
|
|
Rules:
|
|
|
|
- `rendered` stores final `subject`, `text_body`, and optional `html_body`
|
|
- `template` stores `template_id`, canonical `locale`, and strict JSON-object
|
|
`template_variables`
|
|
- raw attachment bodies are stored separately from the delivery audit record
|
|
|
|
### Delivery statuses
|
|
|
|
Stable operator-visible `mail_delivery.status` values:
|
|
|
|
- `queued`
|
|
- `rendered`
|
|
- `sending`
|
|
- `sent`
|
|
- `suppressed`
|
|
- `failed`
|
|
- `dead_letter`
|
|
|
|
Status meanings:
|
|
|
|
- `queued`: durable intake completed and the next attempt is scheduled
|
|
- `rendered`: template content has been materialized
|
|
- `sending`: one worker currently owns the active attempt
|
|
- `sent`: provider accepted the envelope
|
|
- `suppressed`: delivery was intentionally skipped as a successful business
|
|
outcome
|
|
- `failed`: terminal failure without dead-letter escalation
|
|
- `dead_letter`: retry budget was exhausted and operator follow-up is required
|
|
|
|
Stable transition rules:
|
|
|
|
- newly accepted durable deliveries surface as `queued` or `suppressed`
|
|
- `queued -> rendered` is used only for `payload_mode=template`
|
|
- `queued|rendered -> sending` happens on successful claim
|
|
- `sending -> sent|suppressed|failed|queued|dead_letter` depends on provider
|
|
classification and retry policy
|
|
|
|
The internal type `delivery.StatusAccepted` still exists in code, but it is
|
|
not part of the stable public delivery-status vocabulary and is not emitted by
|
|
the current runtime.
|
|
|
|
### Attempt statuses
|
|
|
|
Stable `mail_attempt.status` values:
|
|
|
|
- `scheduled`
|
|
- `in_progress`
|
|
- `render_failed`
|
|
- `provider_accepted`
|
|
- `provider_rejected`
|
|
- `transport_failed`
|
|
- `timed_out`
|
|
|
|
Rules:
|
|
|
|
- there is at most one active `in_progress` attempt per delivery
|
|
- `render_failed` means template rendering failed before provider execution
|
|
- `provider_accepted` ends the delivery as `sent`
|
|
- `provider_rejected` is used for:
|
|
- provider-side suppression ending in `suppressed`
|
|
- permanent provider failure ending in `failed`
|
|
- retryable paths are expressed through:
|
|
- `transport_failed`
|
|
- `timed_out`
|
|
|
|
## Template and Locale Policy
|
|
|
|
Template layout:
|
|
|
|
- `<template_id>/<locale>/subject.tmpl`
|
|
- `<template_id>/<locale>/text.tmpl`
|
|
- optional `<template_id>/<locale>/html.tmpl`
|
|
|
|
Required auth fallback files:
|
|
|
|
- `auth.login_code/en/subject.tmpl`
|
|
- `auth.login_code/en/text.tmpl`
|
|
|
|
Rendering rules:
|
|
|
|
- the process loads the full catalog at startup
|
|
- exact locale match is attempted first
|
|
- the only fallback locale is `en`
|
|
- there are no intermediate reductions such as `fr-CA -> fr -> en`
|
|
- `locale_fallback_used=true` is stored durably when fallback is applied
|
|
- subject and text use `text/template`
|
|
- optional HTML uses `html/template`
|
|
- missing required variables and template lookup failures are classified into
|
|
stable render-failure codes
|
|
|
|
## Redis Logical Model
|
|
|
|
Primary keys:
|
|
|
|
- `mail:deliveries:<delivery_id>`
|
|
- `mail:attempts:<delivery_id>:<attempt_no>`
|
|
- `mail:idempotency:<source>:<idempotency_key>`
|
|
- `mail:dead_letters:<delivery_id>`
|
|
- `mail:delivery_payloads:<delivery_id>`
|
|
- `mail:malformed_commands:<stream_entry_id>`
|
|
- `mail:stream_offsets:<stream>`
|
|
|
|
Scheduling and ingress keys:
|
|
|
|
- `mail:delivery_commands`
|
|
- `mail:attempt_schedule`
|
|
|
|
Operator indexes:
|
|
|
|
- `mail:idx:recipient:<email>`
|
|
- `mail:idx:status:<status>`
|
|
- `mail:idx:source:<source>`
|
|
- `mail:idx:template:<template_id>`
|
|
- `mail:idx:idempotency:<source>:<idempotency_key>`
|
|
- `mail:idx:created_at`
|
|
- `mail:idx:malformed_command:created_at`
|
|
|
|
Storage rules:
|
|
|
|
- dynamic Redis key segments are base64url-encoded
|
|
- durable records are stored as strict JSON blobs
|
|
- timestamps are stored in Unix milliseconds
|
|
- raw attachment payloads are separated from audit metadata
|
|
- malformed async commands are stored idempotently by `stream_entry_id`
|
|
|
|
Current fixed retentions:
|
|
|
|
- idempotency: `7d`
|
|
- deliveries and payload audit: `30d`
|
|
- attempts and dead letters: `90d`
|
|
- malformed commands: `90d`
|
|
|
|
## Provider, Retry, and Failure Policy
|
|
|
|
Provider modes:
|
|
|
|
- `stub`
|
|
- `smtp`
|
|
|
|
SMTP rules:
|
|
|
|
- outbound SMTP requires `STARTTLS`
|
|
- servers without `STARTTLS` support are treated as permanent failure
|
|
- SMTP authentication is enabled only when both username and password are set
|
|
|
|
Retry ladder:
|
|
|
|
- attempt `1 -> 2`: `1m`
|
|
- attempt `2 -> 3`: `5m`
|
|
- attempt `3 -> 4`: `30m`
|
|
- after attempt `4`: `dead_letter`
|
|
|
|
Failure handling:
|
|
|
|
- retryable provider failures become `transport_failed` or `timed_out`, then
|
|
either reschedule or escalate to `dead_letter`
|
|
- permanent provider failures become `failed`
|
|
- render failures become `failed` with `render_failed`
|
|
- stale claimed work is recovered after `MAIL_SMTP_TIMEOUT + 30s`
|
|
|
|
## Observability
|
|
|
|
The runtime exports telemetry through configured OpenTelemetry exporters only.
|
|
|
|
Main signals:
|
|
|
|
- `mail.delivery.accepted_auth`
|
|
- `mail.delivery.accepted_generic`
|
|
- `mail.delivery.suppressed`
|
|
- `mail.delivery.status_transitions`
|
|
- `mail.attempt.outcomes`
|
|
- `mail.delivery.dead_letters`
|
|
- `mail.template.locale_fallback`
|
|
- `mail.attempt_schedule.depth`
|
|
- `mail.attempt_schedule.oldest_age_ms`
|
|
- `mail.provider.send.duration_ms`
|
|
- `mail.stream_commands.malformed`
|
|
|
|
Additional behavior:
|
|
|
|
- internal HTTP uses `otelhttp`
|
|
- Redis clients use `redisotel`
|
|
- structured logs include `otel_trace_id` and `otel_span_id` when available
|
|
|
|
## Verification
|
|
|
|
Relevant commands:
|
|
|
|
- `cd mail && go test ./...`
|
|
- `cd integration && go test ./authsessionmail/...`
|
|
- `cd integration && go test ./gatewayauthsessionmail/...`
|
|
|
|
Extended references:
|
|
|
|
- [Runtime and components](docs/runtime.md)
|
|
- [Main flows](docs/flows.md)
|
|
- [Configuration and contract examples](docs/examples.md)
|
|
- [Operator runbook](docs/runbook.md)
|