feat: mail service

This commit is contained in:
Ilia Denisov
2026-04-17 18:39:16 +02:00
committed by GitHub
parent 23ffcb7535
commit 5b7593e6f6
183 changed files with 31215 additions and 248 deletions
+460
View File
@@ -0,0 +1,460 @@
# Mail Service
`Mail Service` is the internal e-mail delivery service of Galaxy.
Canonical contracts:
- [Internal REST API](api/internal-openapi.yaml)
- [Async generic command contract](api/delivery-commands-asyncapi.yaml)
- [Extended service docs](docs/README.md)
## Purpose
`Mail Service` owns durable intake, rendering, execution, retry, audit, and
operator recovery for outbound e-mail.
It does not decide whether a business event should become e-mail. That
decision belongs to `Notification Service`.
## Responsibility Boundaries
`Mail Service` is responsible for:
- direct auth-code mail intake from `Auth / Session Service`
- async generic mail intake from `Notification Service`
- validation of recipient envelope, payload shape, locale, and attachments
- deterministic template rendering for template-mode deliveries
- provider execution through `stub` or `smtp`
- retry scheduling, dead-letter escalation, and operator-visible audit state
- trusted operator reads and resend by clone creation
`Mail Service` is not responsible for:
- end-user authentication or authorization
- notification preference ownership
- deciding whether non-auth mail should be sent at all
- direct calls from `Geo Profile Service`
- hot-reloading templates or editing template catalog state at runtime
Cross-service routing rules:
- `Auth / Session Service -> Mail Service` is synchronous trusted REST
- `Notification Service -> Mail Service` is asynchronous `Redis Streams`
- `Geo Profile Service` must route optional admin e-mail through
`Notification Service`, not directly to `Mail Service`
## Runtime Surface
`cmd/mail` starts one internal-only process with:
- one trusted internal HTTP listener on `MAIL_INTERNAL_HTTP_ADDR`
- one async command consumer
- one attempt scheduler
- one attempt worker pool
- one cleanup worker
The service has no public ingress and no dedicated admin listener.
Intentional runtime omissions:
- no `/healthz`
- no `/readyz`
- no `/metrics`
Operational behavior:
- startup performs bounded Redis connectivity checks and fails fast on invalid
runtime configuration
- the template catalog is parsed once at startup and kept immutable for the
lifetime of the process
- template changes require process restart
- operator handlers execute under `MAIL_OPERATOR_REQUEST_TIMEOUT`
## Configuration
Required for all starts:
- `MAIL_REDIS_ADDR`
Primary configuration groups:
- process and logging:
- `MAIL_SHUTDOWN_TIMEOUT`
- `MAIL_LOG_LEVEL`
- internal HTTP:
- `MAIL_INTERNAL_HTTP_ADDR`
- `MAIL_INTERNAL_HTTP_READ_HEADER_TIMEOUT`
- `MAIL_INTERNAL_HTTP_READ_TIMEOUT`
- `MAIL_INTERNAL_HTTP_IDLE_TIMEOUT`
- Redis connectivity:
- `MAIL_REDIS_USERNAME`
- `MAIL_REDIS_PASSWORD`
- `MAIL_REDIS_DB`
- `MAIL_REDIS_TLS_ENABLED`
- `MAIL_REDIS_OPERATION_TIMEOUT`
- `MAIL_REDIS_COMMAND_STREAM`
- SMTP provider:
- `MAIL_SMTP_MODE=stub|smtp`
- `MAIL_SMTP_ADDR`
- `MAIL_SMTP_USERNAME`
- `MAIL_SMTP_PASSWORD`
- `MAIL_SMTP_FROM_EMAIL`
- `MAIL_SMTP_FROM_NAME`
- `MAIL_SMTP_TIMEOUT`
- `MAIL_SMTP_INSECURE_SKIP_VERIFY`
- template catalog:
- `MAIL_TEMPLATE_DIR`
- worker and operator behavior:
- `MAIL_ATTEMPT_WORKER_CONCURRENCY`
- `MAIL_STREAM_BLOCK_TIMEOUT`
- `MAIL_OPERATOR_REQUEST_TIMEOUT`
- OpenTelemetry:
- `OTEL_SERVICE_NAME`
- `OTEL_TRACES_EXPORTER`
- `OTEL_METRICS_EXPORTER`
- `OTEL_EXPORTER_OTLP_PROTOCOL`
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
- `MAIL_OTEL_STDOUT_TRACES_ENABLED`
- `MAIL_OTEL_STDOUT_METRICS_ENABLED`
Defaults worth knowing:
- `MAIL_INTERNAL_HTTP_ADDR=:8080`
- `MAIL_SMTP_MODE=stub`
- `MAIL_SMTP_TIMEOUT=15s`
Additional SMTP note:
- `MAIL_SMTP_INSECURE_SKIP_VERIFY=false` by default and is intended only for
local self-signed SMTP capture or similar non-production environments
- `MAIL_TEMPLATE_DIR=templates`
- `MAIL_ATTEMPT_WORKER_CONCURRENCY=4`
- `MAIL_STREAM_BLOCK_TIMEOUT=2s`
- `MAIL_OPERATOR_REQUEST_TIMEOUT=5s`
- `MAIL_SHUTDOWN_TIMEOUT=5s`
Current implementation caveats:
- `MAIL_REDIS_COMMAND_STREAM` is effective for the async command consumer
- `MAIL_REDIS_ATTEMPT_SCHEDULE_KEY` and `MAIL_REDIS_DEAD_LETTER_PREFIX` are
parsed but the Redis adapters still use the fixed keys
`mail:attempt_schedule` and `mail:dead_letters:<delivery_id>`
- `MAIL_IDEMPOTENCY_TTL`, `MAIL_DELIVERY_TTL`, and `MAIL_ATTEMPT_TTL` are
parsed but the Redis adapters still enforce fixed retentions of `7d`, `30d`,
and `90d`
## Stable Input Contracts
### 1. Auth delivery REST
Route:
- `POST /api/v1/internal/login-code-deliveries`
Headers:
- required `Idempotency-Key`
Request body:
- `email`
- `code`
- `locale`
Stable success outcomes:
- `sent`
- `suppressed`
Important semantics:
- `sent` means the request was durably accepted into the internal
mail-delivery pipeline
- `sent` does not mean that SMTP delivery has already completed
- new durable auth deliveries surface as:
- `queued` in `MAIL_SMTP_MODE=smtp`
- `suppressed` in `MAIL_SMTP_MODE=stub`
- duplicate replays with the same normalized request return the same stable
outcome
- mismatched replays on the same `(source, idempotency_key)` return
`409 conflict`
### 2. Async generic command intake
Ingress stream:
- `mail:delivery_commands`
Stable envelope fields:
- `delivery_id`
- `source`
- `payload_mode`
- `idempotency_key`
- `request_id`
- `trace_id`
- `payload_json`
Contract rules:
- async `source` is fixed to `notification`
- supported `payload_mode` values are `rendered` and `template`
- `request_id` and `trace_id` are observability-only metadata and do not
participate in idempotency fingerprinting
- malformed commands are metered, logged, and recorded as dedicated
malformed-command entries
- malformed commands do not create a durable delivery record
- stream offset advances only after durable acceptance or durable
malformed-command recording
### 3. Trusted operator REST
Routes:
- `GET /api/v1/internal/deliveries`
- `GET /api/v1/internal/deliveries/{delivery_id}`
- `GET /api/v1/internal/deliveries/{delivery_id}/attempts`
- `POST /api/v1/internal/deliveries/{delivery_id}/resend`
List filters:
- `recipient`
- `status`
- `source`
- `template_id`
- `idempotency_key`
- `from_created_at_ms`
- `to_created_at_ms`
- `limit`
- `cursor`
Stable list behavior:
- ordering is `created_at_ms DESC`, then `delivery_id DESC`
- cursor is an opaque base64url encoding of `created_at_ms:delivery_id`
- `idempotency_key` without `source` matches across all stable sources
Stable resend rules:
- resend is clone-only
- resend is allowed only for terminal delivery states
- resend creates a new delivery with `source=operator_resend`
- resend clones preserve audit history of the original instead of mutating it
## Delivery Model
### Source vocabulary
Stable `mail_delivery.source` values:
- `authsession`
- `notification`
- `operator_resend`
### Payload modes
Stable `mail_delivery.payload_mode` values:
- `rendered`
- `template`
Rules:
- `rendered` stores final `subject`, `text_body`, and optional `html_body`
- `template` stores `template_id`, canonical `locale`, and strict JSON-object
`template_variables`
- raw attachment bodies are stored separately from the delivery audit record
### Delivery statuses
Stable operator-visible `mail_delivery.status` values:
- `queued`
- `rendered`
- `sending`
- `sent`
- `suppressed`
- `failed`
- `dead_letter`
Status meanings:
- `queued`: durable intake completed and the next attempt is scheduled
- `rendered`: template content has been materialized
- `sending`: one worker currently owns the active attempt
- `sent`: provider accepted the envelope
- `suppressed`: delivery was intentionally skipped as a successful business
outcome
- `failed`: terminal failure without dead-letter escalation
- `dead_letter`: retry budget was exhausted and operator follow-up is required
Stable transition rules:
- newly accepted durable deliveries surface as `queued` or `suppressed`
- `queued -> rendered` is used only for `payload_mode=template`
- `queued|rendered -> sending` happens on successful claim
- `sending -> sent|suppressed|failed|queued|dead_letter` depends on provider
classification and retry policy
The internal type `delivery.StatusAccepted` still exists in code, but it is
not part of the stable public delivery-status vocabulary and is not emitted by
the current runtime.
### Attempt statuses
Stable `mail_attempt.status` values:
- `scheduled`
- `in_progress`
- `render_failed`
- `provider_accepted`
- `provider_rejected`
- `transport_failed`
- `timed_out`
Rules:
- there is at most one active `in_progress` attempt per delivery
- `render_failed` means template rendering failed before provider execution
- `provider_accepted` ends the delivery as `sent`
- `provider_rejected` is used for:
- provider-side suppression ending in `suppressed`
- permanent provider failure ending in `failed`
- retryable paths are expressed through:
- `transport_failed`
- `timed_out`
## Template and Locale Policy
Template layout:
- `<template_id>/<locale>/subject.tmpl`
- `<template_id>/<locale>/text.tmpl`
- optional `<template_id>/<locale>/html.tmpl`
Required auth fallback files:
- `auth.login_code/en/subject.tmpl`
- `auth.login_code/en/text.tmpl`
Rendering rules:
- the process loads the full catalog at startup
- exact locale match is attempted first
- the only fallback locale is `en`
- there are no intermediate reductions such as `fr-CA -> fr -> en`
- `locale_fallback_used=true` is stored durably when fallback is applied
- subject and text use `text/template`
- optional HTML uses `html/template`
- missing required variables and template lookup failures are classified into
stable render-failure codes
## Redis Logical Model
Primary keys:
- `mail:deliveries:<delivery_id>`
- `mail:attempts:<delivery_id>:<attempt_no>`
- `mail:idempotency:<source>:<idempotency_key>`
- `mail:dead_letters:<delivery_id>`
- `mail:delivery_payloads:<delivery_id>`
- `mail:malformed_commands:<stream_entry_id>`
- `mail:stream_offsets:<stream>`
Scheduling and ingress keys:
- `mail:delivery_commands`
- `mail:attempt_schedule`
Operator indexes:
- `mail:idx:recipient:<email>`
- `mail:idx:status:<status>`
- `mail:idx:source:<source>`
- `mail:idx:template:<template_id>`
- `mail:idx:idempotency:<source>:<idempotency_key>`
- `mail:idx:created_at`
- `mail:idx:malformed_command:created_at`
Storage rules:
- dynamic Redis key segments are base64url-encoded
- durable records are stored as strict JSON blobs
- timestamps are stored in Unix milliseconds
- raw attachment payloads are separated from audit metadata
- malformed async commands are stored idempotently by `stream_entry_id`
Current fixed retentions:
- idempotency: `7d`
- deliveries and payload audit: `30d`
- attempts and dead letters: `90d`
- malformed commands: `90d`
## Provider, Retry, and Failure Policy
Provider modes:
- `stub`
- `smtp`
SMTP rules:
- outbound SMTP requires `STARTTLS`
- servers without `STARTTLS` support are treated as permanent failure
- SMTP authentication is enabled only when both username and password are set
Retry ladder:
- attempt `1 -> 2`: `1m`
- attempt `2 -> 3`: `5m`
- attempt `3 -> 4`: `30m`
- after attempt `4`: `dead_letter`
Failure handling:
- retryable provider failures become `transport_failed` or `timed_out`, then
either reschedule or escalate to `dead_letter`
- permanent provider failures become `failed`
- render failures become `failed` with `render_failed`
- stale claimed work is recovered after `MAIL_SMTP_TIMEOUT + 30s`
## Observability
The runtime exports telemetry through configured OpenTelemetry exporters only.
Main signals:
- `mail.delivery.accepted_auth`
- `mail.delivery.accepted_generic`
- `mail.delivery.suppressed`
- `mail.delivery.status_transitions`
- `mail.attempt.outcomes`
- `mail.delivery.dead_letters`
- `mail.template.locale_fallback`
- `mail.attempt_schedule.depth`
- `mail.attempt_schedule.oldest_age_ms`
- `mail.provider.send.duration_ms`
- `mail.stream_commands.malformed`
Additional behavior:
- internal HTTP uses `otelhttp`
- Redis clients use `redisotel`
- structured logs include `otel_trace_id` and `otel_span_id` when available
## Verification
Relevant commands:
- `cd mail && go test ./...`
- `cd integration && go test ./authsessionmail/...`
- `cd integration && go test ./gatewayauthsessionmail/...`
Extended references:
- [Runtime and components](docs/runtime.md)
- [Main flows](docs/flows.md)
- [Configuration and contract examples](docs/examples.md)
- [Operator runbook](docs/runbook.md)