508 lines
16 KiB
Markdown
508 lines
16 KiB
Markdown
# Mail Service
|
|
|
|
`Mail Service` is the internal e-mail delivery service of Galaxy.
|
|
|
|
Canonical contracts:
|
|
|
|
- [Internal REST API](api/internal-openapi.yaml)
|
|
- [Async generic command contract](api/delivery-commands-asyncapi.yaml)
|
|
- [Extended service docs](docs/README.md)
|
|
|
|
## Purpose
|
|
|
|
`Mail Service` owns durable intake, rendering, execution, retry, audit, and
|
|
operator recovery for outbound e-mail.
|
|
|
|
It does not decide whether a business event should become e-mail. That
|
|
decision belongs to `Notification Service`.
|
|
|
|
## Responsibility Boundaries
|
|
|
|
`Mail Service` is responsible for:
|
|
|
|
- direct auth-code mail intake from `Auth / Session Service`
|
|
- async generic mail intake from `Notification Service`
|
|
- validation of recipient envelope, payload shape, locale, and attachments
|
|
- deterministic template rendering for template-mode deliveries
|
|
- provider execution through `stub` or `smtp`
|
|
- retry scheduling, dead-letter escalation, and operator-visible audit state
|
|
- trusted operator reads and resend by clone creation
|
|
|
|
`Mail Service` is not responsible for:
|
|
|
|
- end-user authentication or authorization
|
|
- notification preference ownership
|
|
- deciding whether non-auth mail should be sent at all
|
|
- direct calls from `Geo Profile Service`
|
|
- hot-reloading templates or editing template catalog state at runtime
|
|
|
|
Cross-service routing rules:
|
|
|
|
- `Auth / Session Service -> Mail Service` is synchronous trusted REST
|
|
- `Notification Service -> Mail Service` is asynchronous `Redis Streams`
|
|
- `Geo Profile Service` must route optional admin e-mail through
|
|
`Notification Service`, not directly to `Mail Service`
|
|
- auth-code delivery remains a direct `Auth / Session Service -> Mail Service`
|
|
flow and does not pass through `Notification Service`
|
|
|
|
## Runtime Surface
|
|
|
|
`cmd/mail` starts one internal-only process with:
|
|
|
|
- one trusted internal HTTP listener on `MAIL_INTERNAL_HTTP_ADDR`
|
|
- one async command consumer reading from `MAIL_REDIS_COMMAND_STREAM`
|
|
- one attempt scheduler driven by Postgres `FOR UPDATE SKIP LOCKED`
|
|
- one attempt worker pool
|
|
- one SQL retention worker
|
|
|
|
The service has no public ingress and no dedicated admin listener.
|
|
|
|
Persistence split (steady state, see `docs/postgres-migration.md`):
|
|
|
|
- PostgreSQL is the source of truth for durable mail state — accepted
|
|
deliveries, attempts, dead letters, payload bundles, malformed-command
|
|
audit records, and idempotency reservations.
|
|
- Redis is the source of truth only for the inbound `mail:delivery_commands`
|
|
stream and its persisted consumer offset.
|
|
|
|
Intentional runtime omissions:
|
|
|
|
- no `/healthz`
|
|
- no `/readyz`
|
|
- no `/metrics`
|
|
|
|
Operational behavior:
|
|
|
|
- startup performs bounded Redis and PostgreSQL connectivity checks and fails
|
|
fast on invalid runtime configuration
|
|
- embedded goose migrations are applied strictly before any HTTP listener
|
|
opens; a migration failure exits with non-zero status
|
|
- the template catalog is parsed once at startup and kept immutable for the
|
|
lifetime of the process
|
|
- template changes require process restart
|
|
- operator handlers execute under `MAIL_OPERATOR_REQUEST_TIMEOUT`
|
|
|
|
## Configuration
|
|
|
|
Required for all starts:
|
|
|
|
- `MAIL_REDIS_MASTER_ADDR`
|
|
- `MAIL_REDIS_PASSWORD`
|
|
- `MAIL_POSTGRES_PRIMARY_DSN`
|
|
|
|
Primary configuration groups:
|
|
|
|
- process and logging:
|
|
- `MAIL_SHUTDOWN_TIMEOUT`
|
|
- `MAIL_LOG_LEVEL`
|
|
- internal HTTP:
|
|
- `MAIL_INTERNAL_HTTP_ADDR`
|
|
- `MAIL_INTERNAL_HTTP_READ_HEADER_TIMEOUT`
|
|
- `MAIL_INTERNAL_HTTP_READ_TIMEOUT`
|
|
- `MAIL_INTERNAL_HTTP_IDLE_TIMEOUT`
|
|
- Redis connectivity (`pkg/redisconn` shape):
|
|
- `MAIL_REDIS_MASTER_ADDR`
|
|
- `MAIL_REDIS_REPLICA_ADDRS` (comma-separated, optional)
|
|
- `MAIL_REDIS_PASSWORD`
|
|
- `MAIL_REDIS_DB`
|
|
- `MAIL_REDIS_OPERATION_TIMEOUT`
|
|
- `MAIL_REDIS_COMMAND_STREAM`
|
|
- PostgreSQL connectivity (`pkg/postgres` shape):
|
|
- `MAIL_POSTGRES_PRIMARY_DSN`
|
|
- `MAIL_POSTGRES_REPLICA_DSNS` (comma-separated, optional; reserved for
|
|
future read routing)
|
|
- `MAIL_POSTGRES_OPERATION_TIMEOUT`
|
|
- `MAIL_POSTGRES_MAX_OPEN_CONNS`
|
|
- `MAIL_POSTGRES_MAX_IDLE_CONNS`
|
|
- `MAIL_POSTGRES_CONN_MAX_LIFETIME`
|
|
- SMTP provider:
|
|
- `MAIL_SMTP_MODE=stub|smtp`
|
|
- `MAIL_SMTP_ADDR`
|
|
- `MAIL_SMTP_USERNAME`
|
|
- `MAIL_SMTP_PASSWORD`
|
|
- `MAIL_SMTP_FROM_EMAIL`
|
|
- `MAIL_SMTP_FROM_NAME`
|
|
- `MAIL_SMTP_TIMEOUT`
|
|
- `MAIL_SMTP_INSECURE_SKIP_VERIFY`
|
|
- template catalog:
|
|
- `MAIL_TEMPLATE_DIR`
|
|
- worker and operator behavior:
|
|
- `MAIL_ATTEMPT_WORKER_CONCURRENCY`
|
|
- `MAIL_STREAM_BLOCK_TIMEOUT`
|
|
- `MAIL_OPERATOR_REQUEST_TIMEOUT`
|
|
- `MAIL_IDEMPOTENCY_TTL`
|
|
- SQL retention worker:
|
|
- `MAIL_DELIVERY_RETENTION` (default `30d`)
|
|
- `MAIL_MALFORMED_COMMAND_RETENTION` (default `90d`)
|
|
- `MAIL_CLEANUP_INTERVAL` (default `1h`)
|
|
- OpenTelemetry:
|
|
- `OTEL_SERVICE_NAME`
|
|
- `OTEL_TRACES_EXPORTER`
|
|
- `OTEL_METRICS_EXPORTER`
|
|
- `OTEL_EXPORTER_OTLP_PROTOCOL`
|
|
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
|
|
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
|
|
- `MAIL_OTEL_STDOUT_TRACES_ENABLED`
|
|
- `MAIL_OTEL_STDOUT_METRICS_ENABLED`
|
|
|
|
Defaults worth knowing:
|
|
|
|
- `MAIL_INTERNAL_HTTP_ADDR=:8080`
|
|
- `MAIL_SMTP_MODE=stub`
|
|
- `MAIL_SMTP_TIMEOUT=15s`
|
|
- `MAIL_TEMPLATE_DIR=templates`
|
|
- `MAIL_ATTEMPT_WORKER_CONCURRENCY=4`
|
|
- `MAIL_STREAM_BLOCK_TIMEOUT=2s`
|
|
- `MAIL_OPERATOR_REQUEST_TIMEOUT=5s`
|
|
- `MAIL_SHUTDOWN_TIMEOUT=5s`
|
|
- `MAIL_IDEMPOTENCY_TTL=168h` (`7d`)
|
|
- `MAIL_DELIVERY_RETENTION=720h` (`30d`)
|
|
- `MAIL_MALFORMED_COMMAND_RETENTION=2160h` (`90d`)
|
|
- `MAIL_CLEANUP_INTERVAL=1h`
|
|
|
|
Additional SMTP note:
|
|
|
|
- `MAIL_SMTP_INSECURE_SKIP_VERIFY=false` by default and is intended only for
|
|
local self-signed SMTP capture or similar non-production environments
|
|
|
|
Retired (Stage 4 of `PG_PLAN.md`): `MAIL_REDIS_ADDR`, `MAIL_REDIS_USERNAME`,
|
|
`MAIL_REDIS_TLS_ENABLED`, `MAIL_REDIS_ATTEMPT_SCHEDULE_KEY`,
|
|
`MAIL_REDIS_DEAD_LETTER_PREFIX`, `MAIL_DELIVERY_TTL`, `MAIL_ATTEMPT_TTL`.
|
|
The new connection envelope is supplied by `pkg/redisconn` and `pkg/postgres`,
|
|
and durable retention is enforced by the SQL retention worker against the
|
|
PostgreSQL-backed source of truth (see `docs/postgres-migration.md`).
|
|
|
|
## Stable Input Contracts
|
|
|
|
### 1. Auth delivery REST
|
|
|
|
Route:
|
|
|
|
- `POST /api/v1/internal/login-code-deliveries`
|
|
|
|
Headers:
|
|
|
|
- required `Idempotency-Key`
|
|
|
|
Request body:
|
|
|
|
- `email`
|
|
- `code`
|
|
- `locale`
|
|
|
|
Stable success outcomes:
|
|
|
|
- `sent`
|
|
- `suppressed`
|
|
|
|
Important semantics:
|
|
|
|
- `sent` means the request was durably accepted into the internal
|
|
mail-delivery pipeline
|
|
- `sent` does not mean that SMTP delivery has already completed
|
|
- new durable auth deliveries surface as:
|
|
- `queued` in `MAIL_SMTP_MODE=smtp`
|
|
- `suppressed` in `MAIL_SMTP_MODE=stub`
|
|
- duplicate replays with the same normalized request return the same stable
|
|
outcome
|
|
- mismatched replays on the same `(source, idempotency_key)` return
|
|
`409 conflict`
|
|
|
|
### 2. Async generic command intake
|
|
|
|
Ingress stream:
|
|
|
|
- `mail:delivery_commands`
|
|
|
|
Stable envelope fields:
|
|
|
|
- `delivery_id`
|
|
- `source`
|
|
- `payload_mode`
|
|
- `idempotency_key`
|
|
- `requested_at_ms`
|
|
- `request_id`
|
|
- `trace_id`
|
|
- `payload_json`
|
|
|
|
Contract rules:
|
|
|
|
- async `source` is fixed to `notification`
|
|
- supported `payload_mode` values are `rendered` and `template`
|
|
- `Notification Service` uses only `payload_mode=template` for
|
|
notification-generated mail, even though the generic async contract keeps
|
|
both `rendered` and `template`
|
|
- notification-owned `template_id` values are identical to the
|
|
`notification_type` vocabulary, for example `game.turn.ready` and
|
|
`lobby.membership.approved`
|
|
- the real `Notification Service -> Mail Service` integration suite verifies
|
|
template-mode handoff for notification-owned mail
|
|
- `requested_at_ms` stores the publisher-side original request timestamp in
|
|
Unix milliseconds
|
|
- `request_id` and `trace_id` are observability-only metadata and do not
|
|
participate in idempotency fingerprinting
|
|
- malformed commands are metered, logged, and recorded as dedicated
|
|
malformed-command entries
|
|
- malformed commands do not create a durable delivery record
|
|
- stream offset advances only after durable acceptance or durable
|
|
malformed-command recording
|
|
|
|
### 3. Trusted operator REST
|
|
|
|
Routes:
|
|
|
|
- `GET /api/v1/internal/deliveries`
|
|
- `GET /api/v1/internal/deliveries/{delivery_id}`
|
|
- `GET /api/v1/internal/deliveries/{delivery_id}/attempts`
|
|
- `POST /api/v1/internal/deliveries/{delivery_id}/resend`
|
|
|
|
List filters:
|
|
|
|
- `recipient`
|
|
- `status`
|
|
- `source`
|
|
- `template_id`
|
|
- `idempotency_key`
|
|
- `from_created_at_ms`
|
|
- `to_created_at_ms`
|
|
- `limit`
|
|
- `cursor`
|
|
|
|
Stable list behavior:
|
|
|
|
- ordering is `created_at_ms DESC`, then `delivery_id DESC`
|
|
- cursor is an opaque base64url encoding of `created_at_ms:delivery_id`
|
|
- `idempotency_key` without `source` matches across all stable sources
|
|
|
|
Stable resend rules:
|
|
|
|
- resend is clone-only
|
|
- resend is allowed only for terminal delivery states
|
|
- resend creates a new delivery with `source=operator_resend`
|
|
- resend clones preserve audit history of the original instead of mutating it
|
|
|
|
## Delivery Model
|
|
|
|
### Source vocabulary
|
|
|
|
Stable `mail_delivery.source` values:
|
|
|
|
- `authsession`
|
|
- `notification`
|
|
- `operator_resend`
|
|
|
|
### Payload modes
|
|
|
|
Stable `mail_delivery.payload_mode` values:
|
|
|
|
- `rendered`
|
|
- `template`
|
|
|
|
Rules:
|
|
|
|
- `rendered` stores final `subject`, `text_body`, and optional `html_body`
|
|
- `template` stores `template_id`, canonical `locale`, and strict JSON-object
|
|
`template_variables`
|
|
- raw attachment bodies are stored separately from the delivery audit record
|
|
|
|
### Delivery statuses
|
|
|
|
Stable operator-visible `mail_delivery.status` values:
|
|
|
|
- `queued`
|
|
- `rendered`
|
|
- `sending`
|
|
- `sent`
|
|
- `suppressed`
|
|
- `failed`
|
|
- `dead_letter`
|
|
|
|
Status meanings:
|
|
|
|
- `queued`: durable intake completed and the next attempt is scheduled
|
|
- `rendered`: template content has been materialized
|
|
- `sending`: one worker currently owns the active attempt
|
|
- `sent`: provider accepted the envelope
|
|
- `suppressed`: delivery was intentionally skipped as a successful business
|
|
outcome
|
|
- `failed`: terminal failure without dead-letter escalation
|
|
- `dead_letter`: retry budget was exhausted and operator follow-up is required
|
|
|
|
Stable transition rules:
|
|
|
|
- newly accepted durable deliveries surface as `queued` or `suppressed`
|
|
- `queued -> rendered` is used only for `payload_mode=template`
|
|
- `queued|rendered -> sending` happens on successful claim
|
|
- `sending -> sent|suppressed|failed|queued|dead_letter` depends on provider
|
|
classification and retry policy
|
|
|
|
The internal type `delivery.StatusAccepted` still exists in code, but it is
|
|
not part of the stable public delivery-status vocabulary and is not emitted by
|
|
the current runtime.
|
|
|
|
### Attempt statuses
|
|
|
|
Stable `mail_attempt.status` values:
|
|
|
|
- `scheduled`
|
|
- `in_progress`
|
|
- `render_failed`
|
|
- `provider_accepted`
|
|
- `provider_rejected`
|
|
- `transport_failed`
|
|
- `timed_out`
|
|
|
|
Rules:
|
|
|
|
- there is at most one active `in_progress` attempt per delivery
|
|
- `render_failed` means template rendering failed before provider execution
|
|
- `provider_accepted` ends the delivery as `sent`
|
|
- `provider_rejected` is used for:
|
|
- provider-side suppression ending in `suppressed`
|
|
- permanent provider failure ending in `failed`
|
|
- retryable paths are expressed through:
|
|
- `transport_failed`
|
|
- `timed_out`
|
|
|
|
## Template and Locale Policy
|
|
|
|
Template layout:
|
|
|
|
- `<template_id>/<locale>/subject.tmpl`
|
|
- `<template_id>/<locale>/text.tmpl`
|
|
- optional `<template_id>/<locale>/html.tmpl`
|
|
|
|
Required auth fallback files:
|
|
|
|
- `auth.login_code/en/subject.tmpl`
|
|
- `auth.login_code/en/text.tmpl`
|
|
|
|
Notification-owned English template directories are frozen by
|
|
[`../notification/README.md`](../notification/README.md) and the service-local
|
|
[`Notification Service` docs](../notification/docs/README.md).
|
|
`auth.login_code` remains the required auth template family for the direct
|
|
`Auth / Session Service -> Mail Service` flow and is not part of the
|
|
notification-owned template set.
|
|
|
|
Rendering rules:
|
|
|
|
- the process loads the full catalog at startup
|
|
- exact locale match is attempted first
|
|
- the only fallback locale is `en`
|
|
- there are no intermediate reductions such as `fr-CA -> fr -> en`
|
|
- `locale_fallback_used=true` is stored durably when fallback is applied
|
|
- subject and text use `text/template`
|
|
- optional HTML uses `html/template`
|
|
- missing required variables and template lookup failures are classified into
|
|
stable render-failure codes
|
|
|
|
## Persistence Layout
|
|
|
|
PostgreSQL `mail` schema (source of truth — see
|
|
[`docs/postgres-migration.md`](docs/postgres-migration.md)):
|
|
|
|
- `deliveries(delivery_id PK, source, status, payload_mode, …,
|
|
idempotency_key, request_fingerprint, idempotency_expires_at,
|
|
attempt_count, next_attempt_at, created_at, updated_at, …)` with
|
|
`UNIQUE (source, idempotency_key)` and a partial scheduler index on
|
|
`next_attempt_at`
|
|
- `delivery_recipients(delivery_id FK, kind, position, email)` with
|
|
`kind ∈ {'to','cc','bcc','reply_to'}` and an `email` index that excludes
|
|
`reply_to`
|
|
- `attempts(delivery_id FK, attempt_no, status, scheduled_for, started_at,
|
|
finished_at, provider_classification, provider_summary)`,
|
|
`PRIMARY KEY (delivery_id, attempt_no)`
|
|
- `dead_letters(delivery_id PK FK, final_attempt_no, failure_classification,
|
|
provider_summary, recovery_hint, created_at)`
|
|
- `delivery_payloads(delivery_id PK FK, payload jsonb)` for raw attachment
|
|
bundles
|
|
- `malformed_commands(stream_entry_id PK, delivery_id, source,
|
|
idempotency_key, failure_code, failure_message, raw_fields jsonb,
|
|
recorded_at)`
|
|
|
|
Redis surface (intake stream + offset only):
|
|
|
|
- `mail:delivery_commands` — async ingress Redis Stream
|
|
- `mail:stream_offsets:<stream>` — persisted consumer offset for the
|
|
intake stream
|
|
|
|
Storage rules:
|
|
|
|
- timestamps are stored as PostgreSQL `timestamptz` and normalised to UTC
|
|
at the adapter boundary
|
|
- malformed async commands are stored idempotently by `stream_entry_id`
|
|
- the `idempotency_expires_at` column is set per acceptance from
|
|
`MAIL_IDEMPOTENCY_TTL` (default `7d`); resends store an empty fingerprint
|
|
and a synthetic far-future expiry that the read helper treats as
|
|
non-idempotent
|
|
- the SQL retention worker periodically deletes deliveries older than
|
|
`MAIL_DELIVERY_RETENTION` (cascade) and malformed commands older than
|
|
`MAIL_MALFORMED_COMMAND_RETENTION`
|
|
|
|
## Provider, Retry, and Failure Policy
|
|
|
|
Provider modes:
|
|
|
|
- `stub`
|
|
- `smtp`
|
|
|
|
SMTP rules:
|
|
|
|
- outbound SMTP requires `STARTTLS`
|
|
- servers without `STARTTLS` support are treated as permanent failure
|
|
- SMTP authentication is enabled only when both username and password are set
|
|
|
|
Retry ladder:
|
|
|
|
- attempt `1 -> 2`: `1m`
|
|
- attempt `2 -> 3`: `5m`
|
|
- attempt `3 -> 4`: `30m`
|
|
- after attempt `4`: `dead_letter`
|
|
|
|
Failure handling:
|
|
|
|
- retryable provider failures become `transport_failed` or `timed_out`, then
|
|
either reschedule or escalate to `dead_letter`
|
|
- permanent provider failures become `failed`
|
|
- render failures become `failed` with `render_failed`
|
|
- stale claimed work is recovered after `MAIL_SMTP_TIMEOUT + 30s`
|
|
|
|
## Observability
|
|
|
|
The runtime exports telemetry through configured OpenTelemetry exporters only.
|
|
|
|
Main signals:
|
|
|
|
- `mail.delivery.accepted_auth`
|
|
- `mail.delivery.accepted_generic`
|
|
- `mail.delivery.suppressed`
|
|
- `mail.delivery.status_transitions`
|
|
- `mail.attempt.outcomes`
|
|
- `mail.delivery.dead_letters`
|
|
- `mail.template.locale_fallback`
|
|
- `mail.attempt_schedule.depth`
|
|
- `mail.attempt_schedule.oldest_age_ms`
|
|
- `mail.provider.send.duration_ms`
|
|
- `mail.stream_commands.malformed`
|
|
|
|
Additional behavior:
|
|
|
|
- internal HTTP uses `otelhttp`
|
|
- Redis clients use `redisotel`
|
|
- structured logs include `otel_trace_id` and `otel_span_id` when available
|
|
|
|
## Verification
|
|
|
|
Relevant commands:
|
|
|
|
- `cd mail && go test ./...`
|
|
- `cd integration && go test ./authsessionmail/...`
|
|
- `cd integration && go test ./gatewayauthsessionmail/...`
|
|
|
|
Extended references:
|
|
|
|
- [Runtime and components](docs/runtime.md)
|
|
- [Main flows](docs/flows.md)
|
|
- [Configuration and contract examples](docs/examples.md)
|
|
- [Operator runbook](docs/runbook.md)
|