Files
galaxy-game/mail/README.md
T
2026-04-26 20:34:39 +02:00

508 lines
16 KiB
Markdown

# Mail Service
`Mail Service` is the internal e-mail delivery service of Galaxy.
Canonical contracts:
- [Internal REST API](api/internal-openapi.yaml)
- [Async generic command contract](api/delivery-commands-asyncapi.yaml)
- [Extended service docs](docs/README.md)
## Purpose
`Mail Service` owns durable intake, rendering, execution, retry, audit, and
operator recovery for outbound e-mail.
It does not decide whether a business event should become e-mail. That
decision belongs to `Notification Service`.
## Responsibility Boundaries
`Mail Service` is responsible for:
- direct auth-code mail intake from `Auth / Session Service`
- async generic mail intake from `Notification Service`
- validation of recipient envelope, payload shape, locale, and attachments
- deterministic template rendering for template-mode deliveries
- provider execution through `stub` or `smtp`
- retry scheduling, dead-letter escalation, and operator-visible audit state
- trusted operator reads and resend by clone creation
`Mail Service` is not responsible for:
- end-user authentication or authorization
- notification preference ownership
- deciding whether non-auth mail should be sent at all
- direct calls from `Geo Profile Service`
- hot-reloading templates or editing template catalog state at runtime
Cross-service routing rules:
- `Auth / Session Service -> Mail Service` is synchronous trusted REST
- `Notification Service -> Mail Service` is asynchronous `Redis Streams`
- `Geo Profile Service` must route optional admin e-mail through
`Notification Service`, not directly to `Mail Service`
- auth-code delivery remains a direct `Auth / Session Service -> Mail Service`
flow and does not pass through `Notification Service`
## Runtime Surface
`cmd/mail` starts one internal-only process with:
- one trusted internal HTTP listener on `MAIL_INTERNAL_HTTP_ADDR`
- one async command consumer reading from `MAIL_REDIS_COMMAND_STREAM`
- one attempt scheduler driven by Postgres `FOR UPDATE SKIP LOCKED`
- one attempt worker pool
- one SQL retention worker
The service has no public ingress and no dedicated admin listener.
Persistence split (steady state, see `docs/postgres-migration.md`):
- PostgreSQL is the source of truth for durable mail state — accepted
deliveries, attempts, dead letters, payload bundles, malformed-command
audit records, and idempotency reservations.
- Redis is the source of truth only for the inbound `mail:delivery_commands`
stream and its persisted consumer offset.
Intentional runtime omissions:
- no `/healthz`
- no `/readyz`
- no `/metrics`
Operational behavior:
- startup performs bounded Redis and PostgreSQL connectivity checks and fails
fast on invalid runtime configuration
- embedded goose migrations are applied strictly before any HTTP listener
opens; a migration failure exits with non-zero status
- the template catalog is parsed once at startup and kept immutable for the
lifetime of the process
- template changes require process restart
- operator handlers execute under `MAIL_OPERATOR_REQUEST_TIMEOUT`
## Configuration
Required for all starts:
- `MAIL_REDIS_MASTER_ADDR`
- `MAIL_REDIS_PASSWORD`
- `MAIL_POSTGRES_PRIMARY_DSN`
Primary configuration groups:
- process and logging:
- `MAIL_SHUTDOWN_TIMEOUT`
- `MAIL_LOG_LEVEL`
- internal HTTP:
- `MAIL_INTERNAL_HTTP_ADDR`
- `MAIL_INTERNAL_HTTP_READ_HEADER_TIMEOUT`
- `MAIL_INTERNAL_HTTP_READ_TIMEOUT`
- `MAIL_INTERNAL_HTTP_IDLE_TIMEOUT`
- Redis connectivity (`pkg/redisconn` shape):
- `MAIL_REDIS_MASTER_ADDR`
- `MAIL_REDIS_REPLICA_ADDRS` (comma-separated, optional)
- `MAIL_REDIS_PASSWORD`
- `MAIL_REDIS_DB`
- `MAIL_REDIS_OPERATION_TIMEOUT`
- `MAIL_REDIS_COMMAND_STREAM`
- PostgreSQL connectivity (`pkg/postgres` shape):
- `MAIL_POSTGRES_PRIMARY_DSN`
- `MAIL_POSTGRES_REPLICA_DSNS` (comma-separated, optional; reserved for
future read routing)
- `MAIL_POSTGRES_OPERATION_TIMEOUT`
- `MAIL_POSTGRES_MAX_OPEN_CONNS`
- `MAIL_POSTGRES_MAX_IDLE_CONNS`
- `MAIL_POSTGRES_CONN_MAX_LIFETIME`
- SMTP provider:
- `MAIL_SMTP_MODE=stub|smtp`
- `MAIL_SMTP_ADDR`
- `MAIL_SMTP_USERNAME`
- `MAIL_SMTP_PASSWORD`
- `MAIL_SMTP_FROM_EMAIL`
- `MAIL_SMTP_FROM_NAME`
- `MAIL_SMTP_TIMEOUT`
- `MAIL_SMTP_INSECURE_SKIP_VERIFY`
- template catalog:
- `MAIL_TEMPLATE_DIR`
- worker and operator behavior:
- `MAIL_ATTEMPT_WORKER_CONCURRENCY`
- `MAIL_STREAM_BLOCK_TIMEOUT`
- `MAIL_OPERATOR_REQUEST_TIMEOUT`
- `MAIL_IDEMPOTENCY_TTL`
- SQL retention worker:
- `MAIL_DELIVERY_RETENTION` (default `30d`)
- `MAIL_MALFORMED_COMMAND_RETENTION` (default `90d`)
- `MAIL_CLEANUP_INTERVAL` (default `1h`)
- OpenTelemetry:
- `OTEL_SERVICE_NAME`
- `OTEL_TRACES_EXPORTER`
- `OTEL_METRICS_EXPORTER`
- `OTEL_EXPORTER_OTLP_PROTOCOL`
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
- `MAIL_OTEL_STDOUT_TRACES_ENABLED`
- `MAIL_OTEL_STDOUT_METRICS_ENABLED`
Defaults worth knowing:
- `MAIL_INTERNAL_HTTP_ADDR=:8080`
- `MAIL_SMTP_MODE=stub`
- `MAIL_SMTP_TIMEOUT=15s`
- `MAIL_TEMPLATE_DIR=templates`
- `MAIL_ATTEMPT_WORKER_CONCURRENCY=4`
- `MAIL_STREAM_BLOCK_TIMEOUT=2s`
- `MAIL_OPERATOR_REQUEST_TIMEOUT=5s`
- `MAIL_SHUTDOWN_TIMEOUT=5s`
- `MAIL_IDEMPOTENCY_TTL=168h` (`7d`)
- `MAIL_DELIVERY_RETENTION=720h` (`30d`)
- `MAIL_MALFORMED_COMMAND_RETENTION=2160h` (`90d`)
- `MAIL_CLEANUP_INTERVAL=1h`
Additional SMTP note:
- `MAIL_SMTP_INSECURE_SKIP_VERIFY=false` by default and is intended only for
local self-signed SMTP capture or similar non-production environments
Retired (Stage 4 of `PG_PLAN.md`): `MAIL_REDIS_ADDR`, `MAIL_REDIS_USERNAME`,
`MAIL_REDIS_TLS_ENABLED`, `MAIL_REDIS_ATTEMPT_SCHEDULE_KEY`,
`MAIL_REDIS_DEAD_LETTER_PREFIX`, `MAIL_DELIVERY_TTL`, `MAIL_ATTEMPT_TTL`.
The new connection envelope is supplied by `pkg/redisconn` and `pkg/postgres`,
and durable retention is enforced by the SQL retention worker against the
PostgreSQL-backed source of truth (see `docs/postgres-migration.md`).
## Stable Input Contracts
### 1. Auth delivery REST
Route:
- `POST /api/v1/internal/login-code-deliveries`
Headers:
- required `Idempotency-Key`
Request body:
- `email`
- `code`
- `locale`
Stable success outcomes:
- `sent`
- `suppressed`
Important semantics:
- `sent` means the request was durably accepted into the internal
mail-delivery pipeline
- `sent` does not mean that SMTP delivery has already completed
- new durable auth deliveries surface as:
- `queued` in `MAIL_SMTP_MODE=smtp`
- `suppressed` in `MAIL_SMTP_MODE=stub`
- duplicate replays with the same normalized request return the same stable
outcome
- mismatched replays on the same `(source, idempotency_key)` return
`409 conflict`
### 2. Async generic command intake
Ingress stream:
- `mail:delivery_commands`
Stable envelope fields:
- `delivery_id`
- `source`
- `payload_mode`
- `idempotency_key`
- `requested_at_ms`
- `request_id`
- `trace_id`
- `payload_json`
Contract rules:
- async `source` is fixed to `notification`
- supported `payload_mode` values are `rendered` and `template`
- `Notification Service` uses only `payload_mode=template` for
notification-generated mail, even though the generic async contract keeps
both `rendered` and `template`
- notification-owned `template_id` values are identical to the
`notification_type` vocabulary, for example `game.turn.ready` and
`lobby.membership.approved`
- the real `Notification Service -> Mail Service` integration suite verifies
template-mode handoff for notification-owned mail
- `requested_at_ms` stores the publisher-side original request timestamp in
Unix milliseconds
- `request_id` and `trace_id` are observability-only metadata and do not
participate in idempotency fingerprinting
- malformed commands are metered, logged, and recorded as dedicated
malformed-command entries
- malformed commands do not create a durable delivery record
- stream offset advances only after durable acceptance or durable
malformed-command recording
### 3. Trusted operator REST
Routes:
- `GET /api/v1/internal/deliveries`
- `GET /api/v1/internal/deliveries/{delivery_id}`
- `GET /api/v1/internal/deliveries/{delivery_id}/attempts`
- `POST /api/v1/internal/deliveries/{delivery_id}/resend`
List filters:
- `recipient`
- `status`
- `source`
- `template_id`
- `idempotency_key`
- `from_created_at_ms`
- `to_created_at_ms`
- `limit`
- `cursor`
Stable list behavior:
- ordering is `created_at_ms DESC`, then `delivery_id DESC`
- cursor is an opaque base64url encoding of `created_at_ms:delivery_id`
- `idempotency_key` without `source` matches across all stable sources
Stable resend rules:
- resend is clone-only
- resend is allowed only for terminal delivery states
- resend creates a new delivery with `source=operator_resend`
- resend clones preserve audit history of the original instead of mutating it
## Delivery Model
### Source vocabulary
Stable `mail_delivery.source` values:
- `authsession`
- `notification`
- `operator_resend`
### Payload modes
Stable `mail_delivery.payload_mode` values:
- `rendered`
- `template`
Rules:
- `rendered` stores final `subject`, `text_body`, and optional `html_body`
- `template` stores `template_id`, canonical `locale`, and strict JSON-object
`template_variables`
- raw attachment bodies are stored separately from the delivery audit record
### Delivery statuses
Stable operator-visible `mail_delivery.status` values:
- `queued`
- `rendered`
- `sending`
- `sent`
- `suppressed`
- `failed`
- `dead_letter`
Status meanings:
- `queued`: durable intake completed and the next attempt is scheduled
- `rendered`: template content has been materialized
- `sending`: one worker currently owns the active attempt
- `sent`: provider accepted the envelope
- `suppressed`: delivery was intentionally skipped as a successful business
outcome
- `failed`: terminal failure without dead-letter escalation
- `dead_letter`: retry budget was exhausted and operator follow-up is required
Stable transition rules:
- newly accepted durable deliveries surface as `queued` or `suppressed`
- `queued -> rendered` is used only for `payload_mode=template`
- `queued|rendered -> sending` happens on successful claim
- `sending -> sent|suppressed|failed|queued|dead_letter` depends on provider
classification and retry policy
The internal type `delivery.StatusAccepted` still exists in code, but it is
not part of the stable public delivery-status vocabulary and is not emitted by
the current runtime.
### Attempt statuses
Stable `mail_attempt.status` values:
- `scheduled`
- `in_progress`
- `render_failed`
- `provider_accepted`
- `provider_rejected`
- `transport_failed`
- `timed_out`
Rules:
- there is at most one active `in_progress` attempt per delivery
- `render_failed` means template rendering failed before provider execution
- `provider_accepted` ends the delivery as `sent`
- `provider_rejected` is used for:
- provider-side suppression ending in `suppressed`
- permanent provider failure ending in `failed`
- retryable paths are expressed through:
- `transport_failed`
- `timed_out`
## Template and Locale Policy
Template layout:
- `<template_id>/<locale>/subject.tmpl`
- `<template_id>/<locale>/text.tmpl`
- optional `<template_id>/<locale>/html.tmpl`
Required auth fallback files:
- `auth.login_code/en/subject.tmpl`
- `auth.login_code/en/text.tmpl`
Notification-owned English template directories are frozen by
[`../notification/README.md`](../notification/README.md) and the service-local
[`Notification Service` docs](../notification/docs/README.md).
`auth.login_code` remains the required auth template family for the direct
`Auth / Session Service -> Mail Service` flow and is not part of the
notification-owned template set.
Rendering rules:
- the process loads the full catalog at startup
- exact locale match is attempted first
- the only fallback locale is `en`
- there are no intermediate reductions such as `fr-CA -> fr -> en`
- `locale_fallback_used=true` is stored durably when fallback is applied
- subject and text use `text/template`
- optional HTML uses `html/template`
- missing required variables and template lookup failures are classified into
stable render-failure codes
## Persistence Layout
PostgreSQL `mail` schema (source of truth — see
[`docs/postgres-migration.md`](docs/postgres-migration.md)):
- `deliveries(delivery_id PK, source, status, payload_mode, …,
idempotency_key, request_fingerprint, idempotency_expires_at,
attempt_count, next_attempt_at, created_at, updated_at, …)` with
`UNIQUE (source, idempotency_key)` and a partial scheduler index on
`next_attempt_at`
- `delivery_recipients(delivery_id FK, kind, position, email)` with
`kind ∈ {'to','cc','bcc','reply_to'}` and an `email` index that excludes
`reply_to`
- `attempts(delivery_id FK, attempt_no, status, scheduled_for, started_at,
finished_at, provider_classification, provider_summary)`,
`PRIMARY KEY (delivery_id, attempt_no)`
- `dead_letters(delivery_id PK FK, final_attempt_no, failure_classification,
provider_summary, recovery_hint, created_at)`
- `delivery_payloads(delivery_id PK FK, payload jsonb)` for raw attachment
bundles
- `malformed_commands(stream_entry_id PK, delivery_id, source,
idempotency_key, failure_code, failure_message, raw_fields jsonb,
recorded_at)`
Redis surface (intake stream + offset only):
- `mail:delivery_commands` — async ingress Redis Stream
- `mail:stream_offsets:<stream>` — persisted consumer offset for the
intake stream
Storage rules:
- timestamps are stored as PostgreSQL `timestamptz` and normalised to UTC
at the adapter boundary
- malformed async commands are stored idempotently by `stream_entry_id`
- the `idempotency_expires_at` column is set per acceptance from
`MAIL_IDEMPOTENCY_TTL` (default `7d`); resends store an empty fingerprint
and a synthetic far-future expiry that the read helper treats as
non-idempotent
- the SQL retention worker periodically deletes deliveries older than
`MAIL_DELIVERY_RETENTION` (cascade) and malformed commands older than
`MAIL_MALFORMED_COMMAND_RETENTION`
## Provider, Retry, and Failure Policy
Provider modes:
- `stub`
- `smtp`
SMTP rules:
- outbound SMTP requires `STARTTLS`
- servers without `STARTTLS` support are treated as permanent failure
- SMTP authentication is enabled only when both username and password are set
Retry ladder:
- attempt `1 -> 2`: `1m`
- attempt `2 -> 3`: `5m`
- attempt `3 -> 4`: `30m`
- after attempt `4`: `dead_letter`
Failure handling:
- retryable provider failures become `transport_failed` or `timed_out`, then
either reschedule or escalate to `dead_letter`
- permanent provider failures become `failed`
- render failures become `failed` with `render_failed`
- stale claimed work is recovered after `MAIL_SMTP_TIMEOUT + 30s`
## Observability
The runtime exports telemetry through configured OpenTelemetry exporters only.
Main signals:
- `mail.delivery.accepted_auth`
- `mail.delivery.accepted_generic`
- `mail.delivery.suppressed`
- `mail.delivery.status_transitions`
- `mail.attempt.outcomes`
- `mail.delivery.dead_letters`
- `mail.template.locale_fallback`
- `mail.attempt_schedule.depth`
- `mail.attempt_schedule.oldest_age_ms`
- `mail.provider.send.duration_ms`
- `mail.stream_commands.malformed`
Additional behavior:
- internal HTTP uses `otelhttp`
- Redis clients use `redisotel`
- structured logs include `otel_trace_id` and `otel_span_id` when available
## Verification
Relevant commands:
- `cd mail && go test ./...`
- `cd integration && go test ./authsessionmail/...`
- `cd integration && go test ./gatewayauthsessionmail/...`
Extended references:
- [Runtime and components](docs/runtime.md)
- [Main flows](docs/flows.md)
- [Configuration and contract examples](docs/examples.md)
- [Operator runbook](docs/runbook.md)