Files
galaxy-game/mail/README.md
T
2026-04-26 20:34:39 +02:00

16 KiB

Mail Service

Mail Service is the internal e-mail delivery service of Galaxy.

Canonical contracts:

Purpose

Mail Service owns durable intake, rendering, execution, retry, audit, and operator recovery for outbound e-mail.

It does not decide whether a business event should become e-mail. That decision belongs to Notification Service.

Responsibility Boundaries

Mail Service is responsible for:

  • direct auth-code mail intake from Auth / Session Service
  • async generic mail intake from Notification Service
  • validation of recipient envelope, payload shape, locale, and attachments
  • deterministic template rendering for template-mode deliveries
  • provider execution through stub or smtp
  • retry scheduling, dead-letter escalation, and operator-visible audit state
  • trusted operator reads and resend by clone creation

Mail Service is not responsible for:

  • end-user authentication or authorization
  • notification preference ownership
  • deciding whether non-auth mail should be sent at all
  • direct calls from Geo Profile Service
  • hot-reloading templates or editing template catalog state at runtime

Cross-service routing rules:

  • Auth / Session Service -> Mail Service is synchronous trusted REST
  • Notification Service -> Mail Service is asynchronous Redis Streams
  • Geo Profile Service must route optional admin e-mail through Notification Service, not directly to Mail Service
  • auth-code delivery remains a direct Auth / Session Service -> Mail Service flow and does not pass through Notification Service

Runtime Surface

cmd/mail starts one internal-only process with:

  • one trusted internal HTTP listener on MAIL_INTERNAL_HTTP_ADDR
  • one async command consumer reading from MAIL_REDIS_COMMAND_STREAM
  • one attempt scheduler driven by Postgres FOR UPDATE SKIP LOCKED
  • one attempt worker pool
  • one SQL retention worker

The service has no public ingress and no dedicated admin listener.

Persistence split (steady state, see docs/postgres-migration.md):

  • PostgreSQL is the source of truth for durable mail state — accepted deliveries, attempts, dead letters, payload bundles, malformed-command audit records, and idempotency reservations.
  • Redis is the source of truth only for the inbound mail:delivery_commands stream and its persisted consumer offset.

Intentional runtime omissions:

  • no /healthz
  • no /readyz
  • no /metrics

Operational behavior:

  • startup performs bounded Redis and PostgreSQL connectivity checks and fails fast on invalid runtime configuration
  • embedded goose migrations are applied strictly before any HTTP listener opens; a migration failure exits with non-zero status
  • the template catalog is parsed once at startup and kept immutable for the lifetime of the process
  • template changes require process restart
  • operator handlers execute under MAIL_OPERATOR_REQUEST_TIMEOUT

Configuration

Required for all starts:

  • MAIL_REDIS_MASTER_ADDR
  • MAIL_REDIS_PASSWORD
  • MAIL_POSTGRES_PRIMARY_DSN

Primary configuration groups:

  • process and logging:
    • MAIL_SHUTDOWN_TIMEOUT
    • MAIL_LOG_LEVEL
  • internal HTTP:
    • MAIL_INTERNAL_HTTP_ADDR
    • MAIL_INTERNAL_HTTP_READ_HEADER_TIMEOUT
    • MAIL_INTERNAL_HTTP_READ_TIMEOUT
    • MAIL_INTERNAL_HTTP_IDLE_TIMEOUT
  • Redis connectivity (pkg/redisconn shape):
    • MAIL_REDIS_MASTER_ADDR
    • MAIL_REDIS_REPLICA_ADDRS (comma-separated, optional)
    • MAIL_REDIS_PASSWORD
    • MAIL_REDIS_DB
    • MAIL_REDIS_OPERATION_TIMEOUT
    • MAIL_REDIS_COMMAND_STREAM
  • PostgreSQL connectivity (pkg/postgres shape):
    • MAIL_POSTGRES_PRIMARY_DSN
    • MAIL_POSTGRES_REPLICA_DSNS (comma-separated, optional; reserved for future read routing)
    • MAIL_POSTGRES_OPERATION_TIMEOUT
    • MAIL_POSTGRES_MAX_OPEN_CONNS
    • MAIL_POSTGRES_MAX_IDLE_CONNS
    • MAIL_POSTGRES_CONN_MAX_LIFETIME
  • SMTP provider:
    • MAIL_SMTP_MODE=stub|smtp
    • MAIL_SMTP_ADDR
    • MAIL_SMTP_USERNAME
    • MAIL_SMTP_PASSWORD
    • MAIL_SMTP_FROM_EMAIL
    • MAIL_SMTP_FROM_NAME
    • MAIL_SMTP_TIMEOUT
    • MAIL_SMTP_INSECURE_SKIP_VERIFY
  • template catalog:
    • MAIL_TEMPLATE_DIR
  • worker and operator behavior:
    • MAIL_ATTEMPT_WORKER_CONCURRENCY
    • MAIL_STREAM_BLOCK_TIMEOUT
    • MAIL_OPERATOR_REQUEST_TIMEOUT
    • MAIL_IDEMPOTENCY_TTL
  • SQL retention worker:
    • MAIL_DELIVERY_RETENTION (default 30d)
    • MAIL_MALFORMED_COMMAND_RETENTION (default 90d)
    • MAIL_CLEANUP_INTERVAL (default 1h)
  • OpenTelemetry:
    • OTEL_SERVICE_NAME
    • OTEL_TRACES_EXPORTER
    • OTEL_METRICS_EXPORTER
    • OTEL_EXPORTER_OTLP_PROTOCOL
    • OTEL_EXPORTER_OTLP_TRACES_PROTOCOL
    • OTEL_EXPORTER_OTLP_METRICS_PROTOCOL
    • MAIL_OTEL_STDOUT_TRACES_ENABLED
    • MAIL_OTEL_STDOUT_METRICS_ENABLED

Defaults worth knowing:

  • MAIL_INTERNAL_HTTP_ADDR=:8080
  • MAIL_SMTP_MODE=stub
  • MAIL_SMTP_TIMEOUT=15s
  • MAIL_TEMPLATE_DIR=templates
  • MAIL_ATTEMPT_WORKER_CONCURRENCY=4
  • MAIL_STREAM_BLOCK_TIMEOUT=2s
  • MAIL_OPERATOR_REQUEST_TIMEOUT=5s
  • MAIL_SHUTDOWN_TIMEOUT=5s
  • MAIL_IDEMPOTENCY_TTL=168h (7d)
  • MAIL_DELIVERY_RETENTION=720h (30d)
  • MAIL_MALFORMED_COMMAND_RETENTION=2160h (90d)
  • MAIL_CLEANUP_INTERVAL=1h

Additional SMTP note:

  • MAIL_SMTP_INSECURE_SKIP_VERIFY=false by default and is intended only for local self-signed SMTP capture or similar non-production environments

Retired (Stage 4 of PG_PLAN.md): MAIL_REDIS_ADDR, MAIL_REDIS_USERNAME, MAIL_REDIS_TLS_ENABLED, MAIL_REDIS_ATTEMPT_SCHEDULE_KEY, MAIL_REDIS_DEAD_LETTER_PREFIX, MAIL_DELIVERY_TTL, MAIL_ATTEMPT_TTL. The new connection envelope is supplied by pkg/redisconn and pkg/postgres, and durable retention is enforced by the SQL retention worker against the PostgreSQL-backed source of truth (see docs/postgres-migration.md).

Stable Input Contracts

1. Auth delivery REST

Route:

  • POST /api/v1/internal/login-code-deliveries

Headers:

  • required Idempotency-Key

Request body:

  • email
  • code
  • locale

Stable success outcomes:

  • sent
  • suppressed

Important semantics:

  • sent means the request was durably accepted into the internal mail-delivery pipeline
  • sent does not mean that SMTP delivery has already completed
  • new durable auth deliveries surface as:
    • queued in MAIL_SMTP_MODE=smtp
    • suppressed in MAIL_SMTP_MODE=stub
  • duplicate replays with the same normalized request return the same stable outcome
  • mismatched replays on the same (source, idempotency_key) return 409 conflict

2. Async generic command intake

Ingress stream:

  • mail:delivery_commands

Stable envelope fields:

  • delivery_id
  • source
  • payload_mode
  • idempotency_key
  • requested_at_ms
  • request_id
  • trace_id
  • payload_json

Contract rules:

  • async source is fixed to notification
  • supported payload_mode values are rendered and template
  • Notification Service uses only payload_mode=template for notification-generated mail, even though the generic async contract keeps both rendered and template
  • notification-owned template_id values are identical to the notification_type vocabulary, for example game.turn.ready and lobby.membership.approved
  • the real Notification Service -> Mail Service integration suite verifies template-mode handoff for notification-owned mail
  • requested_at_ms stores the publisher-side original request timestamp in Unix milliseconds
  • request_id and trace_id are observability-only metadata and do not participate in idempotency fingerprinting
  • malformed commands are metered, logged, and recorded as dedicated malformed-command entries
  • malformed commands do not create a durable delivery record
  • stream offset advances only after durable acceptance or durable malformed-command recording

3. Trusted operator REST

Routes:

  • GET /api/v1/internal/deliveries
  • GET /api/v1/internal/deliveries/{delivery_id}
  • GET /api/v1/internal/deliveries/{delivery_id}/attempts
  • POST /api/v1/internal/deliveries/{delivery_id}/resend

List filters:

  • recipient
  • status
  • source
  • template_id
  • idempotency_key
  • from_created_at_ms
  • to_created_at_ms
  • limit
  • cursor

Stable list behavior:

  • ordering is created_at_ms DESC, then delivery_id DESC
  • cursor is an opaque base64url encoding of created_at_ms:delivery_id
  • idempotency_key without source matches across all stable sources

Stable resend rules:

  • resend is clone-only
  • resend is allowed only for terminal delivery states
  • resend creates a new delivery with source=operator_resend
  • resend clones preserve audit history of the original instead of mutating it

Delivery Model

Source vocabulary

Stable mail_delivery.source values:

  • authsession
  • notification
  • operator_resend

Payload modes

Stable mail_delivery.payload_mode values:

  • rendered
  • template

Rules:

  • rendered stores final subject, text_body, and optional html_body
  • template stores template_id, canonical locale, and strict JSON-object template_variables
  • raw attachment bodies are stored separately from the delivery audit record

Delivery statuses

Stable operator-visible mail_delivery.status values:

  • queued
  • rendered
  • sending
  • sent
  • suppressed
  • failed
  • dead_letter

Status meanings:

  • queued: durable intake completed and the next attempt is scheduled
  • rendered: template content has been materialized
  • sending: one worker currently owns the active attempt
  • sent: provider accepted the envelope
  • suppressed: delivery was intentionally skipped as a successful business outcome
  • failed: terminal failure without dead-letter escalation
  • dead_letter: retry budget was exhausted and operator follow-up is required

Stable transition rules:

  • newly accepted durable deliveries surface as queued or suppressed
  • queued -> rendered is used only for payload_mode=template
  • queued|rendered -> sending happens on successful claim
  • sending -> sent|suppressed|failed|queued|dead_letter depends on provider classification and retry policy

The internal type delivery.StatusAccepted still exists in code, but it is not part of the stable public delivery-status vocabulary and is not emitted by the current runtime.

Attempt statuses

Stable mail_attempt.status values:

  • scheduled
  • in_progress
  • render_failed
  • provider_accepted
  • provider_rejected
  • transport_failed
  • timed_out

Rules:

  • there is at most one active in_progress attempt per delivery
  • render_failed means template rendering failed before provider execution
  • provider_accepted ends the delivery as sent
  • provider_rejected is used for:
    • provider-side suppression ending in suppressed
    • permanent provider failure ending in failed
  • retryable paths are expressed through:
    • transport_failed
    • timed_out

Template and Locale Policy

Template layout:

  • <template_id>/<locale>/subject.tmpl
  • <template_id>/<locale>/text.tmpl
  • optional <template_id>/<locale>/html.tmpl

Required auth fallback files:

  • auth.login_code/en/subject.tmpl
  • auth.login_code/en/text.tmpl

Notification-owned English template directories are frozen by ../notification/README.md and the service-local Notification Service docs. auth.login_code remains the required auth template family for the direct Auth / Session Service -> Mail Service flow and is not part of the notification-owned template set.

Rendering rules:

  • the process loads the full catalog at startup
  • exact locale match is attempted first
  • the only fallback locale is en
  • there are no intermediate reductions such as fr-CA -> fr -> en
  • locale_fallback_used=true is stored durably when fallback is applied
  • subject and text use text/template
  • optional HTML uses html/template
  • missing required variables and template lookup failures are classified into stable render-failure codes

Persistence Layout

PostgreSQL mail schema (source of truth — see docs/postgres-migration.md):

  • deliveries(delivery_id PK, source, status, payload_mode, …, idempotency_key, request_fingerprint, idempotency_expires_at, attempt_count, next_attempt_at, created_at, updated_at, …) with UNIQUE (source, idempotency_key) and a partial scheduler index on next_attempt_at
  • delivery_recipients(delivery_id FK, kind, position, email) with kind ∈ {'to','cc','bcc','reply_to'} and an email index that excludes reply_to
  • attempts(delivery_id FK, attempt_no, status, scheduled_for, started_at, finished_at, provider_classification, provider_summary), PRIMARY KEY (delivery_id, attempt_no)
  • dead_letters(delivery_id PK FK, final_attempt_no, failure_classification, provider_summary, recovery_hint, created_at)
  • delivery_payloads(delivery_id PK FK, payload jsonb) for raw attachment bundles
  • malformed_commands(stream_entry_id PK, delivery_id, source, idempotency_key, failure_code, failure_message, raw_fields jsonb, recorded_at)

Redis surface (intake stream + offset only):

  • mail:delivery_commands — async ingress Redis Stream
  • mail:stream_offsets:<stream> — persisted consumer offset for the intake stream

Storage rules:

  • timestamps are stored as PostgreSQL timestamptz and normalised to UTC at the adapter boundary
  • malformed async commands are stored idempotently by stream_entry_id
  • the idempotency_expires_at column is set per acceptance from MAIL_IDEMPOTENCY_TTL (default 7d); resends store an empty fingerprint and a synthetic far-future expiry that the read helper treats as non-idempotent
  • the SQL retention worker periodically deletes deliveries older than MAIL_DELIVERY_RETENTION (cascade) and malformed commands older than MAIL_MALFORMED_COMMAND_RETENTION

Provider, Retry, and Failure Policy

Provider modes:

  • stub
  • smtp

SMTP rules:

  • outbound SMTP requires STARTTLS
  • servers without STARTTLS support are treated as permanent failure
  • SMTP authentication is enabled only when both username and password are set

Retry ladder:

  • attempt 1 -> 2: 1m
  • attempt 2 -> 3: 5m
  • attempt 3 -> 4: 30m
  • after attempt 4: dead_letter

Failure handling:

  • retryable provider failures become transport_failed or timed_out, then either reschedule or escalate to dead_letter
  • permanent provider failures become failed
  • render failures become failed with render_failed
  • stale claimed work is recovered after MAIL_SMTP_TIMEOUT + 30s

Observability

The runtime exports telemetry through configured OpenTelemetry exporters only.

Main signals:

  • mail.delivery.accepted_auth
  • mail.delivery.accepted_generic
  • mail.delivery.suppressed
  • mail.delivery.status_transitions
  • mail.attempt.outcomes
  • mail.delivery.dead_letters
  • mail.template.locale_fallback
  • mail.attempt_schedule.depth
  • mail.attempt_schedule.oldest_age_ms
  • mail.provider.send.duration_ms
  • mail.stream_commands.malformed

Additional behavior:

  • internal HTTP uses otelhttp
  • Redis clients use redisotel
  • structured logs include otel_trace_id and otel_span_id when available

Verification

Relevant commands:

  • cd mail && go test ./...
  • cd integration && go test ./authsessionmail/...
  • cd integration && go test ./gatewayauthsessionmail/...

Extended references: