feat: mail service

This commit is contained in:
Ilia Denisov
2026-04-17 18:39:16 +02:00
committed by GitHub
parent 23ffcb7535
commit 5b7593e6f6
183 changed files with 31215 additions and 248 deletions
+23
View File
@@ -0,0 +1,23 @@
# Mail Service Docs
This directory keeps service-local documentation that is more operational or
more example-heavy than [`../README.md`](../README.md).
Sections:
- [Runtime and components](runtime.md)
- [Main flows](flows.md)
- [Configuration and contract examples](examples.md)
- [Operator runbook](runbook.md)
Primary references:
- [`../README.md`](../README.md) for stable service scope, contracts, data
model, Redis layout, and retry policy
- [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml) for the
trusted internal REST contract
- [`../api/delivery-commands-asyncapi.yaml`](../api/delivery-commands-asyncapi.yaml)
for the trusted async generic command contract
- [`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for system-level service
boundaries and transport rules
- [`../../TESTING.md`](../../TESTING.md) for the cross-service testing matrix
+129
View File
@@ -0,0 +1,129 @@
# Configuration and Contract Examples
The examples below are illustrative. IDs, timestamps, and keys are placeholders
unless explicitly stated otherwise.
## Example Environment
Minimal local runtime with stub provider:
```dotenv
MAIL_REDIS_ADDR=127.0.0.1:6379
MAIL_INTERNAL_HTTP_ADDR=:8080
MAIL_TEMPLATE_DIR=templates
MAIL_SMTP_MODE=stub
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
SMTP-backed shape:
```dotenv
MAIL_REDIS_ADDR=127.0.0.1:6379
MAIL_INTERNAL_HTTP_ADDR=:8080
MAIL_TEMPLATE_DIR=templates
MAIL_SMTP_MODE=smtp
MAIL_SMTP_ADDR=127.0.0.1:1025
MAIL_SMTP_FROM_EMAIL=noreply@example.com
MAIL_SMTP_TIMEOUT=15s
# Optional for local self-signed SMTP capture only:
# MAIL_SMTP_INSECURE_SKIP_VERIFY=true
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
## Auth Delivery REST
Request:
```bash
curl -X POST http://127.0.0.1:8080/api/v1/internal/login-code-deliveries \
-H 'Content-Type: application/json' \
-H 'Idempotency-Key: challenge-123' \
-d '{
"email": "pilot@example.com",
"code": "123456",
"locale": "fr-FR"
}'
```
Success response:
```json
{
"outcome": "sent"
}
```
Suppressed response:
```json
{
"outcome": "suppressed"
}
```
## Async Generic Command Examples
Rendered payload:
```bash
redis-cli XADD mail:delivery_commands '*' \
delivery_id mail-123 \
source notification \
payload_mode rendered \
idempotency_key notification:mail-123 \
request_id req-123 \
trace_id trace-123 \
payload_json '{"to":["pilot@example.com"],"cc":[],"bcc":[],"reply_to":[],"subject":"Turn ready","text_body":"Turn 54 is ready.","html_body":"<p>Turn <strong>54</strong> is ready.</p>","attachments":[]}'
```
Template payload:
```bash
redis-cli XADD mail:delivery_commands '*' \
delivery_id mail-124 \
source notification \
payload_mode template \
idempotency_key notification:mail-124 \
request_id req-124 \
trace_id trace-124 \
payload_json '{"to":["pilot@example.com"],"cc":[],"bcc":[],"reply_to":[],"template_id":"game.turn_ready","locale":"fr-FR","variables":{"turn_number":54},"attachments":[]}'
```
## Operator API Examples
List deliveries:
```bash
curl 'http://127.0.0.1:8080/api/v1/internal/deliveries?source=authsession&status=sent&limit=10'
```
Get one delivery:
```bash
curl http://127.0.0.1:8080/api/v1/internal/deliveries/delivery-123
```
List attempts:
```bash
curl http://127.0.0.1:8080/api/v1/internal/deliveries/delivery-123/attempts
```
Resend one terminal delivery:
```bash
curl -X POST http://127.0.0.1:8080/api/v1/internal/deliveries/delivery-123/resend
```
Example resend response:
```json
{
"delivery_id": "delivery-456"
}
```
+100
View File
@@ -0,0 +1,100 @@
# Main Flows
## Auth / Session -> Mail
```mermaid
sequenceDiagram
participant Auth as Auth / Session Service
participant Mail as Mail Service
participant Redis
participant Scheduler
participant SMTP as Provider
Auth->>Mail: POST /api/v1/internal/login-code-deliveries + Idempotency-Key
Mail->>Mail: validate request and idempotency scope
alt MAIL_SMTP_MODE = stub
Mail->>Redis: persist delivery as suppressed
Mail-->>Auth: 200 {outcome=suppressed}
else MAIL_SMTP_MODE = smtp
Mail->>Redis: persist delivery as queued + attempt #1 scheduled
Mail-->>Auth: 200 {outcome=sent}
Scheduler->>Redis: claim due attempt
Scheduler->>SMTP: send rendered auth mail
SMTP-->>Scheduler: accepted or classified failure
Scheduler->>Redis: commit sent / retry / failed / dead_letter
end
```
`sent` on this boundary means durable intake into the mail-delivery pipeline.
It does not mean SMTP completion.
## Notification -> Mail
```mermaid
sequenceDiagram
participant Notify as Notification Service
participant Stream as Redis Stream mail:delivery_commands
participant Consumer as Command consumer
participant Mail as Mail Service
participant Redis
Notify->>Stream: XADD generic command
Consumer->>Stream: XREAD from last stored offset
Consumer->>Mail: decode and validate command
alt malformed or conflicting command
Mail->>Redis: record malformed command entry
Consumer->>Redis: save stream offset
else valid command
Mail->>Redis: persist delivery + first attempt + optional payload bundle
Consumer->>Redis: save stream offset
end
```
## Retry and Dead Letter
```mermaid
sequenceDiagram
participant Scheduler
participant Redis
participant Worker as Attempt worker
participant SMTP as Provider
Scheduler->>Redis: find next due delivery
Scheduler->>Redis: load work item
alt template delivery not yet rendered
Scheduler->>Redis: render and store materialized content
end
Scheduler->>Redis: claim scheduled attempt
Scheduler->>Worker: enqueue claimed work
Worker->>SMTP: send materialized message
SMTP-->>Worker: accepted / suppressed / transient_failure / permanent_failure
alt accepted
Worker->>Redis: commit sent + provider_accepted
else suppressed
Worker->>Redis: commit suppressed + provider_rejected
else transient failure before retry budget ends
Worker->>Redis: commit transport_failed|timed_out + next scheduled attempt
else retry budget exhausted
Worker->>Redis: commit dead_letter + dead-letter entry
else permanent failure
Worker->>Redis: commit failed + provider_rejected
end
```
## Operator Resend
```mermaid
sequenceDiagram
participant Ops as Trusted operator
participant Mail as Mail Service
participant Redis
Ops->>Mail: POST /api/v1/internal/deliveries/{delivery_id}/resend
Mail->>Redis: load original delivery and optional payload bundle
Mail->>Mail: verify original status is terminal
Mail->>Redis: create clone delivery with source=operator_resend
Mail-->>Ops: 200 {delivery_id=<clone>}
```
Resend always creates a new delivery and never mutates the original delivery or
its attempt history.
+177
View File
@@ -0,0 +1,177 @@
# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
verification, shutdown, and common `Mail Service` incidents.
## Startup Checks
Before starting the process, confirm:
- `MAIL_REDIS_ADDR` points to the Redis deployment that stores deliveries,
attempts, idempotency reservations, malformed commands, and stream offsets
- the configured Redis ACL, DB, TLS, and timeout settings match the target
environment
- `MAIL_TEMPLATE_DIR` points to the intended immutable template catalog
- if `MAIL_SMTP_MODE=smtp`, the SMTP address, sender identity, and optional
credentials are configured together
- the OpenTelemetry exporter settings point at the intended collector when
traces or metrics are expected outside the process
At startup the process performs bounded `PING` checks for both Redis clients
used by the runtime and parses the full template catalog.
Startup fails fast if those checks fail or if the template catalog cannot be
loaded.
Known startup caveats:
- there is no `/healthz`, `/readyz`, or `/metrics` route
- traces and metrics are exported only through the configured OpenTelemetry
exporters
- template changes are not hot-reloaded; restart is required after template
edits
## Steady-State Verification
Practical readiness verification is:
1. confirm the process emitted startup logs for the internal HTTP listener,
command consumer, scheduler, and worker pool
2. open a TCP connection to `MAIL_INTERNAL_HTTP_ADDR`
3. issue one trusted smoke request such as
`GET /api/v1/internal/deliveries/does-not-exist`
4. verify Redis connectivity and OpenTelemetry exporter health out of band
Expected steady-state signals:
- `mail.attempt_schedule.depth` remains bounded
- `mail.attempt_schedule.oldest_age_ms` stays near the active retry ladder
- `mail.delivery.dead_letters` changes rarely
- `mail.stream_commands.malformed` changes only on bad upstream commands
- internal HTTP logs include `otel_trace_id` and `otel_span_id`
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- coordinated shutdown is bounded by `MAIL_SHUTDOWN_TIMEOUT`
- the internal HTTP listener is stopped before process resources are closed
- Redis clients are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
1. send `SIGTERM`
2. wait for listener and worker shutdown logs
3. restart the process with the same Redis and template configuration
4. repeat the steady-state verification steps
## Incident Triage
### Attempt Schedule Backlog Grows
Symptoms:
- `mail.attempt_schedule.depth` rises steadily
- `mail.attempt_schedule.oldest_age_ms` increases instead of oscillating
- queued deliveries remain in `queued` or `rendered` longer than expected
Checks:
1. confirm the scheduler is still logging regular activity
2. confirm Redis connectivity and latency for attempt-schedule keys
3. confirm attempt workers are running and not blocked on SMTP
4. inspect `mail.provider.send.duration_ms` for elevated latency
5. verify `MAIL_ATTEMPT_WORKER_CONCURRENCY` is appropriate for the workload
### Dead-Letter Spikes
Symptoms:
- `mail.delivery.dead_letters` increases rapidly
- operator reads show repeated `dead_letter` deliveries with recent
`transport_failed` or `timed_out` attempts
Checks:
1. inspect recent provider summaries on dead-lettered deliveries
2. confirm SMTP reachability from the Mail Service process
3. compare the spike against `mail.provider.send.duration_ms` and timeout logs
4. verify the remote SMTP server is accepting `STARTTLS` and mail submission
Expected behavior:
- dead letters appear only after the fixed retry ladder is exhausted
- each dead-lettered delivery has a matching dead-letter entry
### Repeated `suppressed` Outcomes
Symptoms:
- `mail.delivery.suppressed` rises unexpectedly
- auth or generic deliveries end as `suppressed`
Checks:
1. determine whether the source is `authsession` or `notification`
2. for auth deliveries, confirm the service is not intentionally running in
`MAIL_SMTP_MODE=stub`
3. inspect provider summaries for policy-driven suppression markers
4. confirm the upstream business workflow still expects those deliveries to be
skipped
Expected behavior:
- auth suppression is valid in stub mode and still counts as successful intake
- provider-side suppression is recorded as
`mail_attempt.status=provider_rejected` together with
`mail_delivery.status=suppressed`
### SMTP Authentication Failures
Symptoms:
- provider summaries indicate auth or login failures
- delivery attempts shift toward `failed` or repeated retryable failures,
depending on provider classification
Checks:
1. verify `MAIL_SMTP_USERNAME` and `MAIL_SMTP_PASSWORD` are both configured
2. verify the credential pair is valid for the target SMTP server
3. verify the sender identity matches the allowed submission account
4. confirm the server advertises the expected authentication mechanisms
### SMTP Timeouts
Symptoms:
- `mail.attempt.outcomes{status="timed_out"}` increases
- `mail.provider.send.duration_ms` shifts upward
- logs show retry scheduling or dead-letter transitions after timeout paths
Checks:
1. confirm network reachability to `MAIL_SMTP_ADDR`
2. compare observed send duration with `MAIL_SMTP_TIMEOUT`
3. verify the SMTP server is not stalling during `STARTTLS`, auth, or `DATA`
4. confirm the process is not CPU-starved or blocked on Redis
### Malformed Stream Commands
Symptoms:
- `mail.stream_commands.malformed` increases
- logs contain `stream command rejected`
Checks:
1. inspect `failure_code`, `delivery_id`, `source`, and `stream_entry_id`
2. confirm the upstream command payload still matches
[`../api/delivery-commands-asyncapi.yaml`](../api/delivery-commands-asyncapi.yaml)
3. confirm the producer still sends canonical `payload_mode`, locale, and
idempotency fields
4. review stored malformed-command records through the operator tooling or
direct Redis inspection
+187
View File
@@ -0,0 +1,187 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/mail` process and its runtime
dependencies.
```mermaid
flowchart LR
subgraph Callers
Auth["Auth / Session Service"]
Notify["Notification Service"]
Ops["Trusted operators"]
end
subgraph Mail["Mail Service process"]
InternalHTTP["Trusted internal HTTP listener\n/api/v1/internal/*"]
Consumer["Redis Stream command consumer"]
Scheduler["Attempt scheduler"]
Workers["Attempt worker pool"]
Cleanup["Index cleanup worker"]
Services["Application services"]
Templates["Immutable template catalog"]
Telemetry["Logs, traces, metrics"]
end
Redis["Redis\nstate + streams + indexes"]
Provider["SMTP or stub provider"]
Auth --> InternalHTTP
Ops --> InternalHTTP
Notify --> Redis
InternalHTTP --> Services
Consumer --> Services
Scheduler --> Services
Workers --> Services
Cleanup --> Services
Services --> Templates
Services --> Redis
Services --> Provider
InternalHTTP --> Telemetry
Consumer --> Telemetry
Scheduler --> Telemetry
Workers --> Telemetry
```
## Listener
`mail` exposes exactly one HTTP listener:
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Internal HTTP | `:8080` | Trusted intake, operator reads, and resend |
Shared listener defaults:
- read-header timeout: `2s`
- read timeout: `10s`
- idle timeout: `1m`
Intentional omissions:
- no public listener
- no `/healthz`
- no `/readyz`
- no `/metrics`
## Startup Wiring
`cmd/mail` loads config, constructs logging, and builds the runtime through
`internal/app.NewRuntime`.
The runtime wires:
- Redis clients for state access and blocking stream consumption
- filesystem-backed template catalog
- provider adapter selected by `MAIL_SMTP_MODE`
- acceptance, render, execution, operator-read, and resend services
- internal HTTP server
- command consumer
- scheduler
- attempt worker pool
- cleanup worker
Before startup completes, the process performs bounded `PING` checks for both
Redis clients and validates the template catalog. Startup fails fast on invalid
configuration or unavailable Redis.
## Background Components
### Command consumer
- reads one plain `XREAD` stream
- starts from stored offset or `0-0`
- advances offset only after durable command acceptance or durable malformed
command recording
### Scheduler
- polls due work every `250ms`
- recovers stale claims every `30s`
- derives recovery deadline from `MAIL_SMTP_TIMEOUT + 30s`
### Attempt worker pool
- processes only already claimed work items
- concurrency is controlled by `MAIL_ATTEMPT_WORKER_CONCURRENCY`
### Cleanup worker
- removes stale delivery-index members after primary delivery expiry
- does not clean `mail:attempt_schedule`
- does not clean malformed-command index entries
## Configuration Groups
Required for all starts:
- `MAIL_REDIS_ADDR`
Core process config:
- `MAIL_SHUTDOWN_TIMEOUT`
- `MAIL_LOG_LEVEL`
Internal HTTP config:
- `MAIL_INTERNAL_HTTP_ADDR`
- `MAIL_INTERNAL_HTTP_READ_HEADER_TIMEOUT`
- `MAIL_INTERNAL_HTTP_READ_TIMEOUT`
- `MAIL_INTERNAL_HTTP_IDLE_TIMEOUT`
Redis connectivity:
- `MAIL_REDIS_USERNAME`
- `MAIL_REDIS_PASSWORD`
- `MAIL_REDIS_DB`
- `MAIL_REDIS_TLS_ENABLED`
- `MAIL_REDIS_OPERATION_TIMEOUT`
- `MAIL_REDIS_COMMAND_STREAM`
- `MAIL_REDIS_ATTEMPT_SCHEDULE_KEY`
- `MAIL_REDIS_DEAD_LETTER_PREFIX`
SMTP provider:
- `MAIL_SMTP_MODE`
- `MAIL_SMTP_ADDR`
- `MAIL_SMTP_USERNAME`
- `MAIL_SMTP_PASSWORD`
- `MAIL_SMTP_FROM_EMAIL`
- `MAIL_SMTP_FROM_NAME`
- `MAIL_SMTP_TIMEOUT`
- `MAIL_SMTP_INSECURE_SKIP_VERIFY`
Templates and workers:
- `MAIL_TEMPLATE_DIR`
- `MAIL_ATTEMPT_WORKER_CONCURRENCY`
- `MAIL_STREAM_BLOCK_TIMEOUT`
- `MAIL_OPERATOR_REQUEST_TIMEOUT`
- `MAIL_IDEMPOTENCY_TTL`
- `MAIL_DELIVERY_TTL`
- `MAIL_ATTEMPT_TTL`
Telemetry:
- `OTEL_SERVICE_NAME`
- `OTEL_TRACES_EXPORTER`
- `OTEL_METRICS_EXPORTER`
- `OTEL_EXPORTER_OTLP_PROTOCOL`
- `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`
- `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL`
- `MAIL_OTEL_STDOUT_TRACES_ENABLED`
- `MAIL_OTEL_STDOUT_METRICS_ENABLED`
## Runtime Notes
- `MAIL_REDIS_COMMAND_STREAM` is the only Redis key override that currently
changes runtime behavior
- `MAIL_SMTP_INSECURE_SKIP_VERIFY` is a local-development escape hatch for
self-signed SMTP capture only and should remain disabled in production
- attempt-schedule and dead-letter key overrides are parsed but not yet wired
into Redis adapters
- retention overrides are parsed but storage still uses the fixed `7d`, `30d`,
and `90d` values
- template catalog parsing is eager and immutable
- auth deliveries in `MAIL_SMTP_MODE=stub` surface as `suppressed`
- auth deliveries in `MAIL_SMTP_MODE=smtp` surface as `queued` and later move
through normal attempt execution