feat: mail service
This commit is contained in:
@@ -0,0 +1,177 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup, steady-state
|
||||
verification, shutdown, and common `Mail Service` incidents.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `MAIL_REDIS_ADDR` points to the Redis deployment that stores deliveries,
|
||||
attempts, idempotency reservations, malformed commands, and stream offsets
|
||||
- the configured Redis ACL, DB, TLS, and timeout settings match the target
|
||||
environment
|
||||
- `MAIL_TEMPLATE_DIR` points to the intended immutable template catalog
|
||||
- if `MAIL_SMTP_MODE=smtp`, the SMTP address, sender identity, and optional
|
||||
credentials are configured together
|
||||
- the OpenTelemetry exporter settings point at the intended collector when
|
||||
traces or metrics are expected outside the process
|
||||
|
||||
At startup the process performs bounded `PING` checks for both Redis clients
|
||||
used by the runtime and parses the full template catalog.
|
||||
|
||||
Startup fails fast if those checks fail or if the template catalog cannot be
|
||||
loaded.
|
||||
|
||||
Known startup caveats:
|
||||
|
||||
- there is no `/healthz`, `/readyz`, or `/metrics` route
|
||||
- traces and metrics are exported only through the configured OpenTelemetry
|
||||
exporters
|
||||
- template changes are not hot-reloaded; restart is required after template
|
||||
edits
|
||||
|
||||
## Steady-State Verification
|
||||
|
||||
Practical readiness verification is:
|
||||
|
||||
1. confirm the process emitted startup logs for the internal HTTP listener,
|
||||
command consumer, scheduler, and worker pool
|
||||
2. open a TCP connection to `MAIL_INTERNAL_HTTP_ADDR`
|
||||
3. issue one trusted smoke request such as
|
||||
`GET /api/v1/internal/deliveries/does-not-exist`
|
||||
4. verify Redis connectivity and OpenTelemetry exporter health out of band
|
||||
|
||||
Expected steady-state signals:
|
||||
|
||||
- `mail.attempt_schedule.depth` remains bounded
|
||||
- `mail.attempt_schedule.oldest_age_ms` stays near the active retry ladder
|
||||
- `mail.delivery.dead_letters` changes rarely
|
||||
- `mail.stream_commands.malformed` changes only on bad upstream commands
|
||||
- internal HTTP logs include `otel_trace_id` and `otel_span_id`
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behavior:
|
||||
|
||||
- coordinated shutdown is bounded by `MAIL_SHUTDOWN_TIMEOUT`
|
||||
- the internal HTTP listener is stopped before process resources are closed
|
||||
- Redis clients are closed after the app stops
|
||||
- OpenTelemetry providers are flushed during runtime cleanup
|
||||
|
||||
During a planned restart:
|
||||
|
||||
1. send `SIGTERM`
|
||||
2. wait for listener and worker shutdown logs
|
||||
3. restart the process with the same Redis and template configuration
|
||||
4. repeat the steady-state verification steps
|
||||
|
||||
## Incident Triage
|
||||
|
||||
### Attempt Schedule Backlog Grows
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `mail.attempt_schedule.depth` rises steadily
|
||||
- `mail.attempt_schedule.oldest_age_ms` increases instead of oscillating
|
||||
- queued deliveries remain in `queued` or `rendered` longer than expected
|
||||
|
||||
Checks:
|
||||
|
||||
1. confirm the scheduler is still logging regular activity
|
||||
2. confirm Redis connectivity and latency for attempt-schedule keys
|
||||
3. confirm attempt workers are running and not blocked on SMTP
|
||||
4. inspect `mail.provider.send.duration_ms` for elevated latency
|
||||
5. verify `MAIL_ATTEMPT_WORKER_CONCURRENCY` is appropriate for the workload
|
||||
|
||||
### Dead-Letter Spikes
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `mail.delivery.dead_letters` increases rapidly
|
||||
- operator reads show repeated `dead_letter` deliveries with recent
|
||||
`transport_failed` or `timed_out` attempts
|
||||
|
||||
Checks:
|
||||
|
||||
1. inspect recent provider summaries on dead-lettered deliveries
|
||||
2. confirm SMTP reachability from the Mail Service process
|
||||
3. compare the spike against `mail.provider.send.duration_ms` and timeout logs
|
||||
4. verify the remote SMTP server is accepting `STARTTLS` and mail submission
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- dead letters appear only after the fixed retry ladder is exhausted
|
||||
- each dead-lettered delivery has a matching dead-letter entry
|
||||
|
||||
### Repeated `suppressed` Outcomes
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `mail.delivery.suppressed` rises unexpectedly
|
||||
- auth or generic deliveries end as `suppressed`
|
||||
|
||||
Checks:
|
||||
|
||||
1. determine whether the source is `authsession` or `notification`
|
||||
2. for auth deliveries, confirm the service is not intentionally running in
|
||||
`MAIL_SMTP_MODE=stub`
|
||||
3. inspect provider summaries for policy-driven suppression markers
|
||||
4. confirm the upstream business workflow still expects those deliveries to be
|
||||
skipped
|
||||
|
||||
Expected behavior:
|
||||
|
||||
- auth suppression is valid in stub mode and still counts as successful intake
|
||||
- provider-side suppression is recorded as
|
||||
`mail_attempt.status=provider_rejected` together with
|
||||
`mail_delivery.status=suppressed`
|
||||
|
||||
### SMTP Authentication Failures
|
||||
|
||||
Symptoms:
|
||||
|
||||
- provider summaries indicate auth or login failures
|
||||
- delivery attempts shift toward `failed` or repeated retryable failures,
|
||||
depending on provider classification
|
||||
|
||||
Checks:
|
||||
|
||||
1. verify `MAIL_SMTP_USERNAME` and `MAIL_SMTP_PASSWORD` are both configured
|
||||
2. verify the credential pair is valid for the target SMTP server
|
||||
3. verify the sender identity matches the allowed submission account
|
||||
4. confirm the server advertises the expected authentication mechanisms
|
||||
|
||||
### SMTP Timeouts
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `mail.attempt.outcomes{status="timed_out"}` increases
|
||||
- `mail.provider.send.duration_ms` shifts upward
|
||||
- logs show retry scheduling or dead-letter transitions after timeout paths
|
||||
|
||||
Checks:
|
||||
|
||||
1. confirm network reachability to `MAIL_SMTP_ADDR`
|
||||
2. compare observed send duration with `MAIL_SMTP_TIMEOUT`
|
||||
3. verify the SMTP server is not stalling during `STARTTLS`, auth, or `DATA`
|
||||
4. confirm the process is not CPU-starved or blocked on Redis
|
||||
|
||||
### Malformed Stream Commands
|
||||
|
||||
Symptoms:
|
||||
|
||||
- `mail.stream_commands.malformed` increases
|
||||
- logs contain `stream command rejected`
|
||||
|
||||
Checks:
|
||||
|
||||
1. inspect `failure_code`, `delivery_id`, `source`, and `stream_entry_id`
|
||||
2. confirm the upstream command payload still matches
|
||||
[`../api/delivery-commands-asyncapi.yaml`](../api/delivery-commands-asyncapi.yaml)
|
||||
3. confirm the producer still sends canonical `payload_mode`, locale, and
|
||||
idempotency fields
|
||||
4. review stored malformed-command records through the operator tooling or
|
||||
direct Redis inspection
|
||||
Reference in New Issue
Block a user