187 lines
6.4 KiB
Markdown
187 lines
6.4 KiB
Markdown
# Operator Runbook
|
|
|
|
This runbook covers the checks that matter most during startup, steady-state
|
|
verification, shutdown, and common `Mail Service` incidents.
|
|
|
|
## Startup Checks
|
|
|
|
Before starting the process, confirm:
|
|
|
|
- `MAIL_REDIS_MASTER_ADDR` and `MAIL_REDIS_PASSWORD` point to the Redis
|
|
deployment that hosts the inbound `mail:delivery_commands` Stream and the
|
|
persisted consumer offset
|
|
- `MAIL_POSTGRES_PRIMARY_DSN` points to the PostgreSQL deployment whose
|
|
`mail` schema (provisioned externally for the `mailservice` role) holds the
|
|
durable mail state — deliveries, attempts, dead letters, payloads,
|
|
idempotency reservations, malformed commands
|
|
- `MAIL_TEMPLATE_DIR` points to the intended immutable template catalog
|
|
- if `MAIL_SMTP_MODE=smtp`, the SMTP address, sender identity, and optional
|
|
credentials are configured together
|
|
- the OpenTelemetry exporter settings point at the intended collector when
|
|
traces or metrics are expected outside the process
|
|
|
|
At startup the process pings the shared Redis master client, opens the
|
|
PostgreSQL pool, applies embedded goose migrations strictly before any HTTP
|
|
listener opens, parses the full template catalog, and only then starts the
|
|
internal HTTP listener and background workers.
|
|
|
|
Startup fails fast if any of those steps fail.
|
|
|
|
Known startup caveats:
|
|
|
|
- there is no `/healthz`, `/readyz`, or `/metrics` route
|
|
- traces and metrics are exported only through the configured OpenTelemetry
|
|
exporters
|
|
- template changes are not hot-reloaded; restart is required after template
|
|
edits
|
|
|
|
## Steady-State Verification
|
|
|
|
Practical readiness verification is:
|
|
|
|
1. confirm the process emitted startup logs for the internal HTTP listener,
|
|
command consumer, scheduler, attempt worker pool, and SQL retention
|
|
worker
|
|
2. open a TCP connection to `MAIL_INTERNAL_HTTP_ADDR`
|
|
3. issue one trusted smoke request such as
|
|
`GET /api/v1/internal/deliveries/does-not-exist`
|
|
4. verify Redis and PostgreSQL connectivity, plus OpenTelemetry exporter
|
|
health, out of band
|
|
|
|
Expected steady-state signals:
|
|
|
|
- `mail.attempt_schedule.depth` remains bounded
|
|
- `mail.attempt_schedule.oldest_age_ms` stays near the active retry ladder
|
|
- `mail.delivery.dead_letters` changes rarely
|
|
- `mail.stream_commands.malformed` changes only on bad upstream commands
|
|
- internal HTTP logs include `otel_trace_id` and `otel_span_id`
|
|
|
|
## Shutdown
|
|
|
|
The process handles `SIGINT` and `SIGTERM`.
|
|
|
|
Shutdown behavior:
|
|
|
|
- coordinated shutdown is bounded by `MAIL_SHUTDOWN_TIMEOUT`
|
|
- the internal HTTP listener is stopped before process resources are closed
|
|
- the Redis master client and PostgreSQL pool are closed after the app stops
|
|
- OpenTelemetry providers are flushed during runtime cleanup
|
|
|
|
During a planned restart:
|
|
|
|
1. send `SIGTERM`
|
|
2. wait for listener and worker shutdown logs
|
|
3. restart the process with the same Redis, PostgreSQL, and template
|
|
configuration
|
|
4. repeat the steady-state verification steps
|
|
|
|
## Incident Triage
|
|
|
|
### Attempt Schedule Backlog Grows
|
|
|
|
Symptoms:
|
|
|
|
- `mail.attempt_schedule.depth` rises steadily
|
|
- `mail.attempt_schedule.oldest_age_ms` increases instead of oscillating
|
|
- queued deliveries remain in `queued` or `rendered` longer than expected
|
|
|
|
Checks:
|
|
|
|
1. confirm the scheduler is still logging regular activity
|
|
2. confirm PostgreSQL connectivity and latency on the `deliveries`
|
|
`(next_attempt_at)` partial index — scheduler claims rely on
|
|
`FOR UPDATE SKIP LOCKED`, so contention here surfaces as backlog
|
|
3. confirm attempt workers are running and not blocked on SMTP
|
|
4. inspect `mail.provider.send.duration_ms` for elevated latency
|
|
5. verify `MAIL_ATTEMPT_WORKER_CONCURRENCY` is appropriate for the workload
|
|
|
|
### Dead-Letter Spikes
|
|
|
|
Symptoms:
|
|
|
|
- `mail.delivery.dead_letters` increases rapidly
|
|
- operator reads show repeated `dead_letter` deliveries with recent
|
|
`transport_failed` or `timed_out` attempts
|
|
|
|
Checks:
|
|
|
|
1. inspect recent provider summaries on dead-lettered deliveries
|
|
2. confirm SMTP reachability from the Mail Service process
|
|
3. compare the spike against `mail.provider.send.duration_ms` and timeout logs
|
|
4. verify the remote SMTP server is accepting `STARTTLS` and mail submission
|
|
|
|
Expected behavior:
|
|
|
|
- dead letters appear only after the fixed retry ladder is exhausted
|
|
- each dead-lettered delivery has a matching dead-letter entry
|
|
|
|
### Repeated `suppressed` Outcomes
|
|
|
|
Symptoms:
|
|
|
|
- `mail.delivery.suppressed` rises unexpectedly
|
|
- auth or generic deliveries end as `suppressed`
|
|
|
|
Checks:
|
|
|
|
1. determine whether the source is `authsession` or `notification`
|
|
2. for auth deliveries, confirm the service is not intentionally running in
|
|
`MAIL_SMTP_MODE=stub`
|
|
3. inspect provider summaries for policy-driven suppression markers
|
|
4. confirm the upstream business workflow still expects those deliveries to be
|
|
skipped
|
|
|
|
Expected behavior:
|
|
|
|
- auth suppression is valid in stub mode and still counts as successful intake
|
|
- provider-side suppression is recorded as
|
|
`mail_attempt.status=provider_rejected` together with
|
|
`mail_delivery.status=suppressed`
|
|
|
|
### SMTP Authentication Failures
|
|
|
|
Symptoms:
|
|
|
|
- provider summaries indicate auth or login failures
|
|
- delivery attempts shift toward `failed` or repeated retryable failures,
|
|
depending on provider classification
|
|
|
|
Checks:
|
|
|
|
1. verify `MAIL_SMTP_USERNAME` and `MAIL_SMTP_PASSWORD` are both configured
|
|
2. verify the credential pair is valid for the target SMTP server
|
|
3. verify the sender identity matches the allowed submission account
|
|
4. confirm the server advertises the expected authentication mechanisms
|
|
|
|
### SMTP Timeouts
|
|
|
|
Symptoms:
|
|
|
|
- `mail.attempt.outcomes{status="timed_out"}` increases
|
|
- `mail.provider.send.duration_ms` shifts upward
|
|
- logs show retry scheduling or dead-letter transitions after timeout paths
|
|
|
|
Checks:
|
|
|
|
1. confirm network reachability to `MAIL_SMTP_ADDR`
|
|
2. compare observed send duration with `MAIL_SMTP_TIMEOUT`
|
|
3. verify the SMTP server is not stalling during `STARTTLS`, auth, or `DATA`
|
|
4. confirm the process is not CPU-starved or blocked on Redis
|
|
|
|
### Malformed Stream Commands
|
|
|
|
Symptoms:
|
|
|
|
- `mail.stream_commands.malformed` increases
|
|
- logs contain `stream command rejected`
|
|
|
|
Checks:
|
|
|
|
1. inspect `failure_code`, `delivery_id`, `source`, and `stream_entry_id`
|
|
2. confirm the upstream command payload still matches
|
|
[`../api/delivery-commands-asyncapi.yaml`](../api/delivery-commands-asyncapi.yaml)
|
|
3. confirm the producer still sends canonical `payload_mode`, locale, and
|
|
idempotency fields
|
|
4. review stored malformed-command records through the operator tooling or
|
|
direct Redis inspection
|