6.4 KiB
Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
verification, shutdown, and common Mail Service incidents.
Startup Checks
Before starting the process, confirm:
MAIL_REDIS_MASTER_ADDRandMAIL_REDIS_PASSWORDpoint to the Redis deployment that hosts the inboundmail:delivery_commandsStream and the persisted consumer offsetMAIL_POSTGRES_PRIMARY_DSNpoints to the PostgreSQL deployment whosemailschema (provisioned externally for themailservicerole) holds the durable mail state — deliveries, attempts, dead letters, payloads, idempotency reservations, malformed commandsMAIL_TEMPLATE_DIRpoints to the intended immutable template catalog- if
MAIL_SMTP_MODE=smtp, the SMTP address, sender identity, and optional credentials are configured together - the OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process
At startup the process pings the shared Redis master client, opens the PostgreSQL pool, applies embedded goose migrations strictly before any HTTP listener opens, parses the full template catalog, and only then starts the internal HTTP listener and background workers.
Startup fails fast if any of those steps fail.
Known startup caveats:
- there is no
/healthz,/readyz, or/metricsroute - traces and metrics are exported only through the configured OpenTelemetry exporters
- template changes are not hot-reloaded; restart is required after template edits
Steady-State Verification
Practical readiness verification is:
- confirm the process emitted startup logs for the internal HTTP listener, command consumer, scheduler, attempt worker pool, and SQL retention worker
- open a TCP connection to
MAIL_INTERNAL_HTTP_ADDR - issue one trusted smoke request such as
GET /api/v1/internal/deliveries/does-not-exist - verify Redis and PostgreSQL connectivity, plus OpenTelemetry exporter health, out of band
Expected steady-state signals:
mail.attempt_schedule.depthremains boundedmail.attempt_schedule.oldest_age_msstays near the active retry laddermail.delivery.dead_letterschanges rarelymail.stream_commands.malformedchanges only on bad upstream commands- internal HTTP logs include
otel_trace_idandotel_span_id
Shutdown
The process handles SIGINT and SIGTERM.
Shutdown behavior:
- coordinated shutdown is bounded by
MAIL_SHUTDOWN_TIMEOUT - the internal HTTP listener is stopped before process resources are closed
- the Redis master client and PostgreSQL pool are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
- send
SIGTERM - wait for listener and worker shutdown logs
- restart the process with the same Redis, PostgreSQL, and template configuration
- repeat the steady-state verification steps
Incident Triage
Attempt Schedule Backlog Grows
Symptoms:
mail.attempt_schedule.depthrises steadilymail.attempt_schedule.oldest_age_msincreases instead of oscillating- queued deliveries remain in
queuedorrenderedlonger than expected
Checks:
- confirm the scheduler is still logging regular activity
- confirm PostgreSQL connectivity and latency on the
deliveries(next_attempt_at)partial index — scheduler claims rely onFOR UPDATE SKIP LOCKED, so contention here surfaces as backlog - confirm attempt workers are running and not blocked on SMTP
- inspect
mail.provider.send.duration_msfor elevated latency - verify
MAIL_ATTEMPT_WORKER_CONCURRENCYis appropriate for the workload
Dead-Letter Spikes
Symptoms:
mail.delivery.dead_lettersincreases rapidly- operator reads show repeated
dead_letterdeliveries with recenttransport_failedortimed_outattempts
Checks:
- inspect recent provider summaries on dead-lettered deliveries
- confirm SMTP reachability from the Mail Service process
- compare the spike against
mail.provider.send.duration_msand timeout logs - verify the remote SMTP server is accepting
STARTTLSand mail submission
Expected behavior:
- dead letters appear only after the fixed retry ladder is exhausted
- each dead-lettered delivery has a matching dead-letter entry
Repeated suppressed Outcomes
Symptoms:
mail.delivery.suppressedrises unexpectedly- auth or generic deliveries end as
suppressed
Checks:
- determine whether the source is
authsessionornotification - for auth deliveries, confirm the service is not intentionally running in
MAIL_SMTP_MODE=stub - inspect provider summaries for policy-driven suppression markers
- confirm the upstream business workflow still expects those deliveries to be skipped
Expected behavior:
- auth suppression is valid in stub mode and still counts as successful intake
- provider-side suppression is recorded as
mail_attempt.status=provider_rejectedtogether withmail_delivery.status=suppressed
SMTP Authentication Failures
Symptoms:
- provider summaries indicate auth or login failures
- delivery attempts shift toward
failedor repeated retryable failures, depending on provider classification
Checks:
- verify
MAIL_SMTP_USERNAMEandMAIL_SMTP_PASSWORDare both configured - verify the credential pair is valid for the target SMTP server
- verify the sender identity matches the allowed submission account
- confirm the server advertises the expected authentication mechanisms
SMTP Timeouts
Symptoms:
mail.attempt.outcomes{status="timed_out"}increasesmail.provider.send.duration_msshifts upward- logs show retry scheduling or dead-letter transitions after timeout paths
Checks:
- confirm network reachability to
MAIL_SMTP_ADDR - compare observed send duration with
MAIL_SMTP_TIMEOUT - verify the SMTP server is not stalling during
STARTTLS, auth, orDATA - confirm the process is not CPU-starved or blocked on Redis
Malformed Stream Commands
Symptoms:
mail.stream_commands.malformedincreases- logs contain
stream command rejected
Checks:
- inspect
failure_code,delivery_id,source, andstream_entry_id - confirm the upstream command payload still matches
../api/delivery-commands-asyncapi.yaml - confirm the producer still sends canonical
payload_mode, locale, and idempotency fields - review stored malformed-command records through the operator tooling or direct Redis inspection