Files
galaxy-game/mail/docs/runbook.md
T
2026-04-26 20:34:39 +02:00

187 lines
6.4 KiB
Markdown

# Operator Runbook
This runbook covers the checks that matter most during startup, steady-state
verification, shutdown, and common `Mail Service` incidents.
## Startup Checks
Before starting the process, confirm:
- `MAIL_REDIS_MASTER_ADDR` and `MAIL_REDIS_PASSWORD` point to the Redis
deployment that hosts the inbound `mail:delivery_commands` Stream and the
persisted consumer offset
- `MAIL_POSTGRES_PRIMARY_DSN` points to the PostgreSQL deployment whose
`mail` schema (provisioned externally for the `mailservice` role) holds the
durable mail state — deliveries, attempts, dead letters, payloads,
idempotency reservations, malformed commands
- `MAIL_TEMPLATE_DIR` points to the intended immutable template catalog
- if `MAIL_SMTP_MODE=smtp`, the SMTP address, sender identity, and optional
credentials are configured together
- the OpenTelemetry exporter settings point at the intended collector when
traces or metrics are expected outside the process
At startup the process pings the shared Redis master client, opens the
PostgreSQL pool, applies embedded goose migrations strictly before any HTTP
listener opens, parses the full template catalog, and only then starts the
internal HTTP listener and background workers.
Startup fails fast if any of those steps fail.
Known startup caveats:
- there is no `/healthz`, `/readyz`, or `/metrics` route
- traces and metrics are exported only through the configured OpenTelemetry
exporters
- template changes are not hot-reloaded; restart is required after template
edits
## Steady-State Verification
Practical readiness verification is:
1. confirm the process emitted startup logs for the internal HTTP listener,
command consumer, scheduler, attempt worker pool, and SQL retention
worker
2. open a TCP connection to `MAIL_INTERNAL_HTTP_ADDR`
3. issue one trusted smoke request such as
`GET /api/v1/internal/deliveries/does-not-exist`
4. verify Redis and PostgreSQL connectivity, plus OpenTelemetry exporter
health, out of band
Expected steady-state signals:
- `mail.attempt_schedule.depth` remains bounded
- `mail.attempt_schedule.oldest_age_ms` stays near the active retry ladder
- `mail.delivery.dead_letters` changes rarely
- `mail.stream_commands.malformed` changes only on bad upstream commands
- internal HTTP logs include `otel_trace_id` and `otel_span_id`
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behavior:
- coordinated shutdown is bounded by `MAIL_SHUTDOWN_TIMEOUT`
- the internal HTTP listener is stopped before process resources are closed
- the Redis master client and PostgreSQL pool are closed after the app stops
- OpenTelemetry providers are flushed during runtime cleanup
During a planned restart:
1. send `SIGTERM`
2. wait for listener and worker shutdown logs
3. restart the process with the same Redis, PostgreSQL, and template
configuration
4. repeat the steady-state verification steps
## Incident Triage
### Attempt Schedule Backlog Grows
Symptoms:
- `mail.attempt_schedule.depth` rises steadily
- `mail.attempt_schedule.oldest_age_ms` increases instead of oscillating
- queued deliveries remain in `queued` or `rendered` longer than expected
Checks:
1. confirm the scheduler is still logging regular activity
2. confirm PostgreSQL connectivity and latency on the `deliveries`
`(next_attempt_at)` partial index — scheduler claims rely on
`FOR UPDATE SKIP LOCKED`, so contention here surfaces as backlog
3. confirm attempt workers are running and not blocked on SMTP
4. inspect `mail.provider.send.duration_ms` for elevated latency
5. verify `MAIL_ATTEMPT_WORKER_CONCURRENCY` is appropriate for the workload
### Dead-Letter Spikes
Symptoms:
- `mail.delivery.dead_letters` increases rapidly
- operator reads show repeated `dead_letter` deliveries with recent
`transport_failed` or `timed_out` attempts
Checks:
1. inspect recent provider summaries on dead-lettered deliveries
2. confirm SMTP reachability from the Mail Service process
3. compare the spike against `mail.provider.send.duration_ms` and timeout logs
4. verify the remote SMTP server is accepting `STARTTLS` and mail submission
Expected behavior:
- dead letters appear only after the fixed retry ladder is exhausted
- each dead-lettered delivery has a matching dead-letter entry
### Repeated `suppressed` Outcomes
Symptoms:
- `mail.delivery.suppressed` rises unexpectedly
- auth or generic deliveries end as `suppressed`
Checks:
1. determine whether the source is `authsession` or `notification`
2. for auth deliveries, confirm the service is not intentionally running in
`MAIL_SMTP_MODE=stub`
3. inspect provider summaries for policy-driven suppression markers
4. confirm the upstream business workflow still expects those deliveries to be
skipped
Expected behavior:
- auth suppression is valid in stub mode and still counts as successful intake
- provider-side suppression is recorded as
`mail_attempt.status=provider_rejected` together with
`mail_delivery.status=suppressed`
### SMTP Authentication Failures
Symptoms:
- provider summaries indicate auth or login failures
- delivery attempts shift toward `failed` or repeated retryable failures,
depending on provider classification
Checks:
1. verify `MAIL_SMTP_USERNAME` and `MAIL_SMTP_PASSWORD` are both configured
2. verify the credential pair is valid for the target SMTP server
3. verify the sender identity matches the allowed submission account
4. confirm the server advertises the expected authentication mechanisms
### SMTP Timeouts
Symptoms:
- `mail.attempt.outcomes{status="timed_out"}` increases
- `mail.provider.send.duration_ms` shifts upward
- logs show retry scheduling or dead-letter transitions after timeout paths
Checks:
1. confirm network reachability to `MAIL_SMTP_ADDR`
2. compare observed send duration with `MAIL_SMTP_TIMEOUT`
3. verify the SMTP server is not stalling during `STARTTLS`, auth, or `DATA`
4. confirm the process is not CPU-starved or blocked on Redis
### Malformed Stream Commands
Symptoms:
- `mail.stream_commands.malformed` increases
- logs contain `stream command rejected`
Checks:
1. inspect `failure_code`, `delivery_id`, `source`, and `stream_entry_id`
2. confirm the upstream command payload still matches
[`../api/delivery-commands-asyncapi.yaml`](../api/delivery-commands-asyncapi.yaml)
3. confirm the producer still sends canonical `payload_mode`, locale, and
idempotency fields
4. review stored malformed-command records through the operator tooling or
direct Redis inspection