Files
galaxy-game/mail/docs/runbook.md
T
2026-04-17 18:39:16 +02:00

5.9 KiB

Operator Runbook

This runbook covers the checks that matter most during startup, steady-state verification, shutdown, and common Mail Service incidents.

Startup Checks

Before starting the process, confirm:

  • MAIL_REDIS_ADDR points to the Redis deployment that stores deliveries, attempts, idempotency reservations, malformed commands, and stream offsets
  • the configured Redis ACL, DB, TLS, and timeout settings match the target environment
  • MAIL_TEMPLATE_DIR points to the intended immutable template catalog
  • if MAIL_SMTP_MODE=smtp, the SMTP address, sender identity, and optional credentials are configured together
  • the OpenTelemetry exporter settings point at the intended collector when traces or metrics are expected outside the process

At startup the process performs bounded PING checks for both Redis clients used by the runtime and parses the full template catalog.

Startup fails fast if those checks fail or if the template catalog cannot be loaded.

Known startup caveats:

  • there is no /healthz, /readyz, or /metrics route
  • traces and metrics are exported only through the configured OpenTelemetry exporters
  • template changes are not hot-reloaded; restart is required after template edits

Steady-State Verification

Practical readiness verification is:

  1. confirm the process emitted startup logs for the internal HTTP listener, command consumer, scheduler, and worker pool
  2. open a TCP connection to MAIL_INTERNAL_HTTP_ADDR
  3. issue one trusted smoke request such as GET /api/v1/internal/deliveries/does-not-exist
  4. verify Redis connectivity and OpenTelemetry exporter health out of band

Expected steady-state signals:

  • mail.attempt_schedule.depth remains bounded
  • mail.attempt_schedule.oldest_age_ms stays near the active retry ladder
  • mail.delivery.dead_letters changes rarely
  • mail.stream_commands.malformed changes only on bad upstream commands
  • internal HTTP logs include otel_trace_id and otel_span_id

Shutdown

The process handles SIGINT and SIGTERM.

Shutdown behavior:

  • coordinated shutdown is bounded by MAIL_SHUTDOWN_TIMEOUT
  • the internal HTTP listener is stopped before process resources are closed
  • Redis clients are closed after the app stops
  • OpenTelemetry providers are flushed during runtime cleanup

During a planned restart:

  1. send SIGTERM
  2. wait for listener and worker shutdown logs
  3. restart the process with the same Redis and template configuration
  4. repeat the steady-state verification steps

Incident Triage

Attempt Schedule Backlog Grows

Symptoms:

  • mail.attempt_schedule.depth rises steadily
  • mail.attempt_schedule.oldest_age_ms increases instead of oscillating
  • queued deliveries remain in queued or rendered longer than expected

Checks:

  1. confirm the scheduler is still logging regular activity
  2. confirm Redis connectivity and latency for attempt-schedule keys
  3. confirm attempt workers are running and not blocked on SMTP
  4. inspect mail.provider.send.duration_ms for elevated latency
  5. verify MAIL_ATTEMPT_WORKER_CONCURRENCY is appropriate for the workload

Dead-Letter Spikes

Symptoms:

  • mail.delivery.dead_letters increases rapidly
  • operator reads show repeated dead_letter deliveries with recent transport_failed or timed_out attempts

Checks:

  1. inspect recent provider summaries on dead-lettered deliveries
  2. confirm SMTP reachability from the Mail Service process
  3. compare the spike against mail.provider.send.duration_ms and timeout logs
  4. verify the remote SMTP server is accepting STARTTLS and mail submission

Expected behavior:

  • dead letters appear only after the fixed retry ladder is exhausted
  • each dead-lettered delivery has a matching dead-letter entry

Repeated suppressed Outcomes

Symptoms:

  • mail.delivery.suppressed rises unexpectedly
  • auth or generic deliveries end as suppressed

Checks:

  1. determine whether the source is authsession or notification
  2. for auth deliveries, confirm the service is not intentionally running in MAIL_SMTP_MODE=stub
  3. inspect provider summaries for policy-driven suppression markers
  4. confirm the upstream business workflow still expects those deliveries to be skipped

Expected behavior:

  • auth suppression is valid in stub mode and still counts as successful intake
  • provider-side suppression is recorded as mail_attempt.status=provider_rejected together with mail_delivery.status=suppressed

SMTP Authentication Failures

Symptoms:

  • provider summaries indicate auth or login failures
  • delivery attempts shift toward failed or repeated retryable failures, depending on provider classification

Checks:

  1. verify MAIL_SMTP_USERNAME and MAIL_SMTP_PASSWORD are both configured
  2. verify the credential pair is valid for the target SMTP server
  3. verify the sender identity matches the allowed submission account
  4. confirm the server advertises the expected authentication mechanisms

SMTP Timeouts

Symptoms:

  • mail.attempt.outcomes{status="timed_out"} increases
  • mail.provider.send.duration_ms shifts upward
  • logs show retry scheduling or dead-letter transitions after timeout paths

Checks:

  1. confirm network reachability to MAIL_SMTP_ADDR
  2. compare observed send duration with MAIL_SMTP_TIMEOUT
  3. verify the SMTP server is not stalling during STARTTLS, auth, or DATA
  4. confirm the process is not CPU-starved or blocked on Redis

Malformed Stream Commands

Symptoms:

  • mail.stream_commands.malformed increases
  • logs contain stream command rejected

Checks:

  1. inspect failure_code, delivery_id, source, and stream_entry_id
  2. confirm the upstream command payload still matches ../api/delivery-commands-asyncapi.yaml
  3. confirm the producer still sends canonical payload_mode, locale, and idempotency fields
  4. review stored malformed-command records through the operator tooling or direct Redis inspection