fix(backend): retry migrations on transient connection errors #74
Reference in New Issue
Block a user
Delete Branch "feature/pg-migration-transient-retry"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes the intermittent
TestDiplomailAsyncFallbackOnUnsupportedPairfailure (and the eight other testcontainer e2e harnesses) seen ondevelopmentafter #73, where applying migrations against a freshly started Postgres container occasionally died withdriver: bad connection.Root cause. A freshly started Postgres — notably a test container — can reset a pooled connection moments after it logs "ready to accept connections". The harness already waits for that log twice (
WithOccurrence(2)) and pings before migrating, but goose can still draw a connection postgres then resets, killing the migration transaction on the firstCREATE TABLE. The CI log showed the very next migration attempt succeeding.Fix.
backendpg.ApplyMigrationsnow wraps the schema-create + goose run in a bounded retry (retryOnTransient, 5 attempts / 250 ms backoff) that fires only on transient connection errors (driver.ErrBadConnplus the connection-failure strings Postgres drivers surface). Both steps are idempotent (CREATE SCHEMA IF NOT EXISTS+ goose version tracking), so a retry after a dropped connection resumes cleanly. Deterministic SQL errors (syntax, constraint) still fail fast.ApplyMigrationsis shared by all nine e2e harnesses and by service startup (cmd/backend), so this also hardens production migration-on-startup against a momentarily-unready DB — additive, with no effect on the success path.New deterministic unit tests cover the classifier and the retry behaviour (no DB needed); the real-container migration path and the formerly-flaky diplomail test were verified locally.