fix(backend): retry migrations on transient connection errors #74

Merged
developer merged 1 commits from feature/pg-migration-transient-retry into development 2026-05-30 12:51:40 +00:00
Owner

Fixes the intermittent TestDiplomailAsyncFallbackOnUnsupportedPair failure (and the eight other testcontainer e2e harnesses) seen on development after #73, where applying migrations against a freshly started Postgres container occasionally died with driver: bad connection.

Root cause. A freshly started Postgres — notably a test container — can reset a pooled connection moments after it logs "ready to accept connections". The harness already waits for that log twice (WithOccurrence(2)) and pings before migrating, but goose can still draw a connection postgres then resets, killing the migration transaction on the first CREATE TABLE. The CI log showed the very next migration attempt succeeding.

Fix. backendpg.ApplyMigrations now wraps the schema-create + goose run in a bounded retry (retryOnTransient, 5 attempts / 250 ms backoff) that fires only on transient connection errors (driver.ErrBadConn plus the connection-failure strings Postgres drivers surface). Both steps are idempotent (CREATE SCHEMA IF NOT EXISTS + goose version tracking), so a retry after a dropped connection resumes cleanly. Deterministic SQL errors (syntax, constraint) still fail fast.

ApplyMigrations is shared by all nine e2e harnesses and by service startup (cmd/backend), so this also hardens production migration-on-startup against a momentarily-unready DB — additive, with no effect on the success path.

New deterministic unit tests cover the classifier and the retry behaviour (no DB needed); the real-container migration path and the formerly-flaky diplomail test were verified locally.

Fixes the intermittent `TestDiplomailAsyncFallbackOnUnsupportedPair` failure (and the eight other testcontainer e2e harnesses) seen on `development` after #73, where applying migrations against a freshly started Postgres container occasionally died with `driver: bad connection`. **Root cause.** A freshly started Postgres — notably a test container — can reset a pooled connection moments after it logs "ready to accept connections". The harness already waits for that log twice (`WithOccurrence(2)`) and pings before migrating, but goose can still draw a connection postgres then resets, killing the migration transaction on the first `CREATE TABLE`. The CI log showed the very next migration attempt succeeding. **Fix.** `backendpg.ApplyMigrations` now wraps the schema-create + goose run in a bounded retry (`retryOnTransient`, 5 attempts / 250 ms backoff) that fires *only* on transient connection errors (`driver.ErrBadConn` plus the connection-failure strings Postgres drivers surface). Both steps are idempotent (`CREATE SCHEMA IF NOT EXISTS` + goose version tracking), so a retry after a dropped connection resumes cleanly. Deterministic SQL errors (syntax, constraint) still fail fast. `ApplyMigrations` is shared by all nine e2e harnesses and by service startup (`cmd/backend`), so this also hardens production migration-on-startup against a momentarily-unready DB — additive, with no effect on the success path. New deterministic unit tests cover the classifier and the retry behaviour (no DB needed); the real-container migration path and the formerly-flaky diplomail test were verified locally.
developer added 1 commit 2026-05-30 12:46:18 +00:00
fix(backend): retry migrations on transient connection errors
Tests · Go / test (push) Successful in 2m1s
Tests · Go / test (pull_request) Successful in 3m0s
Tests · Integration / integration (pull_request) Successful in 1m42s
06a2e631c9
Backend e2e tests (and, more rarely, service startup) intermittently
failed applying migrations with `driver: bad connection`: a freshly
started Postgres — notably a test container — can reset a pooled
connection moments after it reports ready, killing the migration
transaction. The harness already waits for the double "ready" log and
pings before migrating, yet goose can still draw a connection postgres
then resets.

ApplyMigrations now wraps the schema-create + goose run in a bounded
retry that fires only on transient connection errors (driver.ErrBadConn
and the connection-failure messages Postgres drivers surface); both
steps are idempotent, so a retry resumes cleanly. Deterministic SQL
errors still fail fast.

Fixes the intermittent TestDiplomailAsyncFallbackOnUnsupportedPair (and
the eight other testcontainer e2e harnesses that share ApplyMigrations).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
owner approved these changes 2026-05-30 12:51:06 +00:00
developer merged commit 40d6ba6ba4 into development 2026-05-30 12:51:40 +00:00
developer deleted branch feature/pg-migration-transient-retry 2026-05-30 12:51:40 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: developer/galaxy-game#74