# PostgreSQL Migration PG_PLAN.md §5 migrated `galaxy/notification` from a Redis-only durable store to the steady-state split codified in `ARCHITECTURE.md §Persistence Backends`: PostgreSQL is the source of truth for table-shaped notification state, and Redis keeps only the inbound `notification:intents` stream, the two outbound streams (`gateway:client-events`, `mail:delivery_commands`), the persisted consumer offset, and the short-lived per-route exclusivity lease. This document records the schema decisions and the non-obvious agreements behind them. Use it together with the migration script (`internal/adapters/postgres/migrations/00001_init.sql`) and the runtime wiring (`internal/app/runtime.go`). ## Outcomes - Schema `notification` (provisioned externally) holds the durable state: `records`, `routes`, `dead_letters`, `malformed_intents`. - The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`, applies embedded goose migrations strictly before any HTTP listener becomes ready, and exits non-zero when migration or ping fails. - The runtime opens one shared `*redis.Client` via `pkg/redisconn.NewMasterClient` and passes it to the intent consumer, the publishers (outbound XADDs), the route lease store, and the persisted stream offset store. - The Redis adapter package (`internal/adapters/redisstate/`) is reduced to the surviving `LeaseStore`, `StreamOffsetStore`, and a slim `Keyspace` exposing only `RouteLease(notificationID, routeID)`, `StreamOffset(stream)`, and `Intents()`. The Lua-backed atomic writer, the route-state mutation scripts, the records/routes/idempotency/dead- letters/malformed-intents keyspace, and the per-record TTL constants are gone. - Configuration drops `NOTIFICATION_REDIS_USERNAME` / `NOTIFICATION_REDIS_TLS_ENABLED` / `NOTIFICATION_REDIS_ADDR` and introduces `NOTIFICATION_REDIS_MASTER_ADDR` / `NOTIFICATION_REDIS_REPLICA_ADDRS` plus `NOTIFICATION_POSTGRES_*`. The retention knobs `NOTIFICATION_RECORD_TTL` / `NOTIFICATION_DEAD_LETTER_TTL` are renamed to `NOTIFICATION_RECORD_RETENTION` / `NOTIFICATION_MALFORMED_INTENT_RETENTION`, and a new `NOTIFICATION_CLEANUP_INTERVAL` drives the periodic SQL retention worker. ## Decisions ### 1. One schema, externally-provisioned role **Decision.** The `notification` schema and the matching `notificationservice` role are created outside the migration sequence (in tests, by `integration/internal/harness/postgres_container.go::EnsureRoleAndSchema`; in production, by an ops init script not in scope for this stage). The embedded migration `00001_init.sql` only contains DDL for tables and indexes and assumes it runs as the schema owner with `search_path=notification`. **Why.** Mixing role creation, schema creation, and table DDL into one script forces every consumer of the migration to run as a superuser. The schema-per-service architectural rule (`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the operational split: ops provisions roles and schemas, the service applies schema-scoped migrations. ### 2. Idempotency record IS the records row **Decision.** The `records` table carries `producer`, `idempotency_key`, `request_fingerprint`, and `idempotency_expires_at` columns and a `UNIQUE (producer, idempotency_key)` constraint. Acceptance flows insert the row directly; a duplicate request races on the UNIQUE constraint and surfaces as `acceptintent.ErrConflict`. There is no separate idempotency table. **Why.** PG_PLAN.md §3 fixed this rule for every PG-backed service. With the reservation living on the durable record, recovery is a single fact — the row either exists or it does not — so no Redis-loss window can make a duplicate sneak through. The `records.accepted_at` value doubles as the `IdempotencyRecord.CreatedAt` returned to the service layer. ### 3. `recipient_user_ids` as JSONB **Decision.** `records.recipient_user_ids` stores the normalized recipient user-id list as a JSONB column. The codec round-trips a nil slice as `[]` to keep the column NOT NULL while letting the read path return a nil slice when the audience is not user-targeted. **Why.** The list is opaque to queries (we never element-filter on it). JSONB lines up with the "everything outside primary fields is JSON" pattern Mail Stage 4 already established; PostgreSQL will accept a future GIN index on `recipient_user_ids jsonb_path_ops` if a recipient-filtered operator UI ever lands. `text[]` would have forced a `pgtype.Array[string]` boundary type and a different scan path with no functional benefit today. ### 4. Timestamps are uniformly `timestamptz` and always UTC at the boundary **Decision.** Every time-valued column on every Stage 5 table uses PostgreSQL's `timestamptz`. The domain model continues to use `time.Time`; the adapter normalises every `time.Time` parameter to UTC at the binding site (`record.X.UTC()` or the `nullableTime` helper that wraps a possibly zero-valued `time.Time`), and re-wraps every scanned `time.Time` with `.UTC()` (directly or via `timeFromNullable` for nullable columns) before it leaves the adapter. The architecture-wide form of this rule lives in `ARCHITECTURE.md §Persistence Backends → Timestamp handling`. **Why.** PG_PLAN.md §5 originally specified `_ms` epoch-millisecond columns. User Service Stage 3 and Mail Service Stage 4 already use `timestamptz` for every table and the runtime contract tests expect Go-level `time.Time` semantics throughout. Keeping the same shape across services reduces adapter-layer complexity and avoids two parallel encoding paths in the notificationstore. The deviation from the literal plan is intentional and is documented here. The defensive `.UTC()` rule on both sides eliminates the class of bug where the pgx driver returns scanned values in `time.Local`, which silently breaks equality tests, JSON formatting, and comparison against pointer fields. ### 5. Scheduler claim is non-locking; transitions use optimistic concurrency on `updated_at` **Decision.** `ListDueRoutes(ctx, now, limit)` is a non-locking `SELECT notification_id, route_id FROM routes WHERE next_attempt_at IS NOT NULL AND next_attempt_at <= $1 ORDER BY next_attempt_at ASC LIMIT $2`. The publisher then takes a Redis lease (`route_leases:*`), reads the route, emits the outbound stream entry, and calls one of `CompleteRoutePublished` / `CompleteRouteFailed` / `CompleteRouteDeadLetter`. Each `Complete*` transaction issues `UPDATE routes SET ... WHERE notification_id = $a AND route_id = $b AND updated_at = $expectedUpdatedAt`; a zero `RowsAffected` count surfaces as `routestate.ErrConflict`, which the publisher treats as a no-op (some other replica progressed the row since the worker loaded it). **Why.** A `FOR UPDATE` held across the publisher's whole publish window would serialise concurrent publishers and block the outbound stream emit. Per-row optimistic concurrency on `updated_at` keeps the lock duration inside the SQL transaction itself; the lease bounds duplicates atop that. The explicit `next_attempt_at` column (set to `NULL` for terminal states) keeps the partial index `routes_due_idx` narrow and avoids the "schedule out of sync with row" failure mode of the previous Redis ZSET + JSON-payload pair. ### 6. Outbound XADD precedes SQL completion (at-least-once across the dual-system boundary) **Decision.** The publisher emits the outbound stream entry through `*redis.Client.XAdd` *before* the route's SQL state transition is committed. If the XADD succeeds and the SQL update later fails, the next replica retries — same notification gets a second outbound entry; the consumer side (Gateway, Mail) deduplicates on the entry id. If the XADD fails, `recordFailure` records a publication failure with classification `gateway_stream_publish_failed` or `mail_stream_publish_failed` and schedules a retry. **Why.** PG_PLAN.md §5 explicitly endorses this ordering by saying the lease is "atop the SQL claim" rather than replacing it. The lease bounds duplicate emission to one replica per route per lease window; the consumer-side dedupe handles the rare cross-window case. A transactional outbox would solve the duplicate but is out of Stage 5 scope; revisit if duplicate-traffic ever becomes an operational concern. ### 7. Lease stays on Redis as a hint **Decision.** The lease key `notification:route_leases::` keeps its existing SETNX/Lua-release semantics, lifted into a dedicated `redisstate.LeaseStore`. The composite `internal/adapters/postgres/routepublisher.Store` wires the SQL state store and the Redis lease store behind the existing publisher-worker interfaces (`PushRouteStateStore`, `EmailRouteStateStore`). **Why.** PG_PLAN.md §5 retains the lease as a "short-lived, per-process exclusivity hint atop the SQL claim". Without the lease, two replicas selecting overlapping due batches would each XADD before either commits the SQL transition — duplicating outbound traffic during contention. The lease bounds emission rate to one-per-route-per-lease-TTL even when scans overlap. Keeping the abstraction inside `LeaseStore` (separate from the SQL store) keeps the architectural split visible. ### 8. Periodic SQL retention replaces Redis EXPIRE **Decision.** A new `worker.SQLRetentionWorker` runs the two DELETE statements driven by config: - `DELETE FROM records WHERE accepted_at < now() - $record_retention` cascades to `routes` and `dead_letters` via `ON DELETE CASCADE`. - `DELETE FROM malformed_intents WHERE recorded_at < now() - $malformed_intent_retention` is a standalone retention pass. Three new env vars (`NOTIFICATION_RECORD_RETENTION`, `NOTIFICATION_MALFORMED_INTENT_RETENTION`, `NOTIFICATION_CLEANUP_INTERVAL`) drive the worker. `NOTIFICATION_IDEMPOTENCY_TTL` survives unchanged: the service layer materialises it on each row as `idempotency_expires_at`. **Why.** PostgreSQL maintains its own indexes; the previous per-key Redis EXPIRE TTL semantics translate to a periodic batch DELETE. The two-knob shape mirrors Mail Stage 4 (`MAIL_DELIVERY_RETENTION` + `MAIL_MALFORMED_COMMAND_RETENTION`). The legacy `NOTIFICATION_RECORD_TTL` / `NOTIFICATION_DEAD_LETTER_TTL` env vars are intentionally retired without a backward-compat shim — keeping the names would mislead operators reading the runbook because the eviction mechanism genuinely changed. ### 9. Shared Redis client with consumer-driven shutdown **Decision.** `internal/app/runtime.go` constructs one `redisconn.NewMasterClient(cfg.Redis.Conn)` (via the thin `redisadapter.NewClient` wrapper) and passes it to the intent consumer, the lease store, the stream offset store, and both publishers (for their outbound XADDs). The runtime cleanup tolerates `redis.ErrClosed` so a double-close from any consumer is benign. **Why.** Each subsequent PG_PLAN stage (Lobby) ships a similar pattern; sharing one client is the shape we want all stages to converge on. A dedicated client per consumer is the artefact the Redis-only architecture needed; sharing one client multiplies fewer TCP connections, ping points, and OpenTelemetry instrumentation hooks for no functional benefit. ### 10. Query layer is `go-jet/jet/v2` **Decision.** All `notificationstore` packages build SQL through the jet builder API (`pgtable..INSERT/SELECT/UPDATE/DELETE` plus the `pg.AND/OR/SET/MIN/COUNT/...` DSL). `cmd/jetgen` (invoked via `make jet`) brings up a transient PostgreSQL container, applies the embedded migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB` against the provisioned schema; the generated table/model code lives under `internal/adapters/postgres/jet/notification/{model,table}/*.go` and is committed to the repo, so build consumers do not need Docker. Statements are run through the `database/sql` API (`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner` helpers preserve the codecs.go boundary translations and domain-type mapping. **Why.** Aligns with `PG_PLAN.md` §Library stack ("Query layer: `github.com/go-jet/jet/v2` (PostgreSQL dialect). Generated code lives under each service `internal/adapters/postgres/jet/`, regenerated via a `make jet` target and committed to the repo"). Constructs the jet builder does not cover natively (`MIN(timestamptz)` aggregates, optimistic-concurrency `WHERE updated_at = $expected`, JSONB params) are expressed through the per-DSL helpers (`pg.MIN(...)`, `pg.TimestampzT(...)`, direct `[]byte`/string params for JSONB columns). ## Cross-References - `PG_PLAN.md §5` (Stage 5 — Notification Service migration). - `ARCHITECTURE.md §Persistence Backends`. - `internal/adapters/postgres/migrations/00001_init.sql` and `internal/adapters/postgres/migrations/migrations.go`. - `internal/adapters/postgres/notificationstore/{store,records,routes, acceptance,scheduler,dead_letters,malformed_intents,retention,codecs, helpers}.go` plus the testcontainers-backed unit suite under `notificationstore/{harness,store}_test.go`. - `internal/adapters/postgres/jet/notification/{model,table}/*.go` (committed generated code) plus `cmd/jetgen/main.go` and the `make jet` Makefile target that regenerate it. - `internal/adapters/postgres/routepublisher/store.go` (composite PG state + Redis lease behind the publisher contracts). - `internal/service/routestate/types.go` (storage-agnostic value types). - `internal/config/{config,env}.go` (`PostgresConfig` plus the `redisconn.Config`-shaped `RedisConfig` envelope). - `internal/app/runtime.go` (shared Redis client + PG pool open + migration + notificationstore wiring + retention worker startup). - `internal/worker/sqlretention.go` (periodic SQL retention worker). - `internal/adapters/redisstate/{keyspace,codecs,errors,lease_store, stream_offset_store}.go` (surviving slim Redis surface). - `integration/internal/harness/notificationservice.go` (per-suite Postgres container + `notification`/`notificationservice` provisioning).