# PostgreSQL Migration

PG_PLAN.md §5 migrated `galaxy/notification` from a Redis-only durable store
to the steady-state split codified in `ARCHITECTURE.md §Persistence
Backends`: PostgreSQL is the source of truth for table-shaped notification
state, and Redis keeps only the inbound `notification:intents` stream, the
two outbound streams (`gateway:client-events`, `mail:delivery_commands`),
the persisted consumer offset, and the short-lived per-route exclusivity
lease.

This document records the schema decisions and the non-obvious agreements
behind them. Use it together with the migration script
(`internal/adapters/postgres/migrations/00001_init.sql`) and the runtime
wiring (`internal/app/runtime.go`).

## Outcomes

- Schema `notification` (provisioned externally) holds the durable state:
  `records`, `routes`, `dead_letters`, `malformed_intents`.
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
  applies embedded goose migrations strictly before any HTTP listener
  becomes ready, and exits non-zero when migration or ping fails.
- The runtime opens one shared `*redis.Client` via
  `pkg/redisconn.NewMasterClient` and passes it to the intent consumer, the
  publishers (outbound XADDs), the route lease store, and the persisted
  stream offset store.
- The Redis adapter package (`internal/adapters/redisstate/`) is reduced to
  the surviving `LeaseStore`, `StreamOffsetStore`, and a slim `Keyspace`
  exposing only `RouteLease(notificationID, routeID)`,
  `StreamOffset(stream)`, and `Intents()`. The Lua-backed atomic writer,
  the route-state mutation scripts, the records/routes/idempotency/dead-
  letters/malformed-intents keyspace, and the per-record TTL constants are
  gone.
- Configuration drops `NOTIFICATION_REDIS_USERNAME` /
  `NOTIFICATION_REDIS_TLS_ENABLED` / `NOTIFICATION_REDIS_ADDR` and
  introduces `NOTIFICATION_REDIS_MASTER_ADDR` /
  `NOTIFICATION_REDIS_REPLICA_ADDRS` plus `NOTIFICATION_POSTGRES_*`. The
  retention knobs `NOTIFICATION_RECORD_TTL` /
  `NOTIFICATION_DEAD_LETTER_TTL` are renamed to
  `NOTIFICATION_RECORD_RETENTION` /
  `NOTIFICATION_MALFORMED_INTENT_RETENTION`, and a new
  `NOTIFICATION_CLEANUP_INTERVAL` drives the periodic SQL retention
  worker.

## Decisions

### 1. One schema, externally-provisioned role

**Decision.** The `notification` schema and the matching
`notificationservice` role are created outside the migration sequence (in
tests, by
`integration/internal/harness/postgres_container.go::EnsureRoleAndSchema`;
in production, by an ops init script not in scope for this stage). The
embedded migration `00001_init.sql` only contains DDL for tables and
indexes and assumes it runs as the schema owner with
`search_path=notification`.

**Why.** Mixing role creation, schema creation, and table DDL into one
script forces every consumer of the migration to run as a superuser. The
schema-per-service architectural rule
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
operational split: ops provisions roles and schemas, the service applies
schema-scoped migrations.

### 2. Idempotency record IS the records row

**Decision.** The `records` table carries `producer`, `idempotency_key`,
`request_fingerprint`, and `idempotency_expires_at` columns and a
`UNIQUE (producer, idempotency_key)` constraint. Acceptance flows insert
the row directly; a duplicate request races on the UNIQUE constraint and
surfaces as `acceptintent.ErrConflict`. There is no separate idempotency
table.

**Why.** PG_PLAN.md §3 fixed this rule for every PG-backed service. With
the reservation living on the durable record, recovery is a single fact —
the row either exists or it does not — so no Redis-loss window can make a
duplicate sneak through. The `records.accepted_at` value doubles as the
`IdempotencyRecord.CreatedAt` returned to the service layer.

### 3. `recipient_user_ids` as JSONB

**Decision.** `records.recipient_user_ids` stores the normalized recipient
user-id list as a JSONB column. The codec round-trips a nil slice as `[]`
to keep the column NOT NULL while letting the read path return a nil slice
when the audience is not user-targeted.

**Why.** The list is opaque to queries (we never element-filter on it).
JSONB lines up with the "everything outside primary fields is JSON"
pattern Mail Stage 4 already established; PostgreSQL will accept a future
GIN index on `recipient_user_ids jsonb_path_ops` if a recipient-filtered
operator UI ever lands. `text[]` would have forced a `pgtype.Array[string]`
boundary type and a different scan path with no functional benefit today.

### 4. Timestamps are uniformly `timestamptz` and always UTC at the boundary

**Decision.** Every time-valued column on every Stage 5 table uses
PostgreSQL's `timestamptz`. The domain model continues to use `time.Time`;
the adapter normalises every `time.Time` parameter to UTC at the binding
site (`record.X.UTC()` or the `nullableTime` helper that wraps a possibly
zero-valued `time.Time`), and re-wraps every scanned `time.Time` with
`.UTC()` (directly or via `timeFromNullable` for nullable columns) before
it leaves the adapter. The architecture-wide form of this rule lives in
`ARCHITECTURE.md §Persistence Backends → Timestamp handling`.

**Why.** PG_PLAN.md §5 originally specified `_ms` epoch-millisecond
columns. User Service Stage 3 and Mail Service Stage 4 already use
`timestamptz` for every table and the runtime contract tests expect
Go-level `time.Time` semantics throughout. Keeping the same shape across
services reduces adapter-layer complexity and avoids two parallel encoding
paths in the notificationstore. The deviation from the literal plan is
intentional and is documented here. The defensive `.UTC()` rule on both
sides eliminates the class of bug where the pgx driver returns scanned
values in `time.Local`, which silently breaks equality tests, JSON
formatting, and comparison against pointer fields.

### 5. Scheduler claim is non-locking; transitions use optimistic concurrency on `updated_at`

**Decision.** `ListDueRoutes(ctx, now, limit)` is a non-locking
`SELECT notification_id, route_id FROM routes WHERE next_attempt_at IS
NOT NULL AND next_attempt_at <= $1 ORDER BY next_attempt_at ASC LIMIT $2`.
The publisher then takes a Redis lease (`route_leases:*`), reads the
route, emits the outbound stream entry, and calls one of
`CompleteRoutePublished` / `CompleteRouteFailed` /
`CompleteRouteDeadLetter`. Each `Complete*` transaction issues
`UPDATE routes SET ... WHERE notification_id = $a AND route_id = $b AND
updated_at = $expectedUpdatedAt`; a zero `RowsAffected` count surfaces as
`routestate.ErrConflict`, which the publisher treats as a no-op (some other
replica progressed the row since the worker loaded it).

**Why.** A `FOR UPDATE` held across the publisher's whole publish window
would serialise concurrent publishers and block the outbound stream emit.
Per-row optimistic concurrency on `updated_at` keeps the lock duration
inside the SQL transaction itself; the lease bounds duplicates atop that.
The explicit `next_attempt_at` column (set to `NULL` for terminal states)
keeps the partial index `routes_due_idx` narrow and avoids the "schedule
out of sync with row" failure mode of the previous Redis ZSET +
JSON-payload pair.

### 6. Outbound XADD precedes SQL completion (at-least-once across the dual-system boundary)

**Decision.** The publisher emits the outbound stream entry through
`*redis.Client.XAdd` *before* the route's SQL state transition is
committed. If the XADD succeeds and the SQL update later fails, the next
replica retries — same notification gets a second outbound entry; the
consumer side (Gateway, Mail) deduplicates on the entry id. If the XADD
fails, `recordFailure` records a publication failure with classification
`gateway_stream_publish_failed` or `mail_stream_publish_failed` and
schedules a retry.

**Why.** PG_PLAN.md §5 explicitly endorses this ordering by saying the
lease is "atop the SQL claim" rather than replacing it. The lease bounds
duplicate emission to one replica per route per lease window; the
consumer-side dedupe handles the rare cross-window case. A transactional
outbox would solve the duplicate but is out of Stage 5 scope; revisit if
duplicate-traffic ever becomes an operational concern.

### 7. Lease stays on Redis as a hint

**Decision.** The lease key `notification:route_leases:<notificationID>:<routeID>`
keeps its existing SETNX/Lua-release semantics, lifted into a dedicated
`redisstate.LeaseStore`. The composite
`internal/adapters/postgres/routepublisher.Store` wires the SQL state
store and the Redis lease store behind the existing publisher-worker
interfaces (`PushRouteStateStore`, `EmailRouteStateStore`).

**Why.** PG_PLAN.md §5 retains the lease as a "short-lived, per-process
exclusivity hint atop the SQL claim". Without the lease, two replicas
selecting overlapping due batches would each XADD before either commits
the SQL transition — duplicating outbound traffic during contention. The
lease bounds emission rate to one-per-route-per-lease-TTL even when scans
overlap. Keeping the abstraction inside `LeaseStore` (separate from the
SQL store) keeps the architectural split visible.

### 8. Periodic SQL retention replaces Redis EXPIRE

**Decision.** A new `worker.SQLRetentionWorker` runs the two DELETE
statements driven by config:

- `DELETE FROM records WHERE accepted_at < now() - $record_retention`
  cascades to `routes` and `dead_letters` via `ON DELETE CASCADE`.
- `DELETE FROM malformed_intents WHERE recorded_at < now() -
  $malformed_intent_retention` is a standalone retention pass.

Three new env vars (`NOTIFICATION_RECORD_RETENTION`,
`NOTIFICATION_MALFORMED_INTENT_RETENTION`,
`NOTIFICATION_CLEANUP_INTERVAL`) drive the worker.
`NOTIFICATION_IDEMPOTENCY_TTL` survives unchanged: the service layer
materialises it on each row as `idempotency_expires_at`.

**Why.** PostgreSQL maintains its own indexes; the previous per-key Redis
EXPIRE TTL semantics translate to a periodic batch DELETE. The two-knob
shape mirrors Mail Stage 4 (`MAIL_DELIVERY_RETENTION` +
`MAIL_MALFORMED_COMMAND_RETENTION`). The legacy
`NOTIFICATION_RECORD_TTL` / `NOTIFICATION_DEAD_LETTER_TTL` env vars are
intentionally retired without a backward-compat shim — keeping the names
would mislead operators reading the runbook because the eviction
mechanism genuinely changed.

### 9. Shared Redis client with consumer-driven shutdown

**Decision.** `internal/app/runtime.go` constructs one
`redisconn.NewMasterClient(cfg.Redis.Conn)` (via the thin
`redisadapter.NewClient` wrapper) and passes it to the intent consumer,
the lease store, the stream offset store, and both publishers (for their
outbound XADDs). The runtime cleanup tolerates `redis.ErrClosed` so a
double-close from any consumer is benign.

**Why.** Each subsequent PG_PLAN stage (Lobby) ships a similar pattern;
sharing one client is the shape we want all stages to converge on. A
dedicated client per consumer is the artefact the Redis-only architecture
needed; sharing one client multiplies fewer TCP connections, ping points,
and OpenTelemetry instrumentation hooks for no functional benefit.

### 10. Query layer is `go-jet/jet/v2`

**Decision.** All `notificationstore` packages build SQL through the
jet builder API (`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus
the `pg.AND/OR/SET/MIN/COUNT/...` DSL). `cmd/jetgen` (invoked via
`make jet`) brings up a transient PostgreSQL container, applies the
embedded migrations, and runs
`github.com/go-jet/jet/v2/generator/postgres.GenerateDB` against the
provisioned schema; the generated table/model code lives under
`internal/adapters/postgres/jet/notification/{model,table}/*.go` and
is committed to the repo, so build consumers do not need Docker.
Statements are run through the `database/sql` API
(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
helpers preserve the codecs.go boundary translations and domain-type
mapping.

**Why.** Aligns with `PG_PLAN.md` §Library stack ("Query layer:
`github.com/go-jet/jet/v2` (PostgreSQL dialect). Generated code lives
under each service `internal/adapters/postgres/jet/`, regenerated via
a `make jet` target and committed to the repo"). Constructs the jet
builder does not cover natively (`MIN(timestamptz)` aggregates,
optimistic-concurrency `WHERE updated_at = $expected`, JSONB params)
are expressed through the per-DSL helpers (`pg.MIN(...)`,
`pg.TimestampzT(...)`, direct `[]byte`/string params for JSONB
columns).

## Cross-References

- `PG_PLAN.md §5` (Stage 5 — Notification Service migration).
- `ARCHITECTURE.md §Persistence Backends`.
- `internal/adapters/postgres/migrations/00001_init.sql` and
  `internal/adapters/postgres/migrations/migrations.go`.
- `internal/adapters/postgres/notificationstore/{store,records,routes,
  acceptance,scheduler,dead_letters,malformed_intents,retention,codecs,
  helpers}.go` plus the testcontainers-backed unit suite under
  `notificationstore/{harness,store}_test.go`.
- `internal/adapters/postgres/jet/notification/{model,table}/*.go`
  (committed generated code) plus `cmd/jetgen/main.go` and the
  `make jet` Makefile target that regenerate it.
- `internal/adapters/postgres/routepublisher/store.go` (composite
  PG state + Redis lease behind the publisher contracts).
- `internal/service/routestate/types.go` (storage-agnostic value types).
- `internal/config/{config,env}.go` (`PostgresConfig` plus the
  `redisconn.Config`-shaped `RedisConfig` envelope).
- `internal/app/runtime.go` (shared Redis client + PG pool open + migration
  + notificationstore wiring + retention worker startup).
- `internal/worker/sqlretention.go` (periodic SQL retention worker).
- `internal/adapters/redisstate/{keyspace,codecs,errors,lease_store,
  stream_offset_store}.go` (surviving slim Redis surface).
- `integration/internal/harness/notificationservice.go`
  (per-suite Postgres container + `notification`/`notificationservice`
  provisioning).