feat: use postgres

This commit is contained in:
Ilia Denisov
2026-04-26 20:34:39 +02:00
committed by GitHub
parent 48b0056b49
commit fe829285a6
365 changed files with 29223 additions and 24049 deletions
+920
View File
@@ -0,0 +1,920 @@
# PostgreSQL Migration Plan
This plan has been already implemented and stays here for historical reasons.
It should NOT be threated as source of truth for service functionality.
## Context
The Galaxy Game project currently uses Redis as the only persistence backend
across all implemented services (`user`, `mail`, `notification`, `lobby`,
`gateway`, `authsession`). Redis serves both kinds of state: ephemeral and
runtime-coordination state (where it shines — Streams, caches, replay keys,
runtime queues, session caches, leases) and table-shaped business state where
it is a poor fit (durable user accounts, entitlements/sanctions, mail audit
records, notification routes/idempotency, lobby memberships and invites).
Replication and standby for Redis are not configured anywhere. There is no
SQL/migration tooling in the repo at all.
We migrate to a Redis + PostgreSQL split where each backend owns the data it
serves best. PostgreSQL becomes the source of truth for table-shaped business
state, gives us ACID transactions, mature physical/logical replication, and
backup/restore via `pg_dump` and WAL archiving. Redis remains the source of
truth for streams, pub/sub, caches, leases, replay keys, rate limits, session
caches, runtime queues, and stream consumer offsets.
The plan migrates only services already implemented and explicitly excludes
`galaxy/game`. It targets steady-state architecture rules first (one
authoritative document, `ARCHITECTURE.md`), then walks each service end to end
— code, tests, service-local README/docs, and integration suites — so that no
intermediate commit leaves docs and code in conflict.
## Confirmed decisions (with project owner)
1. **Documentation strategy**: `ARCHITECTURE.md` is updated as the very first
stage with the architecture-wide rules. Each per-service README and per-
service `docs/` change inside that service's own stage, paired with code
and tests. This keeps `ARCHITECTURE.md` ≡ policy, README ≡ current state,
and ensures any commit can be checked out without code/doc divergence.
2. **Service scope**: full migration of durable storage to PostgreSQL for
`user`, `mail`, `notification`, `lobby`. Only Redis configuration refactor
(master/replica + mandatory password, drop `TLS_ENABLED` / `USERNAME`) for
`gateway` and `authsession` — these services intentionally stay Redis-
only. `geoprofile` has no implementation; its `PLAN.md` and `README.md`
absorb the new persistence rules so future implementation follows them.
3. **Idempotency and retry-schedule placement**: idempotency records and
retry schedule queues live in PostgreSQL on the same table as the durable
record they protect (`(producer, idempotency_key)` UNIQUE on `records`,
`next_attempt_at` column on `deliveries` / `routes`). One source of truth,
no dual-write hazard between PG and Redis ZSETs.
4. **Stack**: `github.com/jackc/pgx/v5` driver, exposed as `*sql.DB` via
`github.com/jackc/pgx/v5/stdlib`. `github.com/go-jet/jet/v2` for
type-safe query building + code generation, generated against a
testcontainers PostgreSQL instance with migrations applied (Makefile
target per service). `github.com/pressly/goose/v3` library API for
embedded migrations applied at service startup; the `goose` CLI may be
used for local development and rollback investigations but is not in the
service binary path.
5. **Code**: all postgres queries must use pre-generated code with `jet` and
appropriate builders rather than raw SQL queries, unless this usage cannot
achive the goal of businness-scenario due to lack of `go-jet` functionality.
## Architectural rules (target steady-state)
These rules land in `ARCHITECTURE.md` in Stage 0 and govern every subsequent
service stage.
### Backend assignment
PostgreSQL is the source of truth for:
- Domain entities with table-shaped business state (`accounts`,
`entitlement_records`, `sanction_records`, `limit_records`,
`blocked_emails`, `deliveries`, `attempts`, `dead_letters`,
`malformed_commands`, `notification_records`, `notification_routes`,
`games`, `applications`, `invites`, `memberships`, `race_names`).
- Idempotency records (UNIQUE constraint on the durable table, not a
separate kv).
- Retry scheduling state (`next_attempt_at` column + supporting index on the
durable table).
- Audit history records that must outlive any Redis snapshot.
Redis is the source of truth for:
- Redis Streams used as the event bus (`user:domain_events`,
`user:lifecycle_events`, `gm:lobby_events`, `runtime:job_results`,
`notification:intents`, `gateway:client-events`, `mail:delivery_commands`).
- Stream consumer offsets (small runtime coordination state, rebuildable).
- Caches and projections (gateway session cache).
- Replay reservation keys.
- Rate limit counters.
- Runtime coordination locks/leases (e.g. notification `route_leases`).
- Authentication challenge state and active session tokens (TTL-bounded; loss
is recoverable by re-authentication).
- Ephemeral per-game runtime aggregates that are deleted at game finish
(lobby `game_turn_stats`, `gap_activated_at`, capability evaluation
marker).
### Database topology
- Single PostgreSQL database `galaxy`.
- Schema-per-service: `user`, `mail`, `notification`, `lobby`. Reserved for
later: `geoprofile`. Not allocated unless needed: `gateway`, `authsession`.
- Per-service PostgreSQL role with grants restricted to its own schema
(defense-in-depth, simple to express in the initial migration).
- Authentication: username + password only. `sslmode=disable`. No client
certificates, no SCRAM channel binding, no custom auth plugins.
- Each service connects to one primary plus zero-or-more read-only replicas.
In this iteration only the primary is used; the replica pool is wired but
receives no traffic. Future read-routing is non-breaking.
### Redis topology
- Each service connects to one master Redis plus zero-or-more replica Redis
hosts.
- All connections use a mandatory password. `USERNAME`/ACL not used. TLS off.
- In this iteration only the master is used; the replica list is wired but
unused — non-breaking switch later when the app starts routing reads.
- Existing env vars `*_REDIS_TLS_ENABLED`, `*_REDIS_USERNAME` are removed
(hard rename; no backward-compat shim — fresh project, no production
deploys to migrate).
### Library stack
- Driver: `github.com/jackc/pgx/v5` (modern, actively maintained), exposed
to `database/sql` via `github.com/jackc/pgx/v5/stdlib` so go-jet's
`qrm.Queryable` interface is satisfied without changes.
- Query layer: `github.com/go-jet/jet/v2` (PostgreSQL dialect). Generated
code lives under each service `internal/adapters/postgres/jet/`,
regenerated via a `make jet` target and committed to the repo.
- Migrations: `github.com/pressly/goose/v3` library API; migration files
embedded via `//go:embed *.sql`; applied at startup, before opening any
HTTP/gRPC listener; non-zero exit on failure.
- Test infrastructure: `github.com/testcontainers/testcontainers-go` plus
the `modules/postgres` submodule; the same setup is reused by `make jet`
to host a transient instance for jet codegen.
### Migration discipline
- Forward-only sequence-numbered files: `00001_init.sql`, `00002_*.sql`, …
- Lowercase snake_case names; goose `-- +goose Up` / `-- +goose Down`
markers; statements that need transaction-wrapping use
`-- +goose StatementBegin` / `-- +goose StatementEnd`.
- Migrations apply at service startup; service exits non-zero on failure.
- Per-service decision record at `galaxy/<service>/docs/postgres-migration.md`
captures schema decisions and any non-trivial deviation from the rules.
### Per-service code organisation
```text
galaxy/<service>/
internal/
adapters/
postgres/
migrations/ # *.sql files + migrations.go (//go:embed)
jet/ # generated; commit-checked
<portname>/ # adapter implementations matching internal/ports
config/
config.go # adds Postgres + new Redis schema
Makefile # `jet` target: testcontainers + goose + jet
```
### Test patterns
- Per-service unit tests against a real PostgreSQL via
`testcontainers-go`; replace the corresponding miniredis test path where
storage moved to PG.
- Shared port-test suites (e.g. `lobby/internal/ports/racenamedirtest/`)
gain a Postgres harness; they remain backend-agnostic in shape.
- `integration/internal/harness/postgres_container.go` is added; integration
suites that need PG declare it next to their existing Redis container.
- Stub adapters (`*stub/`) are kept where the in-memory port is useful for
tests that don't need a real backend. Redis adapters that previously
implemented these ports are removed (no dead code).
### Configuration env vars (target)
For each service `<S>` ∈ { `USERSERVICE`, `MAIL`, `NOTIFICATION`, `LOBBY`,
`GATEWAY`, `AUTHSESSION` }:
- `<S>_REDIS_MASTER_ADDR` (required)
- `<S>_REDIS_REPLICA_ADDRS` (optional, comma-separated; default empty)
- `<S>_REDIS_PASSWORD` (required)
- `<S>_REDIS_DB` (default 0)
- `<S>_REDIS_OPERATION_TIMEOUT` (default 250ms)
For PG-backed services (`USERSERVICE`, `MAIL`, `NOTIFICATION`, `LOBBY`):
- `<S>_POSTGRES_PRIMARY_DSN` (required;
e.g. `postgres://userservice:secret@postgres:5432/galaxy?search_path=user&sslmode=disable`)
- `<S>_POSTGRES_REPLICA_DSNS` (optional, comma-separated)
- `<S>_POSTGRES_OPERATION_TIMEOUT` (default 1s)
- `<S>_POSTGRES_MAX_OPEN_CONNS` (default 25)
- `<S>_POSTGRES_MAX_IDLE_CONNS` (default 5)
- `<S>_POSTGRES_CONN_MAX_LIFETIME` (default 30m)
DSN sets `search_path=<schema>` so unqualified table references resolve into
the service-owned schema; `sslmode=disable` is set explicitly per the
"no TLS" requirement.
Service-prefix-specific stream/keyspace env vars (`*_REDIS_DOMAIN_EVENTS_STREAM`,
`*_REDIS_LIFECYCLE_EVENTS_STREAM`, `*_REDIS_KEYSPACE_PREFIX`,
`MAIL_REDIS_COMMAND_STREAM`, etc.) keep their current names and semantics —
they describe stream/key shapes, not connection topology.
---
## Stages
Each stage is independently executable and shippable.
### ~~Stage 0~~ — Architecture-wide rules and PG_PLAN.md materialisation
This stage is implemented.
**Goal**: land the steady-state rules in `ARCHITECTURE.md` and place
`PG_PLAN.md` at the project root so subsequent `/stage-implementation`
invocations have an authoritative reference.
**Actions**:
1. Write the contents of this plan file to `/Users/id/src/go/galaxy/PG_PLAN.md`.
2. Add a new section to `ARCHITECTURE.md` (e.g. `§9 Persistence Backends`)
capturing every rule under the *Architectural rules* heading above:
backend assignment, database/Redis topology, library stack, migration
discipline, code organisation, test patterns, env-var conventions.
3. Add a short *Migration Window* sub-section to `ARCHITECTURE.md` noting
that until all `PG_PLAN.md` stages complete, each service's `README.md`
continues to describe its actual current state — this caveat is removed
in Stage 9.
4. Adjust `ARCHITECTURE.md §8` (publisher rules) so cross-references
distinguish "Redis Stream" (event bus, stays Redis) from "PG-backed
table" (durable record).
**Files (modified / new)**:
- `/Users/id/src/go/galaxy/PG_PLAN.md` — new
- `/Users/id/src/go/galaxy/ARCHITECTURE.md` — modified
**Out of scope**: zero service code, zero per-service README/docs, zero
`go.mod` changes, zero new dependencies in service modules.
**Verification**:
- `git diff --stat` reports two paths only: `PG_PLAN.md`, `ARCHITECTURE.md`.
- `ARCHITECTURE.md` reads coherently end to end, with the new section
cross-referenced from §8 and from any other place that today says
"Redis is the v1 backend".
- Manual: read `PG_PLAN.md` top to bottom, confirm every architectural
decision matches the section in `ARCHITECTURE.md`.
---
### ~~Stage 1~~ — Shared infrastructure packages (`pkg/postgres`, `pkg/redisconn`)
This stage is implemented.
**Goal**: provide one canonical helper each for Postgres and Redis so per-
service stages don't reinvent connection/migration wiring. No service
consumes them yet.
**Files (new)**:
- `pkg/postgres/config.go``Config` struct (PrimaryDSN, ReplicaDSNs,
OperationTimeout, MaxOpenConns, MaxIdleConns, ConnMaxLifetime); helper
`LoadFromEnv(prefix string) (Config, error)` that reads
`<prefix>_POSTGRES_*`.
- `pkg/postgres/open.go``OpenPrimary(ctx, cfg) (*sql.DB, error)` and
`OpenReplicas(ctx, cfg) ([]*sql.DB, error)` using
`pgx.ConnConfig``stdlib.OpenDB(...)`; configures pool sizes and
per-statement context timeout.
- `pkg/postgres/migrate.go` — `RunMigrations(ctx context.Context, db *sql.DB,
fs embed.FS) error` wrapping `goose.SetBaseFS(fs)` + `goose.UpContext`.
- `pkg/postgres/otel.go` — `Instrument(db *sql.DB, telemetry telemetry.Runtime)`
applying `otelsql.RegisterDBStatsMetrics` and statement spans.
- `pkg/postgres/postgres_test.go` — testcontainers-backed smoke test:
open primary, run a one-line migration, insert/select.
- `pkg/redisconn/config.go` — `Config` struct (MasterAddr, ReplicaAddrs,
Password, DB, OperationTimeout); helper `LoadFromEnv(prefix string)
(Config, error)` that reads `<prefix>_REDIS_*` (the new shape only;
rejects deprecated TLS/USERNAME vars with a clear error).
- `pkg/redisconn/client.go` — `NewMasterClient(cfg) *redis.Client` and
`NewReplicaClients(cfg) []*redis.Client` (latter returns nil/empty when
replicas not configured).
- `pkg/redisconn/otel.go` — `Instrument(client *redis.Client,
telemetry telemetry.Runtime)` applying `redisotel.InstrumentTracing` /
`InstrumentMetrics`.
- `pkg/redisconn/redisconn_test.go` — miniredis-backed config and master
client tests.
**Files (touched)**:
- `pkg/go.mod` — add `github.com/jackc/pgx/v5`,
`github.com/jackc/pgx/v5/stdlib`, `github.com/pressly/goose/v3`,
`github.com/testcontainers/testcontainers-go/modules/postgres`,
`github.com/XSAM/otelsql` (for db instrumentation; alternative:
`go.nhat.io/otelsql` — pick one in implementation).
- `go.work` — confirm `pkg/` is registered (already is).
**Verification**:
- `cd /Users/id/src/go/galaxy/pkg && go test ./postgres/... ./redisconn/...`
passes locally with Docker available.
- `go vet ./...` clean.
---
### ~~Stage 2~~ — Integration test harness extension
This stage is implemented.
**Goal**: extend `integration/internal/harness/` with a Postgres container
helper and a service-bootstrap helper that builds the per-service DSN with
the right `search_path`. All existing integration suites stay green.
**Files (new)**:
- `integration/internal/harness/postgres_container.go` —
`StartPostgresContainer(t testing.TB) *PostgresRuntime`. The runtime
exposes `BaseDSN()`, `DSNForSchema(schema, role string) string`, and
`EnsureRoleAndSchema(ctx, schema, role, password string) error` so each
test can prepare an isolated schema for the service it is booting.
- `integration/internal/harness/postgres_container_test.go` — smoke test.
**Files (touched)**:
- `integration/internal/harness/binary.go` — extend `Process`/launch
helpers with `WithPostgres(rt *PostgresRuntime, schema, role string)`
that injects the right `<S>_POSTGRES_PRIMARY_DSN`. (Existing API already
takes `env map[string]string`; this is a thin wrapper.)
- `integration/go.mod` — add the testcontainers Postgres module.
**Out of scope**: no integration suite is yet wired to Postgres; each
service stage wires in its suites.
**Verification**:
- `cd integration && go test ./internal/harness/...` passes.
- `cd integration && go test ./...` still green for all existing suites
(Redis-only services remain Redis-only).
---
### ~~Stage 3~~ — User Service migration (pilot)
**Goal**: replace User Service's Redis durable storage with PostgreSQL. The
two Redis Streams (`user:domain_events`, `user:lifecycle_events`) remain on
Redis. This stage is the pilot; subsequent service stages copy its shape.
**Schema (`user` schema)**:
- `accounts` (user_id PK, email UNIQUE, user_name UNIQUE, display_name,
preferred_language, time_zone, declared_country, created_at, updated_at,
deleted_at).
- `blocked_emails` (email PK, reason_code, blocked_at, actor_type, actor_id,
resolved_user_id).
- `entitlement_records` (record_id PK, user_id FK, plan_code, is_paid,
starts_at, ends_at, source, actor_type, actor_id, reason_code,
updated_at).
- `entitlement_snapshots` (user_id PK FK → accounts, …current effective
values mirroring Redis snapshot shape).
- `sanction_records` (record_id PK, user_id FK, sanction_code, scope,
reason_code, actor_type, actor_id, applied_at, expires_at, removed_at,
removed_by_type, removed_by_id, removed_reason_code).
- `sanction_active` (user_id, sanction_code, record_id) PRIMARY KEY
(user_id, sanction_code).
- `limit_records`, `limit_active` — analogous to sanctions.
- Indexes: `accounts(created_at DESC, user_id DESC)` for newest-first
pagination; `accounts(declared_country)`;
`entitlement_snapshots(plan_code, is_paid)`;
`entitlement_snapshots(ends_at) WHERE is_paid AND ends_at IS NOT NULL`;
`sanction_active(sanction_code)`; `limit_active(limit_code)`. Eligibility
flags become computed predicates on these columns.
**Files (new)**:
- `galaxy/user/internal/adapters/postgres/migrations/00001_init.sql` —
full schema with grants (`GRANT USAGE ON SCHEMA user TO userservice;
GRANT … ON ALL TABLES …;`).
- `galaxy/user/internal/adapters/postgres/migrations/migrations.go` —
`//go:embed *.sql` and a `Migrations() embed.FS` accessor.
- `galaxy/user/internal/adapters/postgres/jet/...` — generated code
(commit-checked).
- `galaxy/user/internal/adapters/postgres/userstore/store.go` — Postgres
implementation of `ports.UserAccountStore` and `ports.AuthDirectoryStore`.
- `galaxy/user/internal/adapters/postgres/userstore/entitlement_store.go` —
Postgres implementation of `EntitlementSnapshotStore` and
`EntitlementHistoryStore`.
- `galaxy/user/internal/adapters/postgres/userstore/policy_store.go` —
Postgres implementation of `SanctionStore` and `LimitStore`.
- `galaxy/user/internal/adapters/postgres/userstore/list_store.go` —
Postgres implementation of `UserListStore` (pagination + filters
expressed as SQL).
- `galaxy/user/internal/adapters/postgres/userstore/store_test.go` and
siblings — testcontainers-backed unit tests covering the same matrix the
current Redis tests cover.
- `galaxy/user/Makefile` — `jet` target.
- `galaxy/user/docs/postgres-migration.md` — decision record (schema
shape, why we keep `entitlement_snapshots` denormalised, eligibility
expressed as SQL predicates, schema role grants).
**Files (touched)**:
- `galaxy/user/internal/config/config.go` — add Postgres config; refactor
Redis config to master/replica/password (drop `TLS_ENABLED`, `USERNAME`).
- `galaxy/user/internal/config/config_test.go` — update to new env shape.
- `galaxy/user/internal/app/runtime.go` — open Postgres pool, run
migrations on startup before listeners open, wire postgres adapters
into services. Redis client now serves only the two stream publishers.
- `galaxy/user/README.md` — replace "Redis-backed user state" with the
new persistence model, update env-var section.
- `galaxy/user/docs/runbook.md`, `galaxy/user/docs/runtime.md`,
`galaxy/user/docs/examples.md` — update storage references and
config sections.
- `galaxy/user/go.mod` — add `github.com/jackc/pgx/v5{,/stdlib}`,
`github.com/pressly/goose/v3`, `github.com/go-jet/jet/v2`,
`github.com/testcontainers/testcontainers-go/modules/postgres`. Use
`pkg/postgres`, `pkg/redisconn`.
**Files (deleted)**:
- `galaxy/user/internal/adapters/redis/userstore/` — entire directory.
- The portions of `galaxy/user/internal/adapters/redisstate/keyspace.go`
that defined account/entitlement/sanction/limit/index keys (keep only
what `domainevents` and `lifecycleevents` publishers still require — if
none, delete the file outright).
**Files retained on Redis**:
- `galaxy/user/internal/adapters/redis/domainevents/publisher.go`.
- `galaxy/user/internal/adapters/redis/lifecycleevents/publisher.go`.
**Touched integration suites** (each gets a Postgres container in addition
to the existing Redis one):
- `integration/authsessionuser/`
- `integration/gatewayauthsessionuser/`
- `integration/gatewayauthsessionusermail/`
- `integration/notificationuser/`
- `integration/lobbyuser/`
**Verification**:
- `cd galaxy/user && make jet && go test ./...` (Docker needed).
- `cd integration && go test ./authsessionuser/... ./gatewayauthsessionuser/... ./gatewayauthsessionusermail/... ./notificationuser/... ./lobbyuser/...`
- Manual smoke against a `docker-compose` stack (PG + Redis with
passwords) using flows from `galaxy/user/docs/examples.md`.
---
### ~~Stage 4~~ — Mail Service migration
This stage is implemented.
**Goal**: move durable mail storage (deliveries, attempts, dead letters,
malformed commands, payloads, idempotency, attempt schedule) into
PostgreSQL. Keep Redis only for the inbound `mail:delivery_commands`
stream and its consumer offset.
**Schema (`mail` schema)**:
- `deliveries` (delivery_id PK, source, status, recipient_envelope JSONB,
subject, text_body, html_body, payload_mode, template_id,
idempotency_source, idempotency_key, locale_fallback_used,
next_attempt_at, attempt_count, max_attempts, created_at, updated_at).
- INDEX (status, next_attempt_at) for the scheduler.
- UNIQUE (idempotency_source, idempotency_key) — the idempotency record
IS this row (no separate kv).
- INDEX (created_at DESC) for operator listings; INDEX on status, source,
template_id, recipient as needed.
- `attempts` (delivery_id FK, attempt_no, status, provider_summary,
scheduled_for_ms, started_at_ms, completed_at_ms, PRIMARY KEY
(delivery_id, attempt_no)).
- `dead_letters` (delivery_id PK FK, final_attempt_count, max_attempts,
failure_classification, failure_message, created_at_ms).
- `delivery_payloads` (delivery_id PK FK, template_variables JSONB).
- `malformed_commands` (stream_entry_id PK, failure_code, failure_message,
raw_fields JSONB, recorded_at_ms; INDEX created_at).
**Files**: mirror Stage 3 (postgres adapter package, migrations, jet
codegen, Makefile, decision record, removal of corresponding
`internal/adapters/redisstate/*` files for migrated entities, retention
of stream offset and consumer wiring on Redis).
**Worker change**: the mail attempt scheduler loop replaces
`ZRANGEBYSCORE` over `mail:attempt_schedule` with
`SELECT … FROM deliveries WHERE status IN ('queued','retry_pending') AND next_attempt_at <= now() ORDER BY next_attempt_at LIMIT N FOR UPDATE SKIP LOCKED`.
**Files (deleted)**:
- `galaxy/mail/internal/adapters/redisstate/auth_acceptance_store.go`
- `galaxy/mail/internal/adapters/redisstate/generic_acceptance_store.go`
- `galaxy/mail/internal/adapters/redisstate/attempt_execution_store.go`
- `galaxy/mail/internal/adapters/redisstate/operator_store.go`
- `galaxy/mail/internal/adapters/redisstate/malformed_command_store.go`
- `galaxy/mail/internal/adapters/redisstate/render_store.go`
- The portions of `galaxy/mail/internal/adapters/redisstate/keyspace.go`
no longer used (`mail:attempt_schedule`, `mail:idempotency:*`, all
delivery/attempt/dead-letter/index keys).
**Files retained on Redis**:
- `galaxy/mail/internal/adapters/redisstate/stream_offset_store.go` (offset
for `mail:delivery_commands` consumer).
- The command stream consumer wiring itself.
**Touched integration suites**:
- `integration/authsessionmail/`
- `integration/gatewayauthsessionmail/`
- `integration/gatewayauthsessionusermail/`
- `integration/notificationmail/`
**Verification**: per Stage 3 pattern; plus end-to-end smoke that pushes
a delivery through retry_pending → provider_accepted using the SMTP stub.
---
### ~~Stage 5~~ — Notification Service migration
This stage is implemented.
**Goal**: move durable notification storage (records, routes, idempotency,
dead letters, malformed intents) into PostgreSQL. Keep Redis for the
inbound `notification:intents` stream, the outbound `gateway:client-events`
stream, the outbound `mail:delivery_commands` stream, the corresponding
stream offsets, and the short-lived per-route lease (`route_leases:*`).
**Schema (`notification` schema)**:
- `records` (notification_id PK, notification_type, producer, audience_kind,
recipient_user_ids JSONB, payload JSONB, idempotency_key,
request_fingerprint, request_id, trace_id, occurred_at_ms,
accepted_at_ms, updated_at_ms).
- UNIQUE (producer, idempotency_key) — idempotency record IS this row.
- `routes` (notification_id, route_id, channel, recipient_ref, status,
attempt_count, max_attempts, next_attempt_at_ms, resolved_email,
resolved_locale, last_error_classification, last_error_message,
last_error_at_ms, created_at_ms, updated_at_ms, published_at_ms,
dead_lettered_at_ms, skipped_at_ms, PRIMARY KEY
(notification_id, route_id)).
- INDEX (status, next_attempt_at_ms) for the scheduler.
- `dead_letters` (notification_id, route_id PK FK, channel, recipient_ref,
final_attempt_count, max_attempts, failure_classification,
failure_message, recovery_hint, created_at_ms).
- `malformed_intents` (stream_entry_id PK, notification_type, producer,
idempotency_key, failure_code, failure_message, raw_fields JSONB,
recorded_at_ms).
**Worker change**: route publisher selects work via the same
`FOR UPDATE SKIP LOCKED` pattern as Mail. The Redis lease is still used
as a short-lived, per-process exclusivity hint atop the SQL claim.
**Files (deleted)**:
- `galaxy/notification/internal/adapters/redisstate/acceptance_store.go`
- `galaxy/notification/internal/adapters/redisstate/route_state_store.go`
- `galaxy/notification/internal/adapters/redisstate/malformed_intent_store.go`
- The portions of
`galaxy/notification/internal/adapters/redisstate/keyspace.go` no longer
used (records, routes, idempotency, dead_letters, malformed_intents).
**Files retained on Redis**:
- `galaxy/notification/internal/adapters/redisstate/stream_offset_store.go`.
- Route lease key generator (still under `redisstate/`, narrowed to leases
only).
- All stream consumer/publisher wiring.
**Touched integration suites**:
- `integration/notificationgateway/`
- `integration/notificationmail/`
- `integration/notificationuser/`
---
### ~~Stage 6A~~ — Lobby Service: core enrollment entities
**Goal**: move `Game`, `Application`, `Invite`, `Membership` records and
their indexes into PostgreSQL. RaceNameDirectory, GameTurnStats,
GapActivation, EvaluationGuard, StreamOffset remain on Redis until later
sub-stages.
**Schema (`lobby` schema, partial)**:
- `games` (game_id PK, owner_id, kind ('public'|'private'), status,
created_at, updated_at, runtime_snapshot JSONB, runtime_binding JSONB,
…other denormalised game settings).
- INDEX (status, created_at).
- INDEX (owner_id) WHERE kind = 'private'.
- `applications` (application_id PK, game_id FK, user_id, status,
canonical_key, submitted_at, decided_at).
- PARTIAL UNIQUE INDEX (user_id, game_id) WHERE status = 'active' —
enforces the single-active constraint at the DB level (replaces
`lobby:user_game_application:*:*`).
- INDEX (game_id), INDEX (user_id).
- `invites` (invite_id PK, game_id FK, inviter_id, invitee_id, race_name,
status, created_at, expires_at, decided_at).
- INDEX (game_id), INDEX (invitee_id), INDEX (inviter_id).
- INDEX (status, expires_at) for any expiration scanner if needed.
- `memberships` (membership_id PK, game_id FK, user_id, status, joined_at,
canonical_key, …).
- INDEX (game_id), INDEX (user_id).
**Files (new)**:
- `galaxy/lobby/internal/adapters/postgres/migrations/00001_core_entities.sql`.
- `galaxy/lobby/internal/adapters/postgres/migrations/migrations.go`.
- `galaxy/lobby/internal/adapters/postgres/jet/...`.
- `galaxy/lobby/internal/adapters/postgres/gamestore/store.go`.
- `galaxy/lobby/internal/adapters/postgres/applicationstore/store.go`.
- `galaxy/lobby/internal/adapters/postgres/invitestore/store.go`.
- `galaxy/lobby/internal/adapters/postgres/membershipstore/store.go`.
- Test files for each store using the existing test patterns.
- `galaxy/lobby/Makefile` (`jet` target).
- `galaxy/lobby/docs/postgres-migration.md` (decision record covering
this sub-stage and what is intentionally left for 6B/6C).
**Files (touched)**:
- `galaxy/lobby/internal/config/config.go` — add Postgres config; refactor
Redis config to the new shape.
- `galaxy/lobby/internal/app/runtime.go` — open Postgres pool, run
migrations on startup, wire core PG-backed stores into services.
RaceNameDirectory and stats/guard stores still wired to Redis until 6B/6C.
- `galaxy/lobby/README.md` and `galaxy/lobby/docs/runbook.md` — updated
to describe core entities on PG, RND/stats still on Redis until 6B/6C.
**Files (deleted)**:
- `galaxy/lobby/internal/adapters/redisstate/gamestore.go`,
`applicationstore.go`, `invitestore.go`, `membershipstore.go`.
- The corresponding sections of `redisstate/keyspace.go`.
**Stub adapters retained**: `gamestub/`, `applicationstub/`, `invitestub/`,
`membershipstub/` stay — they are pure in-memory ports useful for tests
that don't need real PG.
**Touched integration suites**:
- `integration/lobbyuser/`
- `integration/lobbynotification/`
**Verification**: per Stage 3 pattern; plus the existing lobby HTTP
contract tests against the public/internal ports.
---
### ~~Stage 6B~~ — Lobby Service: RaceNameDirectory
This stage is implemented.
**Goal**: replace the Lua-backed Redis `RaceNameDirectory` with a PG
implementation that preserves the two-tier model (registered / reservation /
pending_registration) and atomic registration semantics via SQL
transactions and (where required) advisory locks.
**Schema (additions to `lobby` schema)**:
- `race_names` (canonical_key PK, holder_user_id, binding_kind ('registered'
| 'reserved' | 'pending_registration'), source_game_id, eligible_until_ms,
registered_at_ms, reserved_at_ms).
- INDEX (holder_user_id) for `ListRegistered`/`ListReservations`/
`ListPendingRegistrations` queries.
- PARTIAL INDEX (eligible_until_ms) WHERE binding_kind =
'pending_registration' for the expiration scanner.
- The confusable-pair policy is enforced at write time inside
`BEGIN … COMMIT` transactions; `Reserve`/`Register`/
`MarkPendingRegistration` use `SELECT … FOR UPDATE` on the canonical
keys involved (or PG advisory locks keyed by `hashtext(canonical_key)`)
to serialise concurrent attempts.
**Files (new)**:
- `galaxy/lobby/internal/adapters/postgres/migrations/00002_race_names.sql`.
- `galaxy/lobby/internal/adapters/postgres/racenamedir/directory.go` —
Postgres implementation of `ports.RaceNameDirectory`.
- `galaxy/lobby/internal/adapters/postgres/racenamedir/directory_test.go`
— runs the existing shared suite at
`galaxy/lobby/internal/ports/racenamedirtest/suite.go`.
**Files (touched)**:
- `galaxy/lobby/internal/app/runtime.go` — wire PG RND.
- `galaxy/lobby/internal/ports/racenamedirtest/suite.go` — only
shape-preserving updates if the suite assumed Redis-only behaviour
(e.g. SCAN-based list ordering).
- `galaxy/lobby/README.md`, `galaxy/lobby/docs/runbook.md` — RND now PG-
backed; canonical_lookup cache no longer needed (PG indexed lookup is
fast enough; remove the Redis cache key from `redisstate/keyspace.go`).
**Files (deleted)**:
- `galaxy/lobby/internal/adapters/redisstate/racenamedir.go` and the
embedded Lua scripts.
- `galaxy/lobby/internal/adapters/racenamestub/` stays (useful for unit
tests that don't need PG).
**Worker change**: the pending-registration expiration worker switches
from `ZRANGEBYSCORE` on `lobby:race_names:pending_index` to
`SELECT … FROM race_names WHERE binding_kind='pending_registration' AND eligible_until_ms <= now()`.
**Verification**: shared port suite (`racenamedirtest`) green against PG
adapter; lobby unit tests green; `integration/lobbyuser/`,
`integration/lobbynotification/` green.
---
### ~~Stage 6C~~ — Lobby Service: workers, ephemeral stores, cleanup
This stage is implemented.
**Goal**: finish the lobby migration. Confirm what stays Redis-only,
update workers that touch both backends, drop dead Redis adapters.
**Stays on Redis (per architectural rules)**:
- `GameTurnStatsStore` — ephemeral per-game aggregate, deleted at game
finish, rebuildable from GM events.
- `EvaluationGuardStore` — ephemeral marker.
- `GapActivationStore` — short-lived gap-window timestamp cache.
- `StreamOffsetStore` — runtime coordination per the architectural rule.
- All stream consumers and publishers (`gm:lobby_events`,
`runtime:job_results`, `user:lifecycle_events`, `notification:intents`).
This is documented in `galaxy/lobby/docs/postgres-migration.md`.
**Files (touched)**:
- `galaxy/lobby/internal/worker/gmevents/consumer.go` — write durable
updates via PG-backed `GameStore`.
- `galaxy/lobby/internal/worker/runtimejobresult/consumer.go` — same.
- `galaxy/lobby/internal/adapters/userlifecycle/consumer.go` (and the
worker that drives it) — RND release, membership/application/invite
cascade all flow through PG.
- `galaxy/lobby/internal/worker/pendingregistration/worker.go` — PG-based
scan, no Redis ZSET.
- `galaxy/lobby/internal/worker/enrollmentautomation/worker.go` — uses PG
`GameStore.GetByStatus("enrollment_open")`.
- `galaxy/lobby/internal/adapters/redisstate/keyspace.go` — pruned to the
remaining Redis keys (turn stats, gap activation, evaluation guard,
stream offsets, lifecycle stream consumer state).
- `galaxy/lobby/README.md`, `galaxy/lobby/docs/runtime.md`,
`galaxy/lobby/docs/runbook.md`, `galaxy/lobby/docs/examples.md` —
finalised storage descriptions.
**Files (deleted)**:
- Anything left in `galaxy/lobby/internal/adapters/redisstate/` whose
only consumer was a port now PG-backed (see 6A/6B deletions).
**Verification**:
- All previously-green lobby unit tests pass with PG-backed adapters.
- `integration/lobbyuser/`, `integration/lobbynotification/` pass.
- `grep -rn "redisstate" galaxy/lobby/internal/` returns only the keys
intentionally retained on Redis.
---
### ~~Stage 7~~ — Gateway and Auth/Session: Redis configuration refactor
This stage is implemented.
**Goal**: apply the new Redis configuration shape (master/replica/password,
drop TLS/USERNAME) to Gateway and Auth/Session. No PG migration; these
services intentionally stay Redis-only.
**Files (touched)**:
- `galaxy/gateway/internal/config/config.go` — switch `RedisConfig`
fields to the `pkg/redisconn.Config` shape; update the three
prefixes: `GATEWAY_SESSION_CACHE_REDIS_*`, `GATEWAY_REPLAY_REDIS_*`,
`GATEWAY_SESSION_EVENTS_REDIS_*`. Drop `TLS_ENABLED`, `USERNAME`.
- `galaxy/gateway/internal/session/redis.go`,
`galaxy/gateway/internal/replay/redis.go`,
`galaxy/gateway/internal/events/subscriber.go` — adopt new client
constructor via `pkg/redisconn`.
- `galaxy/gateway/internal/config/config_test.go`,
`galaxy/gateway/internal/session/redis_test.go`,
`galaxy/gateway/internal/replay/redis_test.go` — updated to new env shape.
- `galaxy/authsession/internal/config/config.go` — same pattern; drop
TLS, USERNAME.
- `galaxy/authsession/internal/adapters/redis/sessionstore/store.go`,
`challengestore/store.go`, `projectionpublisher/publisher.go`,
`sendemailcodeabuse/protector.go`, `configprovider/store.go` — adopt
new client.
- `galaxy/authsession/internal/config/config_test.go` — updated.
- `galaxy/gateway/README.md`, `galaxy/authsession/README.md`,
`galaxy/gateway/docs/runbook.md`, `galaxy/authsession/docs/runbook.md`
— note that Redis-only is intentional and reference the `ARCHITECTURE.md`
rule on TTL-bounded auth state.
**No deletions of business logic**; only env-var refactor and adapter
plumbing through `pkg/redisconn`.
**Touched integration suites**:
- `integration/gatewayauthsession/`
- `integration/authsession/`
- (every suite that boots gateway or authsession picks up the new env vars
via the harness; confirm none still pass `*_REDIS_TLS_ENABLED`).
**Verification**:
- `cd galaxy/gateway && go test ./...`
- `cd galaxy/authsession && go test ./...`
- `cd integration && go test ./gatewayauthsession/... ./authsession/...`
---
### ~~Stage 8~~ — GeoProfile: documentation only
**Goal**: ensure the GeoProfile plan and README reflect the new
persistence rules so its future implementation follows them. No code
exists yet.
**Files (touched)**:
- `galaxy/geoprofile/PLAN.md` — add a stage referencing `pkg/postgres`
and `pkg/redisconn`; specify that observed-country aggregates,
declared_country history and review records will live in a `geoprofile`
schema, while ephemeral per-session signals (if any) stay on Redis.
- `galaxy/geoprofile/README.md` — note ownership of the `geoprofile`
schema and the stack choices.
**No code change**.
---
### ~~Stage 9~~ — Final sweep
**Goal**: confirm no dead Redis adapter code, no orphaned stub, no
broken doc reference. Remove the *Migration Window* caveat from
`ARCHITECTURE.md` once all stages are done.
**Activities**:
- Walk every PG-backed service: `grep -rn "redis" galaxy/<svc>/internal/adapters/`
and verify every match belongs to a still-active stream/cache/runtime
use case.
- Walk integration suites: confirm each one provisions only the
containers it actually needs; no stale env vars.
- Update `ARCHITECTURE.md` to drop the *Migration Window* sub-section.
- Combine sequences of migration `.sql` files into a single first file.
Rewrite SQL-code, not just concat.
The reason is that project still in in development state and all schema updates
can go directly in the only and first step of relevant migrations. This should
be represented in `ARCHITECTURE.md` as well.
- One round of `go test ./...` in every module plus
`cd integration && go test ./...`.
**Verification**:
- All tests pass in every module.
- No file matches `// TODO.*postgres` or `// TODO.*migrate`.
- `git grep -n REDIS_TLS_ENABLED REDIS_USERNAME` returns nothing under
`galaxy/` (these env vars are fully retired).
---
## Verification strategy (whole project)
After each stage:
- `cd /Users/id/src/go/galaxy/pkg && go test ./...`
- `cd /Users/id/src/go/galaxy/<changed_service> && go test ./...`
(with Docker available for testcontainers).
- `cd /Users/id/src/go/galaxy/integration && go test ./<affected_suites>/...`
- Manual smoke against a `docker-compose` stack (PG + Redis, both with
passwords) using the example flows in each service's `docs/examples.md`.
After Stage 9:
- `cd /Users/id/src/go/galaxy/integration && go test ./...` end to end
against real PG + real Redis.
- Confirm `git grep -nE 'REDIS_(TLS_ENABLED|USERNAME)'` returns nothing
under `galaxy/`.
- Confirm `git grep -n 'TODO.*(postgres|migrate)'` returns nothing.
## Out of scope
- `galaxy/game` — explicitly excluded by the project owner.
- Production deployment manifests (Helm/k8s) — local `docker-compose` is
enough for development.
- Backup/restore tooling configuration — `pg_dump` and WAL archiving are
available out of the box; operational setup is not part of this plan.
- Sentinel/Cluster Redis topology code paths — config exposes replica
addresses for future use; no failover routing implemented yet.
- Read-traffic routing to PG replicas — config exposes
`*_POSTGRES_REPLICA_DSNS` for future use; no routing implemented yet.
- `golangci-lint` config addition — not part of this migration.
- CI pipeline — no `.github/workflows/` exists; not added by this plan.
## Risks and notes
- **`go-jet` codegen requires a live database**. The `make jet` target
per service uses `testcontainers-go` to bring up a transient PG, applies
the same goose migrations the service applies at startup, then runs
`jet -dsn=… -path=internal/adapters/postgres/jet`. Generated code is
committed; consumers don't need Docker just to build.
- **Schema-per-service vs single-DB cross-service joins**: there are no
cross-schema joins in this plan. Each service reads only its own schema;
cross-service data flows go via Redis Streams (event bus) or HTTP
contracts (User Service is queried by Lobby for eligibility) — same as
today. The DB-level role grants enforce this.
- **Pending registration expiration worker**: under Redis it scanned a
global ZSET; under PG it does an indexed scan. The partial index on
`eligible_until_ms WHERE binding_kind='pending_registration'` keeps the
scan cheap.
- **Idempotency under crash**: with idempotency expressed as a UNIQUE
constraint on the durable record, recovery is "the row either exists or
it doesn't" — no Redis-loss window where duplicates can sneak through.
- **lib/pq vs pgx (revisit)**: confirmed pgx/v5 + jet via stdlib adapter.
The `make jet` target will pass `-source=postgres` to jet (the dialect
is independent of which Go driver runs the queries at runtime).
- **No backward-compat shim for env vars**: `*_REDIS_TLS_ENABLED` and
`*_REDIS_USERNAME` are retired in one cut. Any external dev environment
that sets these will start failing fast at startup with a clear error
emitted by `pkg/redisconn.LoadFromEnv`.