40 KiB
PostgreSQL Migration Plan
This plan has been already implemented and stays here for historical reasons.
It should NOT be threated as source of truth for service functionality.
Context
The Galaxy Game project currently uses Redis as the only persistence backend
across all implemented services (user, mail, notification, lobby,
gateway, authsession). Redis serves both kinds of state: ephemeral and
runtime-coordination state (where it shines — Streams, caches, replay keys,
runtime queues, session caches, leases) and table-shaped business state where
it is a poor fit (durable user accounts, entitlements/sanctions, mail audit
records, notification routes/idempotency, lobby memberships and invites).
Replication and standby for Redis are not configured anywhere. There is no
SQL/migration tooling in the repo at all.
We migrate to a Redis + PostgreSQL split where each backend owns the data it
serves best. PostgreSQL becomes the source of truth for table-shaped business
state, gives us ACID transactions, mature physical/logical replication, and
backup/restore via pg_dump and WAL archiving. Redis remains the source of
truth for streams, pub/sub, caches, leases, replay keys, rate limits, session
caches, runtime queues, and stream consumer offsets.
The plan migrates only services already implemented and explicitly excludes
galaxy/game. It targets steady-state architecture rules first (one
authoritative document, ARCHITECTURE.md), then walks each service end to end
— code, tests, service-local README/docs, and integration suites — so that no
intermediate commit leaves docs and code in conflict.
Confirmed decisions (with project owner)
- Documentation strategy:
ARCHITECTURE.mdis updated as the very first stage with the architecture-wide rules. Each per-service README and per- servicedocs/change inside that service's own stage, paired with code and tests. This keepsARCHITECTURE.md≡ policy, README ≡ current state, and ensures any commit can be checked out without code/doc divergence. - Service scope: full migration of durable storage to PostgreSQL for
user,mail,notification,lobby. Only Redis configuration refactor (master/replica + mandatory password, dropTLS_ENABLED/USERNAME) forgatewayandauthsession— these services intentionally stay Redis- only.geoprofilehas no implementation; itsPLAN.mdandREADME.mdabsorb the new persistence rules so future implementation follows them. - Idempotency and retry-schedule placement: idempotency records and
retry schedule queues live in PostgreSQL on the same table as the durable
record they protect (
(producer, idempotency_key)UNIQUE onrecords,next_attempt_atcolumn ondeliveries/routes). One source of truth, no dual-write hazard between PG and Redis ZSETs. - Stack:
github.com/jackc/pgx/v5driver, exposed as*sql.DBviagithub.com/jackc/pgx/v5/stdlib.github.com/go-jet/jet/v2for type-safe query building + code generation, generated against a testcontainers PostgreSQL instance with migrations applied (Makefile target per service).github.com/pressly/goose/v3library API for embedded migrations applied at service startup; thegooseCLI may be used for local development and rollback investigations but is not in the service binary path. - Code: all postgres queries must use pre-generated code with
jetand appropriate builders rather than raw SQL queries, unless this usage cannot achive the goal of businness-scenario due to lack ofgo-jetfunctionality.
Architectural rules (target steady-state)
These rules land in ARCHITECTURE.md in Stage 0 and govern every subsequent
service stage.
Backend assignment
PostgreSQL is the source of truth for:
- Domain entities with table-shaped business state (
accounts,entitlement_records,sanction_records,limit_records,blocked_emails,deliveries,attempts,dead_letters,malformed_commands,notification_records,notification_routes,games,applications,invites,memberships,race_names). - Idempotency records (UNIQUE constraint on the durable table, not a separate kv).
- Retry scheduling state (
next_attempt_atcolumn + supporting index on the durable table). - Audit history records that must outlive any Redis snapshot.
Redis is the source of truth for:
- Redis Streams used as the event bus (
user:domain_events,user:lifecycle_events,gm:lobby_events,runtime:job_results,notification:intents,gateway:client-events,mail:delivery_commands). - Stream consumer offsets (small runtime coordination state, rebuildable).
- Caches and projections (gateway session cache).
- Replay reservation keys.
- Rate limit counters.
- Runtime coordination locks/leases (e.g. notification
route_leases). - Authentication challenge state and active session tokens (TTL-bounded; loss is recoverable by re-authentication).
- Ephemeral per-game runtime aggregates that are deleted at game finish
(lobby
game_turn_stats,gap_activated_at, capability evaluation marker).
Database topology
- Single PostgreSQL database
galaxy. - Schema-per-service:
user,mail,notification,lobby. Reserved for later:geoprofile. Not allocated unless needed:gateway,authsession. - Per-service PostgreSQL role with grants restricted to its own schema (defense-in-depth, simple to express in the initial migration).
- Authentication: username + password only.
sslmode=disable. No client certificates, no SCRAM channel binding, no custom auth plugins. - Each service connects to one primary plus zero-or-more read-only replicas. In this iteration only the primary is used; the replica pool is wired but receives no traffic. Future read-routing is non-breaking.
Redis topology
- Each service connects to one master Redis plus zero-or-more replica Redis hosts.
- All connections use a mandatory password.
USERNAME/ACL not used. TLS off. - In this iteration only the master is used; the replica list is wired but unused — non-breaking switch later when the app starts routing reads.
- Existing env vars
*_REDIS_TLS_ENABLED,*_REDIS_USERNAMEare removed (hard rename; no backward-compat shim — fresh project, no production deploys to migrate).
Library stack
- Driver:
github.com/jackc/pgx/v5(modern, actively maintained), exposed todatabase/sqlviagithub.com/jackc/pgx/v5/stdlibso go-jet'sqrm.Queryableinterface is satisfied without changes. - Query layer:
github.com/go-jet/jet/v2(PostgreSQL dialect). Generated code lives under each serviceinternal/adapters/postgres/jet/, regenerated via amake jettarget and committed to the repo. - Migrations:
github.com/pressly/goose/v3library API; migration files embedded via//go:embed *.sql; applied at startup, before opening any HTTP/gRPC listener; non-zero exit on failure. - Test infrastructure:
github.com/testcontainers/testcontainers-goplus themodules/postgressubmodule; the same setup is reused bymake jetto host a transient instance for jet codegen.
Migration discipline
- Forward-only sequence-numbered files:
00001_init.sql,00002_*.sql, … - Lowercase snake_case names; goose
-- +goose Up/-- +goose Downmarkers; statements that need transaction-wrapping use-- +goose StatementBegin/-- +goose StatementEnd. - Migrations apply at service startup; service exits non-zero on failure.
- Per-service decision record at
galaxy/<service>/docs/postgres-migration.mdcaptures schema decisions and any non-trivial deviation from the rules.
Per-service code organisation
galaxy/<service>/
internal/
adapters/
postgres/
migrations/ # *.sql files + migrations.go (//go:embed)
jet/ # generated; commit-checked
<portname>/ # adapter implementations matching internal/ports
config/
config.go # adds Postgres + new Redis schema
Makefile # `jet` target: testcontainers + goose + jet
Test patterns
- Per-service unit tests against a real PostgreSQL via
testcontainers-go; replace the corresponding miniredis test path where storage moved to PG. - Shared port-test suites (e.g.
lobby/internal/ports/racenamedirtest/) gain a Postgres harness; they remain backend-agnostic in shape. integration/internal/harness/postgres_container.gois added; integration suites that need PG declare it next to their existing Redis container.- Stub adapters (
*stub/) are kept where the in-memory port is useful for tests that don't need a real backend. Redis adapters that previously implemented these ports are removed (no dead code).
Configuration env vars (target)
For each service <S> ∈ { USERSERVICE, MAIL, NOTIFICATION, LOBBY,
GATEWAY, AUTHSESSION }:
<S>_REDIS_MASTER_ADDR(required)<S>_REDIS_REPLICA_ADDRS(optional, comma-separated; default empty)<S>_REDIS_PASSWORD(required)<S>_REDIS_DB(default 0)<S>_REDIS_OPERATION_TIMEOUT(default 250ms)
For PG-backed services (USERSERVICE, MAIL, NOTIFICATION, LOBBY):
<S>_POSTGRES_PRIMARY_DSN(required; e.g.postgres://userservice:secret@postgres:5432/galaxy?search_path=user&sslmode=disable)<S>_POSTGRES_REPLICA_DSNS(optional, comma-separated)<S>_POSTGRES_OPERATION_TIMEOUT(default 1s)<S>_POSTGRES_MAX_OPEN_CONNS(default 25)<S>_POSTGRES_MAX_IDLE_CONNS(default 5)<S>_POSTGRES_CONN_MAX_LIFETIME(default 30m)
DSN sets search_path=<schema> so unqualified table references resolve into
the service-owned schema; sslmode=disable is set explicitly per the
"no TLS" requirement.
Service-prefix-specific stream/keyspace env vars (*_REDIS_DOMAIN_EVENTS_STREAM,
*_REDIS_LIFECYCLE_EVENTS_STREAM, *_REDIS_KEYSPACE_PREFIX,
MAIL_REDIS_COMMAND_STREAM, etc.) keep their current names and semantics —
they describe stream/key shapes, not connection topology.
Stages
Each stage is independently executable and shippable.
Stage 0 — Architecture-wide rules and PG_PLAN.md materialisation
This stage is implemented.
Goal: land the steady-state rules in ARCHITECTURE.md and place
PG_PLAN.md at the project root so subsequent /stage-implementation
invocations have an authoritative reference.
Actions:
- Write the contents of this plan file to
/Users/id/src/go/galaxy/PG_PLAN.md. - Add a new section to
ARCHITECTURE.md(e.g.§9 Persistence Backends) capturing every rule under the Architectural rules heading above: backend assignment, database/Redis topology, library stack, migration discipline, code organisation, test patterns, env-var conventions. - Add a short Migration Window sub-section to
ARCHITECTURE.mdnoting that until allPG_PLAN.mdstages complete, each service'sREADME.mdcontinues to describe its actual current state — this caveat is removed in Stage 9. - Adjust
ARCHITECTURE.md §8(publisher rules) so cross-references distinguish "Redis Stream" (event bus, stays Redis) from "PG-backed table" (durable record).
Files (modified / new):
/Users/id/src/go/galaxy/PG_PLAN.md— new/Users/id/src/go/galaxy/ARCHITECTURE.md— modified
Out of scope: zero service code, zero per-service README/docs, zero
go.mod changes, zero new dependencies in service modules.
Verification:
git diff --statreports two paths only:PG_PLAN.md,ARCHITECTURE.md.ARCHITECTURE.mdreads coherently end to end, with the new section cross-referenced from §8 and from any other place that today says "Redis is the v1 backend".- Manual: read
PG_PLAN.mdtop to bottom, confirm every architectural decision matches the section inARCHITECTURE.md.
Stage 1 — Shared infrastructure packages (pkg/postgres, pkg/redisconn)
This stage is implemented.
Goal: provide one canonical helper each for Postgres and Redis so per- service stages don't reinvent connection/migration wiring. No service consumes them yet.
Files (new):
pkg/postgres/config.go—Configstruct (PrimaryDSN, ReplicaDSNs, OperationTimeout, MaxOpenConns, MaxIdleConns, ConnMaxLifetime); helperLoadFromEnv(prefix string) (Config, error)that reads<prefix>_POSTGRES_*.pkg/postgres/open.go—OpenPrimary(ctx, cfg) (*sql.DB, error)andOpenReplicas(ctx, cfg) ([]*sql.DB, error)usingpgx.ConnConfig→stdlib.OpenDB(...); configures pool sizes and per-statement context timeout.pkg/postgres/migrate.go—RunMigrations(ctx context.Context, db *sql.DB, fs embed.FS) errorwrappinggoose.SetBaseFS(fs)+goose.UpContext.pkg/postgres/otel.go—Instrument(db *sql.DB, telemetry telemetry.Runtime)applyingotelsql.RegisterDBStatsMetricsand statement spans.pkg/postgres/postgres_test.go— testcontainers-backed smoke test: open primary, run a one-line migration, insert/select.pkg/redisconn/config.go—Configstruct (MasterAddr, ReplicaAddrs, Password, DB, OperationTimeout); helperLoadFromEnv(prefix string) (Config, error)that reads<prefix>_REDIS_*(the new shape only; rejects deprecated TLS/USERNAME vars with a clear error).pkg/redisconn/client.go—NewMasterClient(cfg) *redis.ClientandNewReplicaClients(cfg) []*redis.Client(latter returns nil/empty when replicas not configured).pkg/redisconn/otel.go—Instrument(client *redis.Client, telemetry telemetry.Runtime)applyingredisotel.InstrumentTracing/InstrumentMetrics.pkg/redisconn/redisconn_test.go— miniredis-backed config and master client tests.
Files (touched):
pkg/go.mod— addgithub.com/jackc/pgx/v5,github.com/jackc/pgx/v5/stdlib,github.com/pressly/goose/v3,github.com/testcontainers/testcontainers-go/modules/postgres,github.com/XSAM/otelsql(for db instrumentation; alternative:go.nhat.io/otelsql— pick one in implementation).go.work— confirmpkg/is registered (already is).
Verification:
cd /Users/id/src/go/galaxy/pkg && go test ./postgres/... ./redisconn/...passes locally with Docker available.go vet ./...clean.
Stage 2 — Integration test harness extension
This stage is implemented.
Goal: extend integration/internal/harness/ with a Postgres container
helper and a service-bootstrap helper that builds the per-service DSN with
the right search_path. All existing integration suites stay green.
Files (new):
integration/internal/harness/postgres_container.go—StartPostgresContainer(t testing.TB) *PostgresRuntime. The runtime exposesBaseDSN(),DSNForSchema(schema, role string) string, andEnsureRoleAndSchema(ctx, schema, role, password string) errorso each test can prepare an isolated schema for the service it is booting.integration/internal/harness/postgres_container_test.go— smoke test.
Files (touched):
integration/internal/harness/binary.go— extendProcess/launch helpers withWithPostgres(rt *PostgresRuntime, schema, role string)that injects the right<S>_POSTGRES_PRIMARY_DSN. (Existing API already takesenv map[string]string; this is a thin wrapper.)integration/go.mod— add the testcontainers Postgres module.
Out of scope: no integration suite is yet wired to Postgres; each service stage wires in its suites.
Verification:
cd integration && go test ./internal/harness/...passes.cd integration && go test ./...still green for all existing suites (Redis-only services remain Redis-only).
Stage 3 — User Service migration (pilot)
Goal: replace User Service's Redis durable storage with PostgreSQL. The
two Redis Streams (user:domain_events, user:lifecycle_events) remain on
Redis. This stage is the pilot; subsequent service stages copy its shape.
Schema (user schema):
accounts(user_id PK, email UNIQUE, user_name UNIQUE, display_name, preferred_language, time_zone, declared_country, created_at, updated_at, deleted_at).blocked_emails(email PK, reason_code, blocked_at, actor_type, actor_id, resolved_user_id).entitlement_records(record_id PK, user_id FK, plan_code, is_paid, starts_at, ends_at, source, actor_type, actor_id, reason_code, updated_at).entitlement_snapshots(user_id PK FK → accounts, …current effective values mirroring Redis snapshot shape).sanction_records(record_id PK, user_id FK, sanction_code, scope, reason_code, actor_type, actor_id, applied_at, expires_at, removed_at, removed_by_type, removed_by_id, removed_reason_code).sanction_active(user_id, sanction_code, record_id) PRIMARY KEY (user_id, sanction_code).limit_records,limit_active— analogous to sanctions.- Indexes:
accounts(created_at DESC, user_id DESC)for newest-first pagination;accounts(declared_country);entitlement_snapshots(plan_code, is_paid);entitlement_snapshots(ends_at) WHERE is_paid AND ends_at IS NOT NULL;sanction_active(sanction_code);limit_active(limit_code). Eligibility flags become computed predicates on these columns.
Files (new):
galaxy/user/internal/adapters/postgres/migrations/00001_init.sql— full schema with grants (GRANT USAGE ON SCHEMA user TO userservice; GRANT … ON ALL TABLES …;).galaxy/user/internal/adapters/postgres/migrations/migrations.go—//go:embed *.sqland aMigrations() embed.FSaccessor.galaxy/user/internal/adapters/postgres/jet/...— generated code (commit-checked).galaxy/user/internal/adapters/postgres/userstore/store.go— Postgres implementation ofports.UserAccountStoreandports.AuthDirectoryStore.galaxy/user/internal/adapters/postgres/userstore/entitlement_store.go— Postgres implementation ofEntitlementSnapshotStoreandEntitlementHistoryStore.galaxy/user/internal/adapters/postgres/userstore/policy_store.go— Postgres implementation ofSanctionStoreandLimitStore.galaxy/user/internal/adapters/postgres/userstore/list_store.go— Postgres implementation ofUserListStore(pagination + filters expressed as SQL).galaxy/user/internal/adapters/postgres/userstore/store_test.goand siblings — testcontainers-backed unit tests covering the same matrix the current Redis tests cover.galaxy/user/Makefile—jettarget.galaxy/user/docs/postgres-migration.md— decision record (schema shape, why we keepentitlement_snapshotsdenormalised, eligibility expressed as SQL predicates, schema role grants).
Files (touched):
galaxy/user/internal/config/config.go— add Postgres config; refactor Redis config to master/replica/password (dropTLS_ENABLED,USERNAME).galaxy/user/internal/config/config_test.go— update to new env shape.galaxy/user/internal/app/runtime.go— open Postgres pool, run migrations on startup before listeners open, wire postgres adapters into services. Redis client now serves only the two stream publishers.galaxy/user/README.md— replace "Redis-backed user state" with the new persistence model, update env-var section.galaxy/user/docs/runbook.md,galaxy/user/docs/runtime.md,galaxy/user/docs/examples.md— update storage references and config sections.galaxy/user/go.mod— addgithub.com/jackc/pgx/v5{,/stdlib},github.com/pressly/goose/v3,github.com/go-jet/jet/v2,github.com/testcontainers/testcontainers-go/modules/postgres. Usepkg/postgres,pkg/redisconn.
Files (deleted):
galaxy/user/internal/adapters/redis/userstore/— entire directory.- The portions of
galaxy/user/internal/adapters/redisstate/keyspace.gothat defined account/entitlement/sanction/limit/index keys (keep only whatdomaineventsandlifecycleeventspublishers still require — if none, delete the file outright).
Files retained on Redis:
galaxy/user/internal/adapters/redis/domainevents/publisher.go.galaxy/user/internal/adapters/redis/lifecycleevents/publisher.go.
Touched integration suites (each gets a Postgres container in addition to the existing Redis one):
integration/authsessionuser/integration/gatewayauthsessionuser/integration/gatewayauthsessionusermail/integration/notificationuser/integration/lobbyuser/
Verification:
cd galaxy/user && make jet && go test ./...(Docker needed).cd integration && go test ./authsessionuser/... ./gatewayauthsessionuser/... ./gatewayauthsessionusermail/... ./notificationuser/... ./lobbyuser/...- Manual smoke against a
docker-composestack (PG + Redis with passwords) using flows fromgalaxy/user/docs/examples.md.
Stage 4 — Mail Service migration
This stage is implemented.
Goal: move durable mail storage (deliveries, attempts, dead letters,
malformed commands, payloads, idempotency, attempt schedule) into
PostgreSQL. Keep Redis only for the inbound mail:delivery_commands
stream and its consumer offset.
Schema (mail schema):
deliveries(delivery_id PK, source, status, recipient_envelope JSONB, subject, text_body, html_body, payload_mode, template_id, idempotency_source, idempotency_key, locale_fallback_used, next_attempt_at, attempt_count, max_attempts, created_at, updated_at).- INDEX (status, next_attempt_at) for the scheduler.
- UNIQUE (idempotency_source, idempotency_key) — the idempotency record IS this row (no separate kv).
- INDEX (created_at DESC) for operator listings; INDEX on status, source, template_id, recipient as needed.
attempts(delivery_id FK, attempt_no, status, provider_summary, scheduled_for_ms, started_at_ms, completed_at_ms, PRIMARY KEY (delivery_id, attempt_no)).dead_letters(delivery_id PK FK, final_attempt_count, max_attempts, failure_classification, failure_message, created_at_ms).delivery_payloads(delivery_id PK FK, template_variables JSONB).malformed_commands(stream_entry_id PK, failure_code, failure_message, raw_fields JSONB, recorded_at_ms; INDEX created_at).
Files: mirror Stage 3 (postgres adapter package, migrations, jet
codegen, Makefile, decision record, removal of corresponding
internal/adapters/redisstate/* files for migrated entities, retention
of stream offset and consumer wiring on Redis).
Worker change: the mail attempt scheduler loop replaces
ZRANGEBYSCORE over mail:attempt_schedule with
SELECT … FROM deliveries WHERE status IN ('queued','retry_pending') AND next_attempt_at <= now() ORDER BY next_attempt_at LIMIT N FOR UPDATE SKIP LOCKED.
Files (deleted):
galaxy/mail/internal/adapters/redisstate/auth_acceptance_store.gogalaxy/mail/internal/adapters/redisstate/generic_acceptance_store.gogalaxy/mail/internal/adapters/redisstate/attempt_execution_store.gogalaxy/mail/internal/adapters/redisstate/operator_store.gogalaxy/mail/internal/adapters/redisstate/malformed_command_store.gogalaxy/mail/internal/adapters/redisstate/render_store.go- The portions of
galaxy/mail/internal/adapters/redisstate/keyspace.gono longer used (mail:attempt_schedule,mail:idempotency:*, all delivery/attempt/dead-letter/index keys).
Files retained on Redis:
galaxy/mail/internal/adapters/redisstate/stream_offset_store.go(offset formail:delivery_commandsconsumer).- The command stream consumer wiring itself.
Touched integration suites:
integration/authsessionmail/integration/gatewayauthsessionmail/integration/gatewayauthsessionusermail/integration/notificationmail/
Verification: per Stage 3 pattern; plus end-to-end smoke that pushes a delivery through retry_pending → provider_accepted using the SMTP stub.
Stage 5 — Notification Service migration
This stage is implemented.
Goal: move durable notification storage (records, routes, idempotency,
dead letters, malformed intents) into PostgreSQL. Keep Redis for the
inbound notification:intents stream, the outbound gateway:client-events
stream, the outbound mail:delivery_commands stream, the corresponding
stream offsets, and the short-lived per-route lease (route_leases:*).
Schema (notification schema):
records(notification_id PK, notification_type, producer, audience_kind, recipient_user_ids JSONB, payload JSONB, idempotency_key, request_fingerprint, request_id, trace_id, occurred_at_ms, accepted_at_ms, updated_at_ms).- UNIQUE (producer, idempotency_key) — idempotency record IS this row.
routes(notification_id, route_id, channel, recipient_ref, status, attempt_count, max_attempts, next_attempt_at_ms, resolved_email, resolved_locale, last_error_classification, last_error_message, last_error_at_ms, created_at_ms, updated_at_ms, published_at_ms, dead_lettered_at_ms, skipped_at_ms, PRIMARY KEY (notification_id, route_id)).- INDEX (status, next_attempt_at_ms) for the scheduler.
dead_letters(notification_id, route_id PK FK, channel, recipient_ref, final_attempt_count, max_attempts, failure_classification, failure_message, recovery_hint, created_at_ms).malformed_intents(stream_entry_id PK, notification_type, producer, idempotency_key, failure_code, failure_message, raw_fields JSONB, recorded_at_ms).
Worker change: route publisher selects work via the same
FOR UPDATE SKIP LOCKED pattern as Mail. The Redis lease is still used
as a short-lived, per-process exclusivity hint atop the SQL claim.
Files (deleted):
galaxy/notification/internal/adapters/redisstate/acceptance_store.gogalaxy/notification/internal/adapters/redisstate/route_state_store.gogalaxy/notification/internal/adapters/redisstate/malformed_intent_store.go- The portions of
galaxy/notification/internal/adapters/redisstate/keyspace.gono longer used (records, routes, idempotency, dead_letters, malformed_intents).
Files retained on Redis:
galaxy/notification/internal/adapters/redisstate/stream_offset_store.go.- Route lease key generator (still under
redisstate/, narrowed to leases only). - All stream consumer/publisher wiring.
Touched integration suites:
integration/notificationgateway/integration/notificationmail/integration/notificationuser/
Stage 6A — Lobby Service: core enrollment entities
Goal: move Game, Application, Invite, Membership records and
their indexes into PostgreSQL. RaceNameDirectory, GameTurnStats,
GapActivation, EvaluationGuard, StreamOffset remain on Redis until later
sub-stages.
Schema (lobby schema, partial):
games(game_id PK, owner_id, kind ('public'|'private'), status, created_at, updated_at, runtime_snapshot JSONB, runtime_binding JSONB, …other denormalised game settings).- INDEX (status, created_at).
- INDEX (owner_id) WHERE kind = 'private'.
applications(application_id PK, game_id FK, user_id, status, canonical_key, submitted_at, decided_at).- PARTIAL UNIQUE INDEX (user_id, game_id) WHERE status = 'active' —
enforces the single-active constraint at the DB level (replaces
lobby:user_game_application:*:*). - INDEX (game_id), INDEX (user_id).
- PARTIAL UNIQUE INDEX (user_id, game_id) WHERE status = 'active' —
enforces the single-active constraint at the DB level (replaces
invites(invite_id PK, game_id FK, inviter_id, invitee_id, race_name, status, created_at, expires_at, decided_at).- INDEX (game_id), INDEX (invitee_id), INDEX (inviter_id).
- INDEX (status, expires_at) for any expiration scanner if needed.
memberships(membership_id PK, game_id FK, user_id, status, joined_at, canonical_key, …).- INDEX (game_id), INDEX (user_id).
Files (new):
galaxy/lobby/internal/adapters/postgres/migrations/00001_core_entities.sql.galaxy/lobby/internal/adapters/postgres/migrations/migrations.go.galaxy/lobby/internal/adapters/postgres/jet/....galaxy/lobby/internal/adapters/postgres/gamestore/store.go.galaxy/lobby/internal/adapters/postgres/applicationstore/store.go.galaxy/lobby/internal/adapters/postgres/invitestore/store.go.galaxy/lobby/internal/adapters/postgres/membershipstore/store.go.- Test files for each store using the existing test patterns.
galaxy/lobby/Makefile(jettarget).galaxy/lobby/docs/postgres-migration.md(decision record covering this sub-stage and what is intentionally left for 6B/6C).
Files (touched):
galaxy/lobby/internal/config/config.go— add Postgres config; refactor Redis config to the new shape.galaxy/lobby/internal/app/runtime.go— open Postgres pool, run migrations on startup, wire core PG-backed stores into services. RaceNameDirectory and stats/guard stores still wired to Redis until 6B/6C.galaxy/lobby/README.mdandgalaxy/lobby/docs/runbook.md— updated to describe core entities on PG, RND/stats still on Redis until 6B/6C.
Files (deleted):
galaxy/lobby/internal/adapters/redisstate/gamestore.go,applicationstore.go,invitestore.go,membershipstore.go.- The corresponding sections of
redisstate/keyspace.go.
Stub adapters retained: gamestub/, applicationstub/, invitestub/,
membershipstub/ stay — they are pure in-memory ports useful for tests
that don't need real PG.
Touched integration suites:
integration/lobbyuser/integration/lobbynotification/
Verification: per Stage 3 pattern; plus the existing lobby HTTP contract tests against the public/internal ports.
Stage 6B — Lobby Service: RaceNameDirectory
This stage is implemented.
Goal: replace the Lua-backed Redis RaceNameDirectory with a PG
implementation that preserves the two-tier model (registered / reservation /
pending_registration) and atomic registration semantics via SQL
transactions and (where required) advisory locks.
Schema (additions to lobby schema):
race_names(canonical_key PK, holder_user_id, binding_kind ('registered' | 'reserved' | 'pending_registration'), source_game_id, eligible_until_ms, registered_at_ms, reserved_at_ms).- INDEX (holder_user_id) for
ListRegistered/ListReservations/ListPendingRegistrationsqueries. - PARTIAL INDEX (eligible_until_ms) WHERE binding_kind = 'pending_registration' for the expiration scanner.
- The confusable-pair policy is enforced at write time inside
BEGIN … COMMITtransactions;Reserve/Register/MarkPendingRegistrationuseSELECT … FOR UPDATEon the canonical keys involved (or PG advisory locks keyed byhashtext(canonical_key)) to serialise concurrent attempts.
- INDEX (holder_user_id) for
Files (new):
galaxy/lobby/internal/adapters/postgres/migrations/00002_race_names.sql.galaxy/lobby/internal/adapters/postgres/racenamedir/directory.go— Postgres implementation ofports.RaceNameDirectory.galaxy/lobby/internal/adapters/postgres/racenamedir/directory_test.go— runs the existing shared suite atgalaxy/lobby/internal/ports/racenamedirtest/suite.go.
Files (touched):
galaxy/lobby/internal/app/runtime.go— wire PG RND.galaxy/lobby/internal/ports/racenamedirtest/suite.go— only shape-preserving updates if the suite assumed Redis-only behaviour (e.g. SCAN-based list ordering).galaxy/lobby/README.md,galaxy/lobby/docs/runbook.md— RND now PG- backed; canonical_lookup cache no longer needed (PG indexed lookup is fast enough; remove the Redis cache key fromredisstate/keyspace.go).
Files (deleted):
galaxy/lobby/internal/adapters/redisstate/racenamedir.goand the embedded Lua scripts.galaxy/lobby/internal/adapters/racenamestub/stays (useful for unit tests that don't need PG).
Worker change: the pending-registration expiration worker switches
from ZRANGEBYSCORE on lobby:race_names:pending_index to
SELECT … FROM race_names WHERE binding_kind='pending_registration' AND eligible_until_ms <= now().
Verification: shared port suite (racenamedirtest) green against PG
adapter; lobby unit tests green; integration/lobbyuser/,
integration/lobbynotification/ green.
Stage 6C — Lobby Service: workers, ephemeral stores, cleanup
This stage is implemented.
Goal: finish the lobby migration. Confirm what stays Redis-only, update workers that touch both backends, drop dead Redis adapters.
Stays on Redis (per architectural rules):
GameTurnStatsStore— ephemeral per-game aggregate, deleted at game finish, rebuildable from GM events.EvaluationGuardStore— ephemeral marker.GapActivationStore— short-lived gap-window timestamp cache.StreamOffsetStore— runtime coordination per the architectural rule.- All stream consumers and publishers (
gm:lobby_events,runtime:job_results,user:lifecycle_events,notification:intents).
This is documented in galaxy/lobby/docs/postgres-migration.md.
Files (touched):
galaxy/lobby/internal/worker/gmevents/consumer.go— write durable updates via PG-backedGameStore.galaxy/lobby/internal/worker/runtimejobresult/consumer.go— same.galaxy/lobby/internal/adapters/userlifecycle/consumer.go(and the worker that drives it) — RND release, membership/application/invite cascade all flow through PG.galaxy/lobby/internal/worker/pendingregistration/worker.go— PG-based scan, no Redis ZSET.galaxy/lobby/internal/worker/enrollmentautomation/worker.go— uses PGGameStore.GetByStatus("enrollment_open").galaxy/lobby/internal/adapters/redisstate/keyspace.go— pruned to the remaining Redis keys (turn stats, gap activation, evaluation guard, stream offsets, lifecycle stream consumer state).galaxy/lobby/README.md,galaxy/lobby/docs/runtime.md,galaxy/lobby/docs/runbook.md,galaxy/lobby/docs/examples.md— finalised storage descriptions.
Files (deleted):
- Anything left in
galaxy/lobby/internal/adapters/redisstate/whose only consumer was a port now PG-backed (see 6A/6B deletions).
Verification:
- All previously-green lobby unit tests pass with PG-backed adapters.
integration/lobbyuser/,integration/lobbynotification/pass.grep -rn "redisstate" galaxy/lobby/internal/returns only the keys intentionally retained on Redis.
Stage 7 — Gateway and Auth/Session: Redis configuration refactor
This stage is implemented.
Goal: apply the new Redis configuration shape (master/replica/password, drop TLS/USERNAME) to Gateway and Auth/Session. No PG migration; these services intentionally stay Redis-only.
Files (touched):
galaxy/gateway/internal/config/config.go— switchRedisConfigfields to thepkg/redisconn.Configshape; update the three prefixes:GATEWAY_SESSION_CACHE_REDIS_*,GATEWAY_REPLAY_REDIS_*,GATEWAY_SESSION_EVENTS_REDIS_*. DropTLS_ENABLED,USERNAME.galaxy/gateway/internal/session/redis.go,galaxy/gateway/internal/replay/redis.go,galaxy/gateway/internal/events/subscriber.go— adopt new client constructor viapkg/redisconn.galaxy/gateway/internal/config/config_test.go,galaxy/gateway/internal/session/redis_test.go,galaxy/gateway/internal/replay/redis_test.go— updated to new env shape.galaxy/authsession/internal/config/config.go— same pattern; drop TLS, USERNAME.galaxy/authsession/internal/adapters/redis/sessionstore/store.go,challengestore/store.go,projectionpublisher/publisher.go,sendemailcodeabuse/protector.go,configprovider/store.go— adopt new client.galaxy/authsession/internal/config/config_test.go— updated.galaxy/gateway/README.md,galaxy/authsession/README.md,galaxy/gateway/docs/runbook.md,galaxy/authsession/docs/runbook.md— note that Redis-only is intentional and reference theARCHITECTURE.mdrule on TTL-bounded auth state.
No deletions of business logic; only env-var refactor and adapter
plumbing through pkg/redisconn.
Touched integration suites:
integration/gatewayauthsession/integration/authsession/- (every suite that boots gateway or authsession picks up the new env vars
via the harness; confirm none still pass
*_REDIS_TLS_ENABLED).
Verification:
cd galaxy/gateway && go test ./...cd galaxy/authsession && go test ./...cd integration && go test ./gatewayauthsession/... ./authsession/...
Stage 8 — GeoProfile: documentation only
Goal: ensure the GeoProfile plan and README reflect the new persistence rules so its future implementation follows them. No code exists yet.
Files (touched):
galaxy/geoprofile/PLAN.md— add a stage referencingpkg/postgresandpkg/redisconn; specify that observed-country aggregates, declared_country history and review records will live in ageoprofileschema, while ephemeral per-session signals (if any) stay on Redis.galaxy/geoprofile/README.md— note ownership of thegeoprofileschema and the stack choices.
No code change.
Stage 9 — Final sweep
Goal: confirm no dead Redis adapter code, no orphaned stub, no
broken doc reference. Remove the Migration Window caveat from
ARCHITECTURE.md once all stages are done.
Activities:
- Walk every PG-backed service:
grep -rn "redis" galaxy/<svc>/internal/adapters/and verify every match belongs to a still-active stream/cache/runtime use case. - Walk integration suites: confirm each one provisions only the containers it actually needs; no stale env vars.
- Update
ARCHITECTURE.mdto drop the Migration Window sub-section. - Combine sequences of migration
.sqlfiles into a single first file. Rewrite SQL-code, not just concat. The reason is that project still in in development state and all schema updates can go directly in the only and first step of relevant migrations. This should be represented inARCHITECTURE.mdas well. - One round of
go test ./...in every module pluscd integration && go test ./....
Verification:
- All tests pass in every module.
- No file matches
// TODO.*postgresor// TODO.*migrate. git grep -n REDIS_TLS_ENABLED REDIS_USERNAMEreturns nothing undergalaxy/(these env vars are fully retired).
Verification strategy (whole project)
After each stage:
cd /Users/id/src/go/galaxy/pkg && go test ./...cd /Users/id/src/go/galaxy/<changed_service> && go test ./...(with Docker available for testcontainers).cd /Users/id/src/go/galaxy/integration && go test ./<affected_suites>/...- Manual smoke against a
docker-composestack (PG + Redis, both with passwords) using the example flows in each service'sdocs/examples.md.
After Stage 9:
cd /Users/id/src/go/galaxy/integration && go test ./...end to end against real PG + real Redis.- Confirm
git grep -nE 'REDIS_(TLS_ENABLED|USERNAME)'returns nothing undergalaxy/. - Confirm
git grep -n 'TODO.*(postgres|migrate)'returns nothing.
Out of scope
galaxy/game— explicitly excluded by the project owner.- Production deployment manifests (Helm/k8s) — local
docker-composeis enough for development. - Backup/restore tooling configuration —
pg_dumpand WAL archiving are available out of the box; operational setup is not part of this plan. - Sentinel/Cluster Redis topology code paths — config exposes replica addresses for future use; no failover routing implemented yet.
- Read-traffic routing to PG replicas — config exposes
*_POSTGRES_REPLICA_DSNSfor future use; no routing implemented yet. golangci-lintconfig addition — not part of this migration.- CI pipeline — no
.github/workflows/exists; not added by this plan.
Risks and notes
go-jetcodegen requires a live database. Themake jettarget per service usestestcontainers-goto bring up a transient PG, applies the same goose migrations the service applies at startup, then runsjet -dsn=… -path=internal/adapters/postgres/jet. Generated code is committed; consumers don't need Docker just to build.- Schema-per-service vs single-DB cross-service joins: there are no cross-schema joins in this plan. Each service reads only its own schema; cross-service data flows go via Redis Streams (event bus) or HTTP contracts (User Service is queried by Lobby for eligibility) — same as today. The DB-level role grants enforce this.
- Pending registration expiration worker: under Redis it scanned a
global ZSET; under PG it does an indexed scan. The partial index on
eligible_until_ms WHERE binding_kind='pending_registration'keeps the scan cheap. - Idempotency under crash: with idempotency expressed as a UNIQUE constraint on the durable record, recovery is "the row either exists or it doesn't" — no Redis-loss window where duplicates can sneak through.
- lib/pq vs pgx (revisit): confirmed pgx/v5 + jet via stdlib adapter.
The
make jettarget will pass-source=postgresto jet (the dialect is independent of which Go driver runs the queries at runtime). - No backward-compat shim for env vars:
*_REDIS_TLS_ENABLEDand*_REDIS_USERNAMEare retired in one cut. Any external dev environment that sets these will start failing fast at startup with a clear error emitted bypkg/redisconn.LoadFromEnv.