Files
galaxy-game/PG_PLAN.md
T
2026-04-26 20:34:39 +02:00

40 KiB

PostgreSQL Migration Plan

This plan has been already implemented and stays here for historical reasons.

It should NOT be threated as source of truth for service functionality.

Context

The Galaxy Game project currently uses Redis as the only persistence backend across all implemented services (user, mail, notification, lobby, gateway, authsession). Redis serves both kinds of state: ephemeral and runtime-coordination state (where it shines — Streams, caches, replay keys, runtime queues, session caches, leases) and table-shaped business state where it is a poor fit (durable user accounts, entitlements/sanctions, mail audit records, notification routes/idempotency, lobby memberships and invites). Replication and standby for Redis are not configured anywhere. There is no SQL/migration tooling in the repo at all.

We migrate to a Redis + PostgreSQL split where each backend owns the data it serves best. PostgreSQL becomes the source of truth for table-shaped business state, gives us ACID transactions, mature physical/logical replication, and backup/restore via pg_dump and WAL archiving. Redis remains the source of truth for streams, pub/sub, caches, leases, replay keys, rate limits, session caches, runtime queues, and stream consumer offsets.

The plan migrates only services already implemented and explicitly excludes galaxy/game. It targets steady-state architecture rules first (one authoritative document, ARCHITECTURE.md), then walks each service end to end — code, tests, service-local README/docs, and integration suites — so that no intermediate commit leaves docs and code in conflict.

Confirmed decisions (with project owner)

  1. Documentation strategy: ARCHITECTURE.md is updated as the very first stage with the architecture-wide rules. Each per-service README and per- service docs/ change inside that service's own stage, paired with code and tests. This keeps ARCHITECTURE.md ≡ policy, README ≡ current state, and ensures any commit can be checked out without code/doc divergence.
  2. Service scope: full migration of durable storage to PostgreSQL for user, mail, notification, lobby. Only Redis configuration refactor (master/replica + mandatory password, drop TLS_ENABLED / USERNAME) for gateway and authsession — these services intentionally stay Redis- only. geoprofile has no implementation; its PLAN.md and README.md absorb the new persistence rules so future implementation follows them.
  3. Idempotency and retry-schedule placement: idempotency records and retry schedule queues live in PostgreSQL on the same table as the durable record they protect ((producer, idempotency_key) UNIQUE on records, next_attempt_at column on deliveries / routes). One source of truth, no dual-write hazard between PG and Redis ZSETs.
  4. Stack: github.com/jackc/pgx/v5 driver, exposed as *sql.DB via github.com/jackc/pgx/v5/stdlib. github.com/go-jet/jet/v2 for type-safe query building + code generation, generated against a testcontainers PostgreSQL instance with migrations applied (Makefile target per service). github.com/pressly/goose/v3 library API for embedded migrations applied at service startup; the goose CLI may be used for local development and rollback investigations but is not in the service binary path.
  5. Code: all postgres queries must use pre-generated code with jet and appropriate builders rather than raw SQL queries, unless this usage cannot achive the goal of businness-scenario due to lack of go-jet functionality.

Architectural rules (target steady-state)

These rules land in ARCHITECTURE.md in Stage 0 and govern every subsequent service stage.

Backend assignment

PostgreSQL is the source of truth for:

  • Domain entities with table-shaped business state (accounts, entitlement_records, sanction_records, limit_records, blocked_emails, deliveries, attempts, dead_letters, malformed_commands, notification_records, notification_routes, games, applications, invites, memberships, race_names).
  • Idempotency records (UNIQUE constraint on the durable table, not a separate kv).
  • Retry scheduling state (next_attempt_at column + supporting index on the durable table).
  • Audit history records that must outlive any Redis snapshot.

Redis is the source of truth for:

  • Redis Streams used as the event bus (user:domain_events, user:lifecycle_events, gm:lobby_events, runtime:job_results, notification:intents, gateway:client-events, mail:delivery_commands).
  • Stream consumer offsets (small runtime coordination state, rebuildable).
  • Caches and projections (gateway session cache).
  • Replay reservation keys.
  • Rate limit counters.
  • Runtime coordination locks/leases (e.g. notification route_leases).
  • Authentication challenge state and active session tokens (TTL-bounded; loss is recoverable by re-authentication).
  • Ephemeral per-game runtime aggregates that are deleted at game finish (lobby game_turn_stats, gap_activated_at, capability evaluation marker).

Database topology

  • Single PostgreSQL database galaxy.
  • Schema-per-service: user, mail, notification, lobby. Reserved for later: geoprofile. Not allocated unless needed: gateway, authsession.
  • Per-service PostgreSQL role with grants restricted to its own schema (defense-in-depth, simple to express in the initial migration).
  • Authentication: username + password only. sslmode=disable. No client certificates, no SCRAM channel binding, no custom auth plugins.
  • Each service connects to one primary plus zero-or-more read-only replicas. In this iteration only the primary is used; the replica pool is wired but receives no traffic. Future read-routing is non-breaking.

Redis topology

  • Each service connects to one master Redis plus zero-or-more replica Redis hosts.
  • All connections use a mandatory password. USERNAME/ACL not used. TLS off.
  • In this iteration only the master is used; the replica list is wired but unused — non-breaking switch later when the app starts routing reads.
  • Existing env vars *_REDIS_TLS_ENABLED, *_REDIS_USERNAME are removed (hard rename; no backward-compat shim — fresh project, no production deploys to migrate).

Library stack

  • Driver: github.com/jackc/pgx/v5 (modern, actively maintained), exposed to database/sql via github.com/jackc/pgx/v5/stdlib so go-jet's qrm.Queryable interface is satisfied without changes.
  • Query layer: github.com/go-jet/jet/v2 (PostgreSQL dialect). Generated code lives under each service internal/adapters/postgres/jet/, regenerated via a make jet target and committed to the repo.
  • Migrations: github.com/pressly/goose/v3 library API; migration files embedded via //go:embed *.sql; applied at startup, before opening any HTTP/gRPC listener; non-zero exit on failure.
  • Test infrastructure: github.com/testcontainers/testcontainers-go plus the modules/postgres submodule; the same setup is reused by make jet to host a transient instance for jet codegen.

Migration discipline

  • Forward-only sequence-numbered files: 00001_init.sql, 00002_*.sql, …
  • Lowercase snake_case names; goose -- +goose Up / -- +goose Down markers; statements that need transaction-wrapping use -- +goose StatementBegin / -- +goose StatementEnd.
  • Migrations apply at service startup; service exits non-zero on failure.
  • Per-service decision record at galaxy/<service>/docs/postgres-migration.md captures schema decisions and any non-trivial deviation from the rules.

Per-service code organisation

galaxy/<service>/
  internal/
    adapters/
      postgres/
        migrations/         # *.sql files + migrations.go (//go:embed)
        jet/                # generated; commit-checked
        <portname>/         # adapter implementations matching internal/ports
    config/
      config.go             # adds Postgres + new Redis schema
  Makefile                  # `jet` target: testcontainers + goose + jet

Test patterns

  • Per-service unit tests against a real PostgreSQL via testcontainers-go; replace the corresponding miniredis test path where storage moved to PG.
  • Shared port-test suites (e.g. lobby/internal/ports/racenamedirtest/) gain a Postgres harness; they remain backend-agnostic in shape.
  • integration/internal/harness/postgres_container.go is added; integration suites that need PG declare it next to their existing Redis container.
  • Stub adapters (*stub/) are kept where the in-memory port is useful for tests that don't need a real backend. Redis adapters that previously implemented these ports are removed (no dead code).

Configuration env vars (target)

For each service <S> ∈ { USERSERVICE, MAIL, NOTIFICATION, LOBBY, GATEWAY, AUTHSESSION }:

  • <S>_REDIS_MASTER_ADDR (required)
  • <S>_REDIS_REPLICA_ADDRS (optional, comma-separated; default empty)
  • <S>_REDIS_PASSWORD (required)
  • <S>_REDIS_DB (default 0)
  • <S>_REDIS_OPERATION_TIMEOUT (default 250ms)

For PG-backed services (USERSERVICE, MAIL, NOTIFICATION, LOBBY):

  • <S>_POSTGRES_PRIMARY_DSN (required; e.g. postgres://userservice:secret@postgres:5432/galaxy?search_path=user&sslmode=disable)
  • <S>_POSTGRES_REPLICA_DSNS (optional, comma-separated)
  • <S>_POSTGRES_OPERATION_TIMEOUT (default 1s)
  • <S>_POSTGRES_MAX_OPEN_CONNS (default 25)
  • <S>_POSTGRES_MAX_IDLE_CONNS (default 5)
  • <S>_POSTGRES_CONN_MAX_LIFETIME (default 30m)

DSN sets search_path=<schema> so unqualified table references resolve into the service-owned schema; sslmode=disable is set explicitly per the "no TLS" requirement.

Service-prefix-specific stream/keyspace env vars (*_REDIS_DOMAIN_EVENTS_STREAM, *_REDIS_LIFECYCLE_EVENTS_STREAM, *_REDIS_KEYSPACE_PREFIX, MAIL_REDIS_COMMAND_STREAM, etc.) keep their current names and semantics — they describe stream/key shapes, not connection topology.


Stages

Each stage is independently executable and shippable.

Stage 0 — Architecture-wide rules and PG_PLAN.md materialisation

This stage is implemented.

Goal: land the steady-state rules in ARCHITECTURE.md and place PG_PLAN.md at the project root so subsequent /stage-implementation invocations have an authoritative reference.

Actions:

  1. Write the contents of this plan file to /Users/id/src/go/galaxy/PG_PLAN.md.
  2. Add a new section to ARCHITECTURE.md (e.g. §9 Persistence Backends) capturing every rule under the Architectural rules heading above: backend assignment, database/Redis topology, library stack, migration discipline, code organisation, test patterns, env-var conventions.
  3. Add a short Migration Window sub-section to ARCHITECTURE.md noting that until all PG_PLAN.md stages complete, each service's README.md continues to describe its actual current state — this caveat is removed in Stage 9.
  4. Adjust ARCHITECTURE.md §8 (publisher rules) so cross-references distinguish "Redis Stream" (event bus, stays Redis) from "PG-backed table" (durable record).

Files (modified / new):

  • /Users/id/src/go/galaxy/PG_PLAN.md — new
  • /Users/id/src/go/galaxy/ARCHITECTURE.md — modified

Out of scope: zero service code, zero per-service README/docs, zero go.mod changes, zero new dependencies in service modules.

Verification:

  • git diff --stat reports two paths only: PG_PLAN.md, ARCHITECTURE.md.
  • ARCHITECTURE.md reads coherently end to end, with the new section cross-referenced from §8 and from any other place that today says "Redis is the v1 backend".
  • Manual: read PG_PLAN.md top to bottom, confirm every architectural decision matches the section in ARCHITECTURE.md.

Stage 1 — Shared infrastructure packages (pkg/postgres, pkg/redisconn)

This stage is implemented.

Goal: provide one canonical helper each for Postgres and Redis so per- service stages don't reinvent connection/migration wiring. No service consumes them yet.

Files (new):

  • pkg/postgres/config.goConfig struct (PrimaryDSN, ReplicaDSNs, OperationTimeout, MaxOpenConns, MaxIdleConns, ConnMaxLifetime); helper LoadFromEnv(prefix string) (Config, error) that reads <prefix>_POSTGRES_*.
  • pkg/postgres/open.goOpenPrimary(ctx, cfg) (*sql.DB, error) and OpenReplicas(ctx, cfg) ([]*sql.DB, error) using pgx.ConnConfigstdlib.OpenDB(...); configures pool sizes and per-statement context timeout.
  • pkg/postgres/migrate.goRunMigrations(ctx context.Context, db *sql.DB, fs embed.FS) error wrapping goose.SetBaseFS(fs) + goose.UpContext.
  • pkg/postgres/otel.goInstrument(db *sql.DB, telemetry telemetry.Runtime) applying otelsql.RegisterDBStatsMetrics and statement spans.
  • pkg/postgres/postgres_test.go — testcontainers-backed smoke test: open primary, run a one-line migration, insert/select.
  • pkg/redisconn/config.goConfig struct (MasterAddr, ReplicaAddrs, Password, DB, OperationTimeout); helper LoadFromEnv(prefix string) (Config, error) that reads <prefix>_REDIS_* (the new shape only; rejects deprecated TLS/USERNAME vars with a clear error).
  • pkg/redisconn/client.goNewMasterClient(cfg) *redis.Client and NewReplicaClients(cfg) []*redis.Client (latter returns nil/empty when replicas not configured).
  • pkg/redisconn/otel.goInstrument(client *redis.Client, telemetry telemetry.Runtime) applying redisotel.InstrumentTracing / InstrumentMetrics.
  • pkg/redisconn/redisconn_test.go — miniredis-backed config and master client tests.

Files (touched):

  • pkg/go.mod — add github.com/jackc/pgx/v5, github.com/jackc/pgx/v5/stdlib, github.com/pressly/goose/v3, github.com/testcontainers/testcontainers-go/modules/postgres, github.com/XSAM/otelsql (for db instrumentation; alternative: go.nhat.io/otelsql — pick one in implementation).
  • go.work — confirm pkg/ is registered (already is).

Verification:

  • cd /Users/id/src/go/galaxy/pkg && go test ./postgres/... ./redisconn/... passes locally with Docker available.
  • go vet ./... clean.

Stage 2 — Integration test harness extension

This stage is implemented.

Goal: extend integration/internal/harness/ with a Postgres container helper and a service-bootstrap helper that builds the per-service DSN with the right search_path. All existing integration suites stay green.

Files (new):

  • integration/internal/harness/postgres_container.goStartPostgresContainer(t testing.TB) *PostgresRuntime. The runtime exposes BaseDSN(), DSNForSchema(schema, role string) string, and EnsureRoleAndSchema(ctx, schema, role, password string) error so each test can prepare an isolated schema for the service it is booting.
  • integration/internal/harness/postgres_container_test.go — smoke test.

Files (touched):

  • integration/internal/harness/binary.go — extend Process/launch helpers with WithPostgres(rt *PostgresRuntime, schema, role string) that injects the right <S>_POSTGRES_PRIMARY_DSN. (Existing API already takes env map[string]string; this is a thin wrapper.)
  • integration/go.mod — add the testcontainers Postgres module.

Out of scope: no integration suite is yet wired to Postgres; each service stage wires in its suites.

Verification:

  • cd integration && go test ./internal/harness/... passes.
  • cd integration && go test ./... still green for all existing suites (Redis-only services remain Redis-only).

Stage 3 — User Service migration (pilot)

Goal: replace User Service's Redis durable storage with PostgreSQL. The two Redis Streams (user:domain_events, user:lifecycle_events) remain on Redis. This stage is the pilot; subsequent service stages copy its shape.

Schema (user schema):

  • accounts (user_id PK, email UNIQUE, user_name UNIQUE, display_name, preferred_language, time_zone, declared_country, created_at, updated_at, deleted_at).
  • blocked_emails (email PK, reason_code, blocked_at, actor_type, actor_id, resolved_user_id).
  • entitlement_records (record_id PK, user_id FK, plan_code, is_paid, starts_at, ends_at, source, actor_type, actor_id, reason_code, updated_at).
  • entitlement_snapshots (user_id PK FK → accounts, …current effective values mirroring Redis snapshot shape).
  • sanction_records (record_id PK, user_id FK, sanction_code, scope, reason_code, actor_type, actor_id, applied_at, expires_at, removed_at, removed_by_type, removed_by_id, removed_reason_code).
  • sanction_active (user_id, sanction_code, record_id) PRIMARY KEY (user_id, sanction_code).
  • limit_records, limit_active — analogous to sanctions.
  • Indexes: accounts(created_at DESC, user_id DESC) for newest-first pagination; accounts(declared_country); entitlement_snapshots(plan_code, is_paid); entitlement_snapshots(ends_at) WHERE is_paid AND ends_at IS NOT NULL; sanction_active(sanction_code); limit_active(limit_code). Eligibility flags become computed predicates on these columns.

Files (new):

  • galaxy/user/internal/adapters/postgres/migrations/00001_init.sql — full schema with grants (GRANT USAGE ON SCHEMA user TO userservice; GRANT … ON ALL TABLES …;).
  • galaxy/user/internal/adapters/postgres/migrations/migrations.go//go:embed *.sql and a Migrations() embed.FS accessor.
  • galaxy/user/internal/adapters/postgres/jet/... — generated code (commit-checked).
  • galaxy/user/internal/adapters/postgres/userstore/store.go — Postgres implementation of ports.UserAccountStore and ports.AuthDirectoryStore.
  • galaxy/user/internal/adapters/postgres/userstore/entitlement_store.go — Postgres implementation of EntitlementSnapshotStore and EntitlementHistoryStore.
  • galaxy/user/internal/adapters/postgres/userstore/policy_store.go — Postgres implementation of SanctionStore and LimitStore.
  • galaxy/user/internal/adapters/postgres/userstore/list_store.go — Postgres implementation of UserListStore (pagination + filters expressed as SQL).
  • galaxy/user/internal/adapters/postgres/userstore/store_test.go and siblings — testcontainers-backed unit tests covering the same matrix the current Redis tests cover.
  • galaxy/user/Makefilejet target.
  • galaxy/user/docs/postgres-migration.md — decision record (schema shape, why we keep entitlement_snapshots denormalised, eligibility expressed as SQL predicates, schema role grants).

Files (touched):

  • galaxy/user/internal/config/config.go — add Postgres config; refactor Redis config to master/replica/password (drop TLS_ENABLED, USERNAME).
  • galaxy/user/internal/config/config_test.go — update to new env shape.
  • galaxy/user/internal/app/runtime.go — open Postgres pool, run migrations on startup before listeners open, wire postgres adapters into services. Redis client now serves only the two stream publishers.
  • galaxy/user/README.md — replace "Redis-backed user state" with the new persistence model, update env-var section.
  • galaxy/user/docs/runbook.md, galaxy/user/docs/runtime.md, galaxy/user/docs/examples.md — update storage references and config sections.
  • galaxy/user/go.mod — add github.com/jackc/pgx/v5{,/stdlib}, github.com/pressly/goose/v3, github.com/go-jet/jet/v2, github.com/testcontainers/testcontainers-go/modules/postgres. Use pkg/postgres, pkg/redisconn.

Files (deleted):

  • galaxy/user/internal/adapters/redis/userstore/ — entire directory.
  • The portions of galaxy/user/internal/adapters/redisstate/keyspace.go that defined account/entitlement/sanction/limit/index keys (keep only what domainevents and lifecycleevents publishers still require — if none, delete the file outright).

Files retained on Redis:

  • galaxy/user/internal/adapters/redis/domainevents/publisher.go.
  • galaxy/user/internal/adapters/redis/lifecycleevents/publisher.go.

Touched integration suites (each gets a Postgres container in addition to the existing Redis one):

  • integration/authsessionuser/
  • integration/gatewayauthsessionuser/
  • integration/gatewayauthsessionusermail/
  • integration/notificationuser/
  • integration/lobbyuser/

Verification:

  • cd galaxy/user && make jet && go test ./... (Docker needed).
  • cd integration && go test ./authsessionuser/... ./gatewayauthsessionuser/... ./gatewayauthsessionusermail/... ./notificationuser/... ./lobbyuser/...
  • Manual smoke against a docker-compose stack (PG + Redis with passwords) using flows from galaxy/user/docs/examples.md.

Stage 4 — Mail Service migration

This stage is implemented.

Goal: move durable mail storage (deliveries, attempts, dead letters, malformed commands, payloads, idempotency, attempt schedule) into PostgreSQL. Keep Redis only for the inbound mail:delivery_commands stream and its consumer offset.

Schema (mail schema):

  • deliveries (delivery_id PK, source, status, recipient_envelope JSONB, subject, text_body, html_body, payload_mode, template_id, idempotency_source, idempotency_key, locale_fallback_used, next_attempt_at, attempt_count, max_attempts, created_at, updated_at).
    • INDEX (status, next_attempt_at) for the scheduler.
    • UNIQUE (idempotency_source, idempotency_key) — the idempotency record IS this row (no separate kv).
    • INDEX (created_at DESC) for operator listings; INDEX on status, source, template_id, recipient as needed.
  • attempts (delivery_id FK, attempt_no, status, provider_summary, scheduled_for_ms, started_at_ms, completed_at_ms, PRIMARY KEY (delivery_id, attempt_no)).
  • dead_letters (delivery_id PK FK, final_attempt_count, max_attempts, failure_classification, failure_message, created_at_ms).
  • delivery_payloads (delivery_id PK FK, template_variables JSONB).
  • malformed_commands (stream_entry_id PK, failure_code, failure_message, raw_fields JSONB, recorded_at_ms; INDEX created_at).

Files: mirror Stage 3 (postgres adapter package, migrations, jet codegen, Makefile, decision record, removal of corresponding internal/adapters/redisstate/* files for migrated entities, retention of stream offset and consumer wiring on Redis).

Worker change: the mail attempt scheduler loop replaces ZRANGEBYSCORE over mail:attempt_schedule with SELECT … FROM deliveries WHERE status IN ('queued','retry_pending') AND next_attempt_at <= now() ORDER BY next_attempt_at LIMIT N FOR UPDATE SKIP LOCKED.

Files (deleted):

  • galaxy/mail/internal/adapters/redisstate/auth_acceptance_store.go
  • galaxy/mail/internal/adapters/redisstate/generic_acceptance_store.go
  • galaxy/mail/internal/adapters/redisstate/attempt_execution_store.go
  • galaxy/mail/internal/adapters/redisstate/operator_store.go
  • galaxy/mail/internal/adapters/redisstate/malformed_command_store.go
  • galaxy/mail/internal/adapters/redisstate/render_store.go
  • The portions of galaxy/mail/internal/adapters/redisstate/keyspace.go no longer used (mail:attempt_schedule, mail:idempotency:*, all delivery/attempt/dead-letter/index keys).

Files retained on Redis:

  • galaxy/mail/internal/adapters/redisstate/stream_offset_store.go (offset for mail:delivery_commands consumer).
  • The command stream consumer wiring itself.

Touched integration suites:

  • integration/authsessionmail/
  • integration/gatewayauthsessionmail/
  • integration/gatewayauthsessionusermail/
  • integration/notificationmail/

Verification: per Stage 3 pattern; plus end-to-end smoke that pushes a delivery through retry_pending → provider_accepted using the SMTP stub.


Stage 5 — Notification Service migration

This stage is implemented.

Goal: move durable notification storage (records, routes, idempotency, dead letters, malformed intents) into PostgreSQL. Keep Redis for the inbound notification:intents stream, the outbound gateway:client-events stream, the outbound mail:delivery_commands stream, the corresponding stream offsets, and the short-lived per-route lease (route_leases:*).

Schema (notification schema):

  • records (notification_id PK, notification_type, producer, audience_kind, recipient_user_ids JSONB, payload JSONB, idempotency_key, request_fingerprint, request_id, trace_id, occurred_at_ms, accepted_at_ms, updated_at_ms).
    • UNIQUE (producer, idempotency_key) — idempotency record IS this row.
  • routes (notification_id, route_id, channel, recipient_ref, status, attempt_count, max_attempts, next_attempt_at_ms, resolved_email, resolved_locale, last_error_classification, last_error_message, last_error_at_ms, created_at_ms, updated_at_ms, published_at_ms, dead_lettered_at_ms, skipped_at_ms, PRIMARY KEY (notification_id, route_id)).
    • INDEX (status, next_attempt_at_ms) for the scheduler.
  • dead_letters (notification_id, route_id PK FK, channel, recipient_ref, final_attempt_count, max_attempts, failure_classification, failure_message, recovery_hint, created_at_ms).
  • malformed_intents (stream_entry_id PK, notification_type, producer, idempotency_key, failure_code, failure_message, raw_fields JSONB, recorded_at_ms).

Worker change: route publisher selects work via the same FOR UPDATE SKIP LOCKED pattern as Mail. The Redis lease is still used as a short-lived, per-process exclusivity hint atop the SQL claim.

Files (deleted):

  • galaxy/notification/internal/adapters/redisstate/acceptance_store.go
  • galaxy/notification/internal/adapters/redisstate/route_state_store.go
  • galaxy/notification/internal/adapters/redisstate/malformed_intent_store.go
  • The portions of galaxy/notification/internal/adapters/redisstate/keyspace.go no longer used (records, routes, idempotency, dead_letters, malformed_intents).

Files retained on Redis:

  • galaxy/notification/internal/adapters/redisstate/stream_offset_store.go.
  • Route lease key generator (still under redisstate/, narrowed to leases only).
  • All stream consumer/publisher wiring.

Touched integration suites:

  • integration/notificationgateway/
  • integration/notificationmail/
  • integration/notificationuser/

Stage 6A — Lobby Service: core enrollment entities

Goal: move Game, Application, Invite, Membership records and their indexes into PostgreSQL. RaceNameDirectory, GameTurnStats, GapActivation, EvaluationGuard, StreamOffset remain on Redis until later sub-stages.

Schema (lobby schema, partial):

  • games (game_id PK, owner_id, kind ('public'|'private'), status, created_at, updated_at, runtime_snapshot JSONB, runtime_binding JSONB, …other denormalised game settings).
    • INDEX (status, created_at).
    • INDEX (owner_id) WHERE kind = 'private'.
  • applications (application_id PK, game_id FK, user_id, status, canonical_key, submitted_at, decided_at).
    • PARTIAL UNIQUE INDEX (user_id, game_id) WHERE status = 'active' — enforces the single-active constraint at the DB level (replaces lobby:user_game_application:*:*).
    • INDEX (game_id), INDEX (user_id).
  • invites (invite_id PK, game_id FK, inviter_id, invitee_id, race_name, status, created_at, expires_at, decided_at).
    • INDEX (game_id), INDEX (invitee_id), INDEX (inviter_id).
    • INDEX (status, expires_at) for any expiration scanner if needed.
  • memberships (membership_id PK, game_id FK, user_id, status, joined_at, canonical_key, …).
    • INDEX (game_id), INDEX (user_id).

Files (new):

  • galaxy/lobby/internal/adapters/postgres/migrations/00001_core_entities.sql.
  • galaxy/lobby/internal/adapters/postgres/migrations/migrations.go.
  • galaxy/lobby/internal/adapters/postgres/jet/....
  • galaxy/lobby/internal/adapters/postgres/gamestore/store.go.
  • galaxy/lobby/internal/adapters/postgres/applicationstore/store.go.
  • galaxy/lobby/internal/adapters/postgres/invitestore/store.go.
  • galaxy/lobby/internal/adapters/postgres/membershipstore/store.go.
  • Test files for each store using the existing test patterns.
  • galaxy/lobby/Makefile (jet target).
  • galaxy/lobby/docs/postgres-migration.md (decision record covering this sub-stage and what is intentionally left for 6B/6C).

Files (touched):

  • galaxy/lobby/internal/config/config.go — add Postgres config; refactor Redis config to the new shape.
  • galaxy/lobby/internal/app/runtime.go — open Postgres pool, run migrations on startup, wire core PG-backed stores into services. RaceNameDirectory and stats/guard stores still wired to Redis until 6B/6C.
  • galaxy/lobby/README.md and galaxy/lobby/docs/runbook.md — updated to describe core entities on PG, RND/stats still on Redis until 6B/6C.

Files (deleted):

  • galaxy/lobby/internal/adapters/redisstate/gamestore.go, applicationstore.go, invitestore.go, membershipstore.go.
  • The corresponding sections of redisstate/keyspace.go.

Stub adapters retained: gamestub/, applicationstub/, invitestub/, membershipstub/ stay — they are pure in-memory ports useful for tests that don't need real PG.

Touched integration suites:

  • integration/lobbyuser/
  • integration/lobbynotification/

Verification: per Stage 3 pattern; plus the existing lobby HTTP contract tests against the public/internal ports.


Stage 6B — Lobby Service: RaceNameDirectory

This stage is implemented.

Goal: replace the Lua-backed Redis RaceNameDirectory with a PG implementation that preserves the two-tier model (registered / reservation / pending_registration) and atomic registration semantics via SQL transactions and (where required) advisory locks.

Schema (additions to lobby schema):

  • race_names (canonical_key PK, holder_user_id, binding_kind ('registered' | 'reserved' | 'pending_registration'), source_game_id, eligible_until_ms, registered_at_ms, reserved_at_ms).
    • INDEX (holder_user_id) for ListRegistered/ListReservations/ ListPendingRegistrations queries.
    • PARTIAL INDEX (eligible_until_ms) WHERE binding_kind = 'pending_registration' for the expiration scanner.
    • The confusable-pair policy is enforced at write time inside BEGIN … COMMIT transactions; Reserve/Register/ MarkPendingRegistration use SELECT … FOR UPDATE on the canonical keys involved (or PG advisory locks keyed by hashtext(canonical_key)) to serialise concurrent attempts.

Files (new):

  • galaxy/lobby/internal/adapters/postgres/migrations/00002_race_names.sql.
  • galaxy/lobby/internal/adapters/postgres/racenamedir/directory.go — Postgres implementation of ports.RaceNameDirectory.
  • galaxy/lobby/internal/adapters/postgres/racenamedir/directory_test.go — runs the existing shared suite at galaxy/lobby/internal/ports/racenamedirtest/suite.go.

Files (touched):

  • galaxy/lobby/internal/app/runtime.go — wire PG RND.
  • galaxy/lobby/internal/ports/racenamedirtest/suite.go — only shape-preserving updates if the suite assumed Redis-only behaviour (e.g. SCAN-based list ordering).
  • galaxy/lobby/README.md, galaxy/lobby/docs/runbook.md — RND now PG- backed; canonical_lookup cache no longer needed (PG indexed lookup is fast enough; remove the Redis cache key from redisstate/keyspace.go).

Files (deleted):

  • galaxy/lobby/internal/adapters/redisstate/racenamedir.go and the embedded Lua scripts.
  • galaxy/lobby/internal/adapters/racenamestub/ stays (useful for unit tests that don't need PG).

Worker change: the pending-registration expiration worker switches from ZRANGEBYSCORE on lobby:race_names:pending_index to SELECT … FROM race_names WHERE binding_kind='pending_registration' AND eligible_until_ms <= now().

Verification: shared port suite (racenamedirtest) green against PG adapter; lobby unit tests green; integration/lobbyuser/, integration/lobbynotification/ green.


Stage 6C — Lobby Service: workers, ephemeral stores, cleanup

This stage is implemented.

Goal: finish the lobby migration. Confirm what stays Redis-only, update workers that touch both backends, drop dead Redis adapters.

Stays on Redis (per architectural rules):

  • GameTurnStatsStore — ephemeral per-game aggregate, deleted at game finish, rebuildable from GM events.
  • EvaluationGuardStore — ephemeral marker.
  • GapActivationStore — short-lived gap-window timestamp cache.
  • StreamOffsetStore — runtime coordination per the architectural rule.
  • All stream consumers and publishers (gm:lobby_events, runtime:job_results, user:lifecycle_events, notification:intents).

This is documented in galaxy/lobby/docs/postgres-migration.md.

Files (touched):

  • galaxy/lobby/internal/worker/gmevents/consumer.go — write durable updates via PG-backed GameStore.
  • galaxy/lobby/internal/worker/runtimejobresult/consumer.go — same.
  • galaxy/lobby/internal/adapters/userlifecycle/consumer.go (and the worker that drives it) — RND release, membership/application/invite cascade all flow through PG.
  • galaxy/lobby/internal/worker/pendingregistration/worker.go — PG-based scan, no Redis ZSET.
  • galaxy/lobby/internal/worker/enrollmentautomation/worker.go — uses PG GameStore.GetByStatus("enrollment_open").
  • galaxy/lobby/internal/adapters/redisstate/keyspace.go — pruned to the remaining Redis keys (turn stats, gap activation, evaluation guard, stream offsets, lifecycle stream consumer state).
  • galaxy/lobby/README.md, galaxy/lobby/docs/runtime.md, galaxy/lobby/docs/runbook.md, galaxy/lobby/docs/examples.md — finalised storage descriptions.

Files (deleted):

  • Anything left in galaxy/lobby/internal/adapters/redisstate/ whose only consumer was a port now PG-backed (see 6A/6B deletions).

Verification:

  • All previously-green lobby unit tests pass with PG-backed adapters.
  • integration/lobbyuser/, integration/lobbynotification/ pass.
  • grep -rn "redisstate" galaxy/lobby/internal/ returns only the keys intentionally retained on Redis.

Stage 7 — Gateway and Auth/Session: Redis configuration refactor

This stage is implemented.

Goal: apply the new Redis configuration shape (master/replica/password, drop TLS/USERNAME) to Gateway and Auth/Session. No PG migration; these services intentionally stay Redis-only.

Files (touched):

  • galaxy/gateway/internal/config/config.go — switch RedisConfig fields to the pkg/redisconn.Config shape; update the three prefixes: GATEWAY_SESSION_CACHE_REDIS_*, GATEWAY_REPLAY_REDIS_*, GATEWAY_SESSION_EVENTS_REDIS_*. Drop TLS_ENABLED, USERNAME.
  • galaxy/gateway/internal/session/redis.go, galaxy/gateway/internal/replay/redis.go, galaxy/gateway/internal/events/subscriber.go — adopt new client constructor via pkg/redisconn.
  • galaxy/gateway/internal/config/config_test.go, galaxy/gateway/internal/session/redis_test.go, galaxy/gateway/internal/replay/redis_test.go — updated to new env shape.
  • galaxy/authsession/internal/config/config.go — same pattern; drop TLS, USERNAME.
  • galaxy/authsession/internal/adapters/redis/sessionstore/store.go, challengestore/store.go, projectionpublisher/publisher.go, sendemailcodeabuse/protector.go, configprovider/store.go — adopt new client.
  • galaxy/authsession/internal/config/config_test.go — updated.
  • galaxy/gateway/README.md, galaxy/authsession/README.md, galaxy/gateway/docs/runbook.md, galaxy/authsession/docs/runbook.md — note that Redis-only is intentional and reference the ARCHITECTURE.md rule on TTL-bounded auth state.

No deletions of business logic; only env-var refactor and adapter plumbing through pkg/redisconn.

Touched integration suites:

  • integration/gatewayauthsession/
  • integration/authsession/
  • (every suite that boots gateway or authsession picks up the new env vars via the harness; confirm none still pass *_REDIS_TLS_ENABLED).

Verification:

  • cd galaxy/gateway && go test ./...
  • cd galaxy/authsession && go test ./...
  • cd integration && go test ./gatewayauthsession/... ./authsession/...

Stage 8 — GeoProfile: documentation only

Goal: ensure the GeoProfile plan and README reflect the new persistence rules so its future implementation follows them. No code exists yet.

Files (touched):

  • galaxy/geoprofile/PLAN.md — add a stage referencing pkg/postgres and pkg/redisconn; specify that observed-country aggregates, declared_country history and review records will live in a geoprofile schema, while ephemeral per-session signals (if any) stay on Redis.
  • galaxy/geoprofile/README.md — note ownership of the geoprofile schema and the stack choices.

No code change.


Stage 9 — Final sweep

Goal: confirm no dead Redis adapter code, no orphaned stub, no broken doc reference. Remove the Migration Window caveat from ARCHITECTURE.md once all stages are done.

Activities:

  • Walk every PG-backed service: grep -rn "redis" galaxy/<svc>/internal/adapters/ and verify every match belongs to a still-active stream/cache/runtime use case.
  • Walk integration suites: confirm each one provisions only the containers it actually needs; no stale env vars.
  • Update ARCHITECTURE.md to drop the Migration Window sub-section.
  • Combine sequences of migration .sql files into a single first file. Rewrite SQL-code, not just concat. The reason is that project still in in development state and all schema updates can go directly in the only and first step of relevant migrations. This should be represented in ARCHITECTURE.md as well.
  • One round of go test ./... in every module plus cd integration && go test ./....

Verification:

  • All tests pass in every module.
  • No file matches // TODO.*postgres or // TODO.*migrate.
  • git grep -n REDIS_TLS_ENABLED REDIS_USERNAME returns nothing under galaxy/ (these env vars are fully retired).

Verification strategy (whole project)

After each stage:

  • cd /Users/id/src/go/galaxy/pkg && go test ./...
  • cd /Users/id/src/go/galaxy/<changed_service> && go test ./... (with Docker available for testcontainers).
  • cd /Users/id/src/go/galaxy/integration && go test ./<affected_suites>/...
  • Manual smoke against a docker-compose stack (PG + Redis, both with passwords) using the example flows in each service's docs/examples.md.

After Stage 9:

  • cd /Users/id/src/go/galaxy/integration && go test ./... end to end against real PG + real Redis.
  • Confirm git grep -nE 'REDIS_(TLS_ENABLED|USERNAME)' returns nothing under galaxy/.
  • Confirm git grep -n 'TODO.*(postgres|migrate)' returns nothing.

Out of scope

  • galaxy/game — explicitly excluded by the project owner.
  • Production deployment manifests (Helm/k8s) — local docker-compose is enough for development.
  • Backup/restore tooling configuration — pg_dump and WAL archiving are available out of the box; operational setup is not part of this plan.
  • Sentinel/Cluster Redis topology code paths — config exposes replica addresses for future use; no failover routing implemented yet.
  • Read-traffic routing to PG replicas — config exposes *_POSTGRES_REPLICA_DSNS for future use; no routing implemented yet.
  • golangci-lint config addition — not part of this migration.
  • CI pipeline — no .github/workflows/ exists; not added by this plan.

Risks and notes

  • go-jet codegen requires a live database. The make jet target per service uses testcontainers-go to bring up a transient PG, applies the same goose migrations the service applies at startup, then runs jet -dsn=… -path=internal/adapters/postgres/jet. Generated code is committed; consumers don't need Docker just to build.
  • Schema-per-service vs single-DB cross-service joins: there are no cross-schema joins in this plan. Each service reads only its own schema; cross-service data flows go via Redis Streams (event bus) or HTTP contracts (User Service is queried by Lobby for eligibility) — same as today. The DB-level role grants enforce this.
  • Pending registration expiration worker: under Redis it scanned a global ZSET; under PG it does an indexed scan. The partial index on eligible_until_ms WHERE binding_kind='pending_registration' keeps the scan cheap.
  • Idempotency under crash: with idempotency expressed as a UNIQUE constraint on the durable record, recovery is "the row either exists or it doesn't" — no Redis-loss window where duplicates can sneak through.
  • lib/pq vs pgx (revisit): confirmed pgx/v5 + jet via stdlib adapter. The make jet target will pass -source=postgres to jet (the dialect is independent of which Go driver runs the queries at runtime).
  • No backward-compat shim for env vars: *_REDIS_TLS_ENABLED and *_REDIS_USERNAME are retired in one cut. Any external dev environment that sets these will start failing fast at startup with a clear error emitted by pkg/redisconn.LoadFromEnv.