feat: use postgres

This commit is contained in:
Ilia Denisov
2026-04-26 20:34:39 +02:00
committed by GitHub
parent 48b0056b49
commit fe829285a6
365 changed files with 29223 additions and 24049 deletions
+7
View File
@@ -10,6 +10,13 @@ Sections:
- [Operator runbook](runbook.md)
- [Contract examples](examples.md)
Decision records:
- [PostgreSQL migration](postgres-migration.md) — schema and storage
decisions landed by `PG_PLAN.md §3`
- [Stage 21 — `user_name` + `display_name` refactor](stage21-user-name-display-name.md)
- [Stage 22 — `permanent_block` + `DeleteUser` soft-delete](stage22-permanent-block-delete-user.md)
Primary references:
- [`../README.md`](../README.md) for stable service scope and business rules
+206
View File
@@ -0,0 +1,206 @@
# PostgreSQL Migration
PG_PLAN.md §3 migrated `galaxy/user` from a Redis-only durable store to the
steady-state split codified in `ARCHITECTURE.md §Persistence Backends`:
PostgreSQL is the source of truth for table-shaped business state, and Redis
keeps only the two streams that publish auxiliary domain events
(`user:domain_events`) and trusted user-lifecycle events
(`user:lifecycle_events`).
This document records the schema decisions and the non-obvious agreements
behind them. Use it together with the migration script
(`internal/adapters/postgres/migrations/00001_init.sql`) and the runtime
wiring (`internal/app/runtime.go`).
## Outcomes
- Schema `user` (provisioned externally) holds the durable state: `accounts`,
`blocked_emails`, `entitlement_records`, `entitlement_snapshots`,
`sanction_records`, `sanction_active`, `limit_records`, `limit_active`.
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
applies embedded goose migrations strictly before any HTTP listener
becomes ready, and exits non-zero when migration or ping fails.
- The runtime opens one shared `*redis.Client` via
`pkg/redisconn.NewMasterClient` and passes it to both stream publishers
(`internal/adapters/redis/domainevents`,
`internal/adapters/redis/lifecycleevents`); the publishers no longer hold
their own connection topology fields.
- `internal/adapters/redis/userstore/` and the entire
`internal/adapters/redisstate/` package are removed. The Redis Lua scripts,
Watch/Multi optimistic-concurrency loops, and ZSET indexes are gone.
- Configuration drops `USERSERVICE_REDIS_USERNAME`,
`USERSERVICE_REDIS_TLS_ENABLED`, and `USERSERVICE_REDIS_KEYSPACE_PREFIX`.
`USERSERVICE_REDIS_ADDR` is replaced by
`USERSERVICE_REDIS_MASTER_ADDR` + optional
`USERSERVICE_REDIS_REPLICA_ADDRS`. Postgres-specific knobs live under
`USERSERVICE_POSTGRES_*` per the architectural rule.
## Decisions
### 1. One schema, externally-provisioned role
**Decision.** The `user` schema and the matching `userservice` role are
created outside the migration sequence (in tests, by
`integration/internal/harness/postgres_container.go::EnsureRoleAndSchema`;
in production, by an ops init script not in scope for this stage). The
embedded migration `00001_init.sql` only contains DDL for tables and
indexes and assumes it runs as the schema owner with `search_path=user`.
**Why.** Mixing role creation, schema creation, and table DDL into one
script forces every consumer of the migration to run as a superuser. The
schema-per-service architectural rule
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
operational split: ops provisions roles and schemas, the service applies
schema-scoped migrations.
### 2. `entitlement_snapshots` stays denormalised
**Decision.** A dedicated `entitlement_snapshots` table holds exactly one
row per `user_id` mirroring the current effective fields (`plan_code`,
`is_paid`, `starts_at`, `ends_at`, `source`, `actor_*`, `reason_code`,
`updated_at`). Lifecycle operations (`Grant`, `Extend`, `Revoke`,
`RepairExpired`) write the history row and the snapshot row inside one
transaction.
**Why.** The lobby-eligibility hot-path reads exactly one row per user; a
JOIN over `entitlement_records` to compute the current segment would add
latency and wire-format complexity. Keeping the snapshot denormalised
matches the previous Redis shape where the hot read returned a
pre-materialised JSON blob, which preserves the existing service-layer
contract and the public REST envelope.
### 3. `sanction_active` / `limit_active` are the source of truth for "active"
**Decision.** The active state of a sanction or a user-specific limit is
expressed by a small dedicated table (`sanction_active`, `limit_active`)
whose primary key is `(user_id, code)`. Each row references the matching
history record by `record_id`. Lifecycle operations maintain both tables
inside one transaction.
**Why.** The lobby-eligibility hot path needs to enumerate active
sanctions/limits without scanning the full history. Encoding "active"
as a partial index on `removed_at IS NULL` would still require dedup
because a user can apply, remove, and re-apply the same code. Two narrow
tables let the same predicates that the Redis adapter encoded as
`active` keys remain index-only.
### 4. Eligibility flags are computed predicates, not stored columns
**Decision.** No `can_login`, `can_create_private_game`, `can_join_game`
columns or indexes exist. The admin listing surface (and the lobby
eligibility snapshot) compute these from `entitlement_snapshots` and
`sanction_active` at read time.
**Why.** Stage 21 expanded the eligibility marker catalogue and Stage 22
added `permanent_block`. Each addition would have required schema work
plus a backfill if eligibility flags were materialised columns. Computed
predicates push that complexity into one place — the SQL query — and
keep the schema small.
### 5. Atomic flows use explicit `BEGIN … COMMIT` with per-row `FOR UPDATE`
**Decision.** Composite operations (`AuthDirectoryStore.{Resolve,
Ensure, Block*}`, `EntitlementLifecycleStore.{Grant, Extend, Revoke,
RepairExpired}`, `PolicyLifecycleStore.{ApplySanction, RemoveSanction,
SetLimit, RemoveLimit}`) execute inside `store.withTx` and acquire row
locks with `SELECT … FOR UPDATE` on the rows they intend to mutate.
Optimistic-replacement guards (`Expected*Record`, `Expected*Snapshot`)
are validated against the locked rows before the write goes through;
mismatches surface as `ports.ErrConflict`.
**Why.** PostgreSQL's default `READ COMMITTED` isolation plus row-level
locks gives us the serialisation property the previous Redis
WATCH/MULTI loops achieved without needing the application to retry on
optimistic-failure errors. The explicit `FOR UPDATE` keeps intent
visible; ad-hoc CTE patterns would obscure the locking shape.
### 6. Query layer is `go-jet/jet/v2`
**Decision.** All `userstore` packages build SQL through the jet
builder API (`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
`pg.AND/OR/SET/...` DSL). `cmd/jetgen` (invoked via `make jet`) brings
up a transient PostgreSQL container, applies the embedded migrations,
and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
against the provisioned schema; the generated table/model code lives
under `internal/adapters/postgres/jet/user/{model,table}/*.go` and is
committed to the repo, so build consumers do not need Docker.
Statements are run through the `database/sql` API
(`stmt.Sql() → db.Exec/Query/QueryRow`); manual `rowScanner` helpers
preserve domain-type marshalling.
**Why.** Aligns with `PG_PLAN.md` §Library stack ("Query layer:
`github.com/go-jet/jet/v2` (PostgreSQL dialect). Generated code lives
under each service `internal/adapters/postgres/jet/`, regenerated via
a `make jet` target and committed to the repo"). Constructs the jet
builder does not cover natively (`FOR UPDATE`, keyset-pagination
row-comparison, partial UNIQUE WHERE in `CREATE INDEX`) are expressed
through the per-DSL helpers (`.FOR(pg.UPDATE())`, `OR/AND` expansion
of `(created_at, user_id) < (…)`). The ports contract and the schema
do not change.
### 7. Redis publishers share one `*redis.Client`
**Decision.** `internal/app/runtime.go` constructs one
`redisconn.NewMasterClient(cfg.Redis.Conn)` and passes it to both
`domainevents.New(client, cfg)` and `lifecycleevents.New(client,
cfg)`. The publishers no longer carry connection-topology fields and
no longer close the client; the runtime owns it.
**Why.** Each subsequent PG_PLAN stage (Mail, Notification, Lobby)
ships a similar duo of stream publishers; sharing one client is the
shape we want all stages to converge on. Per-publisher clients
multiplied TCP connections, ping points, and OpenTelemetry
instrumentation hooks for no functional benefit.
### 8. Mandatory Redis password in tests as well
**Decision.** Unit tests for the publishers configure
`miniredis.RequireAuth("integration")` and pass a matching password
through their direct `redis.NewClient(...)` construction. The runtime
contract test
(`runtime_contract_test.go::newRuntimeContractHarness`) does the same
plus boots a Postgres container.
**Why.** The architectural rule forbids password-less Redis
connections; carrying the constraint into tests prevents the rule
from drifting.
### 9. Listing surface keeps storage-thin pagination
**Decision.** `UserListStore.ListUserIDs` paginates only on
`(created_at DESC, user_id DESC)` with keyset cursors carried by the
opaque page token. Filter matrix evaluation (paid_state,
declared_country, sanction_code, limit_code, can_*) is performed by
the service-layer `adminusers.Lister`, which loads each candidate
through the per-user loader. This mirrors the previous Redis
behaviour exactly.
**Why.** Pushing the filter matrix into SQL is desirable — it eliminates
candidate over-fetching — but doing it without changing the public
`UserListStore.ListUserIDs` contract (which returns a page of
`UserID`, not full records) requires a JOIN-driven query. That work
is a non-breaking optimisation and is intentionally deferred so this
stage focuses on the storage cut-over rather than throughput
improvements. The page-token wire format is preserved bit-for-bit so
already-issued tokens keep working.
## Cross-References
- `PG_PLAN.md §3` (Stage 3 — User Service migration / pilot).
- `ARCHITECTURE.md §Persistence Backends`.
- `internal/adapters/postgres/migrations/00001_init.sql` and
`internal/adapters/postgres/migrations/migrations.go`.
- `internal/adapters/postgres/userstore/{store,accounts,blocked_emails,
auth_directory,entitlement_store,policy_store,list_store,page_token,
helpers}.go` plus the testcontainers-backed unit suite under
`userstore/{harness,store}_test.go`.
- `internal/adapters/postgres/jet/user/{model,table}/*.go` (committed
generated code) plus `cmd/jetgen/main.go` and the `make jet`
Makefile target that regenerate it.
- `internal/config/config.go` (`PostgresConfig`, `RedisConfig` reshape).
- `internal/app/runtime.go` (PG pool open + migration + shared Redis
client wiring).
- `internal/adapters/redis/{domainevents,lifecycleevents}/publisher.go`
(refactored to accept the shared `*redis.Client`).
- `runtime_contract_test.go::startPostgresForContractTest` (shows the
inline Postgres bootstrap used by the existing runtime contract).
+33 -7
View File
@@ -32,20 +32,46 @@ additional process-level operational endpoint.
## Common Failure Modes
### PostgreSQL unavailable
Symptoms:
- process fails during startup with `ping postgres` or `run postgres
migrations` in the error chain
- readiness probe never reports healthy, internal API never opens
- internal API returns `503 service_unavailable` if connectivity is lost
after start
Checks:
- DSN reachable from the service host: `psql "$USERSERVICE_POSTGRES_PRIMARY_DSN" -c "select 1"`
- `userservice` role exists with `LOGIN` and the configured password
- Schema `user` exists and is owned (or grant-accessible) by the
`userservice` role: `\dn user`
- Embedded migrations applied: query `goose_db_version` (the schema-qualified
goose bookkeeping table) and confirm the latest version matches the
binary's expectation
- Pool tuning sane:
`USERSERVICE_POSTGRES_MAX_OPEN_CONNS` ≥ peak request fan-out
### Redis unavailable
Symptoms:
- process fails during startup
- internal API returns `503 service_unavailable`
- domain events stop being published
- process fails during startup with `ping redis master` in the error chain
- domain events / lifecycle events stop being published
- internal API still serves reads/writes (PostgreSQL is the source of truth);
publishers degrade gracefully but operators must investigate
Checks:
- connectivity to `USERSERVICE_REDIS_ADDR`
- Redis ACL credentials
- Redis DB number
- TLS setting mismatch
- connectivity to `USERSERVICE_REDIS_MASTER_ADDR`
- `USERSERVICE_REDIS_PASSWORD` matches the Redis configuration
- Redis DB number is reachable and unblocked
- The retired variables `USERSERVICE_REDIS_ADDR`,
`USERSERVICE_REDIS_USERNAME`, `USERSERVICE_REDIS_TLS_ENABLED`,
`USERSERVICE_REDIS_KEYSPACE_PREFIX` are not set in the deployment
(`pkg/redisconn.LoadFromEnv` rejects them with a clear error)
### Invalid registration context
+66 -22
View File
@@ -63,38 +63,67 @@ Intentional omissions:
`cmd/userservice` loads config, constructs logging and telemetry, and then
creates the runtime through `internal/app.NewRuntime`.
The runtime wires:
The runtime wires, in order:
- Redis-backed stores for accounts, entitlement snapshots, sanctions, limits,
and listing indexes
- one shared `*redis.Client` opened through `pkg/redisconn` plus a Ping
- one PostgreSQL pool opened through `pkg/postgres`, instrumented with
`db.sql.connection.*` metrics, pinged, and migrated forward via the
embedded `internal/adapters/postgres/migrations` filesystem
- the PostgreSQL-backed user store from
`internal/adapters/postgres/userstore` (accounts, blocked-emails,
entitlement snapshot/history/lifecycle, sanction history/lifecycle,
limit history/lifecycle, listing index)
- two Redis Stream publishers
(`internal/adapters/redis/domainevents` for auxiliary domain events,
`internal/adapters/redis/lifecycleevents` for trusted user-lifecycle
events) sharing the same `*redis.Client`
- the trusted internal HTTP router
- the optional admin metrics listener
- the optional Redis-backed domain-event publishers
- service-local helpers for clock, IDs, and validation/policy adapters
Startup fails fast when Redis connectivity is unavailable or configuration is
invalid.
Startup fails fast when Redis or PostgreSQL connectivity is unavailable, the
mandatory connection-topology environment variables are missing, the
embedded migration sequence cannot be applied, or configuration is otherwise
invalid. The HTTP listeners do not open until every dependency check passes.
## Redis Namespaces
## Storage Backends
The service uses one Redis keyspace prefix plus one auxiliary domain-events
stream.
The service is split between two backends per
[`../../ARCHITECTURE.md §Persistence Backends`](../../ARCHITECTURE.md):
Configuration:
PostgreSQL holds source-of-truth durable state in the `user` schema:
- `USERSERVICE_REDIS_KEYSPACE_PREFIX`
- `USERSERVICE_REDIS_DOMAIN_EVENTS_STREAM`
- `USERSERVICE_REDIS_DOMAIN_EVENTS_STREAM_MAX_LEN`
- `accounts` (with `email` and `user_name` UNIQUE; `deleted_at` records the
Stage 22 soft-delete state)
- `blocked_emails` (one row per blocked address)
- `entitlement_records` plus the denormalised `entitlement_snapshots`
one-row-per-user current view
- `sanction_records` plus `sanction_active(user_id, sanction_code)`
- `limit_records` plus `limit_active(user_id, limit_code)`
The keyspace stores source-of-truth business state. The stream carries
post-commit auxiliary domain events and must not be treated as the source of
truth.
Indexes carry the listing surface (`accounts(created_at DESC, user_id
DESC)`), reverse-lookup filters (`accounts(declared_country)`,
`entitlement_snapshots(plan_code, is_paid)`,
`entitlement_snapshots(ends_at) WHERE is_paid AND ends_at IS NOT NULL`,
`sanction_active(sanction_code)`, `limit_active(limit_code)`), and the
per-user history scans.
Redis hosts only the two Stream publishers
(`USERSERVICE_REDIS_DOMAIN_EVENTS_STREAM`,
`USERSERVICE_REDIS_LIFECYCLE_EVENTS_STREAM`). It does not store any
durable user state after Stage 3 of `PG_PLAN.md`.
Decision records:
[`postgres-migration.md`](postgres-migration.md) for the schema and
storage decisions.
## Configuration Groups
Required for all process starts:
- `USERSERVICE_REDIS_ADDR`
- `USERSERVICE_REDIS_MASTER_ADDR`
- `USERSERVICE_REDIS_PASSWORD`
- `USERSERVICE_POSTGRES_PRIMARY_DSN`
Core process config:
@@ -116,16 +145,31 @@ Admin HTTP config:
- `USERSERVICE_ADMIN_HTTP_READ_TIMEOUT`
- `USERSERVICE_ADMIN_HTTP_IDLE_TIMEOUT`
Redis connectivity and namespace config:
Redis connectivity (consumed by `pkg/redisconn`):
- `USERSERVICE_REDIS_USERNAME`
- `USERSERVICE_REDIS_PASSWORD`
- `USERSERVICE_REDIS_REPLICA_ADDRS` (optional, comma-separated)
- `USERSERVICE_REDIS_DB`
- `USERSERVICE_REDIS_TLS_ENABLED`
- `USERSERVICE_REDIS_OPERATION_TIMEOUT`
- `USERSERVICE_REDIS_KEYSPACE_PREFIX`
Stream-shape (kept service-local):
- `USERSERVICE_REDIS_DOMAIN_EVENTS_STREAM`
- `USERSERVICE_REDIS_DOMAIN_EVENTS_STREAM_MAX_LEN`
- `USERSERVICE_REDIS_LIFECYCLE_EVENTS_STREAM`
- `USERSERVICE_REDIS_LIFECYCLE_EVENTS_STREAM_MAX_LEN`
PostgreSQL connectivity (consumed by `pkg/postgres`):
- `USERSERVICE_POSTGRES_REPLICA_DSNS` (optional, comma-separated)
- `USERSERVICE_POSTGRES_OPERATION_TIMEOUT`
- `USERSERVICE_POSTGRES_MAX_OPEN_CONNS`
- `USERSERVICE_POSTGRES_MAX_IDLE_CONNS`
- `USERSERVICE_POSTGRES_CONN_MAX_LIFETIME`
The retired Redis variables `USERSERVICE_REDIS_ADDR`,
`USERSERVICE_REDIS_USERNAME`, `USERSERVICE_REDIS_TLS_ENABLED`,
`USERSERVICE_REDIS_KEYSPACE_PREFIX` produce a startup error from
`pkg/redisconn` if set; unset them before starting the service.
Telemetry: