feat: use postgres

This commit is contained in:
Ilia Denisov
2026-04-26 20:34:39 +02:00
committed by GitHub
parent 48b0056b49
commit fe829285a6
365 changed files with 29223 additions and 24049 deletions
+150 -23
View File
@@ -37,7 +37,7 @@ Core product properties:
* in-place upgrade of a running game is allowed only as a patch update within the same semver major/minor line;
* player commands are turn-bound and are accepted only before the next scheduled turn generation cutoff.
The current v1 platform uses Redis as the main data store and Redis Streams as the internal event bus.
The platform stores durable business state in PostgreSQL (one shared database, schema per service) and uses Redis with Redis Streams for ephemeral state, caches, and the internal event bus. The backend split, library stack, and staged migration plan live in [`PG_PLAN.md`](PG_PLAN.md) and the [Persistence Backends](#persistence-backends) section below.
## Main Principles
@@ -124,7 +124,8 @@ flowchart LR
Mail["Mail Service"]
Geo["Geo Profile Service"]
Billing["Billing Service\nfuture"]
Redis["Redis\nKV + Streams"]
Redis["Redis\nCache, Streams, Leases"]
Postgres["PostgreSQL\nDurable Business State"]
Telemetry["Telemetry"]
Client --> Gateway
@@ -162,6 +163,13 @@ flowchart LR
Notify --> Redis
Runtime --> Redis
Mail --> Redis
User --> Postgres
Mail --> Postgres
Notify --> Postgres
Lobby --> Postgres
Billing --> User
Telemetry --- Gateway
Telemetry --- Auth
@@ -332,8 +340,10 @@ For auth callers, a successful result means the request was durably accepted
into the mail-delivery pipeline or intentionally suppressed; it does not
require that the external SMTP exchange already completed before the response
is returned.
Stable service-local delivery rules, retry semantics, and Redis-backed
processing details belong in [`mail/README.md`](mail/README.md), not in the
Stable service-local delivery rules, retry semantics, and storage details
(PostgreSQL for the durable delivery record, attempt history, dead letters,
and audit; Redis for the inbound `mail:delivery_commands` stream and its
consumer offset) belong in [`mail/README.md`](mail/README.md), not in the
root architecture document.
## 5. [Geo Profile Service](geoprofile/README.md)
@@ -490,7 +500,7 @@ service-layer logic.
RND owns three levels of state per name:
- **registered** — platform-unique permanent names owned by one regular user.
* **registered** — platform-unique permanent names owned by one regular user.
A registered name cannot be transferred, released, or renamed; the only path
back to availability is `permanent_block` or `DeleteUser` on the owning
account. The number of registered names a user can hold is bounded by the
@@ -498,13 +508,13 @@ RND owns three levels of state per name:
snapshot): `free=1`, `paid_monthly=2`, `paid_yearly=6`,
`paid_lifetime=unlimited`. Tariff downgrade never revokes existing
registrations; it only constrains new ones.
- **reservation** — per-game binding created when a participant joins a game
* **reservation** — per-game binding created when a participant joins a game
through application approval or invite redeem. The reservation key is
`(game_id, canonical_key)`. One user may hold the same name simultaneously
across multiple active games. A reservation survives until the game
finishes, then either becomes a `pending_registration` (see below) or is
released.
- **pending_registration** — a reservation that survived a capable finish and
* **pending_registration** — a reservation that survived a capable finish and
is now waiting up to 30 days for the owner to upgrade it into a registered
name via `lobby.race_name.register`. Expiration releases the binding.
@@ -807,25 +817,143 @@ The main example is `Lobby -> Game Master`:
* synchronous for critical registration/update after successful start;
* asynchronous for secondary propagation and denormalized status fan-out.
## Redis as Data and Event Infrastructure
## Persistence Backends
Redis is the initial shared infrastructure for:
The platform splits durable state across two backends.
* main persistent data of services where no SQL backend is yet introduced;
* gateway session cache backing data;
* replay reservation store for gateway;
* session lifecycle projection;
* internal event bus using Redis Streams;
* notification-intent ingress through `notification:intents`;
* notification fan-out;
* runtime job completion events;
* lobby/game-master propagation events;
* geo auxiliary events.
PostgreSQL is the source of truth for table-shaped business state:
Redis Streams are therefore the platform event bus in v1.
* user identity, profile settings, tariffs/entitlements, sanctions, limits,
and the blocked-email registry;
* mail deliveries, attempt history, dead letters, payloads, and
malformed-command audit;
* notification records, route materialisations, dead letters, and
malformed-intent audit;
* lobby games, applications, invites, memberships, and the race-name
registry (registered/reservation/pending tiers);
* idempotency records, expressed as `UNIQUE` constraints on the durable
table — not as a separate kv;
* retry scheduling state, expressed as a `next_attempt_at` column on the
durable table and worked off via `SELECT ... FOR UPDATE SKIP LOCKED`.
This is an accepted trade-off for simpler early infrastructure.
Service boundaries must still stay storage-agnostic where future SQL migration is expected, especially in `Auth / Session Service`.
Redis is the source of truth for ephemeral and runtime-coordination state:
* the platform event bus implemented as Redis Streams (`user:domain_events`,
`user:lifecycle_events`, `gm:lobby_events`, `runtime:job_results`,
`notification:intents`, `gateway:client-events`, `mail:delivery_commands`);
* stream consumer offsets;
* gateway session cache, replay reservations, rate-limit counters, and
short-lived runtime locks/leases (e.g. notification `route_leases`);
* `Auth / Session Service` challenges and active session tokens, which are
TTL-bounded and where loss is recoverable by re-authentication;
* lobby per-game runtime aggregates that are deleted at game finish
(`game_turn_stats`, `gap_activated_at`, capability evaluation marker).
### Database topology
* Single PostgreSQL database `galaxy`.
* Schema per service: `user`, `mail`, `notification`, `lobby`. Reserved for
future use: `geoprofile`. Not allocated unless needed: `gateway`,
`authsession`.
* Each service connects with its own PostgreSQL role whose grants are
restricted to its own schema (defense-in-depth).
* Authentication is username + password only. `sslmode=disable`. No client
certificates and no SCRAM channel binding.
* Each service connects to one primary plus zero-or-more read-only
replicas. Only the primary is used in this iteration; the replica pool
is wired but receives no traffic. Future read-routing is a non-breaking
change.
### Redis topology
* Each service connects to one master plus zero-or-more replicas.
* All connections require a password. `USERNAME`/ACL is not used. TLS is
off.
* Only the master is used in this iteration; the replica list is wired but
unused. Failover/read routing is added later without a config break.
* The legacy env vars `*_REDIS_TLS_ENABLED` and `*_REDIS_USERNAME` are
removed without a backward-compat shim.
### Library stack and migration discipline
* Driver: `github.com/jackc/pgx/v5`, exposed as `*sql.DB` via
`github.com/jackc/pgx/v5/stdlib` so it is consumable by query builders
written against `database/sql`.
* Query layer: `github.com/go-jet/jet/v2` (PostgreSQL dialect). Generated
code lives under each service `internal/adapters/postgres/jet/`,
regenerated by a per-service `make jet` target (testcontainers + goose +
jet) and committed to the repo so consumers don't need Docker just to
build.
* Migrations: `github.com/pressly/goose/v3` library API. Migration files
are embedded via `//go:embed *.sql`, applied at service startup before
any listener opens; the service exits non-zero on failure. Files are
forward-only, sequence-numbered, and use the standard `-- +goose Up` /
`-- +goose Down` markers.
* Single-init policy during pre-launch development: each PG-backed
service ships exactly one migration file, `00001_init.sql`, that
represents the full current schema. New tables, columns, and indexes
are added by editing that file directly rather than by appending
`00002_*.sql`, `00003_*.sql`, etc. The trade-off is intentional —
schema clarity beats migration-history granularity while no production
database exists. Once the platform reaches its first production
deploy, future schema evolution switches to additive sequence-numbered
migrations.
* Test infrastructure: `github.com/testcontainers/testcontainers-go` plus
the `modules/postgres` submodule for unit tests and for `make jet`.
Per-service decision records that capture schema and adapter choices live
at `galaxy/<service>/docs/postgres-migration.md`.
### Timestamp handling
Every time-valued column in every Galaxy schema is `timestamptz`. The
adapter layer is responsible for ensuring that all `time.Time` values
crossing the SQL boundary carry `time.UTC` as their location.
* **Writes.** Every `time.Time` parameter bound through `database/sql`
(`ExecContext`, `QueryContext`, `QueryRowContext`) is normalised with
`.UTC()` at the binding site. Optional `*time.Time` columns are bound
through a shared helper (`nullableTime` or equivalent per adapter) that
returns `value.UTC()` when non-nil and SQL `NULL` otherwise. Helper
bindings of `cutoff`, `now`, etc. (retention, schedulers) follow the
same rule even when the input was already produced via
`clock.Now().UTC()` — defensive `.UTC()` calls are intentional and
cheap.
* **Reads.** Every `time.Time` scanned out of PostgreSQL is re-wrapped
with `.UTC()` (directly or via a small helper that mirrors
`nullableTime` for the read path) before it leaves the adapter. The
domain layer therefore never observes a `time.Time` whose location is
anything other than `time.UTC`.
* **Why.** PostgreSQL stores `timestamptz` as UTC at rest, but the Go
driver returns scanned values in `time.Local`. Mixing locations across
the boundary produces inequalities in tests, drift in JSON output, and
comparison bugs against pointer fields. The defensive `.UTC()` rule on
both sides removes that class of bug entirely.
### Configuration
For each service `<S>` ∈ { `USERSERVICE`, `MAIL`, `NOTIFICATION`,
`LOBBY`, `GATEWAY`, `AUTHSESSION` }, the Redis connection accepts:
* `<S>_REDIS_MASTER_ADDR` (required)
* `<S>_REDIS_REPLICA_ADDRS` (optional, comma-separated)
* `<S>_REDIS_PASSWORD` (required)
* `<S>_REDIS_DB`, `<S>_REDIS_OPERATION_TIMEOUT`
For PG-backed services (`USERSERVICE`, `MAIL`, `NOTIFICATION`, `LOBBY`)
the Postgres connection accepts:
* `<S>_POSTGRES_PRIMARY_DSN` (required;
`postgres://<role>:<pwd>@<host>:5432/galaxy?search_path=<schema>&sslmode=disable`)
* `<S>_POSTGRES_REPLICA_DSNS` (optional, comma-separated)
* `<S>_POSTGRES_OPERATION_TIMEOUT`, `<S>_POSTGRES_MAX_OPEN_CONNS`,
`<S>_POSTGRES_MAX_IDLE_CONNS`, `<S>_POSTGRES_CONN_MAX_LIFETIME`
Stream- and key-shape env vars (`*_REDIS_DOMAIN_EVENTS_STREAM`,
`*_REDIS_LIFECYCLE_EVENTS_STREAM`, `*_REDIS_KEYSPACE_PREFIX`,
`MAIL_REDIS_COMMAND_STREAM`, `NOTIFICATION_INTENTS_STREAM`, etc.) keep
their current names and semantics — they describe stream/key shapes, not
connection topology.
## Main End-to-End Flows
@@ -1122,7 +1250,6 @@ The architecture intentionally does not try to solve all future concerns now.
Current non-goals:
* a separate global SQL storage layer in v1;
* a separate policy engine;
* automatic billing integration in v1;
* automatic match balancing in v1;