feat: use postgres

This commit is contained in:
Ilia Denisov
2026-04-26 20:34:39 +02:00
committed by GitHub
parent 48b0056b49
commit fe829285a6
365 changed files with 29223 additions and 24049 deletions
+32 -14
View File
@@ -6,10 +6,14 @@ and timestamps with values that match the deployment under inspection.
## Example `.env`
A minimum-viable `LOBBY_*` set for a local run against a single Redis
container. The full list with defaults lives in `../README.md` §Configuration.
container plus a PostgreSQL container with the `lobby` schema and the
`lobbyservice` role provisioned. The full list with defaults lives in
`../README.md` §Configuration.
```bash
LOBBY_REDIS_ADDR=127.0.0.1:6379
LOBBY_REDIS_MASTER_ADDR=127.0.0.1:6379
LOBBY_REDIS_PASSWORD=local
LOBBY_POSTGRES_PRIMARY_DSN=postgres://lobbyservice:lobbyservice@127.0.0.1:5432/galaxy?search_path=lobby&sslmode=disable
LOBBY_USER_SERVICE_BASE_URL=http://127.0.0.1:8083
LOBBY_GM_BASE_URL=http://127.0.0.1:8096
@@ -19,7 +23,7 @@ LOBBY_INTERNAL_HTTP_ADDR=:8095
LOBBY_LOG_LEVEL=info
LOBBY_SHUTDOWN_TIMEOUT=30s
LOBBY_RACE_NAME_DIRECTORY_BACKEND=redis
LOBBY_RACE_NAME_DIRECTORY_BACKEND=postgres
LOBBY_ENROLLMENT_AUTOMATION_INTERVAL=30s
LOBBY_RACE_NAME_EXPIRATION_INTERVAL=1h
@@ -115,16 +119,36 @@ curl -s http://localhost:8095/api/v1/internal/games/game-01HZ...
curl -s http://localhost:8095/api/v1/internal/games/game-01HZ.../memberships
```
## Redis Examples
## Storage Inspection Examples
### Inspect a game record
### Inspect a game record (PostgreSQL)
```bash
redis-cli GET lobby:games:game-01HZ...
psql "$LOBBY_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM lobby.games WHERE game_id = 'game-01HZ...'"
```
The value is a strict JSON blob with the fields documented in
`../README.md` §Game Record Model.
The columns mirror the fields documented in `../README.md` §Game Record Model.
### Inspect open enrollment games (sorted by created_at)
```bash
psql "$LOBBY_POSTGRES_PRIMARY_DSN" -c \
"SELECT game_id, game_name, created_at FROM lobby.games
WHERE status = 'enrollment_open'
ORDER BY created_at DESC"
```
### Inspect a Race Name Directory binding
```bash
psql "$LOBBY_POSTGRES_PRIMARY_DSN" -c \
"SELECT canonical_key, game_id, holder_user_id, race_name, binding_kind,
source_game_id, eligible_until_ms, registered_at_ms
FROM lobby.race_names WHERE race_name = 'Aurora'"
```
## Redis Examples
### Publish a runtime job result (Runtime Manager simulation)
@@ -162,12 +186,6 @@ redis-cli XADD gm:lobby_events '*' \
finished_at_ms 1714123456789
```
### Inspect open enrollment games (sorted by created_at)
```bash
redis-cli ZRANGE lobby:games_by_status:enrollment_open 0 -1 WITHSCORES
```
## Notification Intent Format
Lobby produces every notification through `pkg/notificationintent` and
+386
View File
@@ -0,0 +1,386 @@
# PostgreSQL Migration
PG_PLAN.md §6A migrated the four core enrollment entities of Game Lobby
Service — `Game`, `Application`, `Invite`, `Membership` — from Redis-only
durable storage to the steady-state Redis + PostgreSQL split codified in
`ARCHITECTURE.md §Persistence Backends`. PG_PLAN.md §6B then moved the
Race Name Directory onto PostgreSQL, retiring the Redis Lua scripts and
canonical-lookup cache that backed it. PG_PLAN.md §6C confirmed which
runtime-coordination state intentionally stays on Redis (per-game
`game_turn_stats`, `gap_activated_at`, `capability_evaluation:done:*`,
`stream_offsets:*`, plus the event-bus streams themselves) and pruned the
remaining redisstate keyspace.
This document records the schema decisions and the non-obvious agreements
behind them. Use it together with the migration scripts under
`internal/adapters/postgres/migrations/` and the runtime wiring
(`internal/app/runtime.go`).
## Outcomes
- Schema `lobby` (provisioned externally) holds four tables: `games`,
`applications`, `invites`, `memberships`. A partial UNIQUE index on
`applications(applicant_user_id, game_id) WHERE status <> 'rejected'`
enforces the single-active-application constraint at the database
level.
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
applies embedded goose migrations strictly before any HTTP listener
becomes ready, and exits non-zero when migration or ping fails.
- The runtime opens one shared `*redis.Client` via
`pkg/redisconn.NewMasterClient` and passes it to the Race Name
Directory adapter, the per-game stats / gap-activation /
evaluation-guard / stream-offset stores, the consumer pipelines, and
the notification-intent publisher.
- The Redis adapter package (`internal/adapters/redisstate/`) keeps the
surviving stores (`racenamedir`, `gameturnstatsstore`,
`gapactivationstore`, `evaluationguardstore`, `streamoffsetstore`,
`streamlagprobe`) and the keyspace methods that back them; the
game/application/invite/membership stores, codecs, tests, and
per-record TTL constants are gone.
- Configuration drops `LOBBY_REDIS_ADDR`, `LOBBY_REDIS_USERNAME`,
`LOBBY_REDIS_TLS_ENABLED` and introduces `LOBBY_REDIS_MASTER_ADDR`,
`LOBBY_REDIS_REPLICA_ADDRS`, `LOBBY_REDIS_PASSWORD`,
`LOBBY_POSTGRES_PRIMARY_DSN`, `LOBBY_POSTGRES_REPLICA_DSNS`, plus
the standard `LOBBY_POSTGRES_*` pool tuning knobs. Setting either of
the two retired Redis env vars now fails fast at startup via the
shared `pkg/redisconn.LoadFromEnv` rejection path.
## Decisions
### 1. One schema, externally-provisioned role
**Decision.** The `lobby` schema and the matching `lobbyservice` role
are created outside the migration sequence (in tests, by
`integration/internal/harness/postgres_container.go::EnsureRoleAndSchema`;
in production, by an ops init script not in scope for this stage). The
embedded migration `00001_init.sql` only contains DDL for tables and
indexes and assumes it runs as the schema owner with
`search_path=lobby`.
**Why.** Mirrors the precedent set by Notification Stage 5 and Mail
Stage 4 and matches the schema-per-service architectural rule
(`ARCHITECTURE.md §Persistence Backends`). Mixing role + schema + table
DDL into one script would force every consumer of the migration to run
as a superuser; splitting them lines up with the operational split
(ops provisions roles and schemas, the service applies schema-scoped
migrations).
### 2. Single-active application = partial UNIQUE on `applications`
**Decision.** `applications` carries a partial UNIQUE index on
`(applicant_user_id, game_id) WHERE status <> 'rejected'`. INSERT
attempts that violate the constraint are surfaced to the service layer
as `application.ErrConflict` via the shared
`sqlx.IsUniqueViolation` helper.
**Why.** Replaces the Redis lookup key `lobby:user_game_application:*:*`
with a deterministic database-level invariant. Multiple `rejected`
rows are intentionally allowed (one applicant may submit, get rejected,
and resubmit), and the UNIQUE only fires on the second simultaneous
submitted/approved row for the same `(user, game)`. The constraint is
race-safe: under concurrent submission attempts one INSERT wins, the
others fail with conflict.
### 3. Public games carry an empty `owner_user_id`; partial index excludes them
**Decision.** `games.owner_user_id` is `text NOT NULL DEFAULT ''`, and
the secondary `games_owner_idx` is partial: `WHERE game_type = 'private'`.
Public games (admin-owned) carry an empty owner string and are excluded
from the index entirely.
**Why.** Mirrors the previous Redis behaviour where `games_by_owner:*`
sets were created only for private games. The partial index keeps the
owner lookup tight (only private-game rows participate) while letting
the column stay non-nullable and consistent with the domain model.
### 4. JSONB columns for runtime snapshot and runtime binding
**Decision.** `games.runtime_snapshot` is `jsonb NOT NULL DEFAULT
'{}'::jsonb`; `games.runtime_binding` is `jsonb NULL`. The JSON shapes
used inside both columns are stable and live in
`internal/adapters/postgres/gamestore/codecs.go`. `runtime_binding`
binds NULL when the domain pointer is nil, otherwise an object with
`container_id`, `engine_endpoint`, `runtime_job_id`, `bound_at_ms`
fields.
**Why.** Both fields are opaque to queries — Lobby never element-filters
on their internals. JSONB matches the "everything outside primary
fields is JSON" pattern Notification Stage 5 already established and
allows a future GIN index without a schema rewrite. The `bound_at_ms`
field inside the binding stays in Unix milliseconds so the encoded
payload is naked-comparable across Redis and PostgreSQL audits during
the transition window.
### 5. Optimistic concurrency via current-status compare-and-swap
**Decision.** `UpdateStatus` on every store is implemented as `UPDATE …
WHERE id = $X AND status = $expected`. A zero-rows result is
disambiguated with a follow-up `SELECT status` probe — missing rows map
to the per-domain `ErrNotFound`, mismatches map to `ErrConflict`.
Snapshot/binding overrides on `games` use the same pattern but only
guard on the primary key (no expected-status gate).
**Why.** Mirrors the previous Redis WATCH/TxPipelined behaviour without
holding a `SELECT … FOR UPDATE` lock across application logic. The
compare-and-swap is local to one statement, never spans more than one
network round trip, and produces the same observable error semantics
the service layer already depends on.
### 6. Memberships store `race_name` and `canonical_key` side by side
**Decision.** `memberships` carries both `race_name` (original casing)
and `canonical_key` (policy-derived form) as separate `text NOT NULL`
columns. There is no UNIQUE constraint on `canonical_key`.
**Why.** Downstream consumers — capability evaluation and the
user-lifecycle cascade — read the canonical form directly without
re-deriving it from `race_name`, which is the same arrangement the
Redis JSON record had. Race-name uniqueness across the platform
remains the responsibility of the Race Name Directory; enforcing a
UNIQUE on memberships' canonical_key now would duplicate the RND
invariant and create deadlock potential between the two stores.
### 7. ON DELETE CASCADE from games to children
**Decision.** Each child table (`applications`, `invites`,
`memberships`) declares its `game_id` as `REFERENCES games(game_id) ON
DELETE CASCADE`.
**Why.** Lobby code never deletes games today — every status terminal
is a soft state — so the cascade has no live trigger. It exists for
two future paths: scheduled cleanup of `cancelled` games far past
retention, and explicit operator/test resets. CASCADE keeps those paths
trivial and free of dangling references.
### 8. Listing order: most-recent-first for games, oldest-first for child tables
**Decision.** `GetByStatus` and `GetByOwner` on `games` order by
`created_at DESC, game_id DESC`. The per-game/per-user listings on
`applications`, `invites`, `memberships` order by `created_at ASC,
<id> ASC` (memberships order by `joined_at ASC`).
**Why.** Game listings serve user-facing feeds where most-recent-first
is the natural expectation, matching the previous Redis sorted-set
score and the `accounts.created_at DESC` convention from User Stage 3.
Child-table listings serve administrative and cascade flows where the
chronological order helps operators reason about the sequence of
events. The ports doc explicitly says "order is adapter-defined", so
either convention is contract-compatible.
### 9. Heavy `runtime_test.go` / `runtime_smoke_test.go` deleted; integration coverage
**Decision.** The service-local `internal/app/runtime_test.go` and
`runtime_smoke_test.go` were removed. Black-box runtime coverage moves
to the `integration/lobbyuser` and `integration/lobbynotification`
suites, which now spin up both a PostgreSQL container (via
`harness.StartLobbyServicePersistence`) and the existing Redis
container.
**Why.** Mirrors the Mail Stage 4 / Notification Stage 5 precedent.
Booting a full Lobby runtime now requires both PostgreSQL and Redis,
which is the integration-suite shape; duplicating that bootstrap
inside `internal/app/` would be heavy and fragile. The remaining
service-local tests cover units that do not require the full runtime.
### 10. Query layer is `go-jet/jet/v2`
**Decision.** All four PG-store packages build SQL through the jet
builder API (`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
`pg.AND/OR/SET/COALESCE/...` DSL). Generated table models live under
`internal/adapters/postgres/jet/lobby/{model,table}/` and are
regenerated by `make jet` (which spins up a transient PostgreSQL via
testcontainers, applies the embedded goose migrations, and runs jet's
generator). Generated code is committed.
**Why.** Aligns with `PG_PLAN.md` §Library stack ("Query layer:
`github.com/go-jet/jet/v2` (PostgreSQL dialect). Generated code lives
under each service `internal/adapters/postgres/jet/`, regenerated via
a `make jet` target and committed to the repo"). PostgreSQL constructs
that the jet builder does not cover natively (`FOR UPDATE`,
`COALESCE`, `LOWER` on subselects, JSONB params) are expressed through
the per-DSL helpers (`.FOR(pg.UPDATE())`, `pg.COALESCE`, `pg.LOWER`,
direct `[]byte`/string params for JSONB columns). Manual `rowScanner`
helpers (`scanGame`, `scanApplication`, `scanInvite`,
`scanMembership`) preserve the codecs.go boundary translations and
domain-type mapping; jet only owns SQL construction.
## Out of scope for §6A
- Read routing through `LOBBY_POSTGRES_REPLICA_DSNS` — config exposes
the field, runtime ignores it.
- Production provisioning of the `lobby` schema and `lobbyservice`
role — operational concern handled outside the service binary.
## §6B — Race Name Directory on PostgreSQL
§6B replaces the Redis-backed Race Name Directory (one Lua script + a
canonical-lookup cache + a pending-index ZSET + per-binding string keys)
with a single PostgreSQL table `race_names` whose rows back all three
binding kinds (`registered`, `reservation`, `pending_registration`).
The `race_names` DDL lives in `00001_init.sql` next to the four core
enrollment tables (it was originally introduced as a separate
`00002_race_names.sql`; PG_PLAN.md §9 collapsed the two files into one
init migration during the pre-launch development window). The adapter
`internal/adapters/postgres/racenamedir/directory.go` is the canonical
reference; the architecture rule is unchanged from §6A.
### 11. One table, composite primary key `(canonical_key, game_id)`
**Decision.** `race_names` carries one row per binding under the
composite primary key `(canonical_key, game_id)`. Reservations and
pending_registrations write the actual game id; registered rows write
`game_id = ''` and keep the source game in `source_game_id`. A partial
UNIQUE index on `(canonical_key)` filtered to `binding_kind =
'registered'` enforces the single-registered-per-canonical rule.
**Why.** PG_PLAN.md §6B sketched the table as `(canonical_key PK, …)`,
but the existing port semantics (`testReserveCrossGame`,
`testReleaseReservationKeepsCrossGame` in
`internal/ports/racenamedirtest/suite.go`) require the same user to hold
several per-game reservations on one canonical key concurrently. A flat
single-PK table cannot model that without losing the per-game
identity. The composite PK matches both invariants — at most one row per
(canonical, game) and at most one registered row per canonical — without
splitting the data into two tables (which would force every write
operation to touch two unrelated indexes and reproduce the old
canonical-lookup cache invariant manually).
### 12. Concurrency: PostgreSQL transactional advisory locks
**Decision.** Every write operation (`Reserve`, `MarkPendingRegistration`,
`Register`, `ReleaseReservation`, the per-row branch of
`ExpirePendingRegistrations`) opens a `BEGIN; …; COMMIT` and acquires
`pg_advisory_xact_lock(hashtextextended($canonical_key, 0))` as the very
first statement. The lock auto-releases on commit or rollback.
`ReleaseAllByUser` is a single `DELETE WHERE holder_user_id = $1` and
takes no advisory lock — it runs on permanent_blocked / deleted
lifecycle events, so the user being deleted cannot be a concurrent
writer on those bindings.
**Why.** PG_PLAN.md §6B explicitly authorised either `SELECT … FOR
UPDATE` or advisory locks. `SELECT … FOR UPDATE` cannot serialize
against not-yet-existing rows (e.g. concurrent first-time `Reserve`s for
the same canonical), so advisory locks are required for race-free
INSERTs. Hashing through `hashtextextended` produces a 64-bit lock key
covering arbitrary canonical strings, sidestepping `bigint` truncation
that older `hashtext` exposes. Holding the lock for one transaction
keeps the contention surface tight and matches the Notification §5
"narrow CAS, no application-logic-bound row locks" precedent.
### 13. `binding_kind` values match `ports.Kind*` verbatim
**Decision.** `race_names.binding_kind` stores `"registered"`,
`"reservation"`, or `"pending_registration"` — the same string literals
exported by `ports.KindRegistered`, `ports.KindReservation`,
`ports.KindPendingRegistration`. The adapter returns the raw value
directly through `Availability.Kind` without translation. A `CHECK`
constraint on the column rejects anything else.
**Why.** Avoids one boundary translation and one synonym ("reserved" vs
"reservation") that the Redis adapter carried internally as
`reservationStatusReserved = "reserved"`. With the port-equivalent
literals on disk, future operator-side queries (`SELECT … WHERE
binding_kind = 'reservation'`) match the Go-level constants 1:1, and
the adapter saves a `switch` per `Check` call.
### 14. `Check` returns the strongest binding via in-process priority
**Decision.** `Check` issues `SELECT holder_user_id, binding_kind FROM
race_names WHERE canonical_key = $1` and picks the strongest binding in
Go using a priority rank `registered > pending_registration >
reservation`. There is no SQL `CASE` expression in the ORDER BY.
**Why.** The dataset per canonical is bounded (at most one registered +
one row per active game) and is read frequently by every `Check`. The
Go-side rank avoids a SQL DSL detour that go-jet/v2 would express via
raw SQL anyway, and it keeps the query plan a single index scan on
`canonical_key`.
### 15. `ExpirePendingRegistrations` scans then locks per row
**Decision.** The expirer first runs an indexed scan
`WHERE binding_kind = 'pending_registration' AND eligible_until_ms <=
$cutoff` (served by `race_names_pending_eligible_idx`), then re-reads
each candidate inside its own advisory-locked transaction, asserts the
binding is still pending and still expired, and DELETEs it. Concurrent
`Register` or `ReleaseReservation` simply causes the per-row branch to
skip without error.
**Why.** Mirrors the Redis adapter's two-phase `ZRANGEBYSCORE` + per-
member release loop. A bulk `DELETE … WHERE eligible_until_ms <= …`
would not produce the per-entry `ports.ExpiredPending` slice the worker
needs for telemetry, and would race with `Register` (which targets the
same row).
### 16. Shared port test suite stays on PostgreSQL via a serial harness
**Decision.** The shared `racenamedirtest` suite no longer calls
`t.Parallel()` from its subtests. Every subtest goes through the
factory, the factory truncates the lobby tables and constructs a fresh
adapter against the package-shared testcontainers PostgreSQL.
**Why.** The PostgreSQL adapter relies on `pgtest.TruncateAll` between
factory invocations; running subtests in parallel against one shared
container would race truncate against other subtests' INSERTs. Spinning
up a per-subtest schema would multiply container provisioning cost
significantly (PG generation step alone takes minutes per fresh
container), and the suite is fast enough serially. The Redis-only
backend retired in §6B no longer needs the parallelism either; only the
in-process stub remains in scope and has trivial setup cost.
## §6C — Workers, ephemeral stores, cleanup
§6C closes the Lobby migration: it confirms what intentionally stays on
Redis, prunes the dead Redis adapter code, and finalises the
service-layer documentation.
### 17. Workers stayed on ports — no functional change
**Decision.** The four Lobby workers (`pendingregistration`,
`gmevents`, `runtimejobresult`, `userlifecycle`) and the
`enrollmentautomation` worker shipped in §6A already consume their
storage through ports. After §6B the `RaceNameDirectory` port resolves
to the PostgreSQL adapter; no worker required code changes.
**Why.** §6A established the port-on-storage seam for `GameStore`,
`ApplicationStore`, `InviteStore`, `MembershipStore`. §6B kept the same
contract for `RaceNameDirectory`. Worker logic depends on the contract,
not the backend, so the migration completes via a wiring switch in
`internal/app/wiring.go::buildRaceNameDirectory` without re-touching
worker code.
### 18. `redisstate` retains only runtime-coordination adapters
**Decision.** After §6C the `internal/adapters/redisstate/` package
implements only `GameTurnStatsStore`, `GapActivationStore`,
`EvaluationGuardStore`, `StreamOffsetStore`, and the `StreamLagProbe`.
The legacy `racenamedir.go`, `racenamedir_lua.go`,
`racenamedir_test.go`, `codecs_racename.go`, and the dead game
codecs (`codecs.go`'s `MarshalGame`/`UnmarshalGame`) are removed. The
`Keyspace` type only builds keys for the surviving adapters
(`GapActivatedAt`, `StreamOffset`, `GameTurnStat`,
`GameTurnStatsByGame`, `CapabilityEvaluationGuard`).
**Why.** Architectural rule (`ARCHITECTURE.md §Persistence Backends`):
Redis owns runtime-coordination state, PostgreSQL owns durable business
state. The retained Redis stores back ephemeral per-game aggregates
(`game_turn_stats`), short-lived sentinels (`gap_activated_at`,
`capability_evaluation:done:*`), and the consumer-offset coordination
state (`stream_offsets:*`) — all rebuildable or losable without
durability impact. Streams stay on Redis because they *are* the event
bus.
### 19. Default Race Name Directory backend is `postgres`
**Decision.** `LOBBY_RACE_NAME_DIRECTORY_BACKEND` defaults to
`"postgres"`. The accepted values are `postgres` (production) and
`stub` (in-process for unit tests that do not need a real PostgreSQL).
The `redis` value, the corresponding `RaceNameDirectoryBackendRedis`
constant, and the wiring branch are removed.
**Why.** The Redis adapter is gone; keeping the value in the validator
would produce a misleading "configuration accepted, but startup fails
when wiring resolves the directory" path. Leaving `stub` as a valid
backend lets per-service unit tests run against a small, fast
in-process directory; integration suites use `postgres` via the
testcontainers harness.
+47 -18
View File
@@ -7,8 +7,23 @@ readiness, shutdown, and the handful of recovery paths specific to Lobby.
Before starting the process, confirm:
- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
five Lobby-related streams.
- `LOBBY_REDIS_MASTER_ADDR` and `LOBBY_REDIS_PASSWORD` point to the Redis
deployment used for the runtime-coordination state that intentionally
stays on Redis: stream consumers/publishers, stream offsets, per-game
turn-stats aggregates, gap-activation timestamps, and the
capability-evaluation guard. The deprecated `LOBBY_REDIS_ADDR`,
`LOBBY_REDIS_USERNAME`, and `LOBBY_REDIS_TLS_ENABLED` env vars were
retired in PG_PLAN.md §6A; setting either of the latter two now fails
fast at startup.
- `LOBBY_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that
hosts the `lobby` schema. The DSN must include `search_path=lobby` and
`sslmode=disable`. Embedded goose migrations apply at startup before
any HTTP listener opens; a migration or ping failure terminates the
process with a non-zero exit. After PG_PLAN.md §6A the schema holds
`games`, `applications`, `invites`, `memberships`; after §6B it also
holds `race_names`. The schema and the `lobbyservice` role are
provisioned externally (operator init script in production, the
testcontainers harness in tests).
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
the network the Lobby pods run in. Lobby does not ping these at boot,
but transport failures against them will surface as request errors.
@@ -19,11 +34,13 @@ Before starting the process, confirm:
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
`stub` value is only for unit tests.
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `postgres` for production
(the default after PG_PLAN.md §6B); the `stub` value is only for
unit tests that do not need a real PostgreSQL.
At startup the process performs a bounded `PING` against Redis. Startup
fails fast if the ping fails. There are no liveness checks against User
At startup the process opens the PostgreSQL pool, applies migrations,
pings PostgreSQL, then opens the Redis client and pings Redis. Startup
fails fast if any step fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
@@ -160,11 +177,15 @@ is reachable again.
To inspect the backlog:
```bash
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
```
Entries with `score < now()` (Unix milliseconds) are expirable on the next
tick.
Rows whose `eligible_until_ms` is at or below `extract(epoch from now()) * 1000`
are expirable on the next tick. The partial index
`race_names_pending_eligible_idx` keeps this scan cheap.
## Cascade Release Operator Notes
@@ -195,26 +216,34 @@ out-of-band.
## Diagnostic Queries
A handful of Redis CLI snippets help during incidents:
Durable enrollment state and Race Name Directory bindings live in
PostgreSQL; runtime coordination state stays in Redis. A handful of CLI
snippets help during incidents:
```bash
# Live game count by status
redis-cli ZCARD lobby:games_by_status:enrollment_open
redis-cli ZCARD lobby:games_by_status:running
# Live game count by status (PostgreSQL)
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"
# Inspect a specific game record
redis-cli GET lobby:games:<game_id>
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"
# Member roster for a game
redis-cli SMEMBERS lobby:game_memberships:<game_id>
psql -c "SELECT user_id, race_name, status, joined_at
FROM lobby.memberships
WHERE game_id = '<game_id>'
ORDER BY joined_at"
# Race name pending entries (oldest first)
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
# Stream lag inspection
# Stream lag inspection (Redis)
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
```
The gauges and counters surfaced through OpenTelemetry are the primary
observability surface; raw Redis access is for last-resort triage.
observability surface; raw PostgreSQL and Redis access is for last-resort
triage.
+19 -11
View File
@@ -56,9 +56,10 @@ flowchart LR
Notes:
- `cmd/lobby` refuses startup when Redis connectivity is misconfigured. User
Service and Game Master reachability are not verified at boot; transport
failures surface as request errors.
- `cmd/lobby` refuses startup when Redis connectivity is misconfigured, when
PostgreSQL is unreachable, or when the embedded goose migrations fail to
apply. User Service and Game Master reachability are not verified at boot;
transport failures surface as request errors.
- Both HTTP listeners expose `/healthz` and `/readyz` independently so health
checks can target either port.
- `register-runtime` is an outgoing call from Lobby to Game Master after the
@@ -85,7 +86,7 @@ Probe routes:
- `GET /healthz` returns `{"status":"ok"}`
- `GET /readyz` returns `{"status":"ready"}` once startup wiring completes.
- Neither probe performs a live Redis ping per request.
- Neither probe performs a live Redis or PostgreSQL ping per request.
- There is no `/metrics` route. Metrics flow through OpenTelemetry exporters.
## Background Workers
@@ -130,13 +131,20 @@ lags or stalls, the gauge climbs and stays high.
The full env-var list with defaults lives in `../README.md` §Configuration.
The groups below summarize the structure:
- **Required** — `LOBBY_REDIS_ADDR`, `LOBBY_USER_SERVICE_BASE_URL`,
- **Required** — `LOBBY_REDIS_MASTER_ADDR`, `LOBBY_REDIS_PASSWORD`,
`LOBBY_POSTGRES_PRIMARY_DSN`, `LOBBY_USER_SERVICE_BASE_URL`,
`LOBBY_GM_BASE_URL`.
- **Process and logging** — `LOBBY_SHUTDOWN_TIMEOUT`, `LOBBY_LOG_LEVEL`.
- **HTTP listeners** — `LOBBY_PUBLIC_HTTP_*`, `LOBBY_INTERNAL_HTTP_*`.
- **Redis connectivity** — `LOBBY_REDIS_USERNAME`, `LOBBY_REDIS_PASSWORD`,
`LOBBY_REDIS_DB`, `LOBBY_REDIS_TLS_ENABLED`,
`LOBBY_REDIS_OPERATION_TIMEOUT`.
- **Redis connectivity** — `LOBBY_REDIS_MASTER_ADDR`,
`LOBBY_REDIS_REPLICA_ADDRS`, `LOBBY_REDIS_PASSWORD`, `LOBBY_REDIS_DB`,
`LOBBY_REDIS_OPERATION_TIMEOUT` (legacy `LOBBY_REDIS_ADDR`,
`LOBBY_REDIS_TLS_ENABLED`, `LOBBY_REDIS_USERNAME` removed in PG_PLAN.md
§6A).
- **PostgreSQL connectivity** — `LOBBY_POSTGRES_PRIMARY_DSN`,
`LOBBY_POSTGRES_REPLICA_DSNS`, `LOBBY_POSTGRES_OPERATION_TIMEOUT`,
`LOBBY_POSTGRES_MAX_OPEN_CONNS`, `LOBBY_POSTGRES_MAX_IDLE_CONNS`,
`LOBBY_POSTGRES_CONN_MAX_LIFETIME`.
- **Streams** — `LOBBY_GM_EVENTS_STREAM`, `LOBBY_RUNTIME_START_JOBS_STREAM`,
`LOBBY_RUNTIME_STOP_JOBS_STREAM`, `LOBBY_RUNTIME_JOB_RESULTS_STREAM`,
`LOBBY_NOTIFICATION_INTENTS_STREAM`, `LOBBY_USER_LIFECYCLE_STREAM`.
@@ -152,9 +160,9 @@ The groups below summarize the structure:
- `Game Lobby` owns platform game state. Game Master may cache snapshots but
is not the source of truth.
- The Race Name Directory ships a Redis adapter and an in-process stub; the
stub is intended for unit tests and is selected via
`LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
- The Race Name Directory ships a PostgreSQL adapter (default after
PG_PLAN.md §6B) and an in-process stub. The stub is intended for unit
tests and is selected via `LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
- A `permanent_block` or `deleted` event from User Service fans out
asynchronously through the `user:lifecycle_events` consumer; in-flight
games owned by the affected user receive a stop-job and transition to