feat: use postgres

This commit is contained in:
Ilia Denisov
2026-04-26 20:34:39 +02:00
committed by GitHub
parent 48b0056b49
commit fe829285a6
365 changed files with 29223 additions and 24049 deletions
+47 -18
View File
@@ -7,8 +7,23 @@ readiness, shutdown, and the handful of recovery paths specific to Lobby.
Before starting the process, confirm:
- `LOBBY_REDIS_ADDR` points to the Redis deployment used for state and the
five Lobby-related streams.
- `LOBBY_REDIS_MASTER_ADDR` and `LOBBY_REDIS_PASSWORD` point to the Redis
deployment used for the runtime-coordination state that intentionally
stays on Redis: stream consumers/publishers, stream offsets, per-game
turn-stats aggregates, gap-activation timestamps, and the
capability-evaluation guard. The deprecated `LOBBY_REDIS_ADDR`,
`LOBBY_REDIS_USERNAME`, and `LOBBY_REDIS_TLS_ENABLED` env vars were
retired in PG_PLAN.md §6A; setting either of the latter two now fails
fast at startup.
- `LOBBY_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary that
hosts the `lobby` schema. The DSN must include `search_path=lobby` and
`sslmode=disable`. Embedded goose migrations apply at startup before
any HTTP listener opens; a migration or ping failure terminates the
process with a non-zero exit. After PG_PLAN.md §6A the schema holds
`games`, `applications`, `invites`, `memberships`; after §6B it also
holds `race_names`. The schema and the `lobbyservice` role are
provisioned externally (operator init script in production, the
testcontainers harness in tests).
- `LOBBY_USER_SERVICE_BASE_URL` and `LOBBY_GM_BASE_URL` are reachable from
the network the Lobby pods run in. Lobby does not ping these at boot,
but transport failures against them will surface as request errors.
@@ -19,11 +34,13 @@ Before starting the process, confirm:
- `LOBBY_RUNTIME_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `LOBBY_USER_LIFECYCLE_STREAM` (default `user:lifecycle_events`)
- `LOBBY_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `redis` for production; the
`stub` value is only for unit tests.
- `LOBBY_RACE_NAME_DIRECTORY_BACKEND` is `postgres` for production
(the default after PG_PLAN.md §6B); the `stub` value is only for
unit tests that do not need a real PostgreSQL.
At startup the process performs a bounded `PING` against Redis. Startup
fails fast if the ping fails. There are no liveness checks against User
At startup the process opens the PostgreSQL pool, applies migrations,
pings PostgreSQL, then opens the Redis client and pings Redis. Startup
fails fast if any step fails. There are no liveness checks against User
Service or Game Master at boot; those are surfaced at request time.
Expected listener state after a healthy start:
@@ -160,11 +177,15 @@ is reachable again.
To inspect the backlog:
```bash
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
```
Entries with `score < now()` (Unix milliseconds) are expirable on the next
tick.
Rows whose `eligible_until_ms` is at or below `extract(epoch from now()) * 1000`
are expirable on the next tick. The partial index
`race_names_pending_eligible_idx` keeps this scan cheap.
## Cascade Release Operator Notes
@@ -195,26 +216,34 @@ out-of-band.
## Diagnostic Queries
A handful of Redis CLI snippets help during incidents:
Durable enrollment state and Race Name Directory bindings live in
PostgreSQL; runtime coordination state stays in Redis. A handful of CLI
snippets help during incidents:
```bash
# Live game count by status
redis-cli ZCARD lobby:games_by_status:enrollment_open
redis-cli ZCARD lobby:games_by_status:running
# Live game count by status (PostgreSQL)
psql -c "SELECT status, COUNT(*) FROM lobby.games GROUP BY status"
# Inspect a specific game record
redis-cli GET lobby:games:<game_id>
psql -c "SELECT * FROM lobby.games WHERE game_id = '<game_id>'"
# Member roster for a game
redis-cli SMEMBERS lobby:game_memberships:<game_id>
psql -c "SELECT user_id, race_name, status, joined_at
FROM lobby.memberships
WHERE game_id = '<game_id>'
ORDER BY joined_at"
# Race name pending entries (oldest first)
redis-cli ZRANGE lobby:race_names:pending_index 0 -1 WITHSCORES
psql -c "SELECT canonical_key, game_id, holder_user_id, eligible_until_ms
FROM lobby.race_names
WHERE binding_kind = 'pending_registration'
ORDER BY eligible_until_ms ASC"
# Stream lag inspection
# Stream lag inspection (Redis)
redis-cli XINFO STREAM gm:lobby_events
redis-cli GET lobby:stream_offsets:gm_events
```
The gauges and counters surfaced through OpenTelemetry are the primary
observability surface; raw Redis access is for last-resort triage.
observability surface; raw PostgreSQL and Redis access is for last-resort
triage.