Files
galaxy-game/rtmanager/docs/postgres-migration.md
T
2026-04-28 20:39:18 +02:00

532 lines
24 KiB
Markdown

# PostgreSQL Schema Decisions
Runtime Manager has been PostgreSQL-and-Redis from day one — there is
no Redis-only predecessor and no migration window. This document
records the schema decisions and the non-obvious agreements behind
them, mirroring the shape of
[`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md)
and serving the same role: a single coherent reference for "why does
the persistence layer look this way".
Use this document together with the migration script
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
and the runtime wiring
[`../internal/app/runtime.go`](../internal/app/runtime.go).
## Outcomes
- Schema `rtmanager` (provisioned externally) holds the durable
service state across three tables: `runtime_records`,
`operation_log`, `health_snapshots`. The three tables map onto the
three runtime concerns documented in
[`../README.md` §Persistence Layout](../README.md#persistence-layout):
current state per game, audit trail per operation, and latest
technical health observation per game.
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
applies embedded goose migrations strictly before any HTTP listener
becomes ready, and exits non-zero when migration or ping fails.
Already-applied migrations exit zero — the
`pkg/postgres`-supplied migrator treats "no work to do" as success.
- The runtime opens one shared `*redis.Client` via
`pkg/redisconn.NewMasterClient` and passes it to the stream offset
store, the per-game lease store, the consumer pipelines, and every
publisher (`runtime:job_results`, `runtime:health_events`,
`notification:intents`).
- The Redis adapter package
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
owns one shared `Keyspace` struct with the
`defaultPrefix = "rtmanager:"` constant and per-store subpackages
for stream offsets and the per-game lease.
- Generated jet code under
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
is committed; `make -C rtmanager jet` regenerates it via the
testcontainers-driven `cmd/jetgen` pipeline.
- Configuration uses the `RTMANAGER_` prefix for every variable.
The schema-per-service rule from
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
applies: each service's role is grant-restricted to its own
schema; RTM never touches Lobby's `lobby` schema or vice versa.
## Decisions
### 1. One schema, externally-provisioned `rtmanagerservice` role
**Decision.** The `rtmanager` schema and the matching
`rtmanagerservice` role are created outside the migration sequence
(in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema`
and by the integration harness; in production, by an ops init script
not in scope for any service stage). The embedded migration
`00001_init.sql` only contains DDL for the service-owned tables and
indexes and assumes it runs as the schema owner with
`search_path=rtmanager`.
**Why.** Mixing role creation, schema creation, and table DDL into
one script forces every consumer of the migration to run as a
superuser. The schema-per-service architectural rule
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
operational split: ops provisions roles and schemas, the service
applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA`
from its runtime role would relax the
"each service's role grants are restricted to its own schema"
defense-in-depth rule.
### 2. `runtime_records.game_id` is the natural primary key
**Decision.** `runtime_records` uses
`game_id text PRIMARY KEY`. There is no surrogate key. The `status`
column carries a CHECK constraint enforcing the
`running | stopped | removed` enum.
```sql
CREATE TABLE runtime_records (
game_id text PRIMARY KEY,
status text NOT NULL,
-- ...
CONSTRAINT runtime_records_status_chk
CHECK (status IN ('running', 'stopped', 'removed'))
);
```
**Why.** `game_id` is the platform-wide identifier owned by Lobby;
RTM stores at most one record per game ever. A surrogate
`bigserial` would force every cross-service join to translate
through a lookup table; the natural key keeps RTM's persistence
layer pin-compatible with the streams contract (every
`runtime:start_jobs` envelope already names the `game_id`). The
status CHECK reproduces the Go-level enum from
[`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go)
as a defense-in-depth gate at the storage boundary. Decision context:
[`domain-and-ports.md`](domain-and-ports.md).
### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus`
**Decision.** `runtime_records_status_last_op_idx` is a composite
index on `(status, last_op_at)`. The container cleanup worker scans
`status='stopped' AND last_op_at < cutoff`; the
`runtimerecordstore.ListByStatus` adapter method orders rows
`last_op_at DESC, game_id ASC`.
```sql
CREATE INDEX runtime_records_status_last_op_idx
ON runtime_records (status, last_op_at);
```
**Why.** Both read shapes share the same composite. The cleanup
worker drives the index from one direction (range scan on
`last_op_at` filtered by status); `ListByStatus` drives it from the
other (equality on status, sorted by `last_op_at`). PostgreSQL
satisfies both shapes through one index scan once the planner picks
the index for the WHERE clause. The secondary `game_id ASC` tiebreak
in the adapter ORDER BY is satisfied by primary-key ordering after
the index returns the rows.
A second supporting index for the cleanup worker was considered and
rejected: the workload is so small (single-instance v1, bounded
running game count) that one composite is dominantly cheaper than
two narrow ones.
### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index
**Decision.** `operation_log` carries a `bigserial id PRIMARY KEY`
and is written exclusively through INSERT — there is no UPDATE
pathway, no soft-delete column, and no foreign key to
`runtime_records`. The audit index
`operation_log_game_started_idx (game_id, started_at DESC)` drives
the GM/Admin REST audit reads. The adapter's `ListByGame` orders
results `started_at DESC, id DESC` and applies `LIMIT $2`.
```sql
CREATE INDEX operation_log_game_started_idx
ON operation_log (game_id, started_at DESC);
```
**Why.** The audit's correctness invariant is "every operation RTM
performed gets exactly one row"; CASCADE deletes from
`runtime_records` would silently lose history when an admin removes
a runtime and would break the
[`../README.md` §Persistence Layout](../README.md) commitment. The
secondary `id DESC` tiebreak inside the adapter is necessary because
the audit log can write multiple rows in the same millisecond when
`reconcile_adopt` and a real operation interleave on a single tick;
without the tiebreak the test that asserts insertion-order-stable
reads becomes flaky. A non-positive `limit` is rejected before the
SQL is issued; an empty result set returns as `nil` (matching the
lobby pattern, so service-layer callers can do `len(entries) == 0`
without an extra allocation).
### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome`
**Decision.** `operation_log` reproduces the three Go-level enums
as CHECK constraints:
```sql
CONSTRAINT operation_log_op_kind_chk
CHECK (op_kind IN (
'start', 'stop', 'restart', 'patch',
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
)),
CONSTRAINT operation_log_op_source_chk
CHECK (op_source IN (
'lobby_stream', 'gm_rest', 'admin_rest',
'auto_ttl', 'auto_reconcile'
)),
CONSTRAINT operation_log_outcome_chk
CHECK (outcome IN ('success', 'failure'))
```
The Go-level enums in
[`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
remain the source of truth.
**Why.** A defence-in-depth gate at the storage boundary catches any
adapter regression that would otherwise persist an unexpected
string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`)
benefit from the enum being verifiable directly in psql without
consulting the Go source. Adding a new value requires editing two
places (the Go enum and the migration), which is the right friction
level: every new value is a wire-protocol change and deserves an
explicit migration. The alternative of using PostgreSQL's `CREATE
TYPE … AS ENUM` was rejected because adding a value to a PG enum
type requires `ALTER TYPE` outside a transaction and complicates the
single-init pre-launch policy (decision §12).
### 6. `health_snapshots` is one row per game; status enum collapses event types
**Decision.** `health_snapshots` carries `game_id text PRIMARY KEY`
and stores the latest technical health observation per game. The
`status` column enumerates the **observed engine state**, not the
**triggering event type**:
```sql
CONSTRAINT health_snapshots_status_chk
CHECK (status IN (
'healthy', 'probe_failed', 'exited',
'oom', 'inspect_unhealthy', 'container_disappeared'
))
```
The `runtime:health_events` `event_type` enum has seven values
(`container_started`, `container_exited`, `container_oom`,
`container_disappeared`, `inspect_unhealthy`, `probe_failed`,
`probe_recovered`). The snapshot status has six — the two probe
events fold into `healthy` (after `probe_recovered`) and
`probe_failed`, and `container_started` collapses into `healthy`.
**Why.** Health snapshots answer "what state is the engine in
**right now**", not "what event was just emitted". A consumer who
wants the event firehose reads `runtime:health_events`; a consumer
who wants the latest verdict reads `health_snapshots`. The two
surfaces have different lifetimes (stream entries are bounded only
by Redis trim; snapshot rows are overwritten on every new
observation), so collapsing the seven event types into six status
states aligns the column with the consumer's mental model. The
adapter that implements this collapse lives in
[`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go);
every emission to the stream also upserts the snapshot.
### 7. Two-axis CAS shape on `runtime_records.UpdateStatus`
**Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS
guard into a single `WHERE … AND …` clause. Status must equal the
caller's `ExpectedFrom`; when the caller supplies a non-empty
`ExpectedContainerID`, `current_container_id` must equal it as
well:
```sql
UPDATE rtmanager.runtime_records
SET status = $1, last_op_at = $2, ...
WHERE game_id = $3
AND status = $4
[AND current_container_id = $5]
```
A `RowsAffected() == 0` result is ambiguous — the row may be absent
or the predicate may have failed. The adapter resolves the ambiguity
through a follow-up `SELECT status FROM ... WHERE game_id = $1`:
missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`.
The probe runs only on the slow path; happy-path UPDATEs cost a
single round trip.
**Why.** The two-axis CAS is what services need: a stop driven by an
old container_id (from a stale REST request) must not clobber a
fresh `running` record installed by a concurrent restart. Status-only
CAS would collapse those two cases. The optional shape on
`ExpectedContainerID` lets reconciliation flows that legitimately
target "this game in `running` state without caring which container"
omit the second predicate. The follow-up probe matches the
gamestore / invitestore precedent in `lobby/internal/adapters/postgres`
and produces clean per-error sentinels at the service layer.
`TestUpdateStatusConcurrentCAS` exercises the path end to end with
eight goroutines racing the same transition: exactly one returns
`nil`, the rest see `runtime.ErrConflict`. The test is deterministic
because PostgreSQL serialises row-level UPDATEs through the row's
MVCC tuple.
### 8. Destination-driven `SET` clause on `UpdateStatus`
**Decision.** `UpdateStatus` updates a different column subset
depending on the destination status:
| Destination | Columns set |
| --- | --- |
| `stopped` | `status`, `last_op_at`, `stopped_at` |
| `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` |
| `running` | `status`, `last_op_at` |
The implementation switches on `input.To` and writes the UPDATE
chain inline per branch — three short branches read better than one
parametric helper.
**Why.** Each destination has a different invariant. `stopped`
records the wall-clock at which the engine ceased serving; `removed`
nulls the container_id because the row no longer points at any
Docker resource; `running` only updates the status and the
last-op timestamp because the running invariants
(`current_container_id`, fresh `started_at`, `current_image_ref`,
`engine_endpoint`) are installed through `Upsert` on the `start`
path.
A previous draft built the SET list via `[]pg.Column` / `[]any`
slices and a helper, but jet's `UPDATE(columns ...jet.Column)`
variadic refuses a `[]postgres.Column` slice spread because the
element type does not match `jet.Column` after the type-alias
resolution. The final code switches inline per branch.
The `running` destination is implemented even though the start
service uses `Upsert` for the inner start of restart and patch.
Keeping the `running` path live preserves a one-to-one match between
`runtime.AllowedTransitions()` and the adapter's capability matrix —
otherwise a future caller exercising the `stopped → running`
transition through `UpdateStatus` would hit a runtime error inside
the adapter rather than a domain rejection. The path only updates
`status` and `last_op_at`; callers responsible for the running
invariants install them through `Upsert` first.
### 9. `created_at` preservation on `Upsert`
**Decision.** `runtimerecordstore.Upsert` is implemented as
`INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable
column from EXCLUDED>``created_at` is deliberately omitted from
the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt`
value never overwrites the stored timestamp.
```sql
INSERT INTO rtmanager.runtime_records (...)
VALUES (...)
ON CONFLICT (game_id) DO UPDATE
SET status = EXCLUDED.status,
current_container_id = EXCLUDED.current_container_id,
current_image_ref = EXCLUDED.current_image_ref,
engine_endpoint = EXCLUDED.engine_endpoint,
state_path = EXCLUDED.state_path,
docker_network = EXCLUDED.docker_network,
started_at = EXCLUDED.started_at,
stopped_at = EXCLUDED.stopped_at,
removed_at = EXCLUDED.removed_at,
last_op_at = EXCLUDED.last_op_at
-- created_at intentionally NOT updated
```
`TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the
invariant.
**Why.** `runtime_records.created_at` records "first time RTM saw
the game". Every restart and every reconcile_adopt re-Upserts the
row with the current wall-clock as `CreatedAt` from the adapter
boundary; without the omission rule the timestamp would drift
forward. Preserving the original creation time keeps a stable
horizon for retention reasoning and matches
`lobby/internal/adapters/postgres/gamestore.Save`, which uses the
same approach for the `games.created_at` column.
### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default
**Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT
'{}'::jsonb`. The jet-generated model declares
`Details string` (jet maps `jsonb` to `string`). The adapter:
- on `Upsert`, substitutes the SQL DEFAULT `{}` when
`snapshot.Details` is empty, so the column never holds a non-JSON
empty string;
- on `Get`, scans `details` as `[]byte` and wraps the bytes in a
`json.RawMessage` so the caller receives verbatim bytes without
an extra round of parsing.
`TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and
`TestUpsertAndGetRoundTrip` cover the two cases.
**Why.** The detail payload is type-specific (the keys differ
between `probe_failed` and `inspect_unhealthy`) and is opaque to
queries — the column is never element-filtered. JSONB matches the
"everything outside primary fields is JSON" pattern that the
Notification Service already established and allows a future
GIN index (e.g. for an admin search-by-key feature) without a
schema rewrite. Substituting the SQL DEFAULT for an empty
parameter avoids the trap where the database accepts `''` for
`text` but rejects it for `jsonb`.
### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary
**Decision.** Every time-valued column on every RTM table uses
PostgreSQL's `timestamptz`. The domain model continues to use
`time.Time`; the adapter normalises every `time.Time` parameter to
UTC at the binding site (`record.X.UTC()` or the `nullableTime`
helper that wraps a possibly-zero `time.Time`), and re-wraps every
scanned `time.Time` with `.UTC()` (directly or via
`timeFromNullable` for nullable columns) before the value leaves
the adapter.
The architecture-wide form of this rule lives in
[`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md).
**Why.** `timestamptz` is the right column type for every cross-
service timestamp the platform observes, and the domain model needs
a `time.Time` API the service layer can compare and arithmetise.
Without explicit `.UTC()` on the bind site, the pgx driver returns
scanned values in `time.Local`, which silently breaks equality
tests, JSON formatting, and comparison against pointer fields
elsewhere in the codebase. The defensive `.UTC()` rule on both
sides eliminates the class of bug where a timezone difference
between the adapter and the test harness flips assertions
intermittently.
The same shape is used in User Service, Mail Service, and
Notification Service — RTM matches the existing convention rather
than introducing a fourth encoding path.
### 12. Single-init pre-launch policy
**Decision.** `00001_init.sql` evolves in place until first
production deploy. Adding a column, an index, or a new table during
the pre-launch development window edits this file directly rather
than producing `00002_*.sql`. The runtime applies the migration on
every boot; if the schema is already at head, `pkg/postgres`'s
goose adapter exits zero.
**Why.** The schema-per-service architectural rule
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
endorses a single-init policy for pre-launch services. The
pre-launch window allows non-additive changes (column rename, type
narrowing, CHECK tightening) that a multi-step migration sequence
would force into awkward two-step rewrites. Once the service ships
to production, the next schema change becomes `00002_*.sql` and
the policy lifts; from that point onward edits to `00001_init.sql`
are rejected by code review.
This applies to RTM exactly the same way it applies to every other
PG-backed service in the workspace; the README explicitly carries
the reminder. The exit-zero behaviour for already-applied
migrations is what makes the policy operationally cheap: a
freshly-spawned replica re-applies the same `00001_init.sql` with
no work to do, no logged error, and proceeds to open its
listeners.
### 13. Query layer is `go-jet/jet/v2`; generated code is committed
**Decision.** All three RTM PG-store packages
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore))
build SQL through the jet builder API
(`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
`pg.AND/OR/SET/COALESCE/...` DSL).
Generated table models live under
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
and are regenerated by `make -C rtmanager jet`. The target invokes
[`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a
transient PostgreSQL container via testcontainers, provisions the
`rtmanager` schema and `rtmanagerservice` role, applies the embedded
goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
against the provisioned schema. Generated code is committed to the
repo, so build consumers do not need Docker.
Statements are run through the `database/sql` API
(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
helpers preserve the codecs.go boundary translations and
domain-type mapping (status enum decoding, `time.Time` UTC
normalisation, JSONB `[]byte``json.RawMessage`).
PostgreSQL constructs that the jet builder does not cover natively
(`COALESCE`, `LOWER` on subselects, JSONB params) are expressed
through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct
`[]byte`/string params for JSONB columns).
**Why.** Aligns with the workspace-wide convention from
[`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is
`github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed
service. Hand-rolled SQL would multiply boundary-translation paths
and require per-store query-builder helpers for what jet already
covers. Committing generated code keeps `go build ./...` working
without Docker.
### 14. `redisstate` keyspace ownership and per-store subpackages
**Decision.** The
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
package owns one shared `Keyspace` struct with a
`defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter
lives in its own subpackage:
- [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/)
for the stream offset store consumed by the start-jobs and
stop-jobs consumers;
- [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/)
for the per-game lease store consumed by every lifecycle service
and the reconciler.
Both subpackages take a `redisstate.Keyspace{}` value and use it to
build their key shapes (`rtmanager:stream_offsets:{label}`,
`rtmanager:game_lease:{game_id}`).
**Why.** Keeping the parent package as the single owner of the prefix
and the key-shape builder mirrors the way Lobby's `redisstate`
namespace centralises every key shape and supports multiple Redis-
backed adapters (stream offsets, the per-game lease) without a
restructure as the surface grows.
The per-store subpackage choice (rather than Lobby's flat
single-package shape) is driven by three considerations:
- It keeps the docker mock generator scoped to one package, since
`mockgen` regenerates per-directory.
- It allows finer-grained dependency selection: `miniredis` is a
dev-only dep, and keeping the `streamoffsets` package
self-contained leaves room for `gamelease` to depend only on the
production `redis` client.
- Each subpackage carries its own tests, which keeps the test
surface focused on one Redis primitive rather than mixing offset
semantics with lease semantics in shared fixtures.
## Cross-References
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
— the embedded schema migration.
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
`//go:embed *.sql` and `FS()` exporter consumed by the runtime.
- [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)
— the three jet-backed PG adapters and their testcontainers-driven
unit suites.
- [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
— committed generated jet models.
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and
[`../Makefile`](../Makefile) `jet` target — the regeneration
pipeline.
- [`../internal/adapters/redisstate/`](../internal/adapters/redisstate),
[`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/),
[`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/)
— Redis adapter package layout.
- [`../internal/app/runtime.go`](../internal/app/runtime.go)
— runtime wiring: PG pool open + migration apply + Redis client
open + adapter assembly.
- [`../internal/config/`](../internal/config) — the config groups
consumed by the wiring (`Postgres`, `Redis`, `Streams`,
`Coordination`).
- Companion design rationales:
[`domain-and-ports.md`](domain-and-ports.md) for status enum and
domain shape, [`adapters.md`](adapters.md) for the redisstate
publishers and clients.