# PostgreSQL Schema Decisions Runtime Manager has been PostgreSQL-and-Redis from day one — there is no Redis-only predecessor and no migration window. This document records the schema decisions and the non-obvious agreements behind them, mirroring the shape of [`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md) and serving the same role: a single coherent reference for "why does the persistence layer look this way". Use this document together with the migration script [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql) and the runtime wiring [`../internal/app/runtime.go`](../internal/app/runtime.go). ## Outcomes - Schema `rtmanager` (provisioned externally) holds the durable service state across three tables: `runtime_records`, `operation_log`, `health_snapshots`. The three tables map onto the three runtime concerns documented in [`../README.md` §Persistence Layout](../README.md#persistence-layout): current state per game, audit trail per operation, and latest technical health observation per game. - The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`, applies embedded goose migrations strictly before any HTTP listener becomes ready, and exits non-zero when migration or ping fails. Already-applied migrations exit zero — the `pkg/postgres`-supplied migrator treats "no work to do" as success. - The runtime opens one shared `*redis.Client` via `pkg/redisconn.NewMasterClient` and passes it to the stream offset store, the per-game lease store, the consumer pipelines, and every publisher (`runtime:job_results`, `runtime:health_events`, `notification:intents`). - The Redis adapter package [`../internal/adapters/redisstate/`](../internal/adapters/redisstate) owns one shared `Keyspace` struct with the `defaultPrefix = "rtmanager:"` constant and per-store subpackages for stream offsets and the per-game lease. - Generated jet code under [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet) is committed; `make -C rtmanager jet` regenerates it via the testcontainers-driven `cmd/jetgen` pipeline. - Configuration uses the `RTMANAGER_` prefix for every variable. The schema-per-service rule from [`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md) applies: each service's role is grant-restricted to its own schema; RTM never touches Lobby's `lobby` schema or vice versa. ## Decisions ### 1. One schema, externally-provisioned `rtmanagerservice` role **Decision.** The `rtmanager` schema and the matching `rtmanagerservice` role are created outside the migration sequence (in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema` and by the integration harness; in production, by an ops init script not in scope for any service stage). The embedded migration `00001_init.sql` only contains DDL for the service-owned tables and indexes and assumes it runs as the schema owner with `search_path=rtmanager`. **Why.** Mixing role creation, schema creation, and table DDL into one script forces every consumer of the migration to run as a superuser. The schema-per-service architectural rule (`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the operational split: ops provisions roles and schemas, the service applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA` from its runtime role would relax the "each service's role grants are restricted to its own schema" defense-in-depth rule. ### 2. `runtime_records.game_id` is the natural primary key **Decision.** `runtime_records` uses `game_id text PRIMARY KEY`. There is no surrogate key. The `status` column carries a CHECK constraint enforcing the `running | stopped | removed` enum. ```sql CREATE TABLE runtime_records ( game_id text PRIMARY KEY, status text NOT NULL, -- ... CONSTRAINT runtime_records_status_chk CHECK (status IN ('running', 'stopped', 'removed')) ); ``` **Why.** `game_id` is the platform-wide identifier owned by Lobby; RTM stores at most one record per game ever. A surrogate `bigserial` would force every cross-service join to translate through a lookup table; the natural key keeps RTM's persistence layer pin-compatible with the streams contract (every `runtime:start_jobs` envelope already names the `game_id`). The status CHECK reproduces the Go-level enum from [`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go) as a defense-in-depth gate at the storage boundary. Decision context: [`domain-and-ports.md`](domain-and-ports.md). ### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus` **Decision.** `runtime_records_status_last_op_idx` is a composite index on `(status, last_op_at)`. The container cleanup worker scans `status='stopped' AND last_op_at < cutoff`; the `runtimerecordstore.ListByStatus` adapter method orders rows `last_op_at DESC, game_id ASC`. ```sql CREATE INDEX runtime_records_status_last_op_idx ON runtime_records (status, last_op_at); ``` **Why.** Both read shapes share the same composite. The cleanup worker drives the index from one direction (range scan on `last_op_at` filtered by status); `ListByStatus` drives it from the other (equality on status, sorted by `last_op_at`). PostgreSQL satisfies both shapes through one index scan once the planner picks the index for the WHERE clause. The secondary `game_id ASC` tiebreak in the adapter ORDER BY is satisfied by primary-key ordering after the index returns the rows. A second supporting index for the cleanup worker was considered and rejected: the workload is so small (single-instance v1, bounded running game count) that one composite is dominantly cheaper than two narrow ones. ### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index **Decision.** `operation_log` carries a `bigserial id PRIMARY KEY` and is written exclusively through INSERT — there is no UPDATE pathway, no soft-delete column, and no foreign key to `runtime_records`. The audit index `operation_log_game_started_idx (game_id, started_at DESC)` drives the GM/Admin REST audit reads. The adapter's `ListByGame` orders results `started_at DESC, id DESC` and applies `LIMIT $2`. ```sql CREATE INDEX operation_log_game_started_idx ON operation_log (game_id, started_at DESC); ``` **Why.** The audit's correctness invariant is "every operation RTM performed gets exactly one row"; CASCADE deletes from `runtime_records` would silently lose history when an admin removes a runtime and would break the [`../README.md` §Persistence Layout](../README.md) commitment. The secondary `id DESC` tiebreak inside the adapter is necessary because the audit log can write multiple rows in the same millisecond when `reconcile_adopt` and a real operation interleave on a single tick; without the tiebreak the test that asserts insertion-order-stable reads becomes flaky. A non-positive `limit` is rejected before the SQL is issued; an empty result set returns as `nil` (matching the lobby pattern, so service-layer callers can do `len(entries) == 0` without an extra allocation). ### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome` **Decision.** `operation_log` reproduces the three Go-level enums as CHECK constraints: ```sql CONSTRAINT operation_log_op_kind_chk CHECK (op_kind IN ( 'start', 'stop', 'restart', 'patch', 'cleanup_container', 'reconcile_adopt', 'reconcile_dispose' )), CONSTRAINT operation_log_op_source_chk CHECK (op_source IN ( 'lobby_stream', 'gm_rest', 'admin_rest', 'auto_ttl', 'auto_reconcile' )), CONSTRAINT operation_log_outcome_chk CHECK (outcome IN ('success', 'failure')) ``` The Go-level enums in [`../internal/domain/operation/log.go`](../internal/domain/operation/log.go) remain the source of truth. **Why.** A defence-in-depth gate at the storage boundary catches any adapter regression that would otherwise persist an unexpected string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`) benefit from the enum being verifiable directly in psql without consulting the Go source. Adding a new value requires editing two places (the Go enum and the migration), which is the right friction level: every new value is a wire-protocol change and deserves an explicit migration. The alternative of using PostgreSQL's `CREATE TYPE … AS ENUM` was rejected because adding a value to a PG enum type requires `ALTER TYPE` outside a transaction and complicates the single-init pre-launch policy (decision §12). ### 6. `health_snapshots` is one row per game; status enum collapses event types **Decision.** `health_snapshots` carries `game_id text PRIMARY KEY` and stores the latest technical health observation per game. The `status` column enumerates the **observed engine state**, not the **triggering event type**: ```sql CONSTRAINT health_snapshots_status_chk CHECK (status IN ( 'healthy', 'probe_failed', 'exited', 'oom', 'inspect_unhealthy', 'container_disappeared' )) ``` The `runtime:health_events` `event_type` enum has seven values (`container_started`, `container_exited`, `container_oom`, `container_disappeared`, `inspect_unhealthy`, `probe_failed`, `probe_recovered`). The snapshot status has six — the two probe events fold into `healthy` (after `probe_recovered`) and `probe_failed`, and `container_started` collapses into `healthy`. **Why.** Health snapshots answer "what state is the engine in **right now**", not "what event was just emitted". A consumer who wants the event firehose reads `runtime:health_events`; a consumer who wants the latest verdict reads `health_snapshots`. The two surfaces have different lifetimes (stream entries are bounded only by Redis trim; snapshot rows are overwritten on every new observation), so collapsing the seven event types into six status states aligns the column with the consumer's mental model. The adapter that implements this collapse lives in [`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go); every emission to the stream also upserts the snapshot. ### 7. Two-axis CAS shape on `runtime_records.UpdateStatus` **Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS guard into a single `WHERE … AND …` clause. Status must equal the caller's `ExpectedFrom`; when the caller supplies a non-empty `ExpectedContainerID`, `current_container_id` must equal it as well: ```sql UPDATE rtmanager.runtime_records SET status = $1, last_op_at = $2, ... WHERE game_id = $3 AND status = $4 [AND current_container_id = $5] ``` A `RowsAffected() == 0` result is ambiguous — the row may be absent or the predicate may have failed. The adapter resolves the ambiguity through a follow-up `SELECT status FROM ... WHERE game_id = $1`: missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`. The probe runs only on the slow path; happy-path UPDATEs cost a single round trip. **Why.** The two-axis CAS is what services need: a stop driven by an old container_id (from a stale REST request) must not clobber a fresh `running` record installed by a concurrent restart. Status-only CAS would collapse those two cases. The optional shape on `ExpectedContainerID` lets reconciliation flows that legitimately target "this game in `running` state without caring which container" omit the second predicate. The follow-up probe matches the gamestore / invitestore precedent in `lobby/internal/adapters/postgres` and produces clean per-error sentinels at the service layer. `TestUpdateStatusConcurrentCAS` exercises the path end to end with eight goroutines racing the same transition: exactly one returns `nil`, the rest see `runtime.ErrConflict`. The test is deterministic because PostgreSQL serialises row-level UPDATEs through the row's MVCC tuple. ### 8. Destination-driven `SET` clause on `UpdateStatus` **Decision.** `UpdateStatus` updates a different column subset depending on the destination status: | Destination | Columns set | | --- | --- | | `stopped` | `status`, `last_op_at`, `stopped_at` | | `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` | | `running` | `status`, `last_op_at` | The implementation switches on `input.To` and writes the UPDATE chain inline per branch — three short branches read better than one parametric helper. **Why.** Each destination has a different invariant. `stopped` records the wall-clock at which the engine ceased serving; `removed` nulls the container_id because the row no longer points at any Docker resource; `running` only updates the status and the last-op timestamp because the running invariants (`current_container_id`, fresh `started_at`, `current_image_ref`, `engine_endpoint`) are installed through `Upsert` on the `start` path. A previous draft built the SET list via `[]pg.Column` / `[]any` slices and a helper, but jet's `UPDATE(columns ...jet.Column)` variadic refuses a `[]postgres.Column` slice spread because the element type does not match `jet.Column` after the type-alias resolution. The final code switches inline per branch. The `running` destination is implemented even though the start service uses `Upsert` for the inner start of restart and patch. Keeping the `running` path live preserves a one-to-one match between `runtime.AllowedTransitions()` and the adapter's capability matrix — otherwise a future caller exercising the `stopped → running` transition through `UpdateStatus` would hit a runtime error inside the adapter rather than a domain rejection. The path only updates `status` and `last_op_at`; callers responsible for the running invariants install them through `Upsert` first. ### 9. `created_at` preservation on `Upsert` **Decision.** `runtimerecordstore.Upsert` is implemented as `INSERT ... ON CONFLICT (game_id) DO UPDATE SET ` — `created_at` is deliberately omitted from the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt` value never overwrites the stored timestamp. ```sql INSERT INTO rtmanager.runtime_records (...) VALUES (...) ON CONFLICT (game_id) DO UPDATE SET status = EXCLUDED.status, current_container_id = EXCLUDED.current_container_id, current_image_ref = EXCLUDED.current_image_ref, engine_endpoint = EXCLUDED.engine_endpoint, state_path = EXCLUDED.state_path, docker_network = EXCLUDED.docker_network, started_at = EXCLUDED.started_at, stopped_at = EXCLUDED.stopped_at, removed_at = EXCLUDED.removed_at, last_op_at = EXCLUDED.last_op_at -- created_at intentionally NOT updated ``` `TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the invariant. **Why.** `runtime_records.created_at` records "first time RTM saw the game". Every restart and every reconcile_adopt re-Upserts the row with the current wall-clock as `CreatedAt` from the adapter boundary; without the omission rule the timestamp would drift forward. Preserving the original creation time keeps a stable horizon for retention reasoning and matches `lobby/internal/adapters/postgres/gamestore.Save`, which uses the same approach for the `games.created_at` column. ### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default **Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT '{}'::jsonb`. The jet-generated model declares `Details string` (jet maps `jsonb` to `string`). The adapter: - on `Upsert`, substitutes the SQL DEFAULT `{}` when `snapshot.Details` is empty, so the column never holds a non-JSON empty string; - on `Get`, scans `details` as `[]byte` and wraps the bytes in a `json.RawMessage` so the caller receives verbatim bytes without an extra round of parsing. `TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and `TestUpsertAndGetRoundTrip` cover the two cases. **Why.** The detail payload is type-specific (the keys differ between `probe_failed` and `inspect_unhealthy`) and is opaque to queries — the column is never element-filtered. JSONB matches the "everything outside primary fields is JSON" pattern that the Notification Service already established and allows a future GIN index (e.g. for an admin search-by-key feature) without a schema rewrite. Substituting the SQL DEFAULT for an empty parameter avoids the trap where the database accepts `''` for `text` but rejects it for `jsonb`. ### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary **Decision.** Every time-valued column on every RTM table uses PostgreSQL's `timestamptz`. The domain model continues to use `time.Time`; the adapter normalises every `time.Time` parameter to UTC at the binding site (`record.X.UTC()` or the `nullableTime` helper that wraps a possibly-zero `time.Time`), and re-wraps every scanned `time.Time` with `.UTC()` (directly or via `timeFromNullable` for nullable columns) before the value leaves the adapter. The architecture-wide form of this rule lives in [`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md). **Why.** `timestamptz` is the right column type for every cross- service timestamp the platform observes, and the domain model needs a `time.Time` API the service layer can compare and arithmetise. Without explicit `.UTC()` on the bind site, the pgx driver returns scanned values in `time.Local`, which silently breaks equality tests, JSON formatting, and comparison against pointer fields elsewhere in the codebase. The defensive `.UTC()` rule on both sides eliminates the class of bug where a timezone difference between the adapter and the test harness flips assertions intermittently. The same shape is used in User Service, Mail Service, and Notification Service — RTM matches the existing convention rather than introducing a fourth encoding path. ### 12. Single-init pre-launch policy **Decision.** `00001_init.sql` evolves in place until first production deploy. Adding a column, an index, or a new table during the pre-launch development window edits this file directly rather than producing `00002_*.sql`. The runtime applies the migration on every boot; if the schema is already at head, `pkg/postgres`'s goose adapter exits zero. **Why.** The schema-per-service architectural rule ([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)) endorses a single-init policy for pre-launch services. The pre-launch window allows non-additive changes (column rename, type narrowing, CHECK tightening) that a multi-step migration sequence would force into awkward two-step rewrites. Once the service ships to production, the next schema change becomes `00002_*.sql` and the policy lifts; from that point onward edits to `00001_init.sql` are rejected by code review. This applies to RTM exactly the same way it applies to every other PG-backed service in the workspace; the README explicitly carries the reminder. The exit-zero behaviour for already-applied migrations is what makes the policy operationally cheap: a freshly-spawned replica re-applies the same `00001_init.sql` with no work to do, no logged error, and proceeds to open its listeners. ### 13. Query layer is `go-jet/jet/v2`; generated code is committed **Decision.** All three RTM PG-store packages ([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore), [`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore), [`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)) build SQL through the jet builder API (`pgtable..INSERT/SELECT/UPDATE/DELETE` plus the `pg.AND/OR/SET/COALESCE/...` DSL). Generated table models live under [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet) and are regenerated by `make -C rtmanager jet`. The target invokes [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a transient PostgreSQL container via testcontainers, provisions the `rtmanager` schema and `rtmanagerservice` role, applies the embedded goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB` against the provisioned schema. Generated code is committed to the repo, so build consumers do not need Docker. Statements are run through the `database/sql` API (`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner` helpers preserve the codecs.go boundary translations and domain-type mapping (status enum decoding, `time.Time` UTC normalisation, JSONB `[]byte` ↔ `json.RawMessage`). PostgreSQL constructs that the jet builder does not cover natively (`COALESCE`, `LOWER` on subselects, JSONB params) are expressed through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct `[]byte`/string params for JSONB columns). **Why.** Aligns with the workspace-wide convention from [`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is `github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed service. Hand-rolled SQL would multiply boundary-translation paths and require per-store query-builder helpers for what jet already covers. Committing generated code keeps `go build ./...` working without Docker. ### 14. `redisstate` keyspace ownership and per-store subpackages **Decision.** The [`../internal/adapters/redisstate/`](../internal/adapters/redisstate) package owns one shared `Keyspace` struct with a `defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter lives in its own subpackage: - [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/) for the stream offset store consumed by the start-jobs and stop-jobs consumers; - [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/) for the per-game lease store consumed by every lifecycle service and the reconciler. Both subpackages take a `redisstate.Keyspace{}` value and use it to build their key shapes (`rtmanager:stream_offsets:{label}`, `rtmanager:game_lease:{game_id}`). **Why.** Keeping the parent package as the single owner of the prefix and the key-shape builder mirrors the way Lobby's `redisstate` namespace centralises every key shape and supports multiple Redis- backed adapters (stream offsets, the per-game lease) without a restructure as the surface grows. The per-store subpackage choice (rather than Lobby's flat single-package shape) is driven by three considerations: - It keeps the docker mock generator scoped to one package, since `mockgen` regenerates per-directory. - It allows finer-grained dependency selection: `miniredis` is a dev-only dep, and keeping the `streamoffsets` package self-contained leaves room for `gamelease` to depend only on the production `redis` client. - Each subpackage carries its own tests, which keeps the test surface focused on one Redis primitive rather than mixing offset semantics with lease semantics in shared fixtures. ## Cross-References - [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql) — the embedded schema migration. - [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go) — `//go:embed *.sql` and `FS()` exporter consumed by the runtime. - [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore), [`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore), [`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore) — the three jet-backed PG adapters and their testcontainers-driven unit suites. - [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet) — committed generated jet models. - [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and [`../Makefile`](../Makefile) `jet` target — the regeneration pipeline. - [`../internal/adapters/redisstate/`](../internal/adapters/redisstate), [`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/), [`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/) — Redis adapter package layout. - [`../internal/app/runtime.go`](../internal/app/runtime.go) — runtime wiring: PG pool open + migration apply + Redis client open + adapter assembly. - [`../internal/config/`](../internal/config) — the config groups consumed by the wiring (`Postgres`, `Redis`, `Streams`, `Coordination`). - Companion design rationales: [`domain-and-ports.md`](domain-and-ports.md) for status enum and domain shape, [`adapters.md`](adapters.md) for the redisstate publishers and clients.