24 KiB
PostgreSQL Schema Decisions
Runtime Manager has been PostgreSQL-and-Redis from day one — there is
no Redis-only predecessor and no migration window. This document
records the schema decisions and the non-obvious agreements behind
them, mirroring the shape of
../../notification/docs/postgres-migration.md
and serving the same role: a single coherent reference for "why does
the persistence layer look this way".
Use this document together with the migration script
../internal/adapters/postgres/migrations/00001_init.sql
and the runtime wiring
../internal/app/runtime.go.
Outcomes
- Schema
rtmanager(provisioned externally) holds the durable service state across three tables:runtime_records,operation_log,health_snapshots. The three tables map onto the three runtime concerns documented in../README.md§Persistence Layout: current state per game, audit trail per operation, and latest technical health observation per game. - The runtime opens one PostgreSQL pool via
pkg/postgres.OpenPrimary, applies embedded goose migrations strictly before any HTTP listener becomes ready, and exits non-zero when migration or ping fails. Already-applied migrations exit zero — thepkg/postgres-supplied migrator treats "no work to do" as success. - The runtime opens one shared
*redis.Clientviapkg/redisconn.NewMasterClientand passes it to the stream offset store, the per-game lease store, the consumer pipelines, and every publisher (runtime:job_results,runtime:health_events,notification:intents). - The Redis adapter package
../internal/adapters/redisstate/owns one sharedKeyspacestruct with thedefaultPrefix = "rtmanager:"constant and per-store subpackages for stream offsets and the per-game lease. - Generated jet code under
../internal/adapters/postgres/jet/is committed;make -C rtmanager jetregenerates it via the testcontainers-drivencmd/jetgenpipeline. - Configuration uses the
RTMANAGER_prefix for every variable. The schema-per-service rule from../../ARCHITECTURE.md§Persistence Backends applies: each service's role is grant-restricted to its own schema; RTM never touches Lobby'slobbyschema or vice versa.
Decisions
1. One schema, externally-provisioned rtmanagerservice role
Decision. The rtmanager schema and the matching
rtmanagerservice role are created outside the migration sequence
(in tests, by the testcontainers harness in cmd/jetgen/main.go::provisionRoleAndSchema
and by the integration harness; in production, by an ops init script
not in scope for any service stage). The embedded migration
00001_init.sql only contains DDL for the service-owned tables and
indexes and assumes it runs as the schema owner with
search_path=rtmanager.
Why. Mixing role creation, schema creation, and table DDL into
one script forces every consumer of the migration to run as a
superuser. The schema-per-service architectural rule
(ARCHITECTURE.md §Persistence Backends) lines up neatly with the
operational split: ops provisions roles and schemas, the service
applies schema-scoped migrations. Letting RTM run CREATE SCHEMA
from its runtime role would relax the
"each service's role grants are restricted to its own schema"
defense-in-depth rule.
2. runtime_records.game_id is the natural primary key
Decision. runtime_records uses
game_id text PRIMARY KEY. There is no surrogate key. The status
column carries a CHECK constraint enforcing the
running | stopped | removed enum.
CREATE TABLE runtime_records (
game_id text PRIMARY KEY,
status text NOT NULL,
-- ...
CONSTRAINT runtime_records_status_chk
CHECK (status IN ('running', 'stopped', 'removed'))
);
Why. game_id is the platform-wide identifier owned by Lobby;
RTM stores at most one record per game ever. A surrogate
bigserial would force every cross-service join to translate
through a lookup table; the natural key keeps RTM's persistence
layer pin-compatible with the streams contract (every
runtime:start_jobs envelope already names the game_id). The
status CHECK reproduces the Go-level enum from
../internal/domain/runtime/model.go
as a defense-in-depth gate at the storage boundary. Decision context:
domain-and-ports.md.
3. (status, last_op_at) index serves both the cleanup worker and ListByStatus
Decision. runtime_records_status_last_op_idx is a composite
index on (status, last_op_at). The container cleanup worker scans
status='stopped' AND last_op_at < cutoff; the
runtimerecordstore.ListByStatus adapter method orders rows
last_op_at DESC, game_id ASC.
CREATE INDEX runtime_records_status_last_op_idx
ON runtime_records (status, last_op_at);
Why. Both read shapes share the same composite. The cleanup
worker drives the index from one direction (range scan on
last_op_at filtered by status); ListByStatus drives it from the
other (equality on status, sorted by last_op_at). PostgreSQL
satisfies both shapes through one index scan once the planner picks
the index for the WHERE clause. The secondary game_id ASC tiebreak
in the adapter ORDER BY is satisfied by primary-key ordering after
the index returns the rows.
A second supporting index for the cleanup worker was considered and rejected: the workload is so small (single-instance v1, bounded running game count) that one composite is dominantly cheaper than two narrow ones.
4. operation_log is append-only with bigserial id and a (game_id, started_at DESC) index
Decision. operation_log carries a bigserial id PRIMARY KEY
and is written exclusively through INSERT — there is no UPDATE
pathway, no soft-delete column, and no foreign key to
runtime_records. The audit index
operation_log_game_started_idx (game_id, started_at DESC) drives
the GM/Admin REST audit reads. The adapter's ListByGame orders
results started_at DESC, id DESC and applies LIMIT $2.
CREATE INDEX operation_log_game_started_idx
ON operation_log (game_id, started_at DESC);
Why. The audit's correctness invariant is "every operation RTM
performed gets exactly one row"; CASCADE deletes from
runtime_records would silently lose history when an admin removes
a runtime and would break the
../README.md §Persistence Layout commitment. The
secondary id DESC tiebreak inside the adapter is necessary because
the audit log can write multiple rows in the same millisecond when
reconcile_adopt and a real operation interleave on a single tick;
without the tiebreak the test that asserts insertion-order-stable
reads becomes flaky. A non-positive limit is rejected before the
SQL is issued; an empty result set returns as nil (matching the
lobby pattern, so service-layer callers can do len(entries) == 0
without an extra allocation).
5. Enum CHECK constraints on op_kind, op_source, outcome
Decision. operation_log reproduces the three Go-level enums
as CHECK constraints:
CONSTRAINT operation_log_op_kind_chk
CHECK (op_kind IN (
'start', 'stop', 'restart', 'patch',
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
)),
CONSTRAINT operation_log_op_source_chk
CHECK (op_source IN (
'lobby_stream', 'gm_rest', 'admin_rest',
'auto_ttl', 'auto_reconcile'
)),
CONSTRAINT operation_log_outcome_chk
CHECK (outcome IN ('success', 'failure'))
The Go-level enums in
../internal/domain/operation/log.go
remain the source of truth.
Why. A defence-in-depth gate at the storage boundary catches any
adapter regression that would otherwise persist an unexpected
string. Operator-side queries (SELECT … WHERE op_kind = 'restart')
benefit from the enum being verifiable directly in psql without
consulting the Go source. Adding a new value requires editing two
places (the Go enum and the migration), which is the right friction
level: every new value is a wire-protocol change and deserves an
explicit migration. The alternative of using PostgreSQL's CREATE TYPE … AS ENUM was rejected because adding a value to a PG enum
type requires ALTER TYPE outside a transaction and complicates the
single-init pre-launch policy (decision §12).
6. health_snapshots is one row per game; status enum collapses event types
Decision. health_snapshots carries game_id text PRIMARY KEY
and stores the latest technical health observation per game. The
status column enumerates the observed engine state, not the
triggering event type:
CONSTRAINT health_snapshots_status_chk
CHECK (status IN (
'healthy', 'probe_failed', 'exited',
'oom', 'inspect_unhealthy', 'container_disappeared'
))
The runtime:health_events event_type enum has seven values
(container_started, container_exited, container_oom,
container_disappeared, inspect_unhealthy, probe_failed,
probe_recovered). The snapshot status has six — the two probe
events fold into healthy (after probe_recovered) and
probe_failed, and container_started collapses into healthy.
Why. Health snapshots answer "what state is the engine in
right now", not "what event was just emitted". A consumer who
wants the event firehose reads runtime:health_events; a consumer
who wants the latest verdict reads health_snapshots. The two
surfaces have different lifetimes (stream entries are bounded only
by Redis trim; snapshot rows are overwritten on every new
observation), so collapsing the seven event types into six status
states aligns the column with the consumer's mental model. The
adapter that implements this collapse lives in
../internal/adapters/healtheventspublisher/publisher.go;
every emission to the stream also upserts the snapshot.
7. Two-axis CAS shape on runtime_records.UpdateStatus
Decision. runtimerecordstore.UpdateStatus compiles its CAS
guard into a single WHERE … AND … clause. Status must equal the
caller's ExpectedFrom; when the caller supplies a non-empty
ExpectedContainerID, current_container_id must equal it as
well:
UPDATE rtmanager.runtime_records
SET status = $1, last_op_at = $2, ...
WHERE game_id = $3
AND status = $4
[AND current_container_id = $5]
A RowsAffected() == 0 result is ambiguous — the row may be absent
or the predicate may have failed. The adapter resolves the ambiguity
through a follow-up SELECT status FROM ... WHERE game_id = $1:
missing row → runtime.ErrNotFound; mismatch → runtime.ErrConflict.
The probe runs only on the slow path; happy-path UPDATEs cost a
single round trip.
Why. The two-axis CAS is what services need: a stop driven by an
old container_id (from a stale REST request) must not clobber a
fresh running record installed by a concurrent restart. Status-only
CAS would collapse those two cases. The optional shape on
ExpectedContainerID lets reconciliation flows that legitimately
target "this game in running state without caring which container"
omit the second predicate. The follow-up probe matches the
gamestore / invitestore precedent in lobby/internal/adapters/postgres
and produces clean per-error sentinels at the service layer.
TestUpdateStatusConcurrentCAS exercises the path end to end with
eight goroutines racing the same transition: exactly one returns
nil, the rest see runtime.ErrConflict. The test is deterministic
because PostgreSQL serialises row-level UPDATEs through the row's
MVCC tuple.
8. Destination-driven SET clause on UpdateStatus
Decision. UpdateStatus updates a different column subset
depending on the destination status:
| Destination | Columns set |
|---|---|
stopped |
status, last_op_at, stopped_at |
removed |
status, last_op_at, removed_at, current_container_id = NULL |
running |
status, last_op_at |
The implementation switches on input.To and writes the UPDATE
chain inline per branch — three short branches read better than one
parametric helper.
Why. Each destination has a different invariant. stopped
records the wall-clock at which the engine ceased serving; removed
nulls the container_id because the row no longer points at any
Docker resource; running only updates the status and the
last-op timestamp because the running invariants
(current_container_id, fresh started_at, current_image_ref,
engine_endpoint) are installed through Upsert on the start
path.
A previous draft built the SET list via []pg.Column / []any
slices and a helper, but jet's UPDATE(columns ...jet.Column)
variadic refuses a []postgres.Column slice spread because the
element type does not match jet.Column after the type-alias
resolution. The final code switches inline per branch.
The running destination is implemented even though the start
service uses Upsert for the inner start of restart and patch.
Keeping the running path live preserves a one-to-one match between
runtime.AllowedTransitions() and the adapter's capability matrix —
otherwise a future caller exercising the stopped → running
transition through UpdateStatus would hit a runtime error inside
the adapter rather than a domain rejection. The path only updates
status and last_op_at; callers responsible for the running
invariants install them through Upsert first.
9. created_at preservation on Upsert
Decision. runtimerecordstore.Upsert is implemented as
INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable column from EXCLUDED> — created_at is deliberately omitted from
the DO UPDATE list, so a second Upsert with a fresh CreatedAt
value never overwrites the stored timestamp.
INSERT INTO rtmanager.runtime_records (...)
VALUES (...)
ON CONFLICT (game_id) DO UPDATE
SET status = EXCLUDED.status,
current_container_id = EXCLUDED.current_container_id,
current_image_ref = EXCLUDED.current_image_ref,
engine_endpoint = EXCLUDED.engine_endpoint,
state_path = EXCLUDED.state_path,
docker_network = EXCLUDED.docker_network,
started_at = EXCLUDED.started_at,
stopped_at = EXCLUDED.stopped_at,
removed_at = EXCLUDED.removed_at,
last_op_at = EXCLUDED.last_op_at
-- created_at intentionally NOT updated
TestUpsertOverwritesMutableColumnsPreservesCreatedAt covers the
invariant.
Why. runtime_records.created_at records "first time RTM saw
the game". Every restart and every reconcile_adopt re-Upserts the
row with the current wall-clock as CreatedAt from the adapter
boundary; without the omission rule the timestamp would drift
forward. Preserving the original creation time keeps a stable
horizon for retention reasoning and matches
lobby/internal/adapters/postgres/gamestore.Save, which uses the
same approach for the games.created_at column.
10. health_snapshots.details JSONB round-trip with '{}'::jsonb default
Decision. health_snapshots.details is jsonb NOT NULL DEFAULT '{}'::jsonb. The jet-generated model declares
Details string (jet maps jsonb to string). The adapter:
- on
Upsert, substitutes the SQL DEFAULT{}whensnapshot.Detailsis empty, so the column never holds a non-JSON empty string; - on
Get, scansdetailsas[]byteand wraps the bytes in ajson.RawMessageso the caller receives verbatim bytes without an extra round of parsing.
TestUpsertEmptyDetailsRoundTripsAsEmptyObject and
TestUpsertAndGetRoundTrip cover the two cases.
Why. The detail payload is type-specific (the keys differ
between probe_failed and inspect_unhealthy) and is opaque to
queries — the column is never element-filtered. JSONB matches the
"everything outside primary fields is JSON" pattern that the
Notification Service already established and allows a future
GIN index (e.g. for an admin search-by-key feature) without a
schema rewrite. Substituting the SQL DEFAULT for an empty
parameter avoids the trap where the database accepts '' for
text but rejects it for jsonb.
11. Timestamps are uniformly timestamptz with UTC normalisation at the adapter boundary
Decision. Every time-valued column on every RTM table uses
PostgreSQL's timestamptz. The domain model continues to use
time.Time; the adapter normalises every time.Time parameter to
UTC at the binding site (record.X.UTC() or the nullableTime
helper that wraps a possibly-zero time.Time), and re-wraps every
scanned time.Time with .UTC() (directly or via
timeFromNullable for nullable columns) before the value leaves
the adapter.
The architecture-wide form of this rule lives in
../../ARCHITECTURE.md §Persistence Backends → Timestamp handling.
Why. timestamptz is the right column type for every cross-
service timestamp the platform observes, and the domain model needs
a time.Time API the service layer can compare and arithmetise.
Without explicit .UTC() on the bind site, the pgx driver returns
scanned values in time.Local, which silently breaks equality
tests, JSON formatting, and comparison against pointer fields
elsewhere in the codebase. The defensive .UTC() rule on both
sides eliminates the class of bug where a timezone difference
between the adapter and the test harness flips assertions
intermittently.
The same shape is used in User Service, Mail Service, and Notification Service — RTM matches the existing convention rather than introducing a fourth encoding path.
12. Single-init pre-launch policy
Decision. 00001_init.sql evolves in place until first
production deploy. Adding a column, an index, or a new table during
the pre-launch development window edits this file directly rather
than producing 00002_*.sql. The runtime applies the migration on
every boot; if the schema is already at head, pkg/postgres's
goose adapter exits zero.
Why. The schema-per-service architectural rule
(../../ARCHITECTURE.md §Persistence Backends)
endorses a single-init policy for pre-launch services. The
pre-launch window allows non-additive changes (column rename, type
narrowing, CHECK tightening) that a multi-step migration sequence
would force into awkward two-step rewrites. Once the service ships
to production, the next schema change becomes 00002_*.sql and
the policy lifts; from that point onward edits to 00001_init.sql
are rejected by code review.
This applies to RTM exactly the same way it applies to every other
PG-backed service in the workspace; the README explicitly carries
the reminder. The exit-zero behaviour for already-applied
migrations is what makes the policy operationally cheap: a
freshly-spawned replica re-applies the same 00001_init.sql with
no work to do, no logged error, and proceeds to open its
listeners.
13. Query layer is go-jet/jet/v2; generated code is committed
Decision. All three RTM PG-store packages
(../internal/adapters/postgres/runtimerecordstore,
../internal/adapters/postgres/operationlogstore,
../internal/adapters/postgres/healthsnapshotstore)
build SQL through the jet builder API
(pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE plus the
pg.AND/OR/SET/COALESCE/... DSL).
Generated table models live under
../internal/adapters/postgres/jet/
and are regenerated by make -C rtmanager jet. The target invokes
../cmd/jetgen/main.go, which spins up a
transient PostgreSQL container via testcontainers, provisions the
rtmanager schema and rtmanagerservice role, applies the embedded
goose migrations, and runs github.com/go-jet/jet/v2/generator/postgres.GenerateDB
against the provisioned schema. Generated code is committed to the
repo, so build consumers do not need Docker.
Statements are run through the database/sql API
(stmt.Sql() → db/tx.Exec/Query/QueryRow); manual rowScanner
helpers preserve the codecs.go boundary translations and
domain-type mapping (status enum decoding, time.Time UTC
normalisation, JSONB []byte ↔ json.RawMessage).
PostgreSQL constructs that the jet builder does not cover natively
(COALESCE, LOWER on subselects, JSONB params) are expressed
through the per-DSL helpers (pg.COALESCE, pg.LOWER, direct
[]byte/string params for JSONB columns).
Why. Aligns with the workspace-wide convention from
../../PG_PLAN.md: the query layer is
github.com/go-jet/jet/v2 (PostgreSQL dialect) for every PG-backed
service. Hand-rolled SQL would multiply boundary-translation paths
and require per-store query-builder helpers for what jet already
covers. Committing generated code keeps go build ./... working
without Docker.
14. redisstate keyspace ownership and per-store subpackages
Decision. The
../internal/adapters/redisstate/
package owns one shared Keyspace struct with a
defaultPrefix = "rtmanager:" constant. Each Redis-backed adapter
lives in its own subpackage:
redisstate/streamoffsetsfor the stream offset store consumed by the start-jobs and stop-jobs consumers;redisstate/gameleasefor the per-game lease store consumed by every lifecycle service and the reconciler.
Both subpackages take a redisstate.Keyspace{} value and use it to
build their key shapes (rtmanager:stream_offsets:{label},
rtmanager:game_lease:{game_id}).
Why. Keeping the parent package as the single owner of the prefix
and the key-shape builder mirrors the way Lobby's redisstate
namespace centralises every key shape and supports multiple Redis-
backed adapters (stream offsets, the per-game lease) without a
restructure as the surface grows.
The per-store subpackage choice (rather than Lobby's flat single-package shape) is driven by three considerations:
- It keeps the docker mock generator scoped to one package, since
mockgenregenerates per-directory. - It allows finer-grained dependency selection:
miniredisis a dev-only dep, and keeping thestreamoffsetspackage self-contained leaves room forgameleaseto depend only on the productionredisclient. - Each subpackage carries its own tests, which keeps the test surface focused on one Redis primitive rather than mixing offset semantics with lease semantics in shared fixtures.
Cross-References
../internal/adapters/postgres/migrations/00001_init.sql— the embedded schema migration.../internal/adapters/postgres/migrations/migrations.go—//go:embed *.sqlandFS()exporter consumed by the runtime.../internal/adapters/postgres/runtimerecordstore,../internal/adapters/postgres/operationlogstore,../internal/adapters/postgres/healthsnapshotstore— the three jet-backed PG adapters and their testcontainers-driven unit suites.../internal/adapters/postgres/jet/— committed generated jet models.../cmd/jetgen/main.goand../Makefilejettarget — the regeneration pipeline.../internal/adapters/redisstate/,../internal/adapters/redisstate/streamoffsets/,../internal/adapters/redisstate/gamelease/— Redis adapter package layout.../internal/app/runtime.go— runtime wiring: PG pool open + migration apply + Redis client open + adapter assembly.../internal/config/— the config groups consumed by the wiring (Postgres,Redis,Streams,Coordination).- Companion design rationales:
domain-and-ports.mdfor status enum and domain shape,adapters.mdfor the redisstate publishers and clients.