Files
galaxy-game/rtmanager/docs/postgres-migration.md
T
2026-04-28 20:39:18 +02:00

24 KiB

PostgreSQL Schema Decisions

Runtime Manager has been PostgreSQL-and-Redis from day one — there is no Redis-only predecessor and no migration window. This document records the schema decisions and the non-obvious agreements behind them, mirroring the shape of ../../notification/docs/postgres-migration.md and serving the same role: a single coherent reference for "why does the persistence layer look this way".

Use this document together with the migration script ../internal/adapters/postgres/migrations/00001_init.sql and the runtime wiring ../internal/app/runtime.go.

Outcomes

  • Schema rtmanager (provisioned externally) holds the durable service state across three tables: runtime_records, operation_log, health_snapshots. The three tables map onto the three runtime concerns documented in ../README.md §Persistence Layout: current state per game, audit trail per operation, and latest technical health observation per game.
  • The runtime opens one PostgreSQL pool via pkg/postgres.OpenPrimary, applies embedded goose migrations strictly before any HTTP listener becomes ready, and exits non-zero when migration or ping fails. Already-applied migrations exit zero — the pkg/postgres-supplied migrator treats "no work to do" as success.
  • The runtime opens one shared *redis.Client via pkg/redisconn.NewMasterClient and passes it to the stream offset store, the per-game lease store, the consumer pipelines, and every publisher (runtime:job_results, runtime:health_events, notification:intents).
  • The Redis adapter package ../internal/adapters/redisstate/ owns one shared Keyspace struct with the defaultPrefix = "rtmanager:" constant and per-store subpackages for stream offsets and the per-game lease.
  • Generated jet code under ../internal/adapters/postgres/jet/ is committed; make -C rtmanager jet regenerates it via the testcontainers-driven cmd/jetgen pipeline.
  • Configuration uses the RTMANAGER_ prefix for every variable. The schema-per-service rule from ../../ARCHITECTURE.md §Persistence Backends applies: each service's role is grant-restricted to its own schema; RTM never touches Lobby's lobby schema or vice versa.

Decisions

1. One schema, externally-provisioned rtmanagerservice role

Decision. The rtmanager schema and the matching rtmanagerservice role are created outside the migration sequence (in tests, by the testcontainers harness in cmd/jetgen/main.go::provisionRoleAndSchema and by the integration harness; in production, by an ops init script not in scope for any service stage). The embedded migration 00001_init.sql only contains DDL for the service-owned tables and indexes and assumes it runs as the schema owner with search_path=rtmanager.

Why. Mixing role creation, schema creation, and table DDL into one script forces every consumer of the migration to run as a superuser. The schema-per-service architectural rule (ARCHITECTURE.md §Persistence Backends) lines up neatly with the operational split: ops provisions roles and schemas, the service applies schema-scoped migrations. Letting RTM run CREATE SCHEMA from its runtime role would relax the "each service's role grants are restricted to its own schema" defense-in-depth rule.

2. runtime_records.game_id is the natural primary key

Decision. runtime_records uses game_id text PRIMARY KEY. There is no surrogate key. The status column carries a CHECK constraint enforcing the running | stopped | removed enum.

CREATE TABLE runtime_records (
    game_id              text PRIMARY KEY,
    status               text NOT NULL,
    -- ...
    CONSTRAINT runtime_records_status_chk
        CHECK (status IN ('running', 'stopped', 'removed'))
);

Why. game_id is the platform-wide identifier owned by Lobby; RTM stores at most one record per game ever. A surrogate bigserial would force every cross-service join to translate through a lookup table; the natural key keeps RTM's persistence layer pin-compatible with the streams contract (every runtime:start_jobs envelope already names the game_id). The status CHECK reproduces the Go-level enum from ../internal/domain/runtime/model.go as a defense-in-depth gate at the storage boundary. Decision context: domain-and-ports.md.

3. (status, last_op_at) index serves both the cleanup worker and ListByStatus

Decision. runtime_records_status_last_op_idx is a composite index on (status, last_op_at). The container cleanup worker scans status='stopped' AND last_op_at < cutoff; the runtimerecordstore.ListByStatus adapter method orders rows last_op_at DESC, game_id ASC.

CREATE INDEX runtime_records_status_last_op_idx
    ON runtime_records (status, last_op_at);

Why. Both read shapes share the same composite. The cleanup worker drives the index from one direction (range scan on last_op_at filtered by status); ListByStatus drives it from the other (equality on status, sorted by last_op_at). PostgreSQL satisfies both shapes through one index scan once the planner picks the index for the WHERE clause. The secondary game_id ASC tiebreak in the adapter ORDER BY is satisfied by primary-key ordering after the index returns the rows.

A second supporting index for the cleanup worker was considered and rejected: the workload is so small (single-instance v1, bounded running game count) that one composite is dominantly cheaper than two narrow ones.

4. operation_log is append-only with bigserial id and a (game_id, started_at DESC) index

Decision. operation_log carries a bigserial id PRIMARY KEY and is written exclusively through INSERT — there is no UPDATE pathway, no soft-delete column, and no foreign key to runtime_records. The audit index operation_log_game_started_idx (game_id, started_at DESC) drives the GM/Admin REST audit reads. The adapter's ListByGame orders results started_at DESC, id DESC and applies LIMIT $2.

CREATE INDEX operation_log_game_started_idx
    ON operation_log (game_id, started_at DESC);

Why. The audit's correctness invariant is "every operation RTM performed gets exactly one row"; CASCADE deletes from runtime_records would silently lose history when an admin removes a runtime and would break the ../README.md §Persistence Layout commitment. The secondary id DESC tiebreak inside the adapter is necessary because the audit log can write multiple rows in the same millisecond when reconcile_adopt and a real operation interleave on a single tick; without the tiebreak the test that asserts insertion-order-stable reads becomes flaky. A non-positive limit is rejected before the SQL is issued; an empty result set returns as nil (matching the lobby pattern, so service-layer callers can do len(entries) == 0 without an extra allocation).

5. Enum CHECK constraints on op_kind, op_source, outcome

Decision. operation_log reproduces the three Go-level enums as CHECK constraints:

CONSTRAINT operation_log_op_kind_chk
    CHECK (op_kind IN (
        'start', 'stop', 'restart', 'patch',
        'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
    )),
CONSTRAINT operation_log_op_source_chk
    CHECK (op_source IN (
        'lobby_stream', 'gm_rest', 'admin_rest',
        'auto_ttl', 'auto_reconcile'
    )),
CONSTRAINT operation_log_outcome_chk
    CHECK (outcome IN ('success', 'failure'))

The Go-level enums in ../internal/domain/operation/log.go remain the source of truth.

Why. A defence-in-depth gate at the storage boundary catches any adapter regression that would otherwise persist an unexpected string. Operator-side queries (SELECT … WHERE op_kind = 'restart') benefit from the enum being verifiable directly in psql without consulting the Go source. Adding a new value requires editing two places (the Go enum and the migration), which is the right friction level: every new value is a wire-protocol change and deserves an explicit migration. The alternative of using PostgreSQL's CREATE TYPE … AS ENUM was rejected because adding a value to a PG enum type requires ALTER TYPE outside a transaction and complicates the single-init pre-launch policy (decision §12).

6. health_snapshots is one row per game; status enum collapses event types

Decision. health_snapshots carries game_id text PRIMARY KEY and stores the latest technical health observation per game. The status column enumerates the observed engine state, not the triggering event type:

CONSTRAINT health_snapshots_status_chk
    CHECK (status IN (
        'healthy', 'probe_failed', 'exited',
        'oom', 'inspect_unhealthy', 'container_disappeared'
    ))

The runtime:health_events event_type enum has seven values (container_started, container_exited, container_oom, container_disappeared, inspect_unhealthy, probe_failed, probe_recovered). The snapshot status has six — the two probe events fold into healthy (after probe_recovered) and probe_failed, and container_started collapses into healthy.

Why. Health snapshots answer "what state is the engine in right now", not "what event was just emitted". A consumer who wants the event firehose reads runtime:health_events; a consumer who wants the latest verdict reads health_snapshots. The two surfaces have different lifetimes (stream entries are bounded only by Redis trim; snapshot rows are overwritten on every new observation), so collapsing the seven event types into six status states aligns the column with the consumer's mental model. The adapter that implements this collapse lives in ../internal/adapters/healtheventspublisher/publisher.go; every emission to the stream also upserts the snapshot.

7. Two-axis CAS shape on runtime_records.UpdateStatus

Decision. runtimerecordstore.UpdateStatus compiles its CAS guard into a single WHERE … AND … clause. Status must equal the caller's ExpectedFrom; when the caller supplies a non-empty ExpectedContainerID, current_container_id must equal it as well:

UPDATE rtmanager.runtime_records
SET status = $1, last_op_at = $2, ...
WHERE game_id = $3
  AND status = $4
  [AND current_container_id = $5]

A RowsAffected() == 0 result is ambiguous — the row may be absent or the predicate may have failed. The adapter resolves the ambiguity through a follow-up SELECT status FROM ... WHERE game_id = $1: missing row → runtime.ErrNotFound; mismatch → runtime.ErrConflict. The probe runs only on the slow path; happy-path UPDATEs cost a single round trip.

Why. The two-axis CAS is what services need: a stop driven by an old container_id (from a stale REST request) must not clobber a fresh running record installed by a concurrent restart. Status-only CAS would collapse those two cases. The optional shape on ExpectedContainerID lets reconciliation flows that legitimately target "this game in running state without caring which container" omit the second predicate. The follow-up probe matches the gamestore / invitestore precedent in lobby/internal/adapters/postgres and produces clean per-error sentinels at the service layer.

TestUpdateStatusConcurrentCAS exercises the path end to end with eight goroutines racing the same transition: exactly one returns nil, the rest see runtime.ErrConflict. The test is deterministic because PostgreSQL serialises row-level UPDATEs through the row's MVCC tuple.

8. Destination-driven SET clause on UpdateStatus

Decision. UpdateStatus updates a different column subset depending on the destination status:

Destination Columns set
stopped status, last_op_at, stopped_at
removed status, last_op_at, removed_at, current_container_id = NULL
running status, last_op_at

The implementation switches on input.To and writes the UPDATE chain inline per branch — three short branches read better than one parametric helper.

Why. Each destination has a different invariant. stopped records the wall-clock at which the engine ceased serving; removed nulls the container_id because the row no longer points at any Docker resource; running only updates the status and the last-op timestamp because the running invariants (current_container_id, fresh started_at, current_image_ref, engine_endpoint) are installed through Upsert on the start path.

A previous draft built the SET list via []pg.Column / []any slices and a helper, but jet's UPDATE(columns ...jet.Column) variadic refuses a []postgres.Column slice spread because the element type does not match jet.Column after the type-alias resolution. The final code switches inline per branch.

The running destination is implemented even though the start service uses Upsert for the inner start of restart and patch. Keeping the running path live preserves a one-to-one match between runtime.AllowedTransitions() and the adapter's capability matrix — otherwise a future caller exercising the stopped → running transition through UpdateStatus would hit a runtime error inside the adapter rather than a domain rejection. The path only updates status and last_op_at; callers responsible for the running invariants install them through Upsert first.

9. created_at preservation on Upsert

Decision. runtimerecordstore.Upsert is implemented as INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable column from EXCLUDED>created_at is deliberately omitted from the DO UPDATE list, so a second Upsert with a fresh CreatedAt value never overwrites the stored timestamp.

INSERT INTO rtmanager.runtime_records (...)
VALUES (...)
ON CONFLICT (game_id) DO UPDATE
SET status               = EXCLUDED.status,
    current_container_id = EXCLUDED.current_container_id,
    current_image_ref    = EXCLUDED.current_image_ref,
    engine_endpoint      = EXCLUDED.engine_endpoint,
    state_path           = EXCLUDED.state_path,
    docker_network       = EXCLUDED.docker_network,
    started_at           = EXCLUDED.started_at,
    stopped_at           = EXCLUDED.stopped_at,
    removed_at           = EXCLUDED.removed_at,
    last_op_at           = EXCLUDED.last_op_at
    -- created_at intentionally NOT updated

TestUpsertOverwritesMutableColumnsPreservesCreatedAt covers the invariant.

Why. runtime_records.created_at records "first time RTM saw the game". Every restart and every reconcile_adopt re-Upserts the row with the current wall-clock as CreatedAt from the adapter boundary; without the omission rule the timestamp would drift forward. Preserving the original creation time keeps a stable horizon for retention reasoning and matches lobby/internal/adapters/postgres/gamestore.Save, which uses the same approach for the games.created_at column.

10. health_snapshots.details JSONB round-trip with '{}'::jsonb default

Decision. health_snapshots.details is jsonb NOT NULL DEFAULT '{}'::jsonb. The jet-generated model declares Details string (jet maps jsonb to string). The adapter:

  • on Upsert, substitutes the SQL DEFAULT {} when snapshot.Details is empty, so the column never holds a non-JSON empty string;
  • on Get, scans details as []byte and wraps the bytes in a json.RawMessage so the caller receives verbatim bytes without an extra round of parsing.

TestUpsertEmptyDetailsRoundTripsAsEmptyObject and TestUpsertAndGetRoundTrip cover the two cases.

Why. The detail payload is type-specific (the keys differ between probe_failed and inspect_unhealthy) and is opaque to queries — the column is never element-filtered. JSONB matches the "everything outside primary fields is JSON" pattern that the Notification Service already established and allows a future GIN index (e.g. for an admin search-by-key feature) without a schema rewrite. Substituting the SQL DEFAULT for an empty parameter avoids the trap where the database accepts '' for text but rejects it for jsonb.

11. Timestamps are uniformly timestamptz with UTC normalisation at the adapter boundary

Decision. Every time-valued column on every RTM table uses PostgreSQL's timestamptz. The domain model continues to use time.Time; the adapter normalises every time.Time parameter to UTC at the binding site (record.X.UTC() or the nullableTime helper that wraps a possibly-zero time.Time), and re-wraps every scanned time.Time with .UTC() (directly or via timeFromNullable for nullable columns) before the value leaves the adapter.

The architecture-wide form of this rule lives in ../../ARCHITECTURE.md §Persistence Backends → Timestamp handling.

Why. timestamptz is the right column type for every cross- service timestamp the platform observes, and the domain model needs a time.Time API the service layer can compare and arithmetise. Without explicit .UTC() on the bind site, the pgx driver returns scanned values in time.Local, which silently breaks equality tests, JSON formatting, and comparison against pointer fields elsewhere in the codebase. The defensive .UTC() rule on both sides eliminates the class of bug where a timezone difference between the adapter and the test harness flips assertions intermittently.

The same shape is used in User Service, Mail Service, and Notification Service — RTM matches the existing convention rather than introducing a fourth encoding path.

12. Single-init pre-launch policy

Decision. 00001_init.sql evolves in place until first production deploy. Adding a column, an index, or a new table during the pre-launch development window edits this file directly rather than producing 00002_*.sql. The runtime applies the migration on every boot; if the schema is already at head, pkg/postgres's goose adapter exits zero.

Why. The schema-per-service architectural rule (../../ARCHITECTURE.md §Persistence Backends) endorses a single-init policy for pre-launch services. The pre-launch window allows non-additive changes (column rename, type narrowing, CHECK tightening) that a multi-step migration sequence would force into awkward two-step rewrites. Once the service ships to production, the next schema change becomes 00002_*.sql and the policy lifts; from that point onward edits to 00001_init.sql are rejected by code review.

This applies to RTM exactly the same way it applies to every other PG-backed service in the workspace; the README explicitly carries the reminder. The exit-zero behaviour for already-applied migrations is what makes the policy operationally cheap: a freshly-spawned replica re-applies the same 00001_init.sql with no work to do, no logged error, and proceeds to open its listeners.

13. Query layer is go-jet/jet/v2; generated code is committed

Decision. All three RTM PG-store packages (../internal/adapters/postgres/runtimerecordstore, ../internal/adapters/postgres/operationlogstore, ../internal/adapters/postgres/healthsnapshotstore) build SQL through the jet builder API (pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE plus the pg.AND/OR/SET/COALESCE/... DSL).

Generated table models live under ../internal/adapters/postgres/jet/ and are regenerated by make -C rtmanager jet. The target invokes ../cmd/jetgen/main.go, which spins up a transient PostgreSQL container via testcontainers, provisions the rtmanager schema and rtmanagerservice role, applies the embedded goose migrations, and runs github.com/go-jet/jet/v2/generator/postgres.GenerateDB against the provisioned schema. Generated code is committed to the repo, so build consumers do not need Docker.

Statements are run through the database/sql API (stmt.Sql() → db/tx.Exec/Query/QueryRow); manual rowScanner helpers preserve the codecs.go boundary translations and domain-type mapping (status enum decoding, time.Time UTC normalisation, JSONB []bytejson.RawMessage).

PostgreSQL constructs that the jet builder does not cover natively (COALESCE, LOWER on subselects, JSONB params) are expressed through the per-DSL helpers (pg.COALESCE, pg.LOWER, direct []byte/string params for JSONB columns).

Why. Aligns with the workspace-wide convention from ../../PG_PLAN.md: the query layer is github.com/go-jet/jet/v2 (PostgreSQL dialect) for every PG-backed service. Hand-rolled SQL would multiply boundary-translation paths and require per-store query-builder helpers for what jet already covers. Committing generated code keeps go build ./... working without Docker.

14. redisstate keyspace ownership and per-store subpackages

Decision. The ../internal/adapters/redisstate/ package owns one shared Keyspace struct with a defaultPrefix = "rtmanager:" constant. Each Redis-backed adapter lives in its own subpackage:

Both subpackages take a redisstate.Keyspace{} value and use it to build their key shapes (rtmanager:stream_offsets:{label}, rtmanager:game_lease:{game_id}).

Why. Keeping the parent package as the single owner of the prefix and the key-shape builder mirrors the way Lobby's redisstate namespace centralises every key shape and supports multiple Redis- backed adapters (stream offsets, the per-game lease) without a restructure as the surface grows.

The per-store subpackage choice (rather than Lobby's flat single-package shape) is driven by three considerations:

  • It keeps the docker mock generator scoped to one package, since mockgen regenerates per-directory.
  • It allows finer-grained dependency selection: miniredis is a dev-only dep, and keeping the streamoffsets package self-contained leaves room for gamelease to depend only on the production redis client.
  • Each subpackage carries its own tests, which keeps the test surface focused on one Redis primitive rather than mixing offset semantics with lease semantics in shared fixtures.

Cross-References