feat: runtime manager

2026-04-28 20:39:18 +02:00
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
@@ -0,0 +1,44 @@
+# Runtime Manager — Service-Local Documentation
+
+This directory hosts the service-local documentation for `Runtime
+Manager`. The top-level [`../README.md`](../README.md) describes the
+current-state contract (purpose, scope, lifecycles, surfaces,
+configuration, observability); the documents below complement it with
+focused content docs and design-rationale records.
+
+## Content docs
+
+- [Runtime and components](runtime.md) — process diagram, listeners,
+  workers, lifecycle services, stream offsets, configuration groups,
+  runtime invariants.
+- [Flows](flows.md) — mermaid sequence diagrams for the lifecycle and
+  observability flows.
+- [Operator runbook](runbook.md) — startup, readiness, shutdown, and
+  recovery scenarios.
+- [Configuration and contract examples](examples.md) — `.env`,
+  REST request bodies, stream payloads, storage inspection snippets.
+
+## Design rationale
+
+- [PostgreSQL schema decisions](postgres-migration.md) — the schema
+  decision record consolidating the persistence-layer agreements
+  (tables, indexes, CAS shape, `created_at` preservation, jsonb
+  round-trip, schema/role provisioning split).
+- [Domain and ports](domain-and-ports.md) — string-typed enums, the
+  four allowed runtime transitions, why `Inspect` splits into
+  `InspectImage` / `InspectContainer`, why `LobbyGameRecord` is
+  minimal, and other domain-layer choices.
+- [Adapters](adapters.md) — Docker SDK adapter, Lobby internal HTTP
+  client, the three Redis publishers, the `mockgen` convention for
+  wide ports, and the unit-test strategy for HTTP-backed adapters.
+- [Lifecycle services](services.md) — per-game lease semantics, the
+  `Result`-shaped contract, failure-mode tables, the lease-bypass
+  `Run` method on inner services, the `X-Galaxy-Caller` header
+  convention, and the canonical error code → HTTP status mapping.
+- [Background workers](workers.md) — single-ownership table per
+  `event_type`, `container_disappeared` suppression rules, probe
+  hysteresis, the events listener reconnect policy, the reconciler's
+  per-game lease and three drift kinds.
+- [Service-local integration suite](integration-tests.md) — the
+  `integration` build tag, the in-process `app.NewRuntime` choice,
+  the Lobby HTTP stub, and the test isolation strategy.
@@ -0,0 +1,192 @@
+# Adapters
+
+This document explains why the production adapters under
+[`../internal/adapters/`](../internal/adapters) — Docker SDK,
+Lobby internal HTTP client, notification-intent publisher, health-event
+publisher, job-result publisher — are shaped the way they are. The
+PostgreSQL stores and the Redis-coordination adapters live in
+[`postgres-migration.md`](postgres-migration.md).
+
+## 1. `mockgen` is the repo-wide convention for wide ports
+
+The Docker port has nine methods plus eight value types in the
+signatures, and most lifecycle services exercise nearly every method
+pair (start, stop, restart, patch, cleanup, reconcile, events, probe).
+A hand-rolled fake would either miss methods or balloon to a per-test
+fixture.
+
+`internal/adapters/docker/` therefore uses `go.uber.org/mock` mocks:
+
+- `//go:generate` directives live next to the interface declaration in
+  `internal/ports/dockerclient.go`;
+- generated code is committed under `internal/adapters/docker/mocks/`
+  (matching the `internal/adapters/postgres/jet/` discipline);
+- `make -C rtmanager mocks` is the single command operators run after
+  a port-signature change.
+
+The maintained `go.uber.org/mock` fork is preferred over the archived
+`github.com/golang/mock`. This convention applies to wide / recorder
+ports across the repository — Lobby uses the same pipeline for its
+narrow recorder ports (`RuntimeManager`, `IntentPublisher`,
+`GMClient`, `UserService`); see
+[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for the cross-service
+rule.
+
+The other two RTM ports (`LobbyInternalClient`,
+`NotificationIntentPublisher`) keep inline `_test.go` fakes: small
+surfaces, easy to fake by hand inside a single test file when needed.
+
+## 2. `EngineEndpoint` is built inside the Docker adapter
+
+The engine port is fixed at `8080`. Pushing it into `RunSpec` would
+force the start service to know an engine implementation detail;
+pushing it into config would give operators a knob that the engine
+image already does not honour. The Docker adapter exposes
+`EnginePort = 8080` as a package constant and constructs
+`RunResult.EngineEndpoint = "http://" + spec.Hostname + ":8080"`
+itself.
+
+The adapter also leaves `container.Config.ExposedPorts` empty: RTM
+never publishes ports to the host. The user-defined Docker bridge
+network gives every container in the network DNS access to the engine
+via `galaxy-game-{game_id}:8080`.
+
+## 3. `Run` removes the container on `ContainerStart` failure
+
+`README.md §Lifecycles → Start` requires no orphan to remain after a
+failed start path. If `ContainerCreate` succeeds but `ContainerStart`
+fails, the adapter calls `ContainerRemove(force=true)` inside a fresh
+`context.Background()` (with a 10s timeout) so the cleanup runs even
+when the original ctx is already cancelled. The cleanup is best-effort:
+a remove failure is silently discarded because the original start
+failure is the actionable error returned to the caller.
+
+The alternative — leaving rollback to the start service — would either
+duplicate the same code in every caller or invite a service that forgets
+to do it. Centralising the rule in the adapter keeps the port contract
+simple. The start service adds an additional rollback layer for the
+post-`Run` `Upsert` failure path; see [`services.md`](services.md) §5.
+
+## 4. `RunSpec.Cmd` is optional
+
+`ports.RunSpec` exposes an optional `Cmd []string`. Production callers
+leave it `nil` so the engine image's own `CMD` runs;
+`internal/adapters/docker/smoke_test.go` uses it to drive
+`["/bin/sh","-c","sleep 60"]` against `alpine:3.21`.
+
+The alternative — building a dedicated test image with a pre-baked
+`sleep` command — would require an extra `Dockerfile` under testdata
+and a build step inside the smoke test. The single new field is
+documented as optional and ignored when empty; production behaviour is
+unchanged.
+
+## 5. `EventsListen` filters at the adapter boundary
+
+The Docker `/events` API accepts a `filters` query parameter, but the
+daemon treats it as a hint, not a guarantee. The adapter therefore
+double-checks at the boundary: only `Type == events.ContainerEventType`
+messages are passed through to the typed `<-chan ports.DockerEvent`.
+Doing the filter at the SDK level would still require a defensive
+recheck on the consumer side; consolidating the check in the adapter
+keeps the contract crisp and the consumer free of Docker-internal type
+discriminants.
+
+The decoded event copies the actor's full `Attributes` map into
+`DockerEvent.Labels`. Docker mixes container labels and runtime
+attributes (`exitCode`, `image`, `name`, etc.) flat in the same map;
+RTM consumers filter by the `com.galaxy.` prefix when they care about
+labels, and the adapter extracts `exitCode` separately for `die`
+events.
+
+## 6. Lobby HTTP client error mapping
+
+`ports.LobbyInternalClient.GetGame` fixes:
+
+- `200` → `LobbyGameRecord` decoded tolerantly (unknown fields
+  ignored);
+- `404` → `ports.ErrLobbyGameNotFound`;
+- transport, timeout, or any other non-2xx → `ports.ErrLobbyUnavailable`
+  wrapped with the original error so callers can `errors.Is` and still
+  log the cause.
+
+The start service treats `ErrLobbyUnavailable` as recoverable: it
+continues without the diagnostic data because the start envelope
+already carries the only required field (`image_ref`). The client
+mirrors `notification/internal/adapters/userservice/client.go`: cloned
+`*http.Transport`, `otelhttp.NewTransport` wrap, per-request
+`context.WithTimeout`, idempotent `Close()` releasing idle connections.
+
+JSON decoding is tolerant: unknown fields in the success body do not
+break the call, so additive changes to Lobby's `GameRecord` schema do
+not require an RTM release.
+
+## 7. Notification publisher wrapper signature
+
+The wrapper drops the entry id returned by
+`notificationintent.Publisher.Publish` (rationale in
+[`domain-and-ports.md`](domain-and-ports.md) §7). The adapter is a
+thin shim:
+
+- `NewPublisher(cfg)` constructs the inner publisher and forwards
+  validation;
+- `Publish(ctx, intent)` calls the inner publisher and discards the
+  entry id.
+
+The compile-time assertion `var _ ports.NotificationIntentPublisher =
+(*Publisher)(nil)` lives in `publisher.go`.
+
+## 8. Health-events publisher: snapshot upsert before stream XADD
+
+Every emission goes through
+`ports.HealthEventPublisher.Publish`, which both XADDs to
+`runtime:health_events` and upserts `health_snapshots`. The snapshot
+upsert runs **before** the XADD: a successful Publish always leaves
+the snapshot store at least as fresh as the stream, and a partial
+failure leaves the snapshot a best-effort lower bound. Reversing the
+order would let consumers observe a stream entry whose
+`health_snapshots` row reflects the prior observation — a misleading
+inversion.
+
+The `event_type → SnapshotStatus / SnapshotSource` mapping mirrors the
+table in [`../README.md` §Health Monitoring](../README.md). In
+particular, `container_started` collapses to `SnapshotStatusHealthy`
+and `probe_recovered` does the same (rationale in
+[`domain-and-ports.md`](domain-and-ports.md) §4).
+
+## 9. Unit-test strategy
+
+Both HTTP-backed adapters (Docker SDK, Lobby client) use
+`httptest.Server` fixtures. The Docker SDK speaks HTTP under the hood
+for both unix sockets and TCP, so adapter unit tests construct a
+Docker client with `client.WithHost(server.URL)` and
+`client.WithHTTPClient(server.Client())`, which lets table-driven
+handlers fake every Docker API endpoint without touching the real
+daemon. The Docker API version is pinned to `1.45`
+(`client.WithVersion("1.45")`) so the URL prefix is stable across CI
+machines whose daemon advertises a different default. Production
+wiring (in `internal/app/bootstrap.go`) keeps API negotiation enabled.
+
+The notification publisher uses `miniredis` directly because the
+adapter's only side effect is an `XADD`, which `miniredis` reproduces
+faithfully and matches every other Galaxy intent test.
+
+## 10. Docker smoke test
+
+`internal/adapters/docker/smoke_test.go` runs on the default
+`go test ./...` invocation and calls `t.Skip` unless the local daemon
+is reachable (`/var/run/docker.sock` exists or `DOCKER_HOST` is set).
+The covered sequence:
+
+1. provision a temporary user-defined bridge network;
+2. assert `EnsureNetwork` for present and missing names;
+3. pull `alpine:3.21` (`PullPolicyIfMissing`);
+4. subscribe to events;
+5. run a sleep container with the full `RunSpec` field set;
+6. observe a `start` event for the new container id;
+7. inspect, stop, remove, and verify `ErrContainerNotFound` is
+   reported afterwards.
+
+This is the production adapter's only end-to-end check that runs from
+the default `go test` pass; the broader service-local integration
+suite ([`integration-tests.md`](integration-tests.md)) is gated
+behind `-tags=integration`.
@@ -0,0 +1,167 @@
+# Domain and Ports
+
+This document explains why the `rtmanager` domain layer
+([`../internal/domain/`](../internal/domain)) and the port interfaces
+([`../internal/ports/`](../internal/ports)) are shaped the way they are.
+The current-state types and method signatures are the source of truth in
+the code; this file records the rationale so future readers do not
+re-litigate the same trade-offs.
+
+For the surrounding behaviour see
+[`../README.md`](../README.md), the SQL CHECK constraints in
+[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql),
+the wire contracts under [`../api/`](../api), and
+[`postgres-migration.md`](postgres-migration.md) for the persistence
+layer.
+
+## 1. String-typed status enums
+
+`runtime.Status`, `operation.OpKind`, `operation.OpSource`,
+`operation.Outcome`, `health.EventType`, `health.SnapshotStatus`, and
+`health.SnapshotSource` are all `type X string`.
+
+The string approach wins on three counts:
+
+- the SQL CHECK constraints already store the values as `text`, so a
+  string domain type maps one-to-one with no codec layer;
+- it matches Lobby (`game.Status`, `membership.Status`,
+  `application.Status`), so reviewers do not switch encoding mental
+  models when crossing service boundaries;
+- `IsKnown` keeps the invariant cheap (a single switch); a `type X uint8`
+  with stringer-generated names would pay a constant lookup and make raw
+  SQL columns harder to read in diagnostics.
+
+## 2. Plain `string` for `CurrentContainerID` and `CurrentImageRef`
+
+The PostgreSQL columns are nullable. The domain model uses plain
+`string` with empty == NULL and bridges the SQL nullability inside the
+adapter. Pointer fields would force every consumer to dereference
+defensively even though business logic rarely cares about the
+NULL/empty distinction (removed records may legitimately carry either
+form depending on whether the record passed through `stopped` first).
+
+The adapter's job is to translate `sql.NullString` ⇄ `string`; the rest
+of the codebase reads the field as a regular value.
+
+## 3. `*time.Time` for nullable timestamps
+
+`StartedAt`, `StoppedAt`, `RemovedAt` retain pointer types. `time.Time{}`
+is a real, comparable value in Go (`IsZero` only reports the canonical
+zero time); mixing "missing" and "set to UTC zero" through plain
+`time.Time` would invite bugs. The jet-generated `model.RuntimeRecords`
+already declares the same fields as `*time.Time`, so the domain type
+aligns with the persistence type and the adapter does not re-shape
+pointers.
+
+## 4. `EventType` and `SnapshotStatus` are deliberately distinct
+
+`runtime-health-asyncapi.yaml.EventType` enumerates seven values; the
+SQL CHECK on `health_snapshots.status` enumerates six. The two sets
+overlap but are not identical:
+
+- `container_started` is an *event*; the snapshot collapses it to
+  `healthy` (a successful start is observed as the container being
+  live, not as an ongoing event);
+- `probe_recovered` is an *event*; it does not become a snapshot row of
+  its own — the next inspect/probe overwrites the prior `probe_failed`
+  with `healthy`.
+
+Modelling them as one shared enum would require a separate "event vs
+snapshot" boolean and invite accidental mismatches. Two distinct types
+with explicit `IsKnown` matrices keep each surface honest at compile
+time.
+
+## 5. `Inspect` split into `InspectImage` + `InspectContainer`
+
+Two narrow methods replace a single polymorphic `Inspect`. The surface
+RTM exercises has two shapes:
+
+- the start service inspects the *image* by reference to read resource
+  limits from labels;
+- the periodic inspect worker, the reconciler, and the events listener
+  inspect *containers* by id to read state, health, restart count, and
+  exit code.
+
+The inputs differ (ref vs id), and the result types differ
+(`ImageInspect.Labels` is the only field used at start time, while
+`ContainerInspect` carries a dozen state fields). One polymorphic
+method would either split internally on input type or return a tagged
+union; either is messier than two narrow methods.
+
+## 6. `LobbyGameRecord` is intentionally minimal
+
+`LobbyInternalClient.GetGame` returns `GameID`, `Status`, and
+`TargetEngineVersion`. The fetch is classified as ancillary diagnostics
+because the start envelope already carries the only required field
+(`image_ref`).
+
+Anything more would invite RTM consumers to depend on Lobby's schema in
+ways that violate the "RTM never resolves engine versions" rule.
+Future fields are additive: each new field is opt-in to the consumer
+and does not break existing call sites. The minimalism is also a hedge
+against schema drift — Lobby's `GameRecord` is large and changes more
+often than RTM needs to track.
+
+## 7. `NotificationIntentPublisher.Publish` returns `error`, not `(string, error)`
+
+Lobby's `IntentPublisher.Publish` returns the Redis Stream entry id so
+business workflows that key on it (idempotency keys, audit
+correlation) can capture it. RTM publishes admin-only failure intents
+where the entry id has no consumer — failing starts do not loop back
+to RTM, and notification routing keys on the producer-supplied
+`idempotency_key` rather than the stream id. The adapter wraps
+`pkg/notificationintent.Publisher` and discards the entry id at the
+wrapper boundary.
+
+## 8. Exactly four allowed runtime transitions
+
+`runtime.AllowedTransitions` covers:
+
+- `running → stopped` — graceful stop, observed exit, reconcile
+  observed exited;
+- `running → removed` — `reconcile_dispose` when the container
+  vanished;
+- `stopped → running` — restart and patch inner start;
+- `stopped → removed` — cleanup TTL or admin DELETE.
+
+Other pairs are intentionally rejected:
+
+- `running → running` and `stopped → stopped` would mean Upsert
+  overwrote state without a CAS guard. Idempotent re-start / re-stop
+  never transitions; the service layer returns `replay_no_op` and the
+  record is left untouched.
+- `removed → *` is forbidden because `removed` is terminal. The
+  reconciler creates fresh records with `reconcile_adopt` rather than
+  resurrecting old ones.
+
+Encoding the table this way means a future bug where a service tries
+to revive a removed record is rejected at the domain layer rather than
+the adapter, which keeps the failure mode close to the offending code.
+
+## 9. `PullPolicy` re-declared inside `ports/dockerclient.go`
+
+The same enum exists as `config.ImagePullPolicy`. Importing
+`internal/config` from the ports package would couple two unrelated
+layers and create a cyclic risk once the wiring layer pulls both in.
+The runtime/wiring layer (in `internal/app`) is the single point that
+translates between the two type aliases — both are `string`-typed, the
+value sets are identical, and the validation lives on each side
+independently.
+
+## 10. Compile-time interface assertions live with adapters
+
+Every interface has a `var _ ports.X = (*Y)(nil)` assertion, but the
+assertion lives in the adapter package (e.g.
+`var _ ports.RuntimeRecordStore = (*Store)(nil)` inside
+`internal/adapters/postgres/runtimerecordstore`). Putting the
+assertions in the port package would force the port package to import
+its own implementations and create an obvious import cycle.
+
+## 11. `RunSpec.Validate` lives on the request type
+
+The Docker port carries a non-trivial request type (`RunSpec`) with
+eight required fields and per-mount invariants. Putting `Validate` on
+the request struct keeps the rule next to the type definition, mirrors
+the pattern used by `lobby/internal/ports/gmclient.go`
+(`RegisterGameRequest.Validate`), and lets the adapter call it as the
+first defensive check before invoking the Docker SDK.
@@ -0,0 +1,429 @@
+# Configuration And Contract Examples
+
+The examples below are illustrative. Replace `localhost`, port
+numbers, IDs, and timestamps with values that match the deployment
+under inspection.
+
+## Example `.env`
+
+A minimum-viable `RTMANAGER_*` set for a local run against a single
+Redis container plus a PostgreSQL container with the `rtmanager`
+schema and the `rtmanagerservice` role provisioned. The full list
+with defaults lives in [`../README.md` §Configuration](../README.md).
+
+```bash
+# Required
+RTMANAGER_INTERNAL_HTTP_ADDR=:8096
+RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
+RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
+RTMANAGER_REDIS_PASSWORD=local
+RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
+RTMANAGER_DOCKER_NETWORK=galaxy-net
+RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
+
+# Lobby internal client (diagnostic GET only in v1)
+RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
+RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
+
+# Container defaults (image labels override these per container)
+RTMANAGER_DEFAULT_CPU_QUOTA=1.0
+RTMANAGER_DEFAULT_MEMORY=512m
+RTMANAGER_DEFAULT_PIDS_LIMIT=512
+RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
+RTMANAGER_CONTAINER_RETENTION_DAYS=30
+RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
+RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
+RTMANAGER_GAME_STATE_DIR_MODE=0750
+RTMANAGER_GAME_STATE_OWNER_UID=0
+RTMANAGER_GAME_STATE_OWNER_GID=0
+
+# Workers
+RTMANAGER_INSPECT_INTERVAL=30s
+RTMANAGER_PROBE_INTERVAL=15s
+RTMANAGER_PROBE_TIMEOUT=2s
+RTMANAGER_PROBE_FAILURES_THRESHOLD=3
+RTMANAGER_RECONCILE_INTERVAL=5m
+RTMANAGER_CLEANUP_INTERVAL=1h
+
+# Coordination
+RTMANAGER_GAME_LEASE_TTL_SECONDS=60
+
+# Process and logging
+RTMANAGER_LOG_LEVEL=info
+RTMANAGER_SHUTDOWN_TIMEOUT=30s
+
+# Telemetry (disabled for local dev — enable to ship traces / metrics)
+OTEL_SERVICE_NAME=galaxy-rtmanager
+OTEL_TRACES_EXPORTER=none
+OTEL_METRICS_EXPORTER=none
+```
+
+For a production-shaped deployment, set
+`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
+a tag mutation is immediately visible to the next runtime),
+`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
+container's user, and configure `OTEL_*` against the cluster's OTLP
+collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
+`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
+sink the operator runs (fluentd, journald, etc.).
+
+For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
+`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
+service-local harness brings up
+([`integration-tests.md` §7](integration-tests.md)).
+
+## Internal HTTP Examples
+
+Every endpoint admits the optional `X-Galaxy-Caller` header which the
+handler records as `op_source` in `operation_log` (`gm` → `gm_rest`,
+`admin` → `admin_rest`; missing or unknown values default to
+`admin_rest` in v1). Decision: [`services.md` §18](services.md).
+
+### Probe a runtime record
+
+```bash
+curl -s -H 'X-Galaxy-Caller: gm' \
+  http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
+```
+
+Response (`200 OK`):
+
+```json
+{
+  "game_id": "game-01HZ...",
+  "status": "running",
+  "current_container_id": "1f2a...",
+  "current_image_ref": "galaxy/game:1.4.0",
+  "engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
+  "state_path": "/var/lib/galaxy/games/game-01HZ...",
+  "docker_network": "galaxy-net",
+  "started_at": "2026-04-28T07:18:54Z",
+  "stopped_at": null,
+  "removed_at": null,
+  "last_op_at": "2026-04-28T07:18:54Z",
+  "created_at": "2026-04-28T07:18:54Z"
+}
+```
+
+### List all runtimes
+
+```bash
+curl -s -H 'X-Galaxy-Caller: admin' \
+  http://localhost:8096/api/v1/internal/runtimes
+```
+
+The response shape is `{"items":[<RuntimeRecord>...]}`.
+
+### Start a runtime
+
+```bash
+curl -s -X POST \
+  -H 'Content-Type: application/json' \
+  -H 'X-Galaxy-Caller: gm' \
+  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
+  -d '{"image_ref": "galaxy/game:1.4.0"}'
+```
+
+A `200` returns the `RuntimeRecord` for the running runtime. Failure
+shapes use the canonical envelope; e.g. an invalid image_ref:
+
+```json
+{
+  "error": {
+    "code": "start_config_invalid",
+    "message": "image_ref shape rejected by docker reference parser"
+  }
+}
+```
+
+### Stop a runtime
+
+```bash
+curl -s -X POST \
+  -H 'Content-Type: application/json' \
+  -H 'X-Galaxy-Caller: admin' \
+  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
+  -d '{"reason": "admin_request"}'
+```
+
+Valid `reason` values:
+`orphan_cleanup | cancelled | finished | admin_request | timeout`.
+
+### Restart a runtime
+
+```bash
+curl -s -X POST \
+  -H 'X-Galaxy-Caller: admin' \
+  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
+```
+
+The body is empty; restart re-uses the current `image_ref`.
+
+### Patch a runtime
+
+```bash
+curl -s -X POST \
+  -H 'Content-Type: application/json' \
+  -H 'X-Galaxy-Caller: admin' \
+  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
+  -d '{"image_ref": "galaxy/game:1.4.2"}'
+```
+
+Patch enforces the semver-only rule: a non-semver tag returns
+`image_ref_not_semver`; a cross-major or cross-minor change returns
+`semver_patch_only`.
+
+### Cleanup a stopped runtime container
+
+```bash
+curl -s -X DELETE \
+  -H 'X-Galaxy-Caller: admin' \
+  http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
+```
+
+Cleanup refuses a `running` runtime with `409 conflict`; stop first.
+
+## Stream Payload Examples
+
+Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
+the defaults are used below. Field types and required/optional
+semantics are frozen by
+[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
+and
+[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
+
+### `runtime:start_jobs` (Lobby → RTM)
+
+```bash
+redis-cli XADD runtime:start_jobs '*' \
+  game_id 'game-01HZ...' \
+  image_ref 'galaxy/game:1.4.0' \
+  requested_at_ms 1714081234567
+```
+
+### `runtime:stop_jobs` (Lobby → RTM)
+
+```bash
+redis-cli XADD runtime:stop_jobs '*' \
+  game_id 'game-01HZ...' \
+  reason 'cancelled' \
+  requested_at_ms 1714081234567
+```
+
+### `runtime:job_results` (RTM → Lobby)
+
+Success envelope:
+
+```bash
+redis-cli XADD runtime:job_results '*' \
+  game_id 'game-01HZ...' \
+  outcome 'success' \
+  container_id '1f2a...' \
+  engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
+  error_code '' \
+  error_message ''
+```
+
+Failure envelope:
+
+```bash
+redis-cli XADD runtime:job_results '*' \
+  game_id 'game-01HZ...' \
+  outcome 'failure' \
+  container_id '' \
+  engine_endpoint '' \
+  error_code 'image_pull_failed' \
+  error_message 'pull failed: manifest unknown'
+```
+
+Idempotent replay envelope (success outcome with explicit
+`replay_no_op`):
+
+```bash
+redis-cli XADD runtime:job_results '*' \
+  game_id 'game-01HZ...' \
+  outcome 'success' \
+  container_id '1f2a...' \
+  engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
+  error_code 'replay_no_op' \
+  error_message ''
+```
+
+The contract permits empty `container_id` and `engine_endpoint`
+strings on every value of `outcome` so the consumer can decode the
+envelope uniformly ([`workers.md` §11](workers.md)).
+
+### `runtime:health_events` (RTM out)
+
+The wire shape is the same for every event type — only the
+`details` payload differs.
+
+`container_started`:
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'container_started' \
+  occurred_at_ms 1714081234567 \
+  details '{"image_ref":"galaxy/game:1.4.0"}'
+```
+
+`container_exited`:
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'container_exited' \
+  occurred_at_ms 1714081234567 \
+  details '{"exit_code":137,"oom":false}'
+```
+
+`container_oom`:
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'container_oom' \
+  occurred_at_ms 1714081234567 \
+  details '{"exit_code":137}'
+```
+
+`container_disappeared`:
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'container_disappeared' \
+  occurred_at_ms 1714081234567 \
+  details '{}'
+```
+
+`inspect_unhealthy`:
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'inspect_unhealthy' \
+  occurred_at_ms 1714081234567 \
+  details '{"restart_count":3,"state":"running","health":"unhealthy"}'
+```
+
+`probe_failed` (after the threshold is crossed):
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'probe_failed' \
+  occurred_at_ms 1714081234567 \
+  details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
+```
+
+`probe_recovered`:
+
+```bash
+redis-cli XADD runtime:health_events '*' \
+  game_id 'game-01HZ...' \
+  container_id '1f2a...' \
+  event_type 'probe_recovered' \
+  occurred_at_ms 1714081234567 \
+  details '{"prior_failure_count":3}'
+```
+
+### `notification:intents` (RTM admin notifications)
+
+RTM publishes admin-only notification intents only for the three
+first-touch start failures. Every payload shares the frozen field
+set `{game_id, image_ref, error_code, error_message,
+attempted_at_ms}`
+([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
+
+`runtime.image_pull_failed`:
+
+```bash
+redis-cli XADD notification:intents '*' \
+  envelope '{
+    "type": "runtime.image_pull_failed",
+    "producer": "rtmanager",
+    "idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
+    "audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
+    "payload": {
+      "game_id": "game-01HZ...",
+      "image_ref": "galaxy/game:1.4.0",
+      "error_code": "image_pull_failed",
+      "error_message": "pull failed: manifest unknown",
+      "attempted_at_ms": 1714081234567
+    }
+  }'
+```
+
+`runtime.container_start_failed` and `runtime.start_config_invalid`
+share the same envelope with their respective `type` and
+`error_code` values.
+
+## Storage Inspection
+
+### Inspect a runtime record (PostgreSQL)
+
+```bash
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
+```
+
+Columns mirror the fields documented in
+[`../README.md` §Persistence Layout](../README.md#persistence-layout).
+
+### Inspect runtime status counts
+
+```bash
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
+```
+
+### Inspect the operation log for a game
+
+```bash
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT id, op_kind, op_source, outcome, error_code,
+          started_at, finished_at
+   FROM rtmanager.operation_log
+   WHERE game_id = 'game-01HZ...'
+   ORDER BY started_at DESC, id DESC
+   LIMIT 50"
+```
+
+### Inspect the latest health snapshot
+
+```bash
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT game_id, container_id, status, source, observed_at, details
+   FROM rtmanager.health_snapshots
+   WHERE game_id = 'game-01HZ...'"
+```
+
+### Inspect Redis runtime-coordination keys
+
+```bash
+# Stream offsets
+redis-cli GET rtmanager:stream_offsets:startjobs
+redis-cli GET rtmanager:stream_offsets:stopjobs
+
+# Per-game lease (only present while an operation is in flight)
+redis-cli GET rtmanager:game_lease:game-01HZ...
+redis-cli TTL rtmanager:game_lease:game-01HZ...
+
+# Recent stream entries
+redis-cli XRANGE runtime:start_jobs - + COUNT 20
+redis-cli XRANGE runtime:job_results - + COUNT 20
+redis-cli XRANGE runtime:health_events - + COUNT 50
+
+# Stream metadata
+redis-cli XINFO STREAM runtime:start_jobs
+redis-cli XINFO STREAM runtime:stop_jobs
+redis-cli XINFO STREAM runtime:health_events
+```
@@ -0,0 +1,305 @@
+# Flows
+
+This document collects the lifecycle and observability flows that
+span Runtime Manager and its synchronous and asynchronous neighbours.
+Narrative descriptions of the rules these flows enforce live in
+[`../README.md`](../README.md); the diagrams here focus on the message
+order across the boundary. Design-rationale records linked from each
+section explain the *why*.
+
+## Start (happy path)
+
+```mermaid
+sequenceDiagram
+    participant Lobby as Lobby publisher
+    participant Stream as runtime:start_jobs
+    participant Consumer as startjobsconsumer
+    participant Service as startruntime
+    participant Lease as Redis lease
+    participant Docker
+    participant PG as Postgres
+    participant Health as runtime:health_events
+    participant Results as runtime:job_results
+
+    Lobby->>Stream: XADD {game_id, image_ref, requested_at_ms}
+    Consumer->>Stream: XREAD
+    Consumer->>Service: Handle(game_id, image_ref, OpSourceLobbyStream, entry_id)
+    Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
+    Service->>PG: SELECT runtime_records WHERE game_id
+    Service->>Docker: PullImage(image_ref) per pull policy
+    Service->>Docker: InspectImage → resource limits
+    Service->>Service: prepareStateDir(<root>/{game_id})
+    Service->>Docker: ContainerCreate + ContainerStart
+    Service->>PG: Upsert runtime_records (status=running)
+    Service->>PG: INSERT operation_log (op_kind=start, outcome=success)
+    Service->>Health: XADD container_started
+    Service-->>Consumer: Result{Outcome=success, ContainerID, EngineEndpoint}
+    Consumer->>Results: XADD {outcome=success, container_id, engine_endpoint}
+    Service->>Lease: DEL rtmanager:game_lease:{game_id}
+```
+
+REST callers (Game Master, Admin Service) drive the same service
+through `POST /api/v1/internal/runtimes/{game_id}/start`; the
+diagram's last two arrows collapse to an HTTP `200` response carrying
+the runtime record. Sources:
+[`../README.md` §Lifecycles → Start](../README.md#start),
+[`services.md` §3](services.md).
+
+## Start failure (image pull)
+
+```mermaid
+sequenceDiagram
+    participant Service as startruntime
+    participant Docker
+    participant PG as Postgres
+    participant Intents as notification:intents
+    participant Results as runtime:job_results
+
+    Service->>Docker: PullImage(image_ref)
+    Docker-->>Service: error
+    Service->>PG: INSERT operation_log (op_kind=start, outcome=failure, error_code=image_pull_failed)
+    Service->>Intents: XADD runtime.image_pull_failed {game_id, image_ref, error_code, error_message, attempted_at_ms}
+    Service-->>Service: Result{Outcome=failure, ErrorCode=image_pull_failed}
+    Service->>Results: XADD {outcome=failure, error_code=image_pull_failed}
+```
+
+The same shape applies to the configuration-validation failures
+(`start_config_invalid` from `EnsureNetwork(ErrNetworkMissing)`,
+`prepareStateDir`, or invalid `image_ref` shape) and the Docker
+create/start failure (`container_start_failed`); only the error code
+and the matching `runtime.*` notification type differ. Three failure
+codes do **not** raise an admin notification: `conflict`,
+`service_unavailable`, `internal_error`
+([`services.md` §4](services.md)).
+
+## Start failure (orphan / Upsert-after-Run rollback)
+
+```mermaid
+sequenceDiagram
+    participant Service as startruntime
+    participant Docker
+    participant PG as Postgres
+    participant Intents as notification:intents
+
+    Service->>Docker: ContainerCreate + ContainerStart
+    Docker-->>Service: container running
+    Service->>PG: Upsert runtime_records
+    PG-->>Service: error (transport / constraint)
+    Note over Service: container is now an orphan<br/>(running, no PG record)
+    Service->>Docker: Remove(container_id) [fresh background context]
+    Docker-->>Service: ok or logged failure
+    Service->>PG: INSERT operation_log (outcome=failure, error_code=container_start_failed)
+    Service->>Intents: XADD runtime.container_start_failed
+    Service-->>Service: Result{Outcome=failure, ErrorCode=container_start_failed}
+```
+
+The Docker adapter already removes the container when `Run` itself
+fails after a successful `ContainerCreate`
+([`adapters.md` §3](adapters.md)); the start service adds the
+post-`Run` rollback for the `Upsert` path. A `Remove` failure is
+logged but not propagated; the reconciler adopts surviving orphans on
+its periodic pass ([`services.md` §5](services.md)).
+
+## Stop
+
+```mermaid
+sequenceDiagram
+    participant Caller as Lobby / GM / Admin
+    participant Service as stopruntime
+    participant Lease as Redis lease
+    participant PG as Postgres
+    participant Docker
+    participant Results as runtime:job_results
+
+    Caller->>Service: stop(game_id, reason)
+    Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
+    Service->>PG: SELECT runtime_records WHERE game_id
+    alt status in {stopped, removed}
+        Service->>PG: INSERT operation_log (outcome=success, error_code=replay_no_op)
+        Service-->>Caller: success / replay_no_op
+    else status = running
+        Service->>Docker: ContainerStop(container_id, RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS)
+        Docker-->>Service: ok
+        Service->>PG: UpdateStatus running→stopped (CAS by container_id)
+        Service->>PG: INSERT operation_log (op_kind=stop, outcome=success)
+        Service-->>Caller: success
+    end
+    Service->>Lease: DEL rtmanager:game_lease:{game_id}
+```
+
+Lobby callers receive the outcome through `runtime:job_results`; REST
+callers receive an HTTP `200`. The `reason` enum
+(`orphan_cleanup | cancelled | finished | admin_request | timeout`)
+is recorded in `operation_log` and is otherwise opaque to the stop
+service — RTM does not branch on the reason in v1
+([`services.md` §15, §17](services.md)).
+
+## Restart
+
+```mermaid
+sequenceDiagram
+    participant Admin as GM / Admin
+    participant Service as restartruntime
+    participant Stop as stopruntime.Run
+    participant Start as startruntime.Run
+    participant Docker
+    participant PG as Postgres
+
+    Admin->>Service: POST /restart
+    Service->>PG: SELECT runtime_records WHERE game_id
+    Note over Service: capture current image_ref
+    Service->>Service: acquire per-game lease (held across both inner ops)
+    Service->>Stop: Run(game_id) [lease bypass]
+    Stop->>Docker: ContainerStop
+    Stop->>PG: UpdateStatus running→stopped
+    Service->>Docker: ContainerRemove
+    Service->>Start: Run(game_id, image_ref) [lease bypass]
+    Start->>Docker: PullImage / Run
+    Start->>PG: Upsert runtime_records (status=running)
+    Service->>PG: INSERT operation_log (op_kind=restart, outcome=success, source_ref=correlation_id)
+    Service-->>Admin: 200 {runtime_record}
+    Service->>Service: release lease
+```
+
+The lease is acquired by `restartruntime` and held across both inner
+operations; `stopruntime.Run` and `startruntime.Run` are
+lease-bypass entry points that skip the inner lease acquisition
+([`services.md` §12](services.md)). The single `operation_log` row
+uses `Input.SourceRef` as a correlation id linking the implicit stop
+and start entries ([`services.md` §13](services.md)).
+
+## Patch
+
+```mermaid
+sequenceDiagram
+    participant Admin as GM / Admin
+    participant Service as patchruntime
+    participant Restart as restartruntime.Run
+
+    Admin->>Service: POST /patch {image_ref: "galaxy/game:1.4.2"}
+    Service->>Service: parse new image_ref + current image_ref
+    alt either ref not semver
+        Service-->>Admin: 422 image_ref_not_semver
+    else major or minor differ
+        Service-->>Admin: 422 semver_patch_only
+    else major.minor match, patch differs (or equal)
+        Service->>Restart: Run(game_id, new_image_ref)
+        Restart-->>Service: Result
+        Service-->>Admin: 200 {runtime_record}
+    end
+```
+
+The semver gate uses the tag fragment of the Docker reference; the
+extraction strategy is recorded in [`services.md` §14](services.md).
+The restart delegate already owns the lease, the inner stop/start,
+the operation log, and the `runtime:health_events container_started`
+emission ([`workers.md` §1](workers.md)).
+
+## Cleanup TTL
+
+```mermaid
+sequenceDiagram
+    participant Worker as containercleanup worker
+    participant PG as Postgres
+    participant Service as cleanupcontainer
+    participant Lease as Redis lease
+    participant Docker
+
+    loop every RTMANAGER_CLEANUP_INTERVAL
+        Worker->>PG: SELECT runtime_records WHERE status='stopped' AND last_op_at < now - retention
+        loop per game
+            Worker->>Service: cleanup(game_id, op_source=auto_ttl)
+            Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
+            Service->>PG: re-read runtime_records WHERE game_id
+            alt status = running
+                Service-->>Worker: refused / conflict
+            else status in {stopped, removed}
+                Service->>Docker: ContainerRemove(container_id)
+                Service->>PG: UpdateStatus stopped→removed (CAS)
+                Service->>PG: INSERT operation_log (op_kind=cleanup_container)
+                Service-->>Worker: success
+            end
+            Service->>Lease: DEL rtmanager:game_lease:{game_id}
+        end
+    end
+```
+
+Admin-driven cleanup follows the same path through
+`DELETE /api/v1/internal/runtimes/{game_id}/container` with
+`op_source=admin_rest` instead of `auto_ttl`. The host state directory
+is **never** removed by this flow
+([`../README.md` §Cleanup](../README.md#cleanup),
+[`services.md` §17](services.md),
+[`workers.md` §19](workers.md)).
+
+## Reconcile drift adopt
+
+```mermaid
+sequenceDiagram
+    participant Reconciler as reconcile worker
+    participant Docker
+    participant PG as Postgres
+    participant Lease as Redis lease
+
+    Note over Reconciler: read pass (lockless)
+    Reconciler->>Docker: List({label=com.galaxy.owner=rtmanager})
+    Reconciler->>PG: ListByStatus(running)
+    Note over Reconciler: write pass (per-game lease)
+    loop per Docker container without matching record
+        Reconciler->>Lease: SET NX PX rtmanager:game_lease:{game_id}
+        Reconciler->>PG: re-read runtime_records WHERE game_id
+        alt record now exists
+            Reconciler-->>Reconciler: skip (state changed since read pass)
+        else record still missing
+            Reconciler->>PG: Upsert runtime_records (status=running, image_ref, started_at)
+            Reconciler->>PG: INSERT operation_log (op_kind=reconcile_adopt, op_source=auto_reconcile)
+        end
+        Reconciler->>Lease: DEL rtmanager:game_lease:{game_id}
+    end
+```
+
+The reconciler **never** stops or removes an unrecorded container —
+operators may have started one manually for diagnostics. The
+`reconcile_dispose` and `observed_exited` paths follow the same
+read-pass / write-pass split, with `dispose` updating the orphaned
+record to `removed` and emitting `container_disappeared`, and
+`observed_exited` updating to `stopped` and emitting `container_exited`
+([`../README.md` §Reconciliation](../README.md#reconciliation),
+[`workers.md` §14–§16](workers.md)).
+
+## Health probe hysteresis
+
+```mermaid
+sequenceDiagram
+    participant Worker as healthprobe worker
+    participant State as in-memory probe state
+    participant Engine as galaxy-game-{id}:8080
+    participant Health as runtime:health_events
+
+    loop every RTMANAGER_PROBE_INTERVAL
+        Worker->>Worker: ListByStatus(running)
+        Worker->>State: prune entries for games no longer running
+        loop per game (semaphore cap = 16)
+            Worker->>Engine: GET /healthz (RTMANAGER_PROBE_TIMEOUT)
+            alt success
+                State->>State: consecutiveFailures = 0
+                opt failurePublished was true
+                    Worker->>Health: XADD probe_recovered {prior_failure_count}
+                    State->>State: failurePublished = false
+                end
+            else failure
+                State->>State: consecutiveFailures++
+                opt consecutiveFailures == RTMANAGER_PROBE_FAILURES_THRESHOLD AND not failurePublished
+                    Worker->>Health: XADD probe_failed {consecutive_failures, last_status, last_error}
+                    State->>State: failurePublished = true
+                end
+            end
+        end
+    end
+```
+
+Hysteresis prevents a single transient failure from emitting a
+`probe_failed` event, and prevents repeated emission while the failure
+persists. State is non-persistent: a process restart re-establishes
+the counters from scratch; a game's state is pruned when it transitions
+out of the running list ([`workers.md` §5–§6](workers.md)).
@@ -0,0 +1,163 @@
+# Service-Local Integration Suite
+
+This document explains the design of the service-local integration
+suite under [`../integration/`](../integration). The current-state
+behaviour (harness layout, env knobs, scenario coverage) lives next
+to the files themselves; this document records the rationale.
+
+The cross-service Lobby↔RTM suite at
+[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
+different rules (it lives in the top-level `galaxy/integration`
+module) and is documented inside that package.
+
+## 1. Build tag `integration`
+
+The scenarios under [`../integration/*_test.go`](../integration) are
+guarded by `//go:build integration`. The default `go test ./...`
+invocation skips them, while `go test -tags=integration
+./integration/...` (and the `make integration` target) runs the full
+set:
+
+```sh
+make -C rtmanager integration
+```
+
+The harness package itself ([`../integration/harness`](../integration/harness))
+has no build tag. It compiles on every run because each helper guards
+its Docker-dependent paths with `t.Skip` when the daemon is
+unavailable. This keeps the harness loadable from a tagless `go vet`
+or IDE workflow without dragging Docker into the default `go test`
+critical path.
+
+## 2. Smoke test runs in the default `go test` pass
+
+[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
+runs in the regular `go test ./...` pass and falls back on
+`skipUnlessDockerAvailable` when no Docker socket is present. The
+smoke test is intentionally kept separate from the new `integration/`
+suite because it exercises the production adapter shape (one
+container at a time against `alpine:3.21`), not the full runtime;
+both surfaces are useful.
+
+## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
+
+The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
+logger)` directly rather than spawning the binary from
+`cmd/rtmanager/main.go`:
+
+- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
+  the runtime context and call `runtime.Close()`; the goroutine
+  driving `runtime.Run` returns with `context.Canceled` and the
+  helper waits on it via the `runDone` channel. With a subprocess the
+  equivalent dance requires SIGTERM, output capture, and graceful
+  shutdown timing tied to the child's signal handler.
+- **Goroutine and store visibility.** Tests read the durable PG state
+  directly through the harness-owned pool and read every Redis stream
+  through the harness-owned client. Both observe the exact wire shape
+  Lobby will see in the cross-service suite.
+- **Logger isolation.** The harness defaults to `slog.Discard` so the
+  default test output stays focused on assertions; flipping
+  `EnvOptions.LogToStderr` lights up the runtime's structured logs
+  for local debugging without requiring any subprocess plumbing.
+
+The cross-service inter-process suite at `integration/lobbyrtm/`
+re-uses the existing `integration/internal/harness` binary-spawn
+helpers; the in-process choice here is specific to the service-local
+scope.
+
+## 4. `httptest.Server` stub for the Lobby internal client
+
+Runtime Manager configuration requires a non-empty
+`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
+diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
+as a no-op (the start envelope already carries the only required
+field, `image_ref`; rationale in [`services.md`](services.md) §7).
+The harness therefore stands up a tiny `httptest.Server` per test
+that returns a stable `200 OK` response. The stub is intentionally
+unconfigurable: every integration scenario produces the same
+ancillary fetch, and adding routing/error injection would invite
+test code to depend on a contract the start service deliberately
+ignores.
+
+## 5. One built engine image, two semver-compatible tags
+
+The patch lifecycle expects the new and current image refs to share
+the same major / minor version (`semver_patch_only` failure
+otherwise). Building two distinct images would multiply the per-run
+build cost without changing what the test verifies — the patch path
+exercises `image_ref_not_semver` and `semver_patch_only` validation
+plus the recreate-with-new-tag flow, none of which depend on
+distinct image *content*. The harness builds the engine once and
+calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
+and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
+
+The integration tags use the `*-rtm-it` suffix (rather than plain
+`galaxy/game:1.0.0`) so an operator running the suite locally cannot
+accidentally consume a hand-built dev image, and so a `docker image
+rm` of integration leftovers does not nuke a production-shaped tag.
+
+## 6. Per-test Docker network and per-test state root
+
+`EnsureNetwork(t)` creates a uniquely-named bridge network per test
+and registers cleanup; `t.ArtifactDir()` provides the per-game state
+root. Both ensure that two scenarios running back-to-back cannot
+collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
+filesystem state. Game ids are themselves unique per test
+(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
+the per-test network and state root, the suite is safe to run with
+`-count` greater than one.
+
+`t.ArtifactDir()` keeps the engine state directory around when a
+test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
+failure and inspect what the engine wrote. On success the directory
+is automatically cleaned up.
+
+## 7. PostgreSQL and Redis containers shared per-package
+
+Both fixtures use `sync.Once` to start one testcontainer per test
+package, mirroring the
+[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
+pattern. `TruncatePostgres` and `FlushRedis` reset state between
+tests so each scenario starts on an empty stack. The trade-off versus
+per-test containers is the standard one: container startup dominates
+the per-package latency, so amortising it across the suite keeps the
+loop tight while the truncate/flush ensures isolation. The ~1–2 s
+difference matters in CI.
+
+## 8. Engine image cache is intentionally retained between runs
+
+`buildAndTagEngineImage` runs once per package via `sync.Once` and
+leaves both image tags in the local Docker cache after the suite
+exits. The cache is a substantial speed-up on a developer laptop
+(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
+hot), and a stale image is unlikely because the tags carry the
+`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
+with multiple test runs. Operators who suspect a stale image can
+`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
+the next run rebuilds.
+
+## 9. Scenario coverage
+
+The suite covers the four end-to-end flows operators care about:
+
+- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
+  restart → patch → stop → cleanup. The intermediate `stop` between
+  `patch` and `cleanup` is intentional: the cleanup endpoint refuses
+  to remove a running container per
+  [`../README.md` §Cleanup](../README.md#cleanup).
+- **replay** (`replay_test.go`) — duplicate start / stop entries
+  surface as `replay_no_op` per [`workers.md`](workers.md) §11.
+- **health** (`health_test.go`) — external `docker rm` produces
+  `container_disappeared`; manual `docker run` is adopted by the
+  reconciler.
+- **notification** (`notification_test.go`) — unresolvable `image_ref`
+  produces `runtime.image_pull_failed` plus a `failure` job_result.
+
+## 10. Service-local scope only
+
+This suite runs Runtime Manager against a real Docker daemon plus
+testcontainers PG / Redis but **does not** include any other Galaxy
+service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
+in the top-level `galaxy/integration/` module, where the harness
+spawns multiple service binaries and uses real (not stubbed) cross-
+service streams.
@@ -0,0 +1,531 @@
+# PostgreSQL Schema Decisions
+
+Runtime Manager has been PostgreSQL-and-Redis from day one — there is
+no Redis-only predecessor and no migration window. This document
+records the schema decisions and the non-obvious agreements behind
+them, mirroring the shape of
+[`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md)
+and serving the same role: a single coherent reference for "why does
+the persistence layer look this way".
+
+Use this document together with the migration script
+[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
+and the runtime wiring
+[`../internal/app/runtime.go`](../internal/app/runtime.go).
+
+## Outcomes
+
+- Schema `rtmanager` (provisioned externally) holds the durable
+  service state across three tables: `runtime_records`,
+  `operation_log`, `health_snapshots`. The three tables map onto the
+  three runtime concerns documented in
+  [`../README.md` §Persistence Layout](../README.md#persistence-layout):
+  current state per game, audit trail per operation, and latest
+  technical health observation per game.
+- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
+  applies embedded goose migrations strictly before any HTTP listener
+  becomes ready, and exits non-zero when migration or ping fails.
+  Already-applied migrations exit zero — the
+  `pkg/postgres`-supplied migrator treats "no work to do" as success.
+- The runtime opens one shared `*redis.Client` via
+  `pkg/redisconn.NewMasterClient` and passes it to the stream offset
+  store, the per-game lease store, the consumer pipelines, and every
+  publisher (`runtime:job_results`, `runtime:health_events`,
+  `notification:intents`).
+- The Redis adapter package
+  [`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
+  owns one shared `Keyspace` struct with the
+  `defaultPrefix = "rtmanager:"` constant and per-store subpackages
+  for stream offsets and the per-game lease.
+- Generated jet code under
+  [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
+  is committed; `make -C rtmanager jet` regenerates it via the
+  testcontainers-driven `cmd/jetgen` pipeline.
+- Configuration uses the `RTMANAGER_` prefix for every variable.
+  The schema-per-service rule from
+  [`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
+  applies: each service's role is grant-restricted to its own
+  schema; RTM never touches Lobby's `lobby` schema or vice versa.
+
+## Decisions
+
+### 1. One schema, externally-provisioned `rtmanagerservice` role
+
+**Decision.** The `rtmanager` schema and the matching
+`rtmanagerservice` role are created outside the migration sequence
+(in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema`
+and by the integration harness; in production, by an ops init script
+not in scope for any service stage). The embedded migration
+`00001_init.sql` only contains DDL for the service-owned tables and
+indexes and assumes it runs as the schema owner with
+`search_path=rtmanager`.
+
+**Why.** Mixing role creation, schema creation, and table DDL into
+one script forces every consumer of the migration to run as a
+superuser. The schema-per-service architectural rule
+(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
+operational split: ops provisions roles and schemas, the service
+applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA`
+from its runtime role would relax the
+"each service's role grants are restricted to its own schema"
+defense-in-depth rule.
+
+### 2. `runtime_records.game_id` is the natural primary key
+
+**Decision.** `runtime_records` uses
+`game_id text PRIMARY KEY`. There is no surrogate key. The `status`
+column carries a CHECK constraint enforcing the
+`running | stopped | removed` enum.
+
+```sql
+CREATE TABLE runtime_records (
+    game_id              text PRIMARY KEY,
+    status               text NOT NULL,
+    -- ...
+    CONSTRAINT runtime_records_status_chk
+        CHECK (status IN ('running', 'stopped', 'removed'))
+);
+```
+
+**Why.** `game_id` is the platform-wide identifier owned by Lobby;
+RTM stores at most one record per game ever. A surrogate
+`bigserial` would force every cross-service join to translate
+through a lookup table; the natural key keeps RTM's persistence
+layer pin-compatible with the streams contract (every
+`runtime:start_jobs` envelope already names the `game_id`). The
+status CHECK reproduces the Go-level enum from
+[`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go)
+as a defense-in-depth gate at the storage boundary. Decision context:
+[`domain-and-ports.md`](domain-and-ports.md).
+
+### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus`
+
+**Decision.** `runtime_records_status_last_op_idx` is a composite
+index on `(status, last_op_at)`. The container cleanup worker scans
+`status='stopped' AND last_op_at < cutoff`; the
+`runtimerecordstore.ListByStatus` adapter method orders rows
+`last_op_at DESC, game_id ASC`.
+
+```sql
+CREATE INDEX runtime_records_status_last_op_idx
+    ON runtime_records (status, last_op_at);
+```
+
+**Why.** Both read shapes share the same composite. The cleanup
+worker drives the index from one direction (range scan on
+`last_op_at` filtered by status); `ListByStatus` drives it from the
+other (equality on status, sorted by `last_op_at`). PostgreSQL
+satisfies both shapes through one index scan once the planner picks
+the index for the WHERE clause. The secondary `game_id ASC` tiebreak
+in the adapter ORDER BY is satisfied by primary-key ordering after
+the index returns the rows.
+
+A second supporting index for the cleanup worker was considered and
+rejected: the workload is so small (single-instance v1, bounded
+running game count) that one composite is dominantly cheaper than
+two narrow ones.
+
+### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index
+
+**Decision.** `operation_log` carries a `bigserial id PRIMARY KEY`
+and is written exclusively through INSERT — there is no UPDATE
+pathway, no soft-delete column, and no foreign key to
+`runtime_records`. The audit index
+`operation_log_game_started_idx (game_id, started_at DESC)` drives
+the GM/Admin REST audit reads. The adapter's `ListByGame` orders
+results `started_at DESC, id DESC` and applies `LIMIT $2`.
+
+```sql
+CREATE INDEX operation_log_game_started_idx
+    ON operation_log (game_id, started_at DESC);
+```
+
+**Why.** The audit's correctness invariant is "every operation RTM
+performed gets exactly one row"; CASCADE deletes from
+`runtime_records` would silently lose history when an admin removes
+a runtime and would break the
+[`../README.md` §Persistence Layout](../README.md) commitment. The
+secondary `id DESC` tiebreak inside the adapter is necessary because
+the audit log can write multiple rows in the same millisecond when
+`reconcile_adopt` and a real operation interleave on a single tick;
+without the tiebreak the test that asserts insertion-order-stable
+reads becomes flaky. A non-positive `limit` is rejected before the
+SQL is issued; an empty result set returns as `nil` (matching the
+lobby pattern, so service-layer callers can do `len(entries) == 0`
+without an extra allocation).
+
+### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome`
+
+**Decision.** `operation_log` reproduces the three Go-level enums
+as CHECK constraints:
+
+```sql
+CONSTRAINT operation_log_op_kind_chk
+    CHECK (op_kind IN (
+        'start', 'stop', 'restart', 'patch',
+        'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
+    )),
+CONSTRAINT operation_log_op_source_chk
+    CHECK (op_source IN (
+        'lobby_stream', 'gm_rest', 'admin_rest',
+        'auto_ttl', 'auto_reconcile'
+    )),
+CONSTRAINT operation_log_outcome_chk
+    CHECK (outcome IN ('success', 'failure'))
+```
+
+The Go-level enums in
+[`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
+remain the source of truth.
+
+**Why.** A defence-in-depth gate at the storage boundary catches any
+adapter regression that would otherwise persist an unexpected
+string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`)
+benefit from the enum being verifiable directly in psql without
+consulting the Go source. Adding a new value requires editing two
+places (the Go enum and the migration), which is the right friction
+level: every new value is a wire-protocol change and deserves an
+explicit migration. The alternative of using PostgreSQL's `CREATE
+TYPE … AS ENUM` was rejected because adding a value to a PG enum
+type requires `ALTER TYPE` outside a transaction and complicates the
+single-init pre-launch policy (decision §12).
+
+### 6. `health_snapshots` is one row per game; status enum collapses event types
+
+**Decision.** `health_snapshots` carries `game_id text PRIMARY KEY`
+and stores the latest technical health observation per game. The
+`status` column enumerates the **observed engine state**, not the
+**triggering event type**:
+
+```sql
+CONSTRAINT health_snapshots_status_chk
+    CHECK (status IN (
+        'healthy', 'probe_failed', 'exited',
+        'oom', 'inspect_unhealthy', 'container_disappeared'
+    ))
+```
+
+The `runtime:health_events` `event_type` enum has seven values
+(`container_started`, `container_exited`, `container_oom`,
+`container_disappeared`, `inspect_unhealthy`, `probe_failed`,
+`probe_recovered`). The snapshot status has six — the two probe
+events fold into `healthy` (after `probe_recovered`) and
+`probe_failed`, and `container_started` collapses into `healthy`.
+
+**Why.** Health snapshots answer "what state is the engine in
+**right now**", not "what event was just emitted". A consumer who
+wants the event firehose reads `runtime:health_events`; a consumer
+who wants the latest verdict reads `health_snapshots`. The two
+surfaces have different lifetimes (stream entries are bounded only
+by Redis trim; snapshot rows are overwritten on every new
+observation), so collapsing the seven event types into six status
+states aligns the column with the consumer's mental model. The
+adapter that implements this collapse lives in
+[`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go);
+every emission to the stream also upserts the snapshot.
+
+### 7. Two-axis CAS shape on `runtime_records.UpdateStatus`
+
+**Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS
+guard into a single `WHERE … AND …` clause. Status must equal the
+caller's `ExpectedFrom`; when the caller supplies a non-empty
+`ExpectedContainerID`, `current_container_id` must equal it as
+well:
+
+```sql
+UPDATE rtmanager.runtime_records
+SET status = $1, last_op_at = $2, ...
+WHERE game_id = $3
+  AND status = $4
+  [AND current_container_id = $5]
+```
+
+A `RowsAffected() == 0` result is ambiguous — the row may be absent
+or the predicate may have failed. The adapter resolves the ambiguity
+through a follow-up `SELECT status FROM ... WHERE game_id = $1`:
+missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`.
+The probe runs only on the slow path; happy-path UPDATEs cost a
+single round trip.
+
+**Why.** The two-axis CAS is what services need: a stop driven by an
+old container_id (from a stale REST request) must not clobber a
+fresh `running` record installed by a concurrent restart. Status-only
+CAS would collapse those two cases. The optional shape on
+`ExpectedContainerID` lets reconciliation flows that legitimately
+target "this game in `running` state without caring which container"
+omit the second predicate. The follow-up probe matches the
+gamestore / invitestore precedent in `lobby/internal/adapters/postgres`
+and produces clean per-error sentinels at the service layer.
+
+`TestUpdateStatusConcurrentCAS` exercises the path end to end with
+eight goroutines racing the same transition: exactly one returns
+`nil`, the rest see `runtime.ErrConflict`. The test is deterministic
+because PostgreSQL serialises row-level UPDATEs through the row's
+MVCC tuple.
+
+### 8. Destination-driven `SET` clause on `UpdateStatus`
+
+**Decision.** `UpdateStatus` updates a different column subset
+depending on the destination status:
+
+| Destination | Columns set |
+| --- | --- |
+| `stopped` | `status`, `last_op_at`, `stopped_at` |
+| `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` |
+| `running` | `status`, `last_op_at` |
+
+The implementation switches on `input.To` and writes the UPDATE
+chain inline per branch — three short branches read better than one
+parametric helper.
+
+**Why.** Each destination has a different invariant. `stopped`
+records the wall-clock at which the engine ceased serving; `removed`
+nulls the container_id because the row no longer points at any
+Docker resource; `running` only updates the status and the
+last-op timestamp because the running invariants
+(`current_container_id`, fresh `started_at`, `current_image_ref`,
+`engine_endpoint`) are installed through `Upsert` on the `start`
+path.
+
+A previous draft built the SET list via `[]pg.Column` / `[]any`
+slices and a helper, but jet's `UPDATE(columns ...jet.Column)`
+variadic refuses a `[]postgres.Column` slice spread because the
+element type does not match `jet.Column` after the type-alias
+resolution. The final code switches inline per branch.
+
+The `running` destination is implemented even though the start
+service uses `Upsert` for the inner start of restart and patch.
+Keeping the `running` path live preserves a one-to-one match between
+`runtime.AllowedTransitions()` and the adapter's capability matrix —
+otherwise a future caller exercising the `stopped → running`
+transition through `UpdateStatus` would hit a runtime error inside
+the adapter rather than a domain rejection. The path only updates
+`status` and `last_op_at`; callers responsible for the running
+invariants install them through `Upsert` first.
+
+### 9. `created_at` preservation on `Upsert`
+
+**Decision.** `runtimerecordstore.Upsert` is implemented as
+`INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable
+column from EXCLUDED>` — `created_at` is deliberately omitted from
+the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt`
+value never overwrites the stored timestamp.
+
+```sql
+INSERT INTO rtmanager.runtime_records (...)
+VALUES (...)
+ON CONFLICT (game_id) DO UPDATE
+SET status               = EXCLUDED.status,
+    current_container_id = EXCLUDED.current_container_id,
+    current_image_ref    = EXCLUDED.current_image_ref,
+    engine_endpoint      = EXCLUDED.engine_endpoint,
+    state_path           = EXCLUDED.state_path,
+    docker_network       = EXCLUDED.docker_network,
+    started_at           = EXCLUDED.started_at,
+    stopped_at           = EXCLUDED.stopped_at,
+    removed_at           = EXCLUDED.removed_at,
+    last_op_at           = EXCLUDED.last_op_at
+    -- created_at intentionally NOT updated
+```
+
+`TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the
+invariant.
+
+**Why.** `runtime_records.created_at` records "first time RTM saw
+the game". Every restart and every reconcile_adopt re-Upserts the
+row with the current wall-clock as `CreatedAt` from the adapter
+boundary; without the omission rule the timestamp would drift
+forward. Preserving the original creation time keeps a stable
+horizon for retention reasoning and matches
+`lobby/internal/adapters/postgres/gamestore.Save`, which uses the
+same approach for the `games.created_at` column.
+
+### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default
+
+**Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT
+'{}'::jsonb`. The jet-generated model declares
+`Details string` (jet maps `jsonb` to `string`). The adapter:
+
+- on `Upsert`, substitutes the SQL DEFAULT `{}` when
+  `snapshot.Details` is empty, so the column never holds a non-JSON
+  empty string;
+- on `Get`, scans `details` as `[]byte` and wraps the bytes in a
+  `json.RawMessage` so the caller receives verbatim bytes without
+  an extra round of parsing.
+
+`TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and
+`TestUpsertAndGetRoundTrip` cover the two cases.
+
+**Why.** The detail payload is type-specific (the keys differ
+between `probe_failed` and `inspect_unhealthy`) and is opaque to
+queries — the column is never element-filtered. JSONB matches the
+"everything outside primary fields is JSON" pattern that the
+Notification Service already established and allows a future
+GIN index (e.g. for an admin search-by-key feature) without a
+schema rewrite. Substituting the SQL DEFAULT for an empty
+parameter avoids the trap where the database accepts `''` for
+`text` but rejects it for `jsonb`.
+
+### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary
+
+**Decision.** Every time-valued column on every RTM table uses
+PostgreSQL's `timestamptz`. The domain model continues to use
+`time.Time`; the adapter normalises every `time.Time` parameter to
+UTC at the binding site (`record.X.UTC()` or the `nullableTime`
+helper that wraps a possibly-zero `time.Time`), and re-wraps every
+scanned `time.Time` with `.UTC()` (directly or via
+`timeFromNullable` for nullable columns) before the value leaves
+the adapter.
+
+The architecture-wide form of this rule lives in
+[`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md).
+
+**Why.** `timestamptz` is the right column type for every cross-
+service timestamp the platform observes, and the domain model needs
+a `time.Time` API the service layer can compare and arithmetise.
+Without explicit `.UTC()` on the bind site, the pgx driver returns
+scanned values in `time.Local`, which silently breaks equality
+tests, JSON formatting, and comparison against pointer fields
+elsewhere in the codebase. The defensive `.UTC()` rule on both
+sides eliminates the class of bug where a timezone difference
+between the adapter and the test harness flips assertions
+intermittently.
+
+The same shape is used in User Service, Mail Service, and
+Notification Service — RTM matches the existing convention rather
+than introducing a fourth encoding path.
+
+### 12. Single-init pre-launch policy
+
+**Decision.** `00001_init.sql` evolves in place until first
+production deploy. Adding a column, an index, or a new table during
+the pre-launch development window edits this file directly rather
+than producing `00002_*.sql`. The runtime applies the migration on
+every boot; if the schema is already at head, `pkg/postgres`'s
+goose adapter exits zero.
+
+**Why.** The schema-per-service architectural rule
+([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
+endorses a single-init policy for pre-launch services. The
+pre-launch window allows non-additive changes (column rename, type
+narrowing, CHECK tightening) that a multi-step migration sequence
+would force into awkward two-step rewrites. Once the service ships
+to production, the next schema change becomes `00002_*.sql` and
+the policy lifts; from that point onward edits to `00001_init.sql`
+are rejected by code review.
+
+This applies to RTM exactly the same way it applies to every other
+PG-backed service in the workspace; the README explicitly carries
+the reminder. The exit-zero behaviour for already-applied
+migrations is what makes the policy operationally cheap: a
+freshly-spawned replica re-applies the same `00001_init.sql` with
+no work to do, no logged error, and proceeds to open its
+listeners.
+
+### 13. Query layer is `go-jet/jet/v2`; generated code is committed
+
+**Decision.** All three RTM PG-store packages
+([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
+[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
+[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore))
+build SQL through the jet builder API
+(`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
+`pg.AND/OR/SET/COALESCE/...` DSL).
+
+Generated table models live under
+[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
+and are regenerated by `make -C rtmanager jet`. The target invokes
+[`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a
+transient PostgreSQL container via testcontainers, provisions the
+`rtmanager` schema and `rtmanagerservice` role, applies the embedded
+goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
+against the provisioned schema. Generated code is committed to the
+repo, so build consumers do not need Docker.
+
+Statements are run through the `database/sql` API
+(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
+helpers preserve the codecs.go boundary translations and
+domain-type mapping (status enum decoding, `time.Time` UTC
+normalisation, JSONB `[]byte` ↔ `json.RawMessage`).
+
+PostgreSQL constructs that the jet builder does not cover natively
+(`COALESCE`, `LOWER` on subselects, JSONB params) are expressed
+through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct
+`[]byte`/string params for JSONB columns).
+
+**Why.** Aligns with the workspace-wide convention from
+[`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is
+`github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed
+service. Hand-rolled SQL would multiply boundary-translation paths
+and require per-store query-builder helpers for what jet already
+covers. Committing generated code keeps `go build ./...` working
+without Docker.
+
+### 14. `redisstate` keyspace ownership and per-store subpackages
+
+**Decision.** The
+[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
+package owns one shared `Keyspace` struct with a
+`defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter
+lives in its own subpackage:
+
+- [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/)
+  for the stream offset store consumed by the start-jobs and
+  stop-jobs consumers;
+- [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/)
+  for the per-game lease store consumed by every lifecycle service
+  and the reconciler.
+
+Both subpackages take a `redisstate.Keyspace{}` value and use it to
+build their key shapes (`rtmanager:stream_offsets:{label}`,
+`rtmanager:game_lease:{game_id}`).
+
+**Why.** Keeping the parent package as the single owner of the prefix
+and the key-shape builder mirrors the way Lobby's `redisstate`
+namespace centralises every key shape and supports multiple Redis-
+backed adapters (stream offsets, the per-game lease) without a
+restructure as the surface grows.
+
+The per-store subpackage choice (rather than Lobby's flat
+single-package shape) is driven by three considerations:
+
+- It keeps the docker mock generator scoped to one package, since
+  `mockgen` regenerates per-directory.
+- It allows finer-grained dependency selection: `miniredis` is a
+  dev-only dep, and keeping the `streamoffsets` package
+  self-contained leaves room for `gamelease` to depend only on the
+  production `redis` client.
+- Each subpackage carries its own tests, which keeps the test
+  surface focused on one Redis primitive rather than mixing offset
+  semantics with lease semantics in shared fixtures.
+
+## Cross-References
+
+- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
+  — the embedded schema migration.
+- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
+  — `//go:embed *.sql` and `FS()` exporter consumed by the runtime.
+- [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
+  [`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
+  [`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)
+  — the three jet-backed PG adapters and their testcontainers-driven
+  unit suites.
+- [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
+  — committed generated jet models.
+- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and
+  [`../Makefile`](../Makefile) `jet` target — the regeneration
+  pipeline.
+- [`../internal/adapters/redisstate/`](../internal/adapters/redisstate),
+  [`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/),
+  [`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/)
+  — Redis adapter package layout.
+- [`../internal/app/runtime.go`](../internal/app/runtime.go)
+  — runtime wiring: PG pool open + migration apply + Redis client
+  open + adapter assembly.
+- [`../internal/config/`](../internal/config) — the config groups
+  consumed by the wiring (`Postgres`, `Redis`, `Streams`,
+  `Coordination`).
+- Companion design rationales:
+  [`domain-and-ports.md`](domain-and-ports.md) for status enum and
+  domain shape, [`adapters.md`](adapters.md) for the redisstate
+  publishers and clients.
@@ -0,0 +1,368 @@
+# Operator Runbook
+
+This runbook covers the checks that matter most during startup,
+steady-state readiness, shutdown, and the handful of recovery paths
+specific to Runtime Manager.
+
+## Startup Checks
+
+Before starting the process, confirm:
+
+- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`)
+  reaches a Docker daemon the operator controls. RTM is the only
+  Galaxy service permitted to interact with the Docker socket;
+  scoping the daemon to RTM-only callers is operator domain.
+- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a
+  user-defined bridge network that has already been created (e.g.
+  via `docker network create galaxy-net` in the environment's
+  bootstrap script). RTM **validates** the network at startup but
+  never creates it. A missing network is fail-fast and the process
+  exits non-zero before opening any listener.
+- `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user
+  can read and write. Per-game subdirectories are created with
+  `RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and
+  `RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the
+  uid/gid to match the engine container's user when running with a
+  non-root engine.
+- `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
+  that hosts the `rtmanager` schema. The DSN must include
+  `search_path=rtmanager` and `sslmode=disable` (or a real SSL mode
+  for production). Embedded goose migrations apply at startup before
+  any HTTP listener opens; a migration or ping failure terminates the
+  process with a non-zero exit. The `rtmanager` schema and the
+  matching `rtmanagerservice` role are provisioned externally
+  ([`postgres-migration.md` §1](postgres-migration.md)).
+- `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach
+  the Redis deployment used for the runtime-coordination state:
+  stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`),
+  publishers (`runtime:job_results`, `runtime:health_events`,
+  `notification:intents`), persisted offsets, and the per-game
+  lease. RTM does not maintain durable business state on Redis.
+- Stream names match the producers and consumers RTM integrates with:
+  - `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`)
+  - `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
+  - `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`)
+  - `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`)
+  - `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
+- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal
+  HTTP listener. RTM's start service issues a diagnostic
+  `GET /api/v1/internal/games/{game_id}` per start; failure is logged
+  at debug and does not abort the start
+  ([`services.md` §7](services.md)).
+
+The startup sequence runs in the order recorded in
+[`../README.md` §Startup dependencies](../README.md#startup-dependencies):
+
+1. PostgreSQL primary opens; goose migrations apply synchronously.
+2. Redis master client opens and pings.
+3. Docker daemon ping; configured network presence check.
+4. Telemetry exporter (OTLP grpc/http or stdout).
+5. Internal HTTP listener.
+6. Reconciler runs **once synchronously** and blocks until done.
+7. Background workers start.
+
+A failure at any step is fatal. The synchronous reconciler pass is
+the reason orphaned containers from a prior process never reach the
+periodic workers in an inconsistent state
+([`workers.md` §17](workers.md)).
+
+Expected log lines on a healthy boot:
+
+- `migrations applied`,
+- `postgres ping ok`,
+- `redis ping ok`,
+- `docker ping ok` and `docker network found`,
+- `telemetry exporter started`,
+- `internal http listening`,
+- `reconciler initial pass completed`,
+- one `worker started` entry per background worker (seven expected).
+
+## Readiness
+
+Use the probes according to what they actually verify:
+
+- `GET /healthz` confirms the listener is alive — no dependency
+  check.
+- `GET /readyz` live-pings PostgreSQL primary, Redis master, and the
+  Docker daemon, then asserts the configured Docker network exists.
+  Returns `{"status":"ready"}` when every check passes; otherwise
+  returns `503` with the canonical
+  `{"error":{"code":"service_unavailable","message":"…"}}` envelope
+  identifying the first failing dependency.
+
+`/readyz` is the strongest readiness signal RTM exposes; unlike
+Lobby's `/readyz`, it does **not** rely on a one-shot boot ping.
+Each request hits the daemon and the database fresh.
+
+For a practical readiness check in production:
+
+1. confirm the process emitted the listener and worker startup logs;
+2. check `GET /healthz` and `GET /readyz`;
+3. verify `rtmanager.runtime_records_by_status{status="running"}`
+   gauge tracks the expected live game count after the first start
+   completes;
+4. verify `rtmanager.docker_op_latency` histograms have at least one
+   sample after the first lifecycle operation.
+
+## Shutdown
+
+The process handles `SIGINT` and `SIGTERM`.
+
+Shutdown behaviour:
+
+- the per-component shutdown budget is controlled by
+  `RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`);
+- the internal HTTP listener drains in-flight requests before closing;
+- stream consumers stop their `XREAD` loops and persist the latest
+  offset before returning; the offset survives the restart
+  ([`workers.md` §9](workers.md));
+- the Docker events listener cancels its subscription;
+- the in-flight services release their per-game lease through the
+  surrounding context cancellation;
+- the reconciler completes its current pass or aborts mid-write at
+  the next lease re-acquisition.
+
+During planned restarts:
+
+1. send `SIGTERM`;
+2. wait for the listener and component-stop logs;
+3. expect any consumer that was mid-cycle to retry from the persisted
+   offset on the next process start;
+4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`.
+
+## Engine Container Died
+
+A running engine container that exits unexpectedly surfaces through
+three observation channels:
+
+- The Docker events listener emits `container_exited` (non-zero exit
+  code) or `container_oom` (Docker action `oom`).
+- The active probe worker eventually emits `probe_failed` once the
+  threshold is crossed.
+- The Docker inspect worker may emit `inspect_unhealthy` if the
+  engine restarts under Docker's healthcheck or if Docker reports an
+  unexpected status.
+
+Triage:
+
+1. Inspect the `runtime:health_events` stream for the affected
+   `game_id` and `event_type`:
+   ```bash
+   redis-cli XRANGE runtime:health_events - + COUNT 200 \
+     | grep -A4 'game_id\s*<game_id>'
+   ```
+2. Read the runtime record and the operation log:
+   ```bash
+   curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
+   psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+     "SELECT id, op_kind, op_source, outcome, error_code, started_at
+      FROM rtmanager.operation_log
+      WHERE game_id = '<game_id>'
+      ORDER BY started_at DESC LIMIT 20"
+   ```
+3. If Lobby has not reacted (the game's status remains `running` in
+   `lobby.games`), check `runtime:job_results` lag and Lobby's
+   `runtimejobresult` worker. RTM publishes the result; Lobby is the
+   consumer.
+4. If the container is already gone (`docker ps -a` shows no row for
+   `galaxy-game-<game_id>`), the reconciler will move the record to
+   `removed` on its next pass. Run the periodic reconcile manually
+   by sending `SIGHUP` is **not** supported — wait
+   `RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the
+   process; the synchronous boot pass will handle the drift.
+5. The `notification:intents` stream is **not** the place to look
+   for ongoing health changes. Only the three first-touch start
+   failures (`runtime.image_pull_failed`,
+   `runtime.container_start_failed`,
+   `runtime.start_config_invalid`) produce a notification intent;
+   probe failures, OOMs, and exits flow through health events only
+   ([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
+
+## Patch Upgrade
+
+A patch upgrade replaces the container with a new `image_ref` while
+preserving the bind-mounted state directory.
+
+Pre-conditions:
+
+- The new and current `image_ref` tags both parse as semver. RTM
+  rejects non-semver tags with `image_ref_not_semver`.
+- The new and current major / minor versions match. A cross-major or
+  cross-minor patch returns `semver_patch_only`.
+
+Driving the upgrade:
+
+```bash
+curl -s -X POST \
+  -H 'Content-Type: application/json' \
+  -H 'X-Galaxy-Caller: admin' \
+  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
+  -d '{"image_ref": "galaxy/game:1.4.2"}'
+```
+
+Behaviour:
+
+- The container is stopped, removed, and recreated. The
+  `current_container_id` changes; the `engine_endpoint`
+  (`http://galaxy-game-<game_id>:8080`) is stable.
+- The engine reads its state from the bind mount on startup, so any
+  data written before the patch survives.
+- A single `operation_log` row is appended with `op_kind=patch` and
+  the old / new image refs.
+- A `runtime:health_events container_started` is emitted by the
+  inner start ([`workers.md` §1](workers.md)).
+
+Post-patch verification:
+
+```bash
+curl -s http://galaxy-game-<game_id>:8080/healthz
+curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
+```
+
+The `current_image_ref` field on the runtime record reflects the new
+tag.
+
+## Manual Cleanup
+
+The cleanup endpoint removes the container and updates the record to
+`removed`. It refuses to remove a `running` container — stop first.
+
+```bash
+# Stop, then clean up
+curl -s -X POST \
+  -H 'Content-Type: application/json' \
+  -H 'X-Galaxy-Caller: admin' \
+  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
+  -d '{"reason":"admin_request"}'
+
+curl -s -X DELETE \
+  -H 'X-Galaxy-Caller: admin' \
+  http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
+```
+
+The host state directory under `<RTMANAGER_GAME_STATE_ROOT>/<game_id>`
+is **never** deleted by RTM. Removing the directory is operator
+domain (backup tooling, future Admin Service workflow). The
+operation_log records `op_kind=cleanup_container` with
+`op_source=admin_rest`.
+
+## Reconcile Drift After Docker Daemon Restart
+
+A Docker daemon restart drops every running engine container; PG
+records remain. On RTM's next boot (or its next periodic reconcile):
+
+1. The reconciler observes `running` records whose containers are
+   missing from `docker ps`. It updates each record to `removed`,
+   appends `operation_log` with `op_kind=reconcile_dispose`, and
+   publishes `runtime:health_events container_disappeared`
+   ([`workers.md` §14–§15](workers.md)).
+2. Lobby's `runtimejobresult` worker does not consume the dispose
+   event in v1, so the cascade does not auto-restart the engine.
+   Operators trigger restarts through Lobby's user-facing flow or
+   directly via the GM/Admin REST `restart` endpoint.
+3. If the operator brings up an engine container manually for
+   diagnostics (`docker run` with the
+   `com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>` labels),
+   the reconciler **adopts** it on the next pass: a new
+   `runtime_records` row appears with `op_kind=reconcile_adopt`.
+   The reconciler **never stops or removes** an unrecorded
+   container — operators stay in control of manual containers
+   ([`../README.md` §Reconciliation](../README.md#reconciliation)).
+
+Three drift kinds run through the same lease-guarded write pass:
+`adopt`, `dispose`, and the README-level path
+`observed_exited` (a record marked `running` whose container exists
+but is in `exited`). Telemetry counter
+`rtmanager.reconcile_drift{kind}` exposes the three independently
+([`workers.md` §15](workers.md)).
+
+## Testing Locally
+
+```sh
+# One-time bootstrap
+docker network create galaxy-net
+
+# Minimal env (see docs/examples.md for a complete .env)
+export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
+export RTMANAGER_DOCKER_NETWORK=galaxy-net
+export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
+export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
+export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
+export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
+export RTMANAGER_REDIS_PASSWORD=local
+export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
+
+go run ./rtmanager/cmd/rtmanager
+```
+
+After start:
+
+- `curl http://localhost:8096/healthz` returns `{"status":"ok"}`;
+- `curl http://localhost:8096/readyz` returns `{"status":"ready"}`
+  once PG, Redis, and Docker pings pass and the configured network
+  exists;
+- driving Lobby through its public flow (`POST /api/v1/lobby/games/<id>/start`)
+  brings up `galaxy-game-<game_id>` containers; RTM logs each
+  lifecycle transition.
+
+The integration suite under `rtmanager/integration/` exercises the
+end-to-end flows against the real Docker daemon. The default
+`go test ./...` skips it via the `integration` build tag; run
+explicitly with:
+
+```sh
+make -C rtmanager integration
+```
+
+The suite requires a reachable Docker daemon. Without one, the
+harness helpers call `t.Skip` and the package becomes a no-op
+([`integration-tests.md` §1](integration-tests.md)).
+
+## Diagnostic Queries
+
+Durable runtime state lives in PostgreSQL; runtime-coordination state
+stays in Redis. CLI snippets that help during incidents:
+
+```bash
+# Live runtime count by status (PostgreSQL)
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
+
+# Inspect a specific runtime record
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
+
+# Last 20 operations for a game (newest first)
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT id, op_kind, op_source, outcome, error_code,
+          started_at, finished_at
+   FROM rtmanager.operation_log
+   WHERE game_id = '<game_id>'
+   ORDER BY started_at DESC, id DESC
+   LIMIT 20"
+
+# Latest health snapshot
+psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
+  "SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
+
+# Containers RTM owns (Docker)
+docker ps --filter label=com.galaxy.owner=rtmanager \
+          --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
+
+# Stream lag (Redis)
+redis-cli XINFO STREAM runtime:start_jobs
+redis-cli XINFO STREAM runtime:stop_jobs
+redis-cli GET rtmanager:stream_offsets:startjobs
+redis-cli GET rtmanager:stream_offsets:stopjobs
+
+# Recent health events (oldest first)
+redis-cli XRANGE runtime:health_events - + COUNT 100
+
+# Per-game lease (only present while an operation runs)
+redis-cli GET rtmanager:game_lease:<game_id>
+redis-cli TTL rtmanager:game_lease:<game_id>
+```
+
+Operators reach the gauges and counters surfaced through
+OpenTelemetry as the primary observability surface; raw PostgreSQL
+and Redis access is for last-resort triage.
@@ -0,0 +1,309 @@
+# Runtime and Components
+
+The diagram below focuses on the deployed `galaxy/rtmanager` process
+and its runtime dependencies. The current-state contract for every
+listener, worker, and adapter lives in [`../README.md`](../README.md);
+this document is the navigation aid that points at the right code path
+and the right design-rationale record.
+
+```mermaid
+flowchart LR
+    subgraph Clients
+        GM["Game Master"]
+        Admin["Admin Service"]
+        Lobby["Game Lobby"]
+    end
+
+    subgraph RTM["Runtime Manager process"]
+        InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
+        StartJobs["startjobsconsumer"]
+        StopJobs["stopjobsconsumer"]
+        DockerEvents["dockerevents listener"]
+        HealthProbe["healthprobe worker"]
+        DockerInspect["dockerinspect worker"]
+        Reconcile["reconcile worker"]
+        Cleanup["containercleanup worker"]
+        Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
+        IntentPublisher["notification:intents publisher"]
+        ResultsPublisher["runtime:job_results publisher"]
+        HealthPublisher["runtime:health_events publisher"]
+        Telemetry["Logs, traces, metrics"]
+    end
+
+    Docker["Docker Daemon"]
+    Engine["galaxy-game-{game_id} container"]
+    Postgres["PostgreSQL\nschema rtmanager"]
+    Redis["Redis\nstreams + leases + offsets"]
+    LobbyHTTP["Lobby internal HTTP"]
+
+    Lobby -. runtime:start_jobs .-> StartJobs
+    Lobby -. runtime:stop_jobs .-> StopJobs
+    GM --> InternalHTTP
+    Admin --> InternalHTTP
+
+    StartJobs --> Services
+    StopJobs --> Services
+    InternalHTTP --> Services
+
+    Services --> Docker
+    Services --> Postgres
+    Services --> Redis
+    Services --> ResultsPublisher
+    Services --> HealthPublisher
+    Services --> IntentPublisher
+    Services -. GET diagnostic .-> LobbyHTTP
+
+    DockerEvents --> Docker
+    DockerInspect --> Docker
+    HealthProbe --> Engine
+    Reconcile --> Docker
+    Reconcile --> Postgres
+    Cleanup --> Postgres
+    Cleanup --> Services
+
+    DockerEvents --> HealthPublisher
+    DockerInspect --> HealthPublisher
+    HealthProbe --> HealthPublisher
+
+    HealthPublisher --> Redis
+    ResultsPublisher --> Redis
+    IntentPublisher --> Redis
+
+    StartJobs --> Redis
+    StopJobs --> Redis
+    InternalHTTP --> Postgres
+
+    Docker -->|create / start / stop / rm| Engine
+    Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
+
+    InternalHTTP --> Telemetry
+    Services --> Telemetry
+    StartJobs --> Telemetry
+    StopJobs --> Telemetry
+    DockerEvents --> Telemetry
+    HealthProbe --> Telemetry
+    DockerInspect --> Telemetry
+    Reconcile --> Telemetry
+    Cleanup --> Telemetry
+```
+
+Notes:
+
+- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
+  goose migrations fail, when Redis ping fails, when the Docker daemon
+  ping fails, or when the configured Docker network is missing. Lobby
+  reachability is **not** verified at boot — the start service's
+  diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
+  outside of debug logging
+  ([`services.md` §7](services.md)).
+- The reconciler runs **synchronously** once on startup before
+  `app.App.Run` registers any other component, then re-runs
+  periodically as a regular `Component`. The synchronous pass is the
+  reason why orphaned containers from a prior process can never be
+  observed by the events listener with no PG record
+  ([`workers.md` §17](workers.md)).
+- A single internal HTTP listener exposes both probes
+  (`/healthz`, `/readyz`) and the trusted REST surface for Game Master
+  and Admin Service. There is no public listener — RTM does not face
+  end users.
+
+## Listeners
+
+| Listener | Default addr | Purpose |
+| --- | --- | --- |
+| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
+
+Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
+
+- read timeout: `5s`
+- write timeout: `15s`
+- idle timeout: `60s`
+
+The listener is unauthenticated and assumes a trusted network segment.
+The `X-Galaxy-Caller` request header carries an optional caller
+identity (`gm` or `admin`) that the handler records as
+`operation_log.op_source`
+([`services.md` §18](services.md)).
+
+Probe routes:
+
+- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
+  the listener is up.
+- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
+  Docker daemon, then asserts the configured Docker network exists.
+  Returns `{"status":"ready"}` only when every check passes; otherwise
+  returns `503` with the canonical error envelope.
+
+## Background Workers
+
+Every worker runs as an `app.Component` and is registered in the
+order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
+
+| Worker | Source | Trigger | Function |
+| --- | --- | --- | --- |
+| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
+| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
+| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
+| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
+| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
+| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
+| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
+
+The events listener and the inspect worker do **not** emit
+`container_started` — that event is owned by the start service
+([`workers.md` §1](workers.md)). The events listener and the inspect
+worker also do not emit `container_disappeared` autonomously when a
+record is missing or stale; the conditional emission rules live in
+[`workers.md` §2](workers.md) and [`§4`](workers.md).
+
+## Lifecycle Services
+
+The five lifecycle services are pure orchestrators called from both
+the stream consumers and the REST handlers. Each service owns the
+per-game lease for the duration of its operation.
+
+| Service | Source | Triggers | Failure envelope |
+| --- | --- | --- | --- |
+| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
+| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
+| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
+| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
+| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
+
+All services share three behaviours captured in
+[`services.md`](services.md):
+
+- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
+  TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
+  not by the caller — which keeps consumer and REST callers symmetric
+  ([`services.md` §1](services.md));
+- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
+  `ContainerID`, `EngineEndpoint`) is what consumers and REST
+  handlers translate into job_results / HTTP responses
+  ([`services.md` §3](services.md));
+- failures pass through one `operation_log` write before returning,
+  and three of the failure codes (`start_config_invalid`,
+  `image_pull_failed`, `container_start_failed`) also publish a
+  `runtime.*` admin notification intent
+  ([`services.md` §4](services.md)).
+
+## Synchronous Upstream Client
+
+| Client | Endpoint | Failure mapping |
+| --- | --- | --- |
+| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
+
+Lobby's outbound transport is the only synchronous client RTM holds.
+Every other interaction (Notification Service, Game Master, Admin
+Service) crosses an asynchronous boundary or is initiated by the peer.
+
+## Stream Offsets
+
+Each consumer persists its position under a fixed label so process
+restart preserves stream progress.
+
+| Stream | Offset key | Block timeout env |
+| --- | --- | --- |
+| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
+| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
+
+The labels `startjobs` and `stopjobs` are stable identifiers — they
+are decoupled from the underlying stream key. An operator who renames
+a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
+`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
+Decision: [`workers.md` §9](workers.md).
+
+The `runtime:job_results`, `runtime:health_events`, and
+`notification:intents` streams are outbound; RTM does not consume them
+itself.
+
+## Configuration Groups
+
+The full env-var list with defaults lives in
+[`../README.md` §Configuration](../README.md). The groups below
+summarise the structure:
+
+- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
+  `RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
+  `RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
+  `RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
+- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
+- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
+  `RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
+  `RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
+- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
+  `RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
+  `RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
+  `RTMANAGER_CONTAINER_RETENTION_DAYS`,
+  `RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
+  `RTMANAGER_ENGINE_STATE_ENV_NAME`,
+  `RTMANAGER_GAME_STATE_DIR_MODE`,
+  `RTMANAGER_GAME_STATE_OWNER_UID`,
+  `RTMANAGER_GAME_STATE_OWNER_GID`.
+- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
+  `RTMANAGER_POSTGRES_REPLICA_DSNS`,
+  `RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
+  `RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
+  `RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
+  `RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
+- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
+  `RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
+  `RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
+- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
+  `RTMANAGER_REDIS_STOP_JOBS_STREAM`,
+  `RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
+  `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
+  `RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
+  `RTMANAGER_STREAM_BLOCK_TIMEOUT`.
+- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
+  `RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
+  `RTMANAGER_PROBE_FAILURES_THRESHOLD`.
+- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
+  `RTMANAGER_CLEANUP_INTERVAL`.
+- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
+- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
+  `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
+- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
+  `RTMANAGER_SHUTDOWN_TIMEOUT`.
+- **Telemetry** — standard `OTEL_*`.
+
+## Runtime Notes
+
+- **Single-instance v1.** Multi-instance Runtime Manager with Redis
+  Streams consumer groups is explicitly out of scope for the current
+  iteration. The per-game lease serialises operations on one game
+  across the consumer + REST entry points; cross-instance
+  coordination is deferred until a real workload demands it.
+- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
+  `SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
+  (default `60s`). The lease is **not renewed mid-operation** in v1;
+  long pulls of multi-GB images can therefore expire the lease
+  before the operation finishes — the trade-off is documented in
+  [`services.md` §1](services.md). The reconciler honours the same
+  lease around every drift mutation
+  ([`workers.md` §14](workers.md)).
+- **Operation log is the source of truth.** Every lifecycle and
+  reconcile mutation appends one row to `rtmanager.operation_log`.
+  The `runtime:health_events` stream and the `notification:intents`
+  emissions are best-effort — a publish failure logs at `Error` and
+  proceeds, never rolling back the recorded operation
+  ([`workers.md` §8](workers.md)).
+- **In-memory probe hysteresis.** The active HTTP probe keeps
+  per-game `consecutiveFailures` and `failurePublished` counters in a
+  mutex-guarded map. State is non-persistent: a process restart that
+  loses the counters re-establishes hysteresis from scratch, and
+  state for a game that transitions through `stopped → running` is
+  pruned at the start of every probe tick
+  ([`workers.md` §5](workers.md)).
+- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
+  `if_missing` (default), `always`, and `never`. Image labels
+  (`com.galaxy.cpu_quota`, `com.galaxy.memory`,
+  `com.galaxy.pids_limit`) drive resource limits when present; the
+  matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
+  label is absent or unparseable. Producers never pass limits.
+- **State directory ownership.** RTM creates per-game state
+  directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
+  mode and uid/gid, but **never deletes them**. Removing the directory
+  is operator domain (backup tooling, a future Admin Service
+  workflow). A cleanup that removes the container leaves the
+  directory intact.
@@ -0,0 +1,443 @@
+# Lifecycle Services
+
+This document explains the design of the five lifecycle services
+(`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
+`cleanupcontainer`) under [`../internal/service/`](../internal/service)
+plus the per-handler REST glue under
+[`../internal/api/internalhttp/`](../internal/api/internalhttp).
+
+The current-state behaviour (lifecycle steps, failure tables, the
+per-game lease semantics, the wire contracts) lives in
+[`../README.md`](../README.md), the OpenAPI spec at
+[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml), and the
+AsyncAPI spec at
+[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml).
+This file records the *why*.
+
+## 1. Per-game lease lives at the service layer
+
+Every lifecycle service acquires `rtmanager:game_lease:{game_id}` via
+[`ports.GameLeaseStore`](../internal/ports/gamelease.go) before doing
+any work, and releases it on the way out:
+
+- the lease primitive serialises operations on a single game across
+  every entry point (stream consumers and REST handlers);
+- holding the lease at the service layer keeps the consumer / REST
+  callers symmetric — neither acquires the lease itself, both call
+  the service the same way;
+- the Redis-backed adapter
+  ([`../internal/adapters/redisstate/gamelease/store.go`](../internal/adapters/redisstate/gamelease/store.go))
+  uses `SET NX PX` on acquire, Lua compare-and-delete on release; a
+  release whose caller-supplied token no longer matches is a silent
+  no-op.
+
+The lease key shape is `rtmanager:game_lease:{base64url(game_id)}` so
+opaque game ids may contain any characters without leaking through
+the key syntax.
+
+The lease TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60s`)
+and is **not renewed mid-operation** in v1. A multi-GB image pull can
+theoretically expire the lease before the start service finishes;
+operators see this as a `reconcile_adopt` event later because the
+container is created with the standard owner labels. A renewal helper
+is deliberately deferred until a workload makes it necessary.
+
+The reconciler ([`workers.md`](workers.md) §4) honours the same lease
+around every drift mutation, which closes the
+restart-vs-`reconcile_dispose` race documented in §6 below.
+
+## 2. Health-events publisher lands with the start service
+
+The start service publishes `container_started` after `docker run`
+returns; the events listener intentionally does **not** duplicate
+the event ([`workers.md`](workers.md) §1). Centralising the publisher
+on the start service avoids a "who emits what" ambiguity and lets the
+publisher be a thin port wrapper rather than a worker-specific
+helper.
+
+The publisher port lives next to the snapshot-upsert rule
+([`adapters.md`](adapters.md) §8): one Publish call updates both
+surfaces.
+
+## 3. `Result`-shaped contract
+
+`Service.Handle` returns `(Result, error)`. The Go-level `error` is
+reserved for system-level / programmer faults (nil context, nil
+service). All business outcomes flow through `Result`:
+
+- `Outcome=success`, `ErrorCode=""` — fresh start succeeded;
+- `Outcome=success`, `ErrorCode="replay_no_op"` — idempotent replay;
+- `Outcome=failure`, `ErrorCode` set — business failure
+  (`start_config_invalid` / `image_pull_failed` /
+  `container_start_failed` / `conflict` / `service_unavailable` /
+  `internal_error`).
+
+The stream consumer uses `Outcome` and `ErrorCode` to populate
+`runtime:job_results` directly; the REST handler maps `Outcome=failure`
+plus `ErrorCode` to the matching HTTP status. Both callers are simpler
+with this contract than with an `errors.Is`-driven sentinel taxonomy.
+
+`ports.JobResult` and the two `JobOutcome*` string constants live in
+the ports package next to `JobResultPublisher` so the wire shape is
+defined exactly once. The constants are intentionally not aliases of
+`operation.Outcome` — the audit-log enum is allowed to grow without
+breaking the wire format.
+
+## 4. Start service failure-mode mapping
+
+| Failure | Error code | Notification intent |
+| --- | --- | --- |
+| Invalid input (empty fields, unknown op_source) | `start_config_invalid` | `runtime.start_config_invalid` |
+| Lease busy | `conflict` | — |
+| Existing record running with a different image_ref | `conflict` | — |
+| Get returns a non-NotFound transport error | `internal_error` | — |
+| `image_ref` shape rejected by `distribution/reference` | `start_config_invalid` | `runtime.start_config_invalid` |
+| `EnsureNetwork` returns `ErrNetworkMissing` | `start_config_invalid` | `runtime.start_config_invalid` |
+| `EnsureNetwork` returns any other error | `service_unavailable` | — |
+| `PullImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
+| `InspectImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
+| `prepareStateDir` failure | `start_config_invalid` | `runtime.start_config_invalid` |
+| `Run` failure | `container_start_failed` | `runtime.container_start_failed` |
+| `Upsert` failure after successful Run | `container_start_failed` | `runtime.container_start_failed` |
+
+Three error codes do **not** raise an admin notification: `conflict`,
+`service_unavailable`, and `internal_error` are operational classes
+(another caller is in flight, a dependency is down, an unclassified
+fault) where the corrective action is not a configuration change. The
+operator already sees them through telemetry and structured logs; an
+email per occurrence would be noise.
+
+## 5. Upsert-after-Run rollback
+
+A `Run` that succeeded but whose `Upsert` failed leaves a running
+container with no PG record. The service issues a best-effort
+`docker.Remove(containerID)` in a fresh `context.Background()` (the
+request context may already be cancelled) before recording the failure.
+A Remove failure is logged but not propagated; the reconciler adopts
+surviving orphans on its periodic pass.
+
+The Docker adapter already removes the container when `Run` itself
+returns an error after a successful `ContainerCreate` ([`adapters.md`](adapters.md) §3).
+The service-layer rollback covers the additional post-`Run` Upsert
+failure path.
+
+## 6. Pre-existing record handling
+
+Only `status=running` + same `image_ref` is a `replay_no_op`.
+`running` + a different `image_ref` returns `failure / conflict` (use
+`patch` to change the image of a running container).
+
+Anything else (`stopped`, `removed`, missing record) proceeds with a
+fresh start that ends in `Upsert`. `Upsert` overwrites verbatim and is
+not bound by the transitions table, so installing a `running` record
+over a `removed` row is permitted — the `removed` terminus rule lives
+in `runtime.AllowedTransitions` (which guards `UpdateStatus`), not in
+`Upsert`.
+
+`created_at` is preserved across re-starts: the start service reuses
+`existing.CreatedAt` when the record was found, so the
+"first time RTM saw the game" semantics from
+[`postgres-migration.md`](postgres-migration.md) §9 hold even when the
+start path goes through `Upsert` rather than through the runtime
+adapter's `INSERT ... ON CONFLICT DO UPDATE` EXCLUDED list.
+
+A residual `galaxy-game-{game_id}` container left over from a previous
+start that was stopped but never cleaned up will fail at `docker run`
+with a name conflict. The service surfaces that as
+`container_start_failed`; cleanup plus the reconciler is the standard
+remedy. A pre-emptive Remove inside the start service was rejected
+because it would silently undo manual operator inspection on stopped
+containers.
+
+## 7. `LobbyInternalClient.GetGame` is best-effort
+
+The fetch happens after the lease is acquired and before the Docker
+work, with the configured `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
+`ErrLobbyUnavailable` and `ErrLobbyGameNotFound` are logged at
+`debug`; the start operation continues either way. The fetched
+`Status` and `TargetEngineVersion` enrich logs only — the start
+envelope already carries the only required field (`image_ref`), and
+the port docstring fixes the recoverable-failure contract.
+
+## 8. `image_ref` validation
+
+Validation uses `github.com/distribution/reference.ParseNormalizedNamed`
+before any Docker round-trip. Rejected shapes surface as
+`start_config_invalid` plus a `runtime.start_config_invalid` intent.
+Daemon-side rejections after a valid parse (manifest unknown,
+authentication required) surface as `image_pull_failed` plus a
+`runtime.image_pull_failed` intent. The split keeps operator-actionable
+configuration mistakes distinct from registry-side failures.
+
+## 9. State-directory preparer is overrideable
+
+`Dependencies.PrepareStateDir` is a `func(gameID string) (string, error)`
+injection point that defaults to `os.MkdirAll` + `os.Chmod` +
+`os.Chown` against `RTMANAGER_GAME_STATE_ROOT`. Tests override it to
+point at a `t.TempDir()`-style fake without exercising the real
+filesystem permissions (which require either matching uid/gid or
+root). This is a deliberate non-port abstraction: the start service
+does no other filesystem work and the cost of a new port for one
+helper is not worth the indirection.
+
+## 10. Container env: both `GAME_STATE_PATH` and `STORAGE_PATH`
+
+Both names are accepted by the v1 engine. The start service always
+sets both; the configured `RTMANAGER_ENGINE_STATE_ENV_NAME` controls
+the primary. When the operator overrides the primary to `STORAGE_PATH`,
+the deduplicating map collapses the two entries into one.
+
+## 11. Wiring layer construction
+
+`internal/app/wiring.go` is the single point that builds every
+production store, adapter, and service from `config.Config`. The
+struct exposes typed fields so handlers and workers can grab the
+singletons without re-wiring; an `addCloser` slice releases adapter
+resources (currently the Lobby HTTP client's idle-connection pool) at
+runtime shutdown. The `runtimeRecordsProbe` adapter installed during
+construction registers the `rtmanager.runtime_records_by_status`
+gauge documented in [`../README.md` §Observability](../README.md).
+
+The persistence-only `CountByStatus` method on the `runtimerecordstore`
+adapter is **not** part of `ports.RuntimeRecordStore` because it is
+only used by the gauge probe; widening the port for one caller would
+force every adapter and test fake to grow with no benefit. The adapter
+exposes it directly and the wiring composes a concrete-typed wrapper.
+
+## 12. Shared lease across composed operations (restart, patch)
+
+Restart and patch must hold the lease across the inner
+`stop → docker rm → start` sequence, otherwise a concurrent stop or
+restart could observe a half-recreated runtime.
+
+`startruntime.Service` and `stopruntime.Service` therefore expose a
+second public method:
+
+```go
+// Run executes the lifecycle assuming the per-game lease is already
+// held by the caller. Reserved for orchestrator services that compose
+// stop or start with another operation under a single outer lease.
+// External callers must use Handle.
+func (service *Service) Run(ctx context.Context, input Input) (Result, error)
+```
+
+`Handle` acquires the lease, defers its release, and calls `Run`.
+Restart and patch acquire the outer lease themselves and call `Run`
+on the inner services. The inner services record their own
+`operation_log` entries, telemetry counters, health events, and admin
+notification intents identically to a top-level `Handle`.
+
+A typed `LeaseTicket` parameter (a small internal-package zero-size
+struct that only the lease store can construct) was considered and
+rejected for v1: only sister services in `internal/service/` ever call
+`Run`, the docstring is loud about the precondition, and the pattern
+can be tightened later without breaking the public surface that
+consumers and handlers consume.
+
+## 13. Correlation id on `source_ref`
+
+The outer restart and patch services reuse the existing
+`Input.SourceRef` as a correlation key:
+
+- when `Input.SourceRef` is non-empty (REST request id, stream entry
+  id), all three entries — outer restart / patch + inner stop +
+  inner start — share that value;
+- when empty, the outer service generates a 32-byte base64url string
+  via the same `NewToken` generator that produces lease tokens, and
+  uses it as the correlation key for all three entries.
+
+The outer entry's `source_ref` keeps its dual semantics: actor ref
+when the caller supplied one, generated correlation id otherwise. Pure
+top-level operations (caller invokes start, stop, or cleanup directly)
+keep the original meaning. Composed operations (restart, patch) use
+the same value in three places to make audit queries trivial.
+
+This is not the cleanest end-state — a dedicated `correlation_id`
+column would carry the link without ambiguity — but it is the smallest
+change that does not touch the schema. A future stage that adds the
+column can rename the field and clear up the dual role in one move.
+
+## 14. Semver validation for patch
+
+`internal/service/patchruntime/semver.go` enforces the
+patch-precondition (current and new `image_ref` parse as semver, share
+major and minor):
+
+- `extractSemverTag(imageRef)` parses with
+  `github.com/distribution/reference.ParseNormalizedNamed`, casts to
+  `reference.NamedTagged`, then validates the tag with
+  `golang.org/x/mod/semver.IsValid` (after prepending `v` when the tag
+  omits it). Failures map to `image_ref_not_semver`;
+- `samePatchSeries(currentSemver, newSemver)` compares
+  `semver.MajorMinor` of the two canonical strings; mismatch maps to
+  `semver_patch_only`.
+
+`golang.org/x/mod` is a direct require to avoid a transitive-version
+surprise. `github.com/Masterminds/semver/v3` (also in the module
+graph) was rejected to avoid two semver libraries on disk for the
+same job; `x/mod/semver` already covers Lobby. A hand-rolled
+`vMajor.Minor.Patch` parser was rejected as premature.
+
+Pre-checks run before any inner stop or `docker rm`: a rejected patch
+never disturbs the running runtime. Patch with
+`new_image_ref == current_image_ref` proceeds through the recreate
+flow unchanged (not `replay_no_op`: the inner start still runs); the
+outer `op_kind=patch` entry records the no-op patch for audit.
+
+## 15. `StopReason` placement
+
+The reason enum mirrors `lobby/internal/ports/runtimemanager.go`
+verbatim and lives at `internal/service/stopruntime/stopreason.go`.
+The stream consumer and the REST handler import `stopruntime` for
+the same enum the service requires.
+
+Inner stop calls from restart and patch always pass
+`StopReasonAdminRequest`. Restart and patch are platform-internal
+recreate flows; `admin_request` is the closest semantic match in the
+five-value vocabulary. The actor that originated the recreate (REST
+request id, admin user id) flows through the `op_source` /
+`source_ref` pair, not through the stop reason.
+
+## 16. Error code centralisation
+
+`internal/service/startruntime/errors.go` is the canonical home for
+the stable error codes returned in `Result.ErrorCode`. The other four
+services (`stopruntime`, `restartruntime`, `patchruntime`,
+`cleanupcontainer`) import the constants from `startruntime` rather
+than redeclaring them. The package comment of `errors.go` flags the
+shared usage so future readers do not chase per-service declarations.
+
+`start_config_invalid` is reserved for start because every start
+validation failure also raises an admin notification intent. The
+other services use the more general `invalid_request` for input
+validation failures.
+
+## 17. Stop / restart / patch / cleanup failure tables
+
+### `stopruntime`
+
+| Failure | Error code | Notes |
+| --- | --- | --- |
+| Invalid input | `invalid_request` | No notification intent. |
+| Lease busy | `conflict` | Lease release skipped because acquire returned false. |
+| Lease error | `service_unavailable` | Redis unreachable. |
+| Record missing | `not_found` | |
+| Status `stopped` / `removed` | success / `replay_no_op` | Idempotent re-stop. |
+| `docker.Stop` returns `ErrContainerNotFound` | success | Record transitions `running → removed`, `container_disappeared` health event published. |
+| `docker.Stop` other error | `service_unavailable` | Record untouched; caller may retry. |
+| `UpdateStatus` returns `ErrConflict` (CAS race) | success / `replay_no_op` | The desired state was reached by another path (reconciler / restart). |
+| `UpdateStatus` returns `ErrNotFound` | `not_found` | Record vanished mid-stop. |
+| `UpdateStatus` other error | `internal_error` | |
+
+### `restartruntime`
+
+| Failure | Error code | Notes |
+| --- | --- | --- |
+| Invalid input | `invalid_request` | |
+| Lease busy / lease error | `conflict` / `service_unavailable` | Same as stop. |
+| Record missing | `not_found` | |
+| Status `removed` | `conflict` | Image_ref may be empty; restart cannot proceed. |
+| Inner stop fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner stop failed: ". |
+| `docker.Remove` fails | `service_unavailable` | Inner stop already moved record to `stopped`; runtime stays in `stopped`. Admin must call `cleanup_container` before retrying restart. |
+| Inner start fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner start failed: ". |
+
+The post-stop `docker rm` failure is the only path that leaves the
+runtime in a state from which the same operation cannot recover by
+itself: a residual `galaxy-game-{game_id}` container blocks a fresh
+inner start (the start service surfaces this as
+`container_start_failed`). The runbook entry — "call cleanup, then
+restart again" — is the standard remedy.
+
+### `patchruntime`
+
+| Failure | Error code | Notes |
+| --- | --- | --- |
+| Invalid input | `invalid_request` | |
+| Lease busy / lease error | `conflict` / `service_unavailable` | |
+| Record missing | `not_found` | |
+| Status `removed` | `conflict` | |
+| Current `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
+| New `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
+| Major / minor mismatch | `semver_patch_only` | Pre-check; no inner ops fired. |
+| Inner stop / `docker rm` / inner start fails | inherits inner code | Same propagation as restart. |
+
+### `cleanupcontainer`
+
+| Failure | Error code | Notes |
+| --- | --- | --- |
+| Invalid input | `invalid_request` | |
+| Lease busy / lease error | `conflict` / `service_unavailable` | |
+| Record missing | `not_found` | |
+| Status `removed` | success / `replay_no_op` | |
+| Status `running` | `conflict` | Error message: "stop the runtime first". |
+| Status `stopped` | proceed | |
+| `docker.Remove` returns `ErrContainerNotFound` | success | Adapter swallows not-found into nil. |
+| `docker.Remove` other error | `service_unavailable` | Record untouched; caller may retry. |
+| `UpdateStatus` returns `ErrConflict` | success / `replay_no_op` | Race with reconciler dispose. |
+| `UpdateStatus` returns `ErrNotFound` | `not_found` | |
+| `UpdateStatus` other error | `internal_error` | |
+
+## 18. REST handler conventions
+
+The internal HTTP handlers under
+[`../internal/api/internalhttp/handlers/`](../internal/api/internalhttp/handlers)
+follow these rules:
+
+- **`X-Galaxy-Caller` header.** The optional header carries the
+  calling service identity (`gm` / `admin`); the handler records the
+  value as `op_source` in `operation_log` (`gm_rest` / `admin_rest`).
+  Missing or unknown values default to `admin_rest` because every
+  audit-log query already filters on the cleanup endpoint
+  (`op_source ∈ {auto_ttl, admin_rest}`); making the default match
+  the most-restricted surface keeps existing dashboards correct when
+  an unconfigured client hits the listener. The header is declared as
+  a reusable parameter (`components.parameters.XGalaxyCallerHeader`)
+  in the OpenAPI spec and is referenced from each runtime operation
+  but not from `/healthz` and `/readyz`.
+- **Error code → HTTP status mapping.** One canonical table in
+  `handlers/common.go`:
+
+  | ErrorCode | HTTP status |
+  | --- | ---: |
+  | (success, including `replay_no_op`) | 200 |
+  | `invalid_request`, `start_config_invalid`, `image_ref_not_semver` | 400 |
+  | `not_found` | 404 |
+  | `conflict`, `semver_patch_only` | 409 |
+  | `service_unavailable`, `docker_unavailable` | 503 |
+  | `internal_error`, `image_pull_failed`, `container_start_failed` | 500 |
+
+  `image_pull_failed` and `container_start_failed` are operational
+  failures that originate inside RTM (registry / daemon problems),
+  not client-side validation issues; they map to `500` so callers
+  retry through their normal resilience paths instead of treating
+  the call as a 4xx that must be fixed at the source.
+  `docker_unavailable` is reserved for future producers; today the
+  start service emits `service_unavailable` for Docker-daemon
+  failures. Unknown error codes default to `500`.
+- **List and Get bypass the service layer.** `internalListRuntimes`
+  and `internalGetRuntime` read directly from
+  `ports.RuntimeRecordStore`. Reads do not produce `operation_log`
+  rows, do not change Docker state, do not need the per-game lease,
+  and do not have a stream-side counterpart — none of the lifecycle
+  service machinery is justified.
+- **`RuntimeRecordStore.List(ctx)` returns every record regardless
+  of status.** A single SELECT ordered by
+  `(last_op_at DESC, game_id ASC)` — the same direction the
+  `runtime_records_status_last_op_idx` index supports, so freshly
+  active games surface first. Pagination is intentionally not
+  modelled in v1; the working set is bounded by the games tracked
+  by Lobby.
+- **Per-handler service ports use `mockgen`.** The handler layer
+  depends on five narrow interfaces — one per lifecycle service —
+  declared in `handlers/services.go`. Production wiring passes the
+  concrete `*<lifecycle>.Service` pointers (each satisfies the
+  matching interface implicitly); tests pass the mockgen-generated
+  mocks under `handlers/mocks/`.
+- **Conformance test scope.** `internalhttp/conformance_test.go`
+  drives every documented runtime operation against a real
+  `internalhttp.Server` whose service deps are deterministic stubs.
+  The test uses `kin-openapi/routers/legacy.NewRouter`, calls
+  `openapi3filter.ValidateRequest` and
+  `openapi3filter.ValidateResponse` so both directions match the
+  contract. The scope is happy-path only; the failure-path response
+  shapes are validated by the per-handler tests.
@@ -0,0 +1,412 @@
+# Background Workers
+
+This document explains the design of the seven background workers
+under [`../internal/worker/`](../internal/worker):
+
+- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
+  [`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
+  consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
+- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
+  subscription;
+- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
+  `InspectContainer` worker;
+- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
+  `/healthz` probe;
+- [`reconcile`](../internal/worker/reconcile) — startup + periodic
+  drift reconciliation;
+- [`containercleanup`](../internal/worker/containercleanup) —
+  periodic TTL cleanup.
+
+The current-state behaviour and configuration surface live in
+[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
+§Reconciliation), and operational notes are in
+[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
+[`runbook.md`](runbook.md). This file records the rationale.
+
+## 1. Single ownership per `event_type`
+
+The `runtime:health_events` vocabulary is shared across four sources;
+each event type is owned by exactly one of them.
+
+| `event_type` | Owner |
+| --- | --- |
+| `container_started` | `internal/service/startruntime` |
+| `container_exited` | `internal/worker/dockerevents` |
+| `container_oom` | `internal/worker/dockerevents` |
+| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
+| `inspect_unhealthy` | `internal/worker/dockerinspect` |
+| `probe_failed` | `internal/worker/healthprobe` |
+| `probe_recovered` | `internal/worker/healthprobe` |
+
+`container_started` is intentionally not duplicated by the events
+listener, even though Docker emits a `start` action whenever the start
+service runs the container. The start service already publishes the
+event with the same wire shape; observing the action in the listener
+would produce two entries per real start.
+
+## 2. `container_disappeared` is conditional on PG state
+
+The Docker events listener inspects the runtime record before emitting
+`container_disappeared` for a `destroy` action. Three suppression rules
+apply:
+
+- record missing → suppress (the destroyed container was never owned
+  by RTM as a tracked runtime, so no consumer cares);
+- record `status != running` → suppress (RTM already finished a stop
+  or cleanup; the destroy is the expected tail of that operation);
+- record `current_container_id != event.ContainerID` → suppress (RTM
+  swapped to a new container through restart or patch; the destroy is
+  the expected removal of the prior container id).
+
+Only a destroy that arrives for a `running` record whose
+`current_container_id` still equals the event id is treated as
+unexpected. This is the wire-side analogue of the reconciler's
+PG-drift check: the reconciler observes "PG=running, no Docker
+container" while the events listener observes "Docker says destroy,
+PG still says running pointing at this container". Together they cover
+both directions of drift.
+
+A read failure against `runtime_records` is treated conservatively as
+"suppress" — the listener cannot tell whether the destroy was external
+or RTM-initiated, and over-emitting `container_disappeared` would lead
+to a real consumer (`Game Master`) escalating a false positive.
+
+## 3. `die` with exit code `0` is suppressed
+
+`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
+event with exit code `0`. The `container_exited` contract guarantees a
+non-zero exit; emitting on exit `0` would shower consumers with
+normal-stop noise. The listener silently drops the event; the
+operation log already records the stop on the caller side.
+
+## 4. Inspect worker leaves `container_disappeared` to the reconciler
+
+When `dockerinspect` calls `InspectContainer` and the daemon returns
+`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
+
+- the reconciler is the single authority for PG-drift reconciliation.
+  Adding a third source for `container_disappeared` would risk double
+  emission and complicate the consumer story;
+- inspect ticks every 30 seconds; the reconciler ticks every 5
+  minutes. The latency window for "Docker drops the container, RTM
+  notices" is therefore at most 5 minutes in v1, which is acceptable
+  for the kinds of drift the reconciler covers (manual `docker rm`
+  outside RTM, daemon restart with stale records). If a future
+  requirement tightens the window, promoting the inspect-side
+  observation to a real `container_disappeared` is a one-line change.
+
+## 5. Probe hysteresis is in-memory and pruned per tick
+
+The active probe worker keeps per-game state in a
+`map[string]*probeState` guarded by a mutex. Two counters live there:
+
+- `consecutiveFailures` — incremented on every failed probe, reset on
+  every success;
+- `failurePublished` — prevents repeated `probe_failed` emission while
+  the failure persists, and triggers a single `probe_recovered` on the
+  first success after the threshold was crossed.
+
+The state is non-persistent. RTM is single-instance in v1, and a
+process restart that loses the counters merely re-establishes the
+hysteresis from scratch — the only consequence is that a probe failure
+already in progress at the moment of restart needs another full
+threshold of failures to surface. Making the state durable would add a
+Redis round-trip to every probe attempt without buying anything that
+operators or downstream consumers depend on.
+
+State pruning happens at the start of every tick. The worker reads the
+current running list and removes any state entry whose `game_id` is
+not in the list. A game that transitions through stopped → running
+again starts fresh; previously-accumulated counters do not bleed into
+the new lifecycle.
+
+## 6. Probe concurrency is bounded by a fixed cap
+
+Probes inside one tick run in parallel through a buffered-channel
+semaphore (`defaultMaxConcurrency = 16`). Three reasons:
+
+- A single slow engine cannot delay the entire cohort. Sequential
+  per-game probing would multiply the worst case by `len(records)`,
+  which is the wrong shape for what is fundamentally a fan-out
+  observation pattern.
+- An unbounded fan-out (one goroutine per record per tick without a
+  cap) was rejected to avoid pathological CPU and connection bursts
+  if the running list ever grows beyond what RTM was sized for. 16
+  in-flight probes at the default 2s timeout fit a single RTM
+  instance well within typical OS file-descriptor and TCP
+  ephemeral-port limits.
+- The cap is a constant rather than an env var because RTM v1 is
+  single-instance and the active-game count is bounded by Lobby; a
+  configurable cap is something we promote to env if a real workload
+  demands it.
+
+The same reasoning argues against parallelism in the inspect worker:
+inspect calls are cheap (sub-ms in the local Docker socket case) and
+serial execution avoids unnecessary concurrency on the daemon socket.
+
+## 7. Events listener reconnects with fixed backoff
+
+The Docker daemon's events stream is a long-lived subscription; the
+SDK channel terminates on any transport error (daemon restart, socket
+hiccup, connection reset). The listener's outer loop handles this by
+re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
+indefinitely while ctx is alive.
+
+Crashing the process on a transport error was rejected because losing
+a few seconds of health observations is a much smaller blast radius
+than losing the entire RTM process while the start/stop pipelines are
+running. The save-offset case is different: a lost offset replays the
+entire backlog and breaks correctness, while a missed health event is
+observation-only.
+
+A subscription error is logged at `Warn` so operators can see the
+reconnect activity without it dominating the log volume.
+
+## 8. Health publisher remains best-effort
+
+Every emission goes through `ports.HealthEventPublisher.Publish`, the
+same surface the start service already uses
+([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
+and proceeds; the worker does not retry, does not adjust its in-memory
+hysteresis, and does not surface the failure to the caller. The
+operation log is the source of truth for runtime state; the event
+stream is a best-effort notification surface to consumers.
+
+## 9. Stream offset labels are stable identifiers
+
+Both consumers persist their progress through
+`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
+start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
+labels match `rtmanager:stream_offsets:{label}` and stay stable when
+the underlying stream key is renamed via
+`RTMANAGER_REDIS_START_JOBS_STREAM` /
+`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
+consumer at a different stream key does not lose the persisted offset.
+
+## 10. `OpSource` and `SourceRef` originate at the consumer boundary
+
+Every consumed envelope is translated into a `Service.Handle` call
+with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
+`SourceRef` is the Redis Stream entry id (`message.ID`); the
+`operation_log` rows therefore record the originating envelope id, and
+restart / patch correlation logic ([`services.md`](services.md) §13)
+keeps working when those services are invoked indirectly.
+
+## 11. Replay-no-op detection lives in the service layer
+
+The consumer does not detect replays itself. `startruntime.Service`
+returns `Outcome=success, ErrorCode=replay_no_op` when the existing
+record is already `running` with the same `image_ref`;
+`stopruntime.Service` does the same for an already-stopped or
+already-removed record. The consumer copies the result fields into
+the `RuntimeJobResult` payload verbatim and lets Lobby observe the
+replay through `error_code`.
+
+The wire-shape consequences:
+
+- `success` + empty `error_code` → fresh start / fresh stop;
+- `success` + `error_code=replay_no_op` → idempotent replay. For
+  start, the existing record carries `container_id` and
+  `engine_endpoint`; for stop on `status=removed`, both fields are
+  empty strings (the record was nulled by an earlier cleanup) — the
+  AsyncAPI contract permits empty strings on these required fields;
+- `failure` + non-empty `error_code` → the start / stop service
+  returned a zero `Record`; the consumer publishes empty
+  `container_id` and `engine_endpoint`.
+
+## 12. Per-message errors are absorbed; the offset always advances
+
+The consumer run loop logs and absorbs any decode error, any go-level
+service error, and any publish failure; `streamOffsetStore.Save` runs
+unconditionally after each handled message. Pinning the offset on a
+single transient publish failure was rejected because the durable side
+effect (operation_log row, runtime_records mutation, Docker state) has
+already happened on the first pass; pinning the offset to retry the
+publish would duplicate audit rows for hours until the operator
+intervened.
+
+The exception is `streamOffsetStore.Save` itself: a save failure
+returns a wrapped error from `Run`. The component supervisor in
+`internal/app/app.go` then exits the process and lets the operator
+escalate, because losing the offset would cause every subsequent
+restart to re-process every prior envelope.
+
+## 13. `requested_at_ms` is logged-only
+
+The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
+consumer parses it (rejecting unparseable values) but only includes
+the value in structured logs — the field is "used for diagnostics, not
+authoritative" per the contract. The service layer ignores it; the
+operation_log uses `service.clock()` for `started_at` / `finished_at`
+so Lobby's wall-clock skew never bleeds into RTM persistence.
+
+## 14. Reconciler: per-game lease around every write
+
+A `running → removed` mutation that races a restart's inner stop
+would clobber the restart's freshly-installed `running` record without
+any other guard. The reconciler honours the same per-game lease that
+the lifecycle services hold ([`services.md`](services.md) §1).
+
+The reconciler splits its work into two phases:
+
+- **Read pass — lockless.**
+  `docker.List({com.galaxy.owner=rtmanager})` followed by
+  `RuntimeRecords.ListByStatus(running)`. No lease is taken; both
+  reads are point-in-time observations of independent systems and a
+  stale view here only delays a mutation by one tick.
+- **Write pass — lease-guarded.** Every drift mutation
+  (`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
+  per-game lease, re-reads the record under the lease, and then
+  either applies the mutation or returns when state has changed.
+  A lease conflict (`acquired=false`) is logged at `info` and the
+  game is silently skipped — the next tick will retry. A lease-store
+  error is logged at `warn`; the rest of the pass continues.
+
+The re-read after lease acquisition is intentional: the read pass is
+lockless, so by the time the lease is held the runtime record may
+have moved. `UpdateStatus` already provides CAS via
+`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
+does not, so the explicit re-read keeps the three paths uniform and
+makes the skip condition obvious in code review.
+
+## 15. Three drift kinds covered by the reconciler
+
+- `adopt` — Docker reports a container labelled
+  `com.galaxy.owner=rtmanager` for which RTM has no record; insert a
+  fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
+  stop or remove the container (operators may have started it
+  manually for diagnostics).
+- `dispose` — RTM has a `running` record whose container is missing
+  in Docker; mark `status=removed`, publish
+  `container_disappeared`, append `op_kind=reconcile_dispose`.
+- `observed_exited` — RTM has a `running` record whose container
+  exists but is in `exited`; mark `status=stopped`, publish
+  `container_exited` with the observed exit code. This third path
+  exists because the events listener sees only live events; a
+  container that died while RTM was offline would otherwise stay
+  `running` indefinitely. The drift is exposed through
+  `rtmanager.reconcile_drift{kind=observed_exited}` and through the
+  `container_exited` health event; no `operation_log` entry is
+  written because the audit log records explicit RTM operations, not
+  passive observations of Docker state.
+
+## 16. `stopped_at = now (reconciler observation time)`
+
+The `observed_exited` path writes `stopped_at = now`, where `now` is
+the reconciler's observation time. The persistence adapter
+([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
+`stopped_at = now` for the `stopped` destination — there is no
+port-level knob for an explicit timestamp, and the reconciler does not
+read `State.FinishedAt` from Docker.
+
+The trade-off: `stopped_at` diverges from the daemon's
+`State.FinishedAt` by at most one tick interval (default 5 minutes).
+If a downstream consumer ever needs the daemon-observed exit
+timestamp, the upgrade path is a one-call extension of
+`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
+that change is deferred until a consumer materialises.
+
+## 17. Synchronous initial pass + periodic Component
+
+`README §Startup dependencies` step 6 demands "Reconciler runs once
+and blocks until done" before background workers start, but
+`app.App.Run` starts every registered `Component` concurrently —
+component ordering does not translate into start ordering.
+
+The reconciler exposes a public `ReconcileNow(ctx)` method that the
+runtime calls synchronously between `newWiring` and `app.New`. The
+same `*Reconciler` is then registered as a `Component`; its `Run`
+only ticks (no immediate pass) so the startup work is not duplicated.
+The cost is one public method on the worker; the benefit is that the
+README invariant holds verbatim and the periodic loop is a textbook
+`Component`.
+
+## 18. Adopt through `Upsert`, race with start is benign
+
+The adopt path constructs a fresh `runtime.RuntimeRecord` (status
+running, container id and image_ref from labels, `started_at` from
+`com.galaxy.started_at_ms` or inspect, state path and docker network
+from configuration, engine endpoint from the
+`http://galaxy-game-{game_id}:8080` rule) and calls
+`RuntimeRecords.Upsert`.
+
+Race scenario: the start service has called `docker.Run` but has not
+yet finished its own `Upsert` when the reconciler observes the
+container without a record. Both writers eventually arrive at PG with
+the same key data — the start service knows the canonical
+`image_ref`, but the reconciler reads it from the
+`com.galaxy.engine_image_ref` label that the start service itself
+wrote. The CAS-free overwrite is therefore benign:
+
+- `created_at` is preserved across upserts by the
+  `ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
+  game" timestamp stays stable regardless of which writer lands last;
+- all other fields in this race carry identical values (same
+  container, same image, same hostname, same state path).
+
+Under the per-game lease this is doubly safe: the reconciler only
+issues `Upsert` while holding the lease, and only after re-reading
+the record finds it absent. Concurrent start would block on the same
+lease; concurrent stop / restart would have moved the record out of
+"absent" by the time the reconciler re-reads.
+
+## 19. Cleanup worker delegates to the service
+
+The TTL-cleanup worker is intentionally tiny: it lists
+`runtime_records.status='stopped'`, filters in process by
+`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
+`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
+candidate. The service already owns:
+
+- the per-game lease around the Docker `Remove` call;
+- the `running → removed` CAS via `UpdateStatus`;
+- the operation_log entry (`op_kind=cleanup_container`,
+  `op_source=auto_ttl`);
+- the telemetry counter and structured log fields.
+
+In-memory filtering is acceptable in v1 because the cardinality of
+`status=stopped` rows is bounded by Lobby's active-game count plus
+retention period. The dedicated `(status, last_op_at)` index drives
+the underlying `ListByStatus(stopped)` query so the database does
+the heavy lifting; the Go-side filter is microseconds-per-row.
+
+The worker uses a small `Cleaner` interface in its own package rather
+than depending on `*cleanupcontainer.Service` directly. This keeps
+the worker's tests light — no need to construct Docker, lease,
+operation-log, and telemetry doubles just to verify TTL math — while
+the production wiring still binds the real service via a compile-time
+interface assertion in `internal/app/wiring.go`.
+
+## 20. Sequential per-game work in reconciler and cleanup
+
+Both workers process games sequentially within a tick. The
+reconciler's mutations are dominated by `Get` + `Upsert` /
+`UpdateStatus` round-trips against PG plus an occasional Docker
+`InspectContainer`; the cleanup worker's mutations are dominated by
+the cleanup service's `docker.Remove` call. Parallelising either
+would multiply the load on the Docker daemon socket and the PG pool
+without buying anything that v1 cardinality demands.
+
+## 21. Cross-module test boundary for the consumer integration test
+
+[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
+covers the contract roundtrip without importing
+`lobby/internal/...`:
+
+- it XADDs a start envelope in the AsyncAPI wire shape (the same
+  shape Lobby's `runtimemanager.Publisher` writes);
+- it runs the real `startruntime.Service` against in-memory fakes for
+  the persistence stores, the lease, and the notification / health
+  publishers, plus a gomock-backed `ports.DockerClient`;
+- it lets the real `jobresultspublisher.Publisher` write to
+  `runtime:job_results`;
+- it reads the resulting entry and asserts the symmetric wire shape;
+- it then XADDs the same envelope a second time and asserts the
+  `error_code=replay_no_op` outcome with no further Docker calls.
+
+The cross-module integration that runs both the real Lobby publisher
+and the real Lobby consumer alongside RTM lives at
+`integration/lobbyrtm/`, which is the home for inter-service
+fixtures. Keeping the in-package test free of `lobby/...` imports
+avoids module-internal coupling and keeps `rtmanager`'s test suite
+buildable on its own.