feat: runtime manager

This commit is contained in:
Ilia Denisov
2026-04-28 20:39:18 +02:00
committed by GitHub
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
+44
View File
@@ -0,0 +1,44 @@
# Runtime Manager — Service-Local Documentation
This directory hosts the service-local documentation for `Runtime
Manager`. The top-level [`../README.md`](../README.md) describes the
current-state contract (purpose, scope, lifecycles, surfaces,
configuration, observability); the documents below complement it with
focused content docs and design-rationale records.
## Content docs
- [Runtime and components](runtime.md) — process diagram, listeners,
workers, lifecycle services, stream offsets, configuration groups,
runtime invariants.
- [Flows](flows.md) — mermaid sequence diagrams for the lifecycle and
observability flows.
- [Operator runbook](runbook.md) — startup, readiness, shutdown, and
recovery scenarios.
- [Configuration and contract examples](examples.md) — `.env`,
REST request bodies, stream payloads, storage inspection snippets.
## Design rationale
- [PostgreSQL schema decisions](postgres-migration.md) — the schema
decision record consolidating the persistence-layer agreements
(tables, indexes, CAS shape, `created_at` preservation, jsonb
round-trip, schema/role provisioning split).
- [Domain and ports](domain-and-ports.md) — string-typed enums, the
four allowed runtime transitions, why `Inspect` splits into
`InspectImage` / `InspectContainer`, why `LobbyGameRecord` is
minimal, and other domain-layer choices.
- [Adapters](adapters.md) — Docker SDK adapter, Lobby internal HTTP
client, the three Redis publishers, the `mockgen` convention for
wide ports, and the unit-test strategy for HTTP-backed adapters.
- [Lifecycle services](services.md) — per-game lease semantics, the
`Result`-shaped contract, failure-mode tables, the lease-bypass
`Run` method on inner services, the `X-Galaxy-Caller` header
convention, and the canonical error code → HTTP status mapping.
- [Background workers](workers.md) — single-ownership table per
`event_type`, `container_disappeared` suppression rules, probe
hysteresis, the events listener reconnect policy, the reconciler's
per-game lease and three drift kinds.
- [Service-local integration suite](integration-tests.md) — the
`integration` build tag, the in-process `app.NewRuntime` choice,
the Lobby HTTP stub, and the test isolation strategy.
+192
View File
@@ -0,0 +1,192 @@
# Adapters
This document explains why the production adapters under
[`../internal/adapters/`](../internal/adapters) — Docker SDK,
Lobby internal HTTP client, notification-intent publisher, health-event
publisher, job-result publisher — are shaped the way they are. The
PostgreSQL stores and the Redis-coordination adapters live in
[`postgres-migration.md`](postgres-migration.md).
## 1. `mockgen` is the repo-wide convention for wide ports
The Docker port has nine methods plus eight value types in the
signatures, and most lifecycle services exercise nearly every method
pair (start, stop, restart, patch, cleanup, reconcile, events, probe).
A hand-rolled fake would either miss methods or balloon to a per-test
fixture.
`internal/adapters/docker/` therefore uses `go.uber.org/mock` mocks:
- `//go:generate` directives live next to the interface declaration in
`internal/ports/dockerclient.go`;
- generated code is committed under `internal/adapters/docker/mocks/`
(matching the `internal/adapters/postgres/jet/` discipline);
- `make -C rtmanager mocks` is the single command operators run after
a port-signature change.
The maintained `go.uber.org/mock` fork is preferred over the archived
`github.com/golang/mock`. This convention applies to wide / recorder
ports across the repository — Lobby uses the same pipeline for its
narrow recorder ports (`RuntimeManager`, `IntentPublisher`,
`GMClient`, `UserService`); see
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for the cross-service
rule.
The other two RTM ports (`LobbyInternalClient`,
`NotificationIntentPublisher`) keep inline `_test.go` fakes: small
surfaces, easy to fake by hand inside a single test file when needed.
## 2. `EngineEndpoint` is built inside the Docker adapter
The engine port is fixed at `8080`. Pushing it into `RunSpec` would
force the start service to know an engine implementation detail;
pushing it into config would give operators a knob that the engine
image already does not honour. The Docker adapter exposes
`EnginePort = 8080` as a package constant and constructs
`RunResult.EngineEndpoint = "http://" + spec.Hostname + ":8080"`
itself.
The adapter also leaves `container.Config.ExposedPorts` empty: RTM
never publishes ports to the host. The user-defined Docker bridge
network gives every container in the network DNS access to the engine
via `galaxy-game-{game_id}:8080`.
## 3. `Run` removes the container on `ContainerStart` failure
`README.md §Lifecycles → Start` requires no orphan to remain after a
failed start path. If `ContainerCreate` succeeds but `ContainerStart`
fails, the adapter calls `ContainerRemove(force=true)` inside a fresh
`context.Background()` (with a 10s timeout) so the cleanup runs even
when the original ctx is already cancelled. The cleanup is best-effort:
a remove failure is silently discarded because the original start
failure is the actionable error returned to the caller.
The alternative — leaving rollback to the start service — would either
duplicate the same code in every caller or invite a service that forgets
to do it. Centralising the rule in the adapter keeps the port contract
simple. The start service adds an additional rollback layer for the
post-`Run` `Upsert` failure path; see [`services.md`](services.md) §5.
## 4. `RunSpec.Cmd` is optional
`ports.RunSpec` exposes an optional `Cmd []string`. Production callers
leave it `nil` so the engine image's own `CMD` runs;
`internal/adapters/docker/smoke_test.go` uses it to drive
`["/bin/sh","-c","sleep 60"]` against `alpine:3.21`.
The alternative — building a dedicated test image with a pre-baked
`sleep` command — would require an extra `Dockerfile` under testdata
and a build step inside the smoke test. The single new field is
documented as optional and ignored when empty; production behaviour is
unchanged.
## 5. `EventsListen` filters at the adapter boundary
The Docker `/events` API accepts a `filters` query parameter, but the
daemon treats it as a hint, not a guarantee. The adapter therefore
double-checks at the boundary: only `Type == events.ContainerEventType`
messages are passed through to the typed `<-chan ports.DockerEvent`.
Doing the filter at the SDK level would still require a defensive
recheck on the consumer side; consolidating the check in the adapter
keeps the contract crisp and the consumer free of Docker-internal type
discriminants.
The decoded event copies the actor's full `Attributes` map into
`DockerEvent.Labels`. Docker mixes container labels and runtime
attributes (`exitCode`, `image`, `name`, etc.) flat in the same map;
RTM consumers filter by the `com.galaxy.` prefix when they care about
labels, and the adapter extracts `exitCode` separately for `die`
events.
## 6. Lobby HTTP client error mapping
`ports.LobbyInternalClient.GetGame` fixes:
- `200``LobbyGameRecord` decoded tolerantly (unknown fields
ignored);
- `404``ports.ErrLobbyGameNotFound`;
- transport, timeout, or any other non-2xx → `ports.ErrLobbyUnavailable`
wrapped with the original error so callers can `errors.Is` and still
log the cause.
The start service treats `ErrLobbyUnavailable` as recoverable: it
continues without the diagnostic data because the start envelope
already carries the only required field (`image_ref`). The client
mirrors `notification/internal/adapters/userservice/client.go`: cloned
`*http.Transport`, `otelhttp.NewTransport` wrap, per-request
`context.WithTimeout`, idempotent `Close()` releasing idle connections.
JSON decoding is tolerant: unknown fields in the success body do not
break the call, so additive changes to Lobby's `GameRecord` schema do
not require an RTM release.
## 7. Notification publisher wrapper signature
The wrapper drops the entry id returned by
`notificationintent.Publisher.Publish` (rationale in
[`domain-and-ports.md`](domain-and-ports.md) §7). The adapter is a
thin shim:
- `NewPublisher(cfg)` constructs the inner publisher and forwards
validation;
- `Publish(ctx, intent)` calls the inner publisher and discards the
entry id.
The compile-time assertion `var _ ports.NotificationIntentPublisher =
(*Publisher)(nil)` lives in `publisher.go`.
## 8. Health-events publisher: snapshot upsert before stream XADD
Every emission goes through
`ports.HealthEventPublisher.Publish`, which both XADDs to
`runtime:health_events` and upserts `health_snapshots`. The snapshot
upsert runs **before** the XADD: a successful Publish always leaves
the snapshot store at least as fresh as the stream, and a partial
failure leaves the snapshot a best-effort lower bound. Reversing the
order would let consumers observe a stream entry whose
`health_snapshots` row reflects the prior observation — a misleading
inversion.
The `event_type → SnapshotStatus / SnapshotSource` mapping mirrors the
table in [`../README.md` §Health Monitoring](../README.md). In
particular, `container_started` collapses to `SnapshotStatusHealthy`
and `probe_recovered` does the same (rationale in
[`domain-and-ports.md`](domain-and-ports.md) §4).
## 9. Unit-test strategy
Both HTTP-backed adapters (Docker SDK, Lobby client) use
`httptest.Server` fixtures. The Docker SDK speaks HTTP under the hood
for both unix sockets and TCP, so adapter unit tests construct a
Docker client with `client.WithHost(server.URL)` and
`client.WithHTTPClient(server.Client())`, which lets table-driven
handlers fake every Docker API endpoint without touching the real
daemon. The Docker API version is pinned to `1.45`
(`client.WithVersion("1.45")`) so the URL prefix is stable across CI
machines whose daemon advertises a different default. Production
wiring (in `internal/app/bootstrap.go`) keeps API negotiation enabled.
The notification publisher uses `miniredis` directly because the
adapter's only side effect is an `XADD`, which `miniredis` reproduces
faithfully and matches every other Galaxy intent test.
## 10. Docker smoke test
`internal/adapters/docker/smoke_test.go` runs on the default
`go test ./...` invocation and calls `t.Skip` unless the local daemon
is reachable (`/var/run/docker.sock` exists or `DOCKER_HOST` is set).
The covered sequence:
1. provision a temporary user-defined bridge network;
2. assert `EnsureNetwork` for present and missing names;
3. pull `alpine:3.21` (`PullPolicyIfMissing`);
4. subscribe to events;
5. run a sleep container with the full `RunSpec` field set;
6. observe a `start` event for the new container id;
7. inspect, stop, remove, and verify `ErrContainerNotFound` is
reported afterwards.
This is the production adapter's only end-to-end check that runs from
the default `go test` pass; the broader service-local integration
suite ([`integration-tests.md`](integration-tests.md)) is gated
behind `-tags=integration`.
+167
View File
@@ -0,0 +1,167 @@
# Domain and Ports
This document explains why the `rtmanager` domain layer
([`../internal/domain/`](../internal/domain)) and the port interfaces
([`../internal/ports/`](../internal/ports)) are shaped the way they are.
The current-state types and method signatures are the source of truth in
the code; this file records the rationale so future readers do not
re-litigate the same trade-offs.
For the surrounding behaviour see
[`../README.md`](../README.md), the SQL CHECK constraints in
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql),
the wire contracts under [`../api/`](../api), and
[`postgres-migration.md`](postgres-migration.md) for the persistence
layer.
## 1. String-typed status enums
`runtime.Status`, `operation.OpKind`, `operation.OpSource`,
`operation.Outcome`, `health.EventType`, `health.SnapshotStatus`, and
`health.SnapshotSource` are all `type X string`.
The string approach wins on three counts:
- the SQL CHECK constraints already store the values as `text`, so a
string domain type maps one-to-one with no codec layer;
- it matches Lobby (`game.Status`, `membership.Status`,
`application.Status`), so reviewers do not switch encoding mental
models when crossing service boundaries;
- `IsKnown` keeps the invariant cheap (a single switch); a `type X uint8`
with stringer-generated names would pay a constant lookup and make raw
SQL columns harder to read in diagnostics.
## 2. Plain `string` for `CurrentContainerID` and `CurrentImageRef`
The PostgreSQL columns are nullable. The domain model uses plain
`string` with empty == NULL and bridges the SQL nullability inside the
adapter. Pointer fields would force every consumer to dereference
defensively even though business logic rarely cares about the
NULL/empty distinction (removed records may legitimately carry either
form depending on whether the record passed through `stopped` first).
The adapter's job is to translate `sql.NullString``string`; the rest
of the codebase reads the field as a regular value.
## 3. `*time.Time` for nullable timestamps
`StartedAt`, `StoppedAt`, `RemovedAt` retain pointer types. `time.Time{}`
is a real, comparable value in Go (`IsZero` only reports the canonical
zero time); mixing "missing" and "set to UTC zero" through plain
`time.Time` would invite bugs. The jet-generated `model.RuntimeRecords`
already declares the same fields as `*time.Time`, so the domain type
aligns with the persistence type and the adapter does not re-shape
pointers.
## 4. `EventType` and `SnapshotStatus` are deliberately distinct
`runtime-health-asyncapi.yaml.EventType` enumerates seven values; the
SQL CHECK on `health_snapshots.status` enumerates six. The two sets
overlap but are not identical:
- `container_started` is an *event*; the snapshot collapses it to
`healthy` (a successful start is observed as the container being
live, not as an ongoing event);
- `probe_recovered` is an *event*; it does not become a snapshot row of
its own — the next inspect/probe overwrites the prior `probe_failed`
with `healthy`.
Modelling them as one shared enum would require a separate "event vs
snapshot" boolean and invite accidental mismatches. Two distinct types
with explicit `IsKnown` matrices keep each surface honest at compile
time.
## 5. `Inspect` split into `InspectImage` + `InspectContainer`
Two narrow methods replace a single polymorphic `Inspect`. The surface
RTM exercises has two shapes:
- the start service inspects the *image* by reference to read resource
limits from labels;
- the periodic inspect worker, the reconciler, and the events listener
inspect *containers* by id to read state, health, restart count, and
exit code.
The inputs differ (ref vs id), and the result types differ
(`ImageInspect.Labels` is the only field used at start time, while
`ContainerInspect` carries a dozen state fields). One polymorphic
method would either split internally on input type or return a tagged
union; either is messier than two narrow methods.
## 6. `LobbyGameRecord` is intentionally minimal
`LobbyInternalClient.GetGame` returns `GameID`, `Status`, and
`TargetEngineVersion`. The fetch is classified as ancillary diagnostics
because the start envelope already carries the only required field
(`image_ref`).
Anything more would invite RTM consumers to depend on Lobby's schema in
ways that violate the "RTM never resolves engine versions" rule.
Future fields are additive: each new field is opt-in to the consumer
and does not break existing call sites. The minimalism is also a hedge
against schema drift — Lobby's `GameRecord` is large and changes more
often than RTM needs to track.
## 7. `NotificationIntentPublisher.Publish` returns `error`, not `(string, error)`
Lobby's `IntentPublisher.Publish` returns the Redis Stream entry id so
business workflows that key on it (idempotency keys, audit
correlation) can capture it. RTM publishes admin-only failure intents
where the entry id has no consumer — failing starts do not loop back
to RTM, and notification routing keys on the producer-supplied
`idempotency_key` rather than the stream id. The adapter wraps
`pkg/notificationintent.Publisher` and discards the entry id at the
wrapper boundary.
## 8. Exactly four allowed runtime transitions
`runtime.AllowedTransitions` covers:
- `running → stopped` — graceful stop, observed exit, reconcile
observed exited;
- `running → removed``reconcile_dispose` when the container
vanished;
- `stopped → running` — restart and patch inner start;
- `stopped → removed` — cleanup TTL or admin DELETE.
Other pairs are intentionally rejected:
- `running → running` and `stopped → stopped` would mean Upsert
overwrote state without a CAS guard. Idempotent re-start / re-stop
never transitions; the service layer returns `replay_no_op` and the
record is left untouched.
- `removed → *` is forbidden because `removed` is terminal. The
reconciler creates fresh records with `reconcile_adopt` rather than
resurrecting old ones.
Encoding the table this way means a future bug where a service tries
to revive a removed record is rejected at the domain layer rather than
the adapter, which keeps the failure mode close to the offending code.
## 9. `PullPolicy` re-declared inside `ports/dockerclient.go`
The same enum exists as `config.ImagePullPolicy`. Importing
`internal/config` from the ports package would couple two unrelated
layers and create a cyclic risk once the wiring layer pulls both in.
The runtime/wiring layer (in `internal/app`) is the single point that
translates between the two type aliases — both are `string`-typed, the
value sets are identical, and the validation lives on each side
independently.
## 10. Compile-time interface assertions live with adapters
Every interface has a `var _ ports.X = (*Y)(nil)` assertion, but the
assertion lives in the adapter package (e.g.
`var _ ports.RuntimeRecordStore = (*Store)(nil)` inside
`internal/adapters/postgres/runtimerecordstore`). Putting the
assertions in the port package would force the port package to import
its own implementations and create an obvious import cycle.
## 11. `RunSpec.Validate` lives on the request type
The Docker port carries a non-trivial request type (`RunSpec`) with
eight required fields and per-mount invariants. Putting `Validate` on
the request struct keeps the rule next to the type definition, mirrors
the pattern used by `lobby/internal/ports/gmclient.go`
(`RegisterGameRequest.Validate`), and lets the adapter call it as the
first defensive check before invoking the Docker SDK.
+429
View File
@@ -0,0 +1,429 @@
# Configuration And Contract Examples
The examples below are illustrative. Replace `localhost`, port
numbers, IDs, and timestamps with values that match the deployment
under inspection.
## Example `.env`
A minimum-viable `RTMANAGER_*` set for a local run against a single
Redis container plus a PostgreSQL container with the `rtmanager`
schema and the `rtmanagerservice` role provisioned. The full list
with defaults lives in [`../README.md` §Configuration](../README.md).
```bash
# Required
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
RTMANAGER_REDIS_PASSWORD=local
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
RTMANAGER_DOCKER_NETWORK=galaxy-net
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
# Lobby internal client (diagnostic GET only in v1)
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
# Container defaults (image labels override these per container)
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
RTMANAGER_DEFAULT_MEMORY=512m
RTMANAGER_DEFAULT_PIDS_LIMIT=512
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
RTMANAGER_CONTAINER_RETENTION_DAYS=30
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
RTMANAGER_GAME_STATE_DIR_MODE=0750
RTMANAGER_GAME_STATE_OWNER_UID=0
RTMANAGER_GAME_STATE_OWNER_GID=0
# Workers
RTMANAGER_INSPECT_INTERVAL=30s
RTMANAGER_PROBE_INTERVAL=15s
RTMANAGER_PROBE_TIMEOUT=2s
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
RTMANAGER_RECONCILE_INTERVAL=5m
RTMANAGER_CLEANUP_INTERVAL=1h
# Coordination
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
# Process and logging
RTMANAGER_LOG_LEVEL=info
RTMANAGER_SHUTDOWN_TIMEOUT=30s
# Telemetry (disabled for local dev — enable to ship traces / metrics)
OTEL_SERVICE_NAME=galaxy-rtmanager
OTEL_TRACES_EXPORTER=none
OTEL_METRICS_EXPORTER=none
```
For a production-shaped deployment, set
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
a tag mutation is immediately visible to the next runtime),
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
container's user, and configure `OTEL_*` against the cluster's OTLP
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
sink the operator runs (fluentd, journald, etc.).
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
service-local harness brings up
([`integration-tests.md` §7](integration-tests.md)).
## Internal HTTP Examples
Every endpoint admits the optional `X-Galaxy-Caller` header which the
handler records as `op_source` in `operation_log` (`gm``gm_rest`,
`admin``admin_rest`; missing or unknown values default to
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
### Probe a runtime record
```bash
curl -s -H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
```
Response (`200 OK`):
```json
{
"game_id": "game-01HZ...",
"status": "running",
"current_container_id": "1f2a...",
"current_image_ref": "galaxy/game:1.4.0",
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
"state_path": "/var/lib/galaxy/games/game-01HZ...",
"docker_network": "galaxy-net",
"started_at": "2026-04-28T07:18:54Z",
"stopped_at": null,
"removed_at": null,
"last_op_at": "2026-04-28T07:18:54Z",
"created_at": "2026-04-28T07:18:54Z"
}
```
### List all runtimes
```bash
curl -s -H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes
```
The response shape is `{"items":[<RuntimeRecord>...]}`.
### Start a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: gm' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
-d '{"image_ref": "galaxy/game:1.4.0"}'
```
A `200` returns the `RuntimeRecord` for the running runtime. Failure
shapes use the canonical envelope; e.g. an invalid image_ref:
```json
{
"error": {
"code": "start_config_invalid",
"message": "image_ref shape rejected by docker reference parser"
}
}
```
### Stop a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
-d '{"reason": "admin_request"}'
```
Valid `reason` values:
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
### Restart a runtime
```bash
curl -s -X POST \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
```
The body is empty; restart re-uses the current `image_ref`.
### Patch a runtime
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
```
Patch enforces the semver-only rule: a non-semver tag returns
`image_ref_not_semver`; a cross-major or cross-minor change returns
`semver_patch_only`.
### Cleanup a stopped runtime container
```bash
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
```
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
## Stream Payload Examples
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
the defaults are used below. Field types and required/optional
semantics are frozen by
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
and
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
### `runtime:start_jobs` (Lobby → RTM)
```bash
redis-cli XADD runtime:start_jobs '*' \
game_id 'game-01HZ...' \
image_ref 'galaxy/game:1.4.0' \
requested_at_ms 1714081234567
```
### `runtime:stop_jobs` (Lobby → RTM)
```bash
redis-cli XADD runtime:stop_jobs '*' \
game_id 'game-01HZ...' \
reason 'cancelled' \
requested_at_ms 1714081234567
```
### `runtime:job_results` (RTM → Lobby)
Success envelope:
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code '' \
error_message ''
```
Failure envelope:
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'failure' \
container_id '' \
engine_endpoint '' \
error_code 'image_pull_failed' \
error_message 'pull failed: manifest unknown'
```
Idempotent replay envelope (success outcome with explicit
`replay_no_op`):
```bash
redis-cli XADD runtime:job_results '*' \
game_id 'game-01HZ...' \
outcome 'success' \
container_id '1f2a...' \
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
error_code 'replay_no_op' \
error_message ''
```
The contract permits empty `container_id` and `engine_endpoint`
strings on every value of `outcome` so the consumer can decode the
envelope uniformly ([`workers.md` §11](workers.md)).
### `runtime:health_events` (RTM out)
The wire shape is the same for every event type — only the
`details` payload differs.
`container_started`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_started' \
occurred_at_ms 1714081234567 \
details '{"image_ref":"galaxy/game:1.4.0"}'
```
`container_exited`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_exited' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137,"oom":false}'
```
`container_oom`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_oom' \
occurred_at_ms 1714081234567 \
details '{"exit_code":137}'
```
`container_disappeared`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'container_disappeared' \
occurred_at_ms 1714081234567 \
details '{}'
```
`inspect_unhealthy`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'inspect_unhealthy' \
occurred_at_ms 1714081234567 \
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
```
`probe_failed` (after the threshold is crossed):
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_failed' \
occurred_at_ms 1714081234567 \
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
```
`probe_recovered`:
```bash
redis-cli XADD runtime:health_events '*' \
game_id 'game-01HZ...' \
container_id '1f2a...' \
event_type 'probe_recovered' \
occurred_at_ms 1714081234567 \
details '{"prior_failure_count":3}'
```
### `notification:intents` (RTM admin notifications)
RTM publishes admin-only notification intents only for the three
first-touch start failures. Every payload shares the frozen field
set `{game_id, image_ref, error_code, error_message,
attempted_at_ms}`
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
`runtime.image_pull_failed`:
```bash
redis-cli XADD notification:intents '*' \
envelope '{
"type": "runtime.image_pull_failed",
"producer": "rtmanager",
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
"payload": {
"game_id": "game-01HZ...",
"image_ref": "galaxy/game:1.4.0",
"error_code": "image_pull_failed",
"error_message": "pull failed: manifest unknown",
"attempted_at_ms": 1714081234567
}
}'
```
`runtime.container_start_failed` and `runtime.start_config_invalid`
share the same envelope with their respective `type` and
`error_code` values.
## Storage Inspection
### Inspect a runtime record (PostgreSQL)
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
```
Columns mirror the fields documented in
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
### Inspect runtime status counts
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
```
### Inspect the operation log for a game
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = 'game-01HZ...'
ORDER BY started_at DESC, id DESC
LIMIT 50"
```
### Inspect the latest health snapshot
```bash
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT game_id, container_id, status, source, observed_at, details
FROM rtmanager.health_snapshots
WHERE game_id = 'game-01HZ...'"
```
### Inspect Redis runtime-coordination keys
```bash
# Stream offsets
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Per-game lease (only present while an operation is in flight)
redis-cli GET rtmanager:game_lease:game-01HZ...
redis-cli TTL rtmanager:game_lease:game-01HZ...
# Recent stream entries
redis-cli XRANGE runtime:start_jobs - + COUNT 20
redis-cli XRANGE runtime:job_results - + COUNT 20
redis-cli XRANGE runtime:health_events - + COUNT 50
# Stream metadata
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli XINFO STREAM runtime:health_events
```
+305
View File
@@ -0,0 +1,305 @@
# Flows
This document collects the lifecycle and observability flows that
span Runtime Manager and its synchronous and asynchronous neighbours.
Narrative descriptions of the rules these flows enforce live in
[`../README.md`](../README.md); the diagrams here focus on the message
order across the boundary. Design-rationale records linked from each
section explain the *why*.
## Start (happy path)
```mermaid
sequenceDiagram
participant Lobby as Lobby publisher
participant Stream as runtime:start_jobs
participant Consumer as startjobsconsumer
participant Service as startruntime
participant Lease as Redis lease
participant Docker
participant PG as Postgres
participant Health as runtime:health_events
participant Results as runtime:job_results
Lobby->>Stream: XADD {game_id, image_ref, requested_at_ms}
Consumer->>Stream: XREAD
Consumer->>Service: Handle(game_id, image_ref, OpSourceLobbyStream, entry_id)
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Service->>PG: SELECT runtime_records WHERE game_id
Service->>Docker: PullImage(image_ref) per pull policy
Service->>Docker: InspectImage → resource limits
Service->>Service: prepareStateDir(<root>/{game_id})
Service->>Docker: ContainerCreate + ContainerStart
Service->>PG: Upsert runtime_records (status=running)
Service->>PG: INSERT operation_log (op_kind=start, outcome=success)
Service->>Health: XADD container_started
Service-->>Consumer: Result{Outcome=success, ContainerID, EngineEndpoint}
Consumer->>Results: XADD {outcome=success, container_id, engine_endpoint}
Service->>Lease: DEL rtmanager:game_lease:{game_id}
```
REST callers (Game Master, Admin Service) drive the same service
through `POST /api/v1/internal/runtimes/{game_id}/start`; the
diagram's last two arrows collapse to an HTTP `200` response carrying
the runtime record. Sources:
[`../README.md` §Lifecycles → Start](../README.md#start),
[`services.md` §3](services.md).
## Start failure (image pull)
```mermaid
sequenceDiagram
participant Service as startruntime
participant Docker
participant PG as Postgres
participant Intents as notification:intents
participant Results as runtime:job_results
Service->>Docker: PullImage(image_ref)
Docker-->>Service: error
Service->>PG: INSERT operation_log (op_kind=start, outcome=failure, error_code=image_pull_failed)
Service->>Intents: XADD runtime.image_pull_failed {game_id, image_ref, error_code, error_message, attempted_at_ms}
Service-->>Service: Result{Outcome=failure, ErrorCode=image_pull_failed}
Service->>Results: XADD {outcome=failure, error_code=image_pull_failed}
```
The same shape applies to the configuration-validation failures
(`start_config_invalid` from `EnsureNetwork(ErrNetworkMissing)`,
`prepareStateDir`, or invalid `image_ref` shape) and the Docker
create/start failure (`container_start_failed`); only the error code
and the matching `runtime.*` notification type differ. Three failure
codes do **not** raise an admin notification: `conflict`,
`service_unavailable`, `internal_error`
([`services.md` §4](services.md)).
## Start failure (orphan / Upsert-after-Run rollback)
```mermaid
sequenceDiagram
participant Service as startruntime
participant Docker
participant PG as Postgres
participant Intents as notification:intents
Service->>Docker: ContainerCreate + ContainerStart
Docker-->>Service: container running
Service->>PG: Upsert runtime_records
PG-->>Service: error (transport / constraint)
Note over Service: container is now an orphan<br/>(running, no PG record)
Service->>Docker: Remove(container_id) [fresh background context]
Docker-->>Service: ok or logged failure
Service->>PG: INSERT operation_log (outcome=failure, error_code=container_start_failed)
Service->>Intents: XADD runtime.container_start_failed
Service-->>Service: Result{Outcome=failure, ErrorCode=container_start_failed}
```
The Docker adapter already removes the container when `Run` itself
fails after a successful `ContainerCreate`
([`adapters.md` §3](adapters.md)); the start service adds the
post-`Run` rollback for the `Upsert` path. A `Remove` failure is
logged but not propagated; the reconciler adopts surviving orphans on
its periodic pass ([`services.md` §5](services.md)).
## Stop
```mermaid
sequenceDiagram
participant Caller as Lobby / GM / Admin
participant Service as stopruntime
participant Lease as Redis lease
participant PG as Postgres
participant Docker
participant Results as runtime:job_results
Caller->>Service: stop(game_id, reason)
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Service->>PG: SELECT runtime_records WHERE game_id
alt status in {stopped, removed}
Service->>PG: INSERT operation_log (outcome=success, error_code=replay_no_op)
Service-->>Caller: success / replay_no_op
else status = running
Service->>Docker: ContainerStop(container_id, RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS)
Docker-->>Service: ok
Service->>PG: UpdateStatus running→stopped (CAS by container_id)
Service->>PG: INSERT operation_log (op_kind=stop, outcome=success)
Service-->>Caller: success
end
Service->>Lease: DEL rtmanager:game_lease:{game_id}
```
Lobby callers receive the outcome through `runtime:job_results`; REST
callers receive an HTTP `200`. The `reason` enum
(`orphan_cleanup | cancelled | finished | admin_request | timeout`)
is recorded in `operation_log` and is otherwise opaque to the stop
service — RTM does not branch on the reason in v1
([`services.md` §15, §17](services.md)).
## Restart
```mermaid
sequenceDiagram
participant Admin as GM / Admin
participant Service as restartruntime
participant Stop as stopruntime.Run
participant Start as startruntime.Run
participant Docker
participant PG as Postgres
Admin->>Service: POST /restart
Service->>PG: SELECT runtime_records WHERE game_id
Note over Service: capture current image_ref
Service->>Service: acquire per-game lease (held across both inner ops)
Service->>Stop: Run(game_id) [lease bypass]
Stop->>Docker: ContainerStop
Stop->>PG: UpdateStatus running→stopped
Service->>Docker: ContainerRemove
Service->>Start: Run(game_id, image_ref) [lease bypass]
Start->>Docker: PullImage / Run
Start->>PG: Upsert runtime_records (status=running)
Service->>PG: INSERT operation_log (op_kind=restart, outcome=success, source_ref=correlation_id)
Service-->>Admin: 200 {runtime_record}
Service->>Service: release lease
```
The lease is acquired by `restartruntime` and held across both inner
operations; `stopruntime.Run` and `startruntime.Run` are
lease-bypass entry points that skip the inner lease acquisition
([`services.md` §12](services.md)). The single `operation_log` row
uses `Input.SourceRef` as a correlation id linking the implicit stop
and start entries ([`services.md` §13](services.md)).
## Patch
```mermaid
sequenceDiagram
participant Admin as GM / Admin
participant Service as patchruntime
participant Restart as restartruntime.Run
Admin->>Service: POST /patch {image_ref: "galaxy/game:1.4.2"}
Service->>Service: parse new image_ref + current image_ref
alt either ref not semver
Service-->>Admin: 422 image_ref_not_semver
else major or minor differ
Service-->>Admin: 422 semver_patch_only
else major.minor match, patch differs (or equal)
Service->>Restart: Run(game_id, new_image_ref)
Restart-->>Service: Result
Service-->>Admin: 200 {runtime_record}
end
```
The semver gate uses the tag fragment of the Docker reference; the
extraction strategy is recorded in [`services.md` §14](services.md).
The restart delegate already owns the lease, the inner stop/start,
the operation log, and the `runtime:health_events container_started`
emission ([`workers.md` §1](workers.md)).
## Cleanup TTL
```mermaid
sequenceDiagram
participant Worker as containercleanup worker
participant PG as Postgres
participant Service as cleanupcontainer
participant Lease as Redis lease
participant Docker
loop every RTMANAGER_CLEANUP_INTERVAL
Worker->>PG: SELECT runtime_records WHERE status='stopped' AND last_op_at < now - retention
loop per game
Worker->>Service: cleanup(game_id, op_source=auto_ttl)
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Service->>PG: re-read runtime_records WHERE game_id
alt status = running
Service-->>Worker: refused / conflict
else status in {stopped, removed}
Service->>Docker: ContainerRemove(container_id)
Service->>PG: UpdateStatus stopped→removed (CAS)
Service->>PG: INSERT operation_log (op_kind=cleanup_container)
Service-->>Worker: success
end
Service->>Lease: DEL rtmanager:game_lease:{game_id}
end
end
```
Admin-driven cleanup follows the same path through
`DELETE /api/v1/internal/runtimes/{game_id}/container` with
`op_source=admin_rest` instead of `auto_ttl`. The host state directory
is **never** removed by this flow
([`../README.md` §Cleanup](../README.md#cleanup),
[`services.md` §17](services.md),
[`workers.md` §19](workers.md)).
## Reconcile drift adopt
```mermaid
sequenceDiagram
participant Reconciler as reconcile worker
participant Docker
participant PG as Postgres
participant Lease as Redis lease
Note over Reconciler: read pass (lockless)
Reconciler->>Docker: List({label=com.galaxy.owner=rtmanager})
Reconciler->>PG: ListByStatus(running)
Note over Reconciler: write pass (per-game lease)
loop per Docker container without matching record
Reconciler->>Lease: SET NX PX rtmanager:game_lease:{game_id}
Reconciler->>PG: re-read runtime_records WHERE game_id
alt record now exists
Reconciler-->>Reconciler: skip (state changed since read pass)
else record still missing
Reconciler->>PG: Upsert runtime_records (status=running, image_ref, started_at)
Reconciler->>PG: INSERT operation_log (op_kind=reconcile_adopt, op_source=auto_reconcile)
end
Reconciler->>Lease: DEL rtmanager:game_lease:{game_id}
end
```
The reconciler **never** stops or removes an unrecorded container —
operators may have started one manually for diagnostics. The
`reconcile_dispose` and `observed_exited` paths follow the same
read-pass / write-pass split, with `dispose` updating the orphaned
record to `removed` and emitting `container_disappeared`, and
`observed_exited` updating to `stopped` and emitting `container_exited`
([`../README.md` §Reconciliation](../README.md#reconciliation),
[`workers.md` §14–§16](workers.md)).
## Health probe hysteresis
```mermaid
sequenceDiagram
participant Worker as healthprobe worker
participant State as in-memory probe state
participant Engine as galaxy-game-{id}:8080
participant Health as runtime:health_events
loop every RTMANAGER_PROBE_INTERVAL
Worker->>Worker: ListByStatus(running)
Worker->>State: prune entries for games no longer running
loop per game (semaphore cap = 16)
Worker->>Engine: GET /healthz (RTMANAGER_PROBE_TIMEOUT)
alt success
State->>State: consecutiveFailures = 0
opt failurePublished was true
Worker->>Health: XADD probe_recovered {prior_failure_count}
State->>State: failurePublished = false
end
else failure
State->>State: consecutiveFailures++
opt consecutiveFailures == RTMANAGER_PROBE_FAILURES_THRESHOLD AND not failurePublished
Worker->>Health: XADD probe_failed {consecutive_failures, last_status, last_error}
State->>State: failurePublished = true
end
end
end
end
```
Hysteresis prevents a single transient failure from emitting a
`probe_failed` event, and prevents repeated emission while the failure
persists. State is non-persistent: a process restart re-establishes
the counters from scratch; a game's state is pruned when it transitions
out of the running list ([`workers.md` §5–§6](workers.md)).
+163
View File
@@ -0,0 +1,163 @@
# Service-Local Integration Suite
This document explains the design of the service-local integration
suite under [`../integration/`](../integration). The current-state
behaviour (harness layout, env knobs, scenario coverage) lives next
to the files themselves; this document records the rationale.
The cross-service Lobby↔RTM suite at
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
different rules (it lives in the top-level `galaxy/integration`
module) and is documented inside that package.
## 1. Build tag `integration`
The scenarios under [`../integration/*_test.go`](../integration) are
guarded by `//go:build integration`. The default `go test ./...`
invocation skips them, while `go test -tags=integration
./integration/...` (and the `make integration` target) runs the full
set:
```sh
make -C rtmanager integration
```
The harness package itself ([`../integration/harness`](../integration/harness))
has no build tag. It compiles on every run because each helper guards
its Docker-dependent paths with `t.Skip` when the daemon is
unavailable. This keeps the harness loadable from a tagless `go vet`
or IDE workflow without dragging Docker into the default `go test`
critical path.
## 2. Smoke test runs in the default `go test` pass
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
runs in the regular `go test ./...` pass and falls back on
`skipUnlessDockerAvailable` when no Docker socket is present. The
smoke test is intentionally kept separate from the new `integration/`
suite because it exercises the production adapter shape (one
container at a time against `alpine:3.21`), not the full runtime;
both surfaces are useful.
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
logger)` directly rather than spawning the binary from
`cmd/rtmanager/main.go`:
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
the runtime context and call `runtime.Close()`; the goroutine
driving `runtime.Run` returns with `context.Canceled` and the
helper waits on it via the `runDone` channel. With a subprocess the
equivalent dance requires SIGTERM, output capture, and graceful
shutdown timing tied to the child's signal handler.
- **Goroutine and store visibility.** Tests read the durable PG state
directly through the harness-owned pool and read every Redis stream
through the harness-owned client. Both observe the exact wire shape
Lobby will see in the cross-service suite.
- **Logger isolation.** The harness defaults to `slog.Discard` so the
default test output stays focused on assertions; flipping
`EnvOptions.LogToStderr` lights up the runtime's structured logs
for local debugging without requiring any subprocess plumbing.
The cross-service inter-process suite at `integration/lobbyrtm/`
re-uses the existing `integration/internal/harness` binary-spawn
helpers; the in-process choice here is specific to the service-local
scope.
## 4. `httptest.Server` stub for the Lobby internal client
Runtime Manager configuration requires a non-empty
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
as a no-op (the start envelope already carries the only required
field, `image_ref`; rationale in [`services.md`](services.md) §7).
The harness therefore stands up a tiny `httptest.Server` per test
that returns a stable `200 OK` response. The stub is intentionally
unconfigurable: every integration scenario produces the same
ancillary fetch, and adding routing/error injection would invite
test code to depend on a contract the start service deliberately
ignores.
## 5. One built engine image, two semver-compatible tags
The patch lifecycle expects the new and current image refs to share
the same major / minor version (`semver_patch_only` failure
otherwise). Building two distinct images would multiply the per-run
build cost without changing what the test verifies — the patch path
exercises `image_ref_not_semver` and `semver_patch_only` validation
plus the recreate-with-new-tag flow, none of which depend on
distinct image *content*. The harness builds the engine once and
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
The integration tags use the `*-rtm-it` suffix (rather than plain
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
accidentally consume a hand-built dev image, and so a `docker image
rm` of integration leftovers does not nuke a production-shaped tag.
## 6. Per-test Docker network and per-test state root
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
and registers cleanup; `t.ArtifactDir()` provides the per-game state
root. Both ensure that two scenarios running back-to-back cannot
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
filesystem state. Game ids are themselves unique per test
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
the per-test network and state root, the suite is safe to run with
`-count` greater than one.
`t.ArtifactDir()` keeps the engine state directory around when a
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
failure and inspect what the engine wrote. On success the directory
is automatically cleaned up.
## 7. PostgreSQL and Redis containers shared per-package
Both fixtures use `sync.Once` to start one testcontainer per test
package, mirroring the
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
pattern. `TruncatePostgres` and `FlushRedis` reset state between
tests so each scenario starts on an empty stack. The trade-off versus
per-test containers is the standard one: container startup dominates
the per-package latency, so amortising it across the suite keeps the
loop tight while the truncate/flush ensures isolation. The ~12 s
difference matters in CI.
## 8. Engine image cache is intentionally retained between runs
`buildAndTagEngineImage` runs once per package via `sync.Once` and
leaves both image tags in the local Docker cache after the suite
exits. The cache is a substantial speed-up on a developer laptop
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
hot), and a stale image is unlikely because the tags carry the
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
with multiple test runs. Operators who suspect a stale image can
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
the next run rebuilds.
## 9. Scenario coverage
The suite covers the four end-to-end flows operators care about:
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
restart → patch → stop → cleanup. The intermediate `stop` between
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
to remove a running container per
[`../README.md` §Cleanup](../README.md#cleanup).
- **replay** (`replay_test.go`) — duplicate start / stop entries
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
- **health** (`health_test.go`) — external `docker rm` produces
`container_disappeared`; manual `docker run` is adopted by the
reconciler.
- **notification** (`notification_test.go`) — unresolvable `image_ref`
produces `runtime.image_pull_failed` plus a `failure` job_result.
## 10. Service-local scope only
This suite runs Runtime Manager against a real Docker daemon plus
testcontainers PG / Redis but **does not** include any other Galaxy
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
in the top-level `galaxy/integration/` module, where the harness
spawns multiple service binaries and uses real (not stubbed) cross-
service streams.
+531
View File
@@ -0,0 +1,531 @@
# PostgreSQL Schema Decisions
Runtime Manager has been PostgreSQL-and-Redis from day one — there is
no Redis-only predecessor and no migration window. This document
records the schema decisions and the non-obvious agreements behind
them, mirroring the shape of
[`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md)
and serving the same role: a single coherent reference for "why does
the persistence layer look this way".
Use this document together with the migration script
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
and the runtime wiring
[`../internal/app/runtime.go`](../internal/app/runtime.go).
## Outcomes
- Schema `rtmanager` (provisioned externally) holds the durable
service state across three tables: `runtime_records`,
`operation_log`, `health_snapshots`. The three tables map onto the
three runtime concerns documented in
[`../README.md` §Persistence Layout](../README.md#persistence-layout):
current state per game, audit trail per operation, and latest
technical health observation per game.
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
applies embedded goose migrations strictly before any HTTP listener
becomes ready, and exits non-zero when migration or ping fails.
Already-applied migrations exit zero — the
`pkg/postgres`-supplied migrator treats "no work to do" as success.
- The runtime opens one shared `*redis.Client` via
`pkg/redisconn.NewMasterClient` and passes it to the stream offset
store, the per-game lease store, the consumer pipelines, and every
publisher (`runtime:job_results`, `runtime:health_events`,
`notification:intents`).
- The Redis adapter package
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
owns one shared `Keyspace` struct with the
`defaultPrefix = "rtmanager:"` constant and per-store subpackages
for stream offsets and the per-game lease.
- Generated jet code under
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
is committed; `make -C rtmanager jet` regenerates it via the
testcontainers-driven `cmd/jetgen` pipeline.
- Configuration uses the `RTMANAGER_` prefix for every variable.
The schema-per-service rule from
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
applies: each service's role is grant-restricted to its own
schema; RTM never touches Lobby's `lobby` schema or vice versa.
## Decisions
### 1. One schema, externally-provisioned `rtmanagerservice` role
**Decision.** The `rtmanager` schema and the matching
`rtmanagerservice` role are created outside the migration sequence
(in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema`
and by the integration harness; in production, by an ops init script
not in scope for any service stage). The embedded migration
`00001_init.sql` only contains DDL for the service-owned tables and
indexes and assumes it runs as the schema owner with
`search_path=rtmanager`.
**Why.** Mixing role creation, schema creation, and table DDL into
one script forces every consumer of the migration to run as a
superuser. The schema-per-service architectural rule
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
operational split: ops provisions roles and schemas, the service
applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA`
from its runtime role would relax the
"each service's role grants are restricted to its own schema"
defense-in-depth rule.
### 2. `runtime_records.game_id` is the natural primary key
**Decision.** `runtime_records` uses
`game_id text PRIMARY KEY`. There is no surrogate key. The `status`
column carries a CHECK constraint enforcing the
`running | stopped | removed` enum.
```sql
CREATE TABLE runtime_records (
game_id text PRIMARY KEY,
status text NOT NULL,
-- ...
CONSTRAINT runtime_records_status_chk
CHECK (status IN ('running', 'stopped', 'removed'))
);
```
**Why.** `game_id` is the platform-wide identifier owned by Lobby;
RTM stores at most one record per game ever. A surrogate
`bigserial` would force every cross-service join to translate
through a lookup table; the natural key keeps RTM's persistence
layer pin-compatible with the streams contract (every
`runtime:start_jobs` envelope already names the `game_id`). The
status CHECK reproduces the Go-level enum from
[`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go)
as a defense-in-depth gate at the storage boundary. Decision context:
[`domain-and-ports.md`](domain-and-ports.md).
### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus`
**Decision.** `runtime_records_status_last_op_idx` is a composite
index on `(status, last_op_at)`. The container cleanup worker scans
`status='stopped' AND last_op_at < cutoff`; the
`runtimerecordstore.ListByStatus` adapter method orders rows
`last_op_at DESC, game_id ASC`.
```sql
CREATE INDEX runtime_records_status_last_op_idx
ON runtime_records (status, last_op_at);
```
**Why.** Both read shapes share the same composite. The cleanup
worker drives the index from one direction (range scan on
`last_op_at` filtered by status); `ListByStatus` drives it from the
other (equality on status, sorted by `last_op_at`). PostgreSQL
satisfies both shapes through one index scan once the planner picks
the index for the WHERE clause. The secondary `game_id ASC` tiebreak
in the adapter ORDER BY is satisfied by primary-key ordering after
the index returns the rows.
A second supporting index for the cleanup worker was considered and
rejected: the workload is so small (single-instance v1, bounded
running game count) that one composite is dominantly cheaper than
two narrow ones.
### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index
**Decision.** `operation_log` carries a `bigserial id PRIMARY KEY`
and is written exclusively through INSERT — there is no UPDATE
pathway, no soft-delete column, and no foreign key to
`runtime_records`. The audit index
`operation_log_game_started_idx (game_id, started_at DESC)` drives
the GM/Admin REST audit reads. The adapter's `ListByGame` orders
results `started_at DESC, id DESC` and applies `LIMIT $2`.
```sql
CREATE INDEX operation_log_game_started_idx
ON operation_log (game_id, started_at DESC);
```
**Why.** The audit's correctness invariant is "every operation RTM
performed gets exactly one row"; CASCADE deletes from
`runtime_records` would silently lose history when an admin removes
a runtime and would break the
[`../README.md` §Persistence Layout](../README.md) commitment. The
secondary `id DESC` tiebreak inside the adapter is necessary because
the audit log can write multiple rows in the same millisecond when
`reconcile_adopt` and a real operation interleave on a single tick;
without the tiebreak the test that asserts insertion-order-stable
reads becomes flaky. A non-positive `limit` is rejected before the
SQL is issued; an empty result set returns as `nil` (matching the
lobby pattern, so service-layer callers can do `len(entries) == 0`
without an extra allocation).
### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome`
**Decision.** `operation_log` reproduces the three Go-level enums
as CHECK constraints:
```sql
CONSTRAINT operation_log_op_kind_chk
CHECK (op_kind IN (
'start', 'stop', 'restart', 'patch',
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
)),
CONSTRAINT operation_log_op_source_chk
CHECK (op_source IN (
'lobby_stream', 'gm_rest', 'admin_rest',
'auto_ttl', 'auto_reconcile'
)),
CONSTRAINT operation_log_outcome_chk
CHECK (outcome IN ('success', 'failure'))
```
The Go-level enums in
[`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
remain the source of truth.
**Why.** A defence-in-depth gate at the storage boundary catches any
adapter regression that would otherwise persist an unexpected
string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`)
benefit from the enum being verifiable directly in psql without
consulting the Go source. Adding a new value requires editing two
places (the Go enum and the migration), which is the right friction
level: every new value is a wire-protocol change and deserves an
explicit migration. The alternative of using PostgreSQL's `CREATE
TYPE … AS ENUM` was rejected because adding a value to a PG enum
type requires `ALTER TYPE` outside a transaction and complicates the
single-init pre-launch policy (decision §12).
### 6. `health_snapshots` is one row per game; status enum collapses event types
**Decision.** `health_snapshots` carries `game_id text PRIMARY KEY`
and stores the latest technical health observation per game. The
`status` column enumerates the **observed engine state**, not the
**triggering event type**:
```sql
CONSTRAINT health_snapshots_status_chk
CHECK (status IN (
'healthy', 'probe_failed', 'exited',
'oom', 'inspect_unhealthy', 'container_disappeared'
))
```
The `runtime:health_events` `event_type` enum has seven values
(`container_started`, `container_exited`, `container_oom`,
`container_disappeared`, `inspect_unhealthy`, `probe_failed`,
`probe_recovered`). The snapshot status has six — the two probe
events fold into `healthy` (after `probe_recovered`) and
`probe_failed`, and `container_started` collapses into `healthy`.
**Why.** Health snapshots answer "what state is the engine in
**right now**", not "what event was just emitted". A consumer who
wants the event firehose reads `runtime:health_events`; a consumer
who wants the latest verdict reads `health_snapshots`. The two
surfaces have different lifetimes (stream entries are bounded only
by Redis trim; snapshot rows are overwritten on every new
observation), so collapsing the seven event types into six status
states aligns the column with the consumer's mental model. The
adapter that implements this collapse lives in
[`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go);
every emission to the stream also upserts the snapshot.
### 7. Two-axis CAS shape on `runtime_records.UpdateStatus`
**Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS
guard into a single `WHERE … AND …` clause. Status must equal the
caller's `ExpectedFrom`; when the caller supplies a non-empty
`ExpectedContainerID`, `current_container_id` must equal it as
well:
```sql
UPDATE rtmanager.runtime_records
SET status = $1, last_op_at = $2, ...
WHERE game_id = $3
AND status = $4
[AND current_container_id = $5]
```
A `RowsAffected() == 0` result is ambiguous — the row may be absent
or the predicate may have failed. The adapter resolves the ambiguity
through a follow-up `SELECT status FROM ... WHERE game_id = $1`:
missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`.
The probe runs only on the slow path; happy-path UPDATEs cost a
single round trip.
**Why.** The two-axis CAS is what services need: a stop driven by an
old container_id (from a stale REST request) must not clobber a
fresh `running` record installed by a concurrent restart. Status-only
CAS would collapse those two cases. The optional shape on
`ExpectedContainerID` lets reconciliation flows that legitimately
target "this game in `running` state without caring which container"
omit the second predicate. The follow-up probe matches the
gamestore / invitestore precedent in `lobby/internal/adapters/postgres`
and produces clean per-error sentinels at the service layer.
`TestUpdateStatusConcurrentCAS` exercises the path end to end with
eight goroutines racing the same transition: exactly one returns
`nil`, the rest see `runtime.ErrConflict`. The test is deterministic
because PostgreSQL serialises row-level UPDATEs through the row's
MVCC tuple.
### 8. Destination-driven `SET` clause on `UpdateStatus`
**Decision.** `UpdateStatus` updates a different column subset
depending on the destination status:
| Destination | Columns set |
| --- | --- |
| `stopped` | `status`, `last_op_at`, `stopped_at` |
| `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` |
| `running` | `status`, `last_op_at` |
The implementation switches on `input.To` and writes the UPDATE
chain inline per branch — three short branches read better than one
parametric helper.
**Why.** Each destination has a different invariant. `stopped`
records the wall-clock at which the engine ceased serving; `removed`
nulls the container_id because the row no longer points at any
Docker resource; `running` only updates the status and the
last-op timestamp because the running invariants
(`current_container_id`, fresh `started_at`, `current_image_ref`,
`engine_endpoint`) are installed through `Upsert` on the `start`
path.
A previous draft built the SET list via `[]pg.Column` / `[]any`
slices and a helper, but jet's `UPDATE(columns ...jet.Column)`
variadic refuses a `[]postgres.Column` slice spread because the
element type does not match `jet.Column` after the type-alias
resolution. The final code switches inline per branch.
The `running` destination is implemented even though the start
service uses `Upsert` for the inner start of restart and patch.
Keeping the `running` path live preserves a one-to-one match between
`runtime.AllowedTransitions()` and the adapter's capability matrix —
otherwise a future caller exercising the `stopped → running`
transition through `UpdateStatus` would hit a runtime error inside
the adapter rather than a domain rejection. The path only updates
`status` and `last_op_at`; callers responsible for the running
invariants install them through `Upsert` first.
### 9. `created_at` preservation on `Upsert`
**Decision.** `runtimerecordstore.Upsert` is implemented as
`INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable
column from EXCLUDED>``created_at` is deliberately omitted from
the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt`
value never overwrites the stored timestamp.
```sql
INSERT INTO rtmanager.runtime_records (...)
VALUES (...)
ON CONFLICT (game_id) DO UPDATE
SET status = EXCLUDED.status,
current_container_id = EXCLUDED.current_container_id,
current_image_ref = EXCLUDED.current_image_ref,
engine_endpoint = EXCLUDED.engine_endpoint,
state_path = EXCLUDED.state_path,
docker_network = EXCLUDED.docker_network,
started_at = EXCLUDED.started_at,
stopped_at = EXCLUDED.stopped_at,
removed_at = EXCLUDED.removed_at,
last_op_at = EXCLUDED.last_op_at
-- created_at intentionally NOT updated
```
`TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the
invariant.
**Why.** `runtime_records.created_at` records "first time RTM saw
the game". Every restart and every reconcile_adopt re-Upserts the
row with the current wall-clock as `CreatedAt` from the adapter
boundary; without the omission rule the timestamp would drift
forward. Preserving the original creation time keeps a stable
horizon for retention reasoning and matches
`lobby/internal/adapters/postgres/gamestore.Save`, which uses the
same approach for the `games.created_at` column.
### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default
**Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT
'{}'::jsonb`. The jet-generated model declares
`Details string` (jet maps `jsonb` to `string`). The adapter:
- on `Upsert`, substitutes the SQL DEFAULT `{}` when
`snapshot.Details` is empty, so the column never holds a non-JSON
empty string;
- on `Get`, scans `details` as `[]byte` and wraps the bytes in a
`json.RawMessage` so the caller receives verbatim bytes without
an extra round of parsing.
`TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and
`TestUpsertAndGetRoundTrip` cover the two cases.
**Why.** The detail payload is type-specific (the keys differ
between `probe_failed` and `inspect_unhealthy`) and is opaque to
queries — the column is never element-filtered. JSONB matches the
"everything outside primary fields is JSON" pattern that the
Notification Service already established and allows a future
GIN index (e.g. for an admin search-by-key feature) without a
schema rewrite. Substituting the SQL DEFAULT for an empty
parameter avoids the trap where the database accepts `''` for
`text` but rejects it for `jsonb`.
### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary
**Decision.** Every time-valued column on every RTM table uses
PostgreSQL's `timestamptz`. The domain model continues to use
`time.Time`; the adapter normalises every `time.Time` parameter to
UTC at the binding site (`record.X.UTC()` or the `nullableTime`
helper that wraps a possibly-zero `time.Time`), and re-wraps every
scanned `time.Time` with `.UTC()` (directly or via
`timeFromNullable` for nullable columns) before the value leaves
the adapter.
The architecture-wide form of this rule lives in
[`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md).
**Why.** `timestamptz` is the right column type for every cross-
service timestamp the platform observes, and the domain model needs
a `time.Time` API the service layer can compare and arithmetise.
Without explicit `.UTC()` on the bind site, the pgx driver returns
scanned values in `time.Local`, which silently breaks equality
tests, JSON formatting, and comparison against pointer fields
elsewhere in the codebase. The defensive `.UTC()` rule on both
sides eliminates the class of bug where a timezone difference
between the adapter and the test harness flips assertions
intermittently.
The same shape is used in User Service, Mail Service, and
Notification Service — RTM matches the existing convention rather
than introducing a fourth encoding path.
### 12. Single-init pre-launch policy
**Decision.** `00001_init.sql` evolves in place until first
production deploy. Adding a column, an index, or a new table during
the pre-launch development window edits this file directly rather
than producing `00002_*.sql`. The runtime applies the migration on
every boot; if the schema is already at head, `pkg/postgres`'s
goose adapter exits zero.
**Why.** The schema-per-service architectural rule
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
endorses a single-init policy for pre-launch services. The
pre-launch window allows non-additive changes (column rename, type
narrowing, CHECK tightening) that a multi-step migration sequence
would force into awkward two-step rewrites. Once the service ships
to production, the next schema change becomes `00002_*.sql` and
the policy lifts; from that point onward edits to `00001_init.sql`
are rejected by code review.
This applies to RTM exactly the same way it applies to every other
PG-backed service in the workspace; the README explicitly carries
the reminder. The exit-zero behaviour for already-applied
migrations is what makes the policy operationally cheap: a
freshly-spawned replica re-applies the same `00001_init.sql` with
no work to do, no logged error, and proceeds to open its
listeners.
### 13. Query layer is `go-jet/jet/v2`; generated code is committed
**Decision.** All three RTM PG-store packages
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore))
build SQL through the jet builder API
(`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
`pg.AND/OR/SET/COALESCE/...` DSL).
Generated table models live under
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
and are regenerated by `make -C rtmanager jet`. The target invokes
[`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a
transient PostgreSQL container via testcontainers, provisions the
`rtmanager` schema and `rtmanagerservice` role, applies the embedded
goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
against the provisioned schema. Generated code is committed to the
repo, so build consumers do not need Docker.
Statements are run through the `database/sql` API
(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
helpers preserve the codecs.go boundary translations and
domain-type mapping (status enum decoding, `time.Time` UTC
normalisation, JSONB `[]byte``json.RawMessage`).
PostgreSQL constructs that the jet builder does not cover natively
(`COALESCE`, `LOWER` on subselects, JSONB params) are expressed
through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct
`[]byte`/string params for JSONB columns).
**Why.** Aligns with the workspace-wide convention from
[`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is
`github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed
service. Hand-rolled SQL would multiply boundary-translation paths
and require per-store query-builder helpers for what jet already
covers. Committing generated code keeps `go build ./...` working
without Docker.
### 14. `redisstate` keyspace ownership and per-store subpackages
**Decision.** The
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
package owns one shared `Keyspace` struct with a
`defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter
lives in its own subpackage:
- [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/)
for the stream offset store consumed by the start-jobs and
stop-jobs consumers;
- [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/)
for the per-game lease store consumed by every lifecycle service
and the reconciler.
Both subpackages take a `redisstate.Keyspace{}` value and use it to
build their key shapes (`rtmanager:stream_offsets:{label}`,
`rtmanager:game_lease:{game_id}`).
**Why.** Keeping the parent package as the single owner of the prefix
and the key-shape builder mirrors the way Lobby's `redisstate`
namespace centralises every key shape and supports multiple Redis-
backed adapters (stream offsets, the per-game lease) without a
restructure as the surface grows.
The per-store subpackage choice (rather than Lobby's flat
single-package shape) is driven by three considerations:
- It keeps the docker mock generator scoped to one package, since
`mockgen` regenerates per-directory.
- It allows finer-grained dependency selection: `miniredis` is a
dev-only dep, and keeping the `streamoffsets` package
self-contained leaves room for `gamelease` to depend only on the
production `redis` client.
- Each subpackage carries its own tests, which keeps the test
surface focused on one Redis primitive rather than mixing offset
semantics with lease semantics in shared fixtures.
## Cross-References
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
— the embedded schema migration.
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
`//go:embed *.sql` and `FS()` exporter consumed by the runtime.
- [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)
— the three jet-backed PG adapters and their testcontainers-driven
unit suites.
- [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
— committed generated jet models.
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and
[`../Makefile`](../Makefile) `jet` target — the regeneration
pipeline.
- [`../internal/adapters/redisstate/`](../internal/adapters/redisstate),
[`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/),
[`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/)
— Redis adapter package layout.
- [`../internal/app/runtime.go`](../internal/app/runtime.go)
— runtime wiring: PG pool open + migration apply + Redis client
open + adapter assembly.
- [`../internal/config/`](../internal/config) — the config groups
consumed by the wiring (`Postgres`, `Redis`, `Streams`,
`Coordination`).
- Companion design rationales:
[`domain-and-ports.md`](domain-and-ports.md) for status enum and
domain shape, [`adapters.md`](adapters.md) for the redisstate
publishers and clients.
+368
View File
@@ -0,0 +1,368 @@
# Operator Runbook
This runbook covers the checks that matter most during startup,
steady-state readiness, shutdown, and the handful of recovery paths
specific to Runtime Manager.
## Startup Checks
Before starting the process, confirm:
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`)
reaches a Docker daemon the operator controls. RTM is the only
Galaxy service permitted to interact with the Docker socket;
scoping the daemon to RTM-only callers is operator domain.
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a
user-defined bridge network that has already been created (e.g.
via `docker network create galaxy-net` in the environment's
bootstrap script). RTM **validates** the network at startup but
never creates it. A missing network is fail-fast and the process
exits non-zero before opening any listener.
- `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user
can read and write. Per-game subdirectories are created with
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the
uid/gid to match the engine container's user when running with a
non-root engine.
- `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
that hosts the `rtmanager` schema. The DSN must include
`search_path=rtmanager` and `sslmode=disable` (or a real SSL mode
for production). Embedded goose migrations apply at startup before
any HTTP listener opens; a migration or ping failure terminates the
process with a non-zero exit. The `rtmanager` schema and the
matching `rtmanagerservice` role are provisioned externally
([`postgres-migration.md` §1](postgres-migration.md)).
- `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach
the Redis deployment used for the runtime-coordination state:
stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`),
publishers (`runtime:job_results`, `runtime:health_events`,
`notification:intents`), persisted offsets, and the per-game
lease. RTM does not maintain durable business state on Redis.
- Stream names match the producers and consumers RTM integrates with:
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`)
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`)
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`)
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal
HTTP listener. RTM's start service issues a diagnostic
`GET /api/v1/internal/games/{game_id}` per start; failure is logged
at debug and does not abort the start
([`services.md` §7](services.md)).
The startup sequence runs in the order recorded in
[`../README.md` §Startup dependencies](../README.md#startup-dependencies):
1. PostgreSQL primary opens; goose migrations apply synchronously.
2. Redis master client opens and pings.
3. Docker daemon ping; configured network presence check.
4. Telemetry exporter (OTLP grpc/http or stdout).
5. Internal HTTP listener.
6. Reconciler runs **once synchronously** and blocks until done.
7. Background workers start.
A failure at any step is fatal. The synchronous reconciler pass is
the reason orphaned containers from a prior process never reach the
periodic workers in an inconsistent state
([`workers.md` §17](workers.md)).
Expected log lines on a healthy boot:
- `migrations applied`,
- `postgres ping ok`,
- `redis ping ok`,
- `docker ping ok` and `docker network found`,
- `telemetry exporter started`,
- `internal http listening`,
- `reconciler initial pass completed`,
- one `worker started` entry per background worker (seven expected).
## Readiness
Use the probes according to what they actually verify:
- `GET /healthz` confirms the listener is alive — no dependency
check.
- `GET /readyz` live-pings PostgreSQL primary, Redis master, and the
Docker daemon, then asserts the configured Docker network exists.
Returns `{"status":"ready"}` when every check passes; otherwise
returns `503` with the canonical
`{"error":{"code":"service_unavailable","message":"…"}}` envelope
identifying the first failing dependency.
`/readyz` is the strongest readiness signal RTM exposes; unlike
Lobby's `/readyz`, it does **not** rely on a one-shot boot ping.
Each request hits the daemon and the database fresh.
For a practical readiness check in production:
1. confirm the process emitted the listener and worker startup logs;
2. check `GET /healthz` and `GET /readyz`;
3. verify `rtmanager.runtime_records_by_status{status="running"}`
gauge tracks the expected live game count after the first start
completes;
4. verify `rtmanager.docker_op_latency` histograms have at least one
sample after the first lifecycle operation.
## Shutdown
The process handles `SIGINT` and `SIGTERM`.
Shutdown behaviour:
- the per-component shutdown budget is controlled by
`RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`);
- the internal HTTP listener drains in-flight requests before closing;
- stream consumers stop their `XREAD` loops and persist the latest
offset before returning; the offset survives the restart
([`workers.md` §9](workers.md));
- the Docker events listener cancels its subscription;
- the in-flight services release their per-game lease through the
surrounding context cancellation;
- the reconciler completes its current pass or aborts mid-write at
the next lease re-acquisition.
During planned restarts:
1. send `SIGTERM`;
2. wait for the listener and component-stop logs;
3. expect any consumer that was mid-cycle to retry from the persisted
offset on the next process start;
4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`.
## Engine Container Died
A running engine container that exits unexpectedly surfaces through
three observation channels:
- The Docker events listener emits `container_exited` (non-zero exit
code) or `container_oom` (Docker action `oom`).
- The active probe worker eventually emits `probe_failed` once the
threshold is crossed.
- The Docker inspect worker may emit `inspect_unhealthy` if the
engine restarts under Docker's healthcheck or if Docker reports an
unexpected status.
Triage:
1. Inspect the `runtime:health_events` stream for the affected
`game_id` and `event_type`:
```bash
redis-cli XRANGE runtime:health_events - + COUNT 200 \
| grep -A4 'game_id\s*<game_id>'
```
2. Read the runtime record and the operation log:
```bash
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code, started_at
FROM rtmanager.operation_log
WHERE game_id = '<game_id>'
ORDER BY started_at DESC LIMIT 20"
```
3. If Lobby has not reacted (the game's status remains `running` in
`lobby.games`), check `runtime:job_results` lag and Lobby's
`runtimejobresult` worker. RTM publishes the result; Lobby is the
consumer.
4. If the container is already gone (`docker ps -a` shows no row for
`galaxy-game-<game_id>`), the reconciler will move the record to
`removed` on its next pass. Run the periodic reconcile manually
by sending `SIGHUP` is **not** supported — wait
`RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the
process; the synchronous boot pass will handle the drift.
5. The `notification:intents` stream is **not** the place to look
for ongoing health changes. Only the three first-touch start
failures (`runtime.image_pull_failed`,
`runtime.container_start_failed`,
`runtime.start_config_invalid`) produce a notification intent;
probe failures, OOMs, and exits flow through health events only
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
## Patch Upgrade
A patch upgrade replaces the container with a new `image_ref` while
preserving the bind-mounted state directory.
Pre-conditions:
- The new and current `image_ref` tags both parse as semver. RTM
rejects non-semver tags with `image_ref_not_semver`.
- The new and current major / minor versions match. A cross-major or
cross-minor patch returns `semver_patch_only`.
Driving the upgrade:
```bash
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
-d '{"image_ref": "galaxy/game:1.4.2"}'
```
Behaviour:
- The container is stopped, removed, and recreated. The
`current_container_id` changes; the `engine_endpoint`
(`http://galaxy-game-<game_id>:8080`) is stable.
- The engine reads its state from the bind mount on startup, so any
data written before the patch survives.
- A single `operation_log` row is appended with `op_kind=patch` and
the old / new image refs.
- A `runtime:health_events container_started` is emitted by the
inner start ([`workers.md` §1](workers.md)).
Post-patch verification:
```bash
curl -s http://galaxy-game-<game_id>:8080/healthz
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
```
The `current_image_ref` field on the runtime record reflects the new
tag.
## Manual Cleanup
The cleanup endpoint removes the container and updates the record to
`removed`. It refuses to remove a `running` container — stop first.
```bash
# Stop, then clean up
curl -s -X POST \
-H 'Content-Type: application/json' \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
-d '{"reason":"admin_request"}'
curl -s -X DELETE \
-H 'X-Galaxy-Caller: admin' \
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
```
The host state directory under `<RTMANAGER_GAME_STATE_ROOT>/<game_id>`
is **never** deleted by RTM. Removing the directory is operator
domain (backup tooling, future Admin Service workflow). The
operation_log records `op_kind=cleanup_container` with
`op_source=admin_rest`.
## Reconcile Drift After Docker Daemon Restart
A Docker daemon restart drops every running engine container; PG
records remain. On RTM's next boot (or its next periodic reconcile):
1. The reconciler observes `running` records whose containers are
missing from `docker ps`. It updates each record to `removed`,
appends `operation_log` with `op_kind=reconcile_dispose`, and
publishes `runtime:health_events container_disappeared`
([`workers.md` §14–§15](workers.md)).
2. Lobby's `runtimejobresult` worker does not consume the dispose
event in v1, so the cascade does not auto-restart the engine.
Operators trigger restarts through Lobby's user-facing flow or
directly via the GM/Admin REST `restart` endpoint.
3. If the operator brings up an engine container manually for
diagnostics (`docker run` with the
`com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>` labels),
the reconciler **adopts** it on the next pass: a new
`runtime_records` row appears with `op_kind=reconcile_adopt`.
The reconciler **never stops or removes** an unrecorded
container — operators stay in control of manual containers
([`../README.md` §Reconciliation](../README.md#reconciliation)).
Three drift kinds run through the same lease-guarded write pass:
`adopt`, `dispose`, and the README-level path
`observed_exited` (a record marked `running` whose container exists
but is in `exited`). Telemetry counter
`rtmanager.reconcile_drift{kind}` exposes the three independently
([`workers.md` §15](workers.md)).
## Testing Locally
```sh
# One-time bootstrap
docker network create galaxy-net
# Minimal env (see docs/examples.md for a complete .env)
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
export RTMANAGER_DOCKER_NETWORK=galaxy-net
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
export RTMANAGER_REDIS_PASSWORD=local
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
go run ./rtmanager/cmd/rtmanager
```
After start:
- `curl http://localhost:8096/healthz` returns `{"status":"ok"}`;
- `curl http://localhost:8096/readyz` returns `{"status":"ready"}`
once PG, Redis, and Docker pings pass and the configured network
exists;
- driving Lobby through its public flow (`POST /api/v1/lobby/games/<id>/start`)
brings up `galaxy-game-<game_id>` containers; RTM logs each
lifecycle transition.
The integration suite under `rtmanager/integration/` exercises the
end-to-end flows against the real Docker daemon. The default
`go test ./...` skips it via the `integration` build tag; run
explicitly with:
```sh
make -C rtmanager integration
```
The suite requires a reachable Docker daemon. Without one, the
harness helpers call `t.Skip` and the package becomes a no-op
([`integration-tests.md` §1](integration-tests.md)).
## Diagnostic Queries
Durable runtime state lives in PostgreSQL; runtime-coordination state
stays in Redis. CLI snippets that help during incidents:
```bash
# Live runtime count by status (PostgreSQL)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
# Inspect a specific runtime record
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
# Last 20 operations for a game (newest first)
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT id, op_kind, op_source, outcome, error_code,
started_at, finished_at
FROM rtmanager.operation_log
WHERE game_id = '<game_id>'
ORDER BY started_at DESC, id DESC
LIMIT 20"
# Latest health snapshot
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
"SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
# Containers RTM owns (Docker)
docker ps --filter label=com.galaxy.owner=rtmanager \
--format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
# Stream lag (Redis)
redis-cli XINFO STREAM runtime:start_jobs
redis-cli XINFO STREAM runtime:stop_jobs
redis-cli GET rtmanager:stream_offsets:startjobs
redis-cli GET rtmanager:stream_offsets:stopjobs
# Recent health events (oldest first)
redis-cli XRANGE runtime:health_events - + COUNT 100
# Per-game lease (only present while an operation runs)
redis-cli GET rtmanager:game_lease:<game_id>
redis-cli TTL rtmanager:game_lease:<game_id>
```
Operators reach the gauges and counters surfaced through
OpenTelemetry as the primary observability surface; raw PostgreSQL
and Redis access is for last-resort triage.
+309
View File
@@ -0,0 +1,309 @@
# Runtime and Components
The diagram below focuses on the deployed `galaxy/rtmanager` process
and its runtime dependencies. The current-state contract for every
listener, worker, and adapter lives in [`../README.md`](../README.md);
this document is the navigation aid that points at the right code path
and the right design-rationale record.
```mermaid
flowchart LR
subgraph Clients
GM["Game Master"]
Admin["Admin Service"]
Lobby["Game Lobby"]
end
subgraph RTM["Runtime Manager process"]
InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
StartJobs["startjobsconsumer"]
StopJobs["stopjobsconsumer"]
DockerEvents["dockerevents listener"]
HealthProbe["healthprobe worker"]
DockerInspect["dockerinspect worker"]
Reconcile["reconcile worker"]
Cleanup["containercleanup worker"]
Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
IntentPublisher["notification:intents publisher"]
ResultsPublisher["runtime:job_results publisher"]
HealthPublisher["runtime:health_events publisher"]
Telemetry["Logs, traces, metrics"]
end
Docker["Docker Daemon"]
Engine["galaxy-game-{game_id} container"]
Postgres["PostgreSQL\nschema rtmanager"]
Redis["Redis\nstreams + leases + offsets"]
LobbyHTTP["Lobby internal HTTP"]
Lobby -. runtime:start_jobs .-> StartJobs
Lobby -. runtime:stop_jobs .-> StopJobs
GM --> InternalHTTP
Admin --> InternalHTTP
StartJobs --> Services
StopJobs --> Services
InternalHTTP --> Services
Services --> Docker
Services --> Postgres
Services --> Redis
Services --> ResultsPublisher
Services --> HealthPublisher
Services --> IntentPublisher
Services -. GET diagnostic .-> LobbyHTTP
DockerEvents --> Docker
DockerInspect --> Docker
HealthProbe --> Engine
Reconcile --> Docker
Reconcile --> Postgres
Cleanup --> Postgres
Cleanup --> Services
DockerEvents --> HealthPublisher
DockerInspect --> HealthPublisher
HealthProbe --> HealthPublisher
HealthPublisher --> Redis
ResultsPublisher --> Redis
IntentPublisher --> Redis
StartJobs --> Redis
StopJobs --> Redis
InternalHTTP --> Postgres
Docker -->|create / start / stop / rm| Engine
Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
InternalHTTP --> Telemetry
Services --> Telemetry
StartJobs --> Telemetry
StopJobs --> Telemetry
DockerEvents --> Telemetry
HealthProbe --> Telemetry
DockerInspect --> Telemetry
Reconcile --> Telemetry
Cleanup --> Telemetry
```
Notes:
- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
goose migrations fail, when Redis ping fails, when the Docker daemon
ping fails, or when the configured Docker network is missing. Lobby
reachability is **not** verified at boot — the start service's
diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
outside of debug logging
([`services.md` §7](services.md)).
- The reconciler runs **synchronously** once on startup before
`app.App.Run` registers any other component, then re-runs
periodically as a regular `Component`. The synchronous pass is the
reason why orphaned containers from a prior process can never be
observed by the events listener with no PG record
([`workers.md` §17](workers.md)).
- A single internal HTTP listener exposes both probes
(`/healthz`, `/readyz`) and the trusted REST surface for Game Master
and Admin Service. There is no public listener — RTM does not face
end users.
## Listeners
| Listener | Default addr | Purpose |
| --- | --- | --- |
| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
- read timeout: `5s`
- write timeout: `15s`
- idle timeout: `60s`
The listener is unauthenticated and assumes a trusted network segment.
The `X-Galaxy-Caller` request header carries an optional caller
identity (`gm` or `admin`) that the handler records as
`operation_log.op_source`
([`services.md` §18](services.md)).
Probe routes:
- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
the listener is up.
- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
Docker daemon, then asserts the configured Docker network exists.
Returns `{"status":"ready"}` only when every check passes; otherwise
returns `503` with the canonical error envelope.
## Background Workers
Every worker runs as an `app.Component` and is registered in the
order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
| Worker | Source | Trigger | Function |
| --- | --- | --- | --- |
| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
The events listener and the inspect worker do **not** emit
`container_started` — that event is owned by the start service
([`workers.md` §1](workers.md)). The events listener and the inspect
worker also do not emit `container_disappeared` autonomously when a
record is missing or stale; the conditional emission rules live in
[`workers.md` §2](workers.md) and [`§4`](workers.md).
## Lifecycle Services
The five lifecycle services are pure orchestrators called from both
the stream consumers and the REST handlers. Each service owns the
per-game lease for the duration of its operation.
| Service | Source | Triggers | Failure envelope |
| --- | --- | --- | --- |
| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
All services share three behaviours captured in
[`services.md`](services.md):
- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
not by the caller — which keeps consumer and REST callers symmetric
([`services.md` §1](services.md));
- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
`ContainerID`, `EngineEndpoint`) is what consumers and REST
handlers translate into job_results / HTTP responses
([`services.md` §3](services.md));
- failures pass through one `operation_log` write before returning,
and three of the failure codes (`start_config_invalid`,
`image_pull_failed`, `container_start_failed`) also publish a
`runtime.*` admin notification intent
([`services.md` §4](services.md)).
## Synchronous Upstream Client
| Client | Endpoint | Failure mapping |
| --- | --- | --- |
| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
Lobby's outbound transport is the only synchronous client RTM holds.
Every other interaction (Notification Service, Game Master, Admin
Service) crosses an asynchronous boundary or is initiated by the peer.
## Stream Offsets
Each consumer persists its position under a fixed label so process
restart preserves stream progress.
| Stream | Offset key | Block timeout env |
| --- | --- | --- |
| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
The labels `startjobs` and `stopjobs` are stable identifiers — they
are decoupled from the underlying stream key. An operator who renames
a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
Decision: [`workers.md` §9](workers.md).
The `runtime:job_results`, `runtime:health_events`, and
`notification:intents` streams are outbound; RTM does not consume them
itself.
## Configuration Groups
The full env-var list with defaults lives in
[`../README.md` §Configuration](../README.md). The groups below
summarise the structure:
- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
`RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
`RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
`RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
`RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
`RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
`RTMANAGER_CONTAINER_RETENTION_DAYS`,
`RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
`RTMANAGER_ENGINE_STATE_ENV_NAME`,
`RTMANAGER_GAME_STATE_DIR_MODE`,
`RTMANAGER_GAME_STATE_OWNER_UID`,
`RTMANAGER_GAME_STATE_OWNER_GID`.
- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
`RTMANAGER_POSTGRES_REPLICA_DSNS`,
`RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
`RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
`RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
`RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
`RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
`RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
`RTMANAGER_REDIS_STOP_JOBS_STREAM`,
`RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
`RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
`RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
`RTMANAGER_STREAM_BLOCK_TIMEOUT`.
- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
`RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
`RTMANAGER_PROBE_FAILURES_THRESHOLD`.
- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
`RTMANAGER_CLEANUP_INTERVAL`.
- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
`RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
`RTMANAGER_SHUTDOWN_TIMEOUT`.
- **Telemetry** — standard `OTEL_*`.
## Runtime Notes
- **Single-instance v1.** Multi-instance Runtime Manager with Redis
Streams consumer groups is explicitly out of scope for the current
iteration. The per-game lease serialises operations on one game
across the consumer + REST entry points; cross-instance
coordination is deferred until a real workload demands it.
- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
`SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
(default `60s`). The lease is **not renewed mid-operation** in v1;
long pulls of multi-GB images can therefore expire the lease
before the operation finishes — the trade-off is documented in
[`services.md` §1](services.md). The reconciler honours the same
lease around every drift mutation
([`workers.md` §14](workers.md)).
- **Operation log is the source of truth.** Every lifecycle and
reconcile mutation appends one row to `rtmanager.operation_log`.
The `runtime:health_events` stream and the `notification:intents`
emissions are best-effort — a publish failure logs at `Error` and
proceeds, never rolling back the recorded operation
([`workers.md` §8](workers.md)).
- **In-memory probe hysteresis.** The active HTTP probe keeps
per-game `consecutiveFailures` and `failurePublished` counters in a
mutex-guarded map. State is non-persistent: a process restart that
loses the counters re-establishes hysteresis from scratch, and
state for a game that transitions through `stopped → running` is
pruned at the start of every probe tick
([`workers.md` §5](workers.md)).
- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
`if_missing` (default), `always`, and `never`. Image labels
(`com.galaxy.cpu_quota`, `com.galaxy.memory`,
`com.galaxy.pids_limit`) drive resource limits when present; the
matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
label is absent or unparseable. Producers never pass limits.
- **State directory ownership.** RTM creates per-game state
directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
mode and uid/gid, but **never deletes them**. Removing the directory
is operator domain (backup tooling, a future Admin Service
workflow). A cleanup that removes the container leaves the
directory intact.
+443
View File
@@ -0,0 +1,443 @@
# Lifecycle Services
This document explains the design of the five lifecycle services
(`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
`cleanupcontainer`) under [`../internal/service/`](../internal/service)
plus the per-handler REST glue under
[`../internal/api/internalhttp/`](../internal/api/internalhttp).
The current-state behaviour (lifecycle steps, failure tables, the
per-game lease semantics, the wire contracts) lives in
[`../README.md`](../README.md), the OpenAPI spec at
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml), and the
AsyncAPI spec at
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml).
This file records the *why*.
## 1. Per-game lease lives at the service layer
Every lifecycle service acquires `rtmanager:game_lease:{game_id}` via
[`ports.GameLeaseStore`](../internal/ports/gamelease.go) before doing
any work, and releases it on the way out:
- the lease primitive serialises operations on a single game across
every entry point (stream consumers and REST handlers);
- holding the lease at the service layer keeps the consumer / REST
callers symmetric — neither acquires the lease itself, both call
the service the same way;
- the Redis-backed adapter
([`../internal/adapters/redisstate/gamelease/store.go`](../internal/adapters/redisstate/gamelease/store.go))
uses `SET NX PX` on acquire, Lua compare-and-delete on release; a
release whose caller-supplied token no longer matches is a silent
no-op.
The lease key shape is `rtmanager:game_lease:{base64url(game_id)}` so
opaque game ids may contain any characters without leaking through
the key syntax.
The lease TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60s`)
and is **not renewed mid-operation** in v1. A multi-GB image pull can
theoretically expire the lease before the start service finishes;
operators see this as a `reconcile_adopt` event later because the
container is created with the standard owner labels. A renewal helper
is deliberately deferred until a workload makes it necessary.
The reconciler ([`workers.md`](workers.md) §4) honours the same lease
around every drift mutation, which closes the
restart-vs-`reconcile_dispose` race documented in §6 below.
## 2. Health-events publisher lands with the start service
The start service publishes `container_started` after `docker run`
returns; the events listener intentionally does **not** duplicate
the event ([`workers.md`](workers.md) §1). Centralising the publisher
on the start service avoids a "who emits what" ambiguity and lets the
publisher be a thin port wrapper rather than a worker-specific
helper.
The publisher port lives next to the snapshot-upsert rule
([`adapters.md`](adapters.md) §8): one Publish call updates both
surfaces.
## 3. `Result`-shaped contract
`Service.Handle` returns `(Result, error)`. The Go-level `error` is
reserved for system-level / programmer faults (nil context, nil
service). All business outcomes flow through `Result`:
- `Outcome=success`, `ErrorCode=""` — fresh start succeeded;
- `Outcome=success`, `ErrorCode="replay_no_op"` — idempotent replay;
- `Outcome=failure`, `ErrorCode` set — business failure
(`start_config_invalid` / `image_pull_failed` /
`container_start_failed` / `conflict` / `service_unavailable` /
`internal_error`).
The stream consumer uses `Outcome` and `ErrorCode` to populate
`runtime:job_results` directly; the REST handler maps `Outcome=failure`
plus `ErrorCode` to the matching HTTP status. Both callers are simpler
with this contract than with an `errors.Is`-driven sentinel taxonomy.
`ports.JobResult` and the two `JobOutcome*` string constants live in
the ports package next to `JobResultPublisher` so the wire shape is
defined exactly once. The constants are intentionally not aliases of
`operation.Outcome` — the audit-log enum is allowed to grow without
breaking the wire format.
## 4. Start service failure-mode mapping
| Failure | Error code | Notification intent |
| --- | --- | --- |
| Invalid input (empty fields, unknown op_source) | `start_config_invalid` | `runtime.start_config_invalid` |
| Lease busy | `conflict` | — |
| Existing record running with a different image_ref | `conflict` | — |
| Get returns a non-NotFound transport error | `internal_error` | — |
| `image_ref` shape rejected by `distribution/reference` | `start_config_invalid` | `runtime.start_config_invalid` |
| `EnsureNetwork` returns `ErrNetworkMissing` | `start_config_invalid` | `runtime.start_config_invalid` |
| `EnsureNetwork` returns any other error | `service_unavailable` | — |
| `PullImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
| `InspectImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
| `prepareStateDir` failure | `start_config_invalid` | `runtime.start_config_invalid` |
| `Run` failure | `container_start_failed` | `runtime.container_start_failed` |
| `Upsert` failure after successful Run | `container_start_failed` | `runtime.container_start_failed` |
Three error codes do **not** raise an admin notification: `conflict`,
`service_unavailable`, and `internal_error` are operational classes
(another caller is in flight, a dependency is down, an unclassified
fault) where the corrective action is not a configuration change. The
operator already sees them through telemetry and structured logs; an
email per occurrence would be noise.
## 5. Upsert-after-Run rollback
A `Run` that succeeded but whose `Upsert` failed leaves a running
container with no PG record. The service issues a best-effort
`docker.Remove(containerID)` in a fresh `context.Background()` (the
request context may already be cancelled) before recording the failure.
A Remove failure is logged but not propagated; the reconciler adopts
surviving orphans on its periodic pass.
The Docker adapter already removes the container when `Run` itself
returns an error after a successful `ContainerCreate` ([`adapters.md`](adapters.md) §3).
The service-layer rollback covers the additional post-`Run` Upsert
failure path.
## 6. Pre-existing record handling
Only `status=running` + same `image_ref` is a `replay_no_op`.
`running` + a different `image_ref` returns `failure / conflict` (use
`patch` to change the image of a running container).
Anything else (`stopped`, `removed`, missing record) proceeds with a
fresh start that ends in `Upsert`. `Upsert` overwrites verbatim and is
not bound by the transitions table, so installing a `running` record
over a `removed` row is permitted — the `removed` terminus rule lives
in `runtime.AllowedTransitions` (which guards `UpdateStatus`), not in
`Upsert`.
`created_at` is preserved across re-starts: the start service reuses
`existing.CreatedAt` when the record was found, so the
"first time RTM saw the game" semantics from
[`postgres-migration.md`](postgres-migration.md) §9 hold even when the
start path goes through `Upsert` rather than through the runtime
adapter's `INSERT ... ON CONFLICT DO UPDATE` EXCLUDED list.
A residual `galaxy-game-{game_id}` container left over from a previous
start that was stopped but never cleaned up will fail at `docker run`
with a name conflict. The service surfaces that as
`container_start_failed`; cleanup plus the reconciler is the standard
remedy. A pre-emptive Remove inside the start service was rejected
because it would silently undo manual operator inspection on stopped
containers.
## 7. `LobbyInternalClient.GetGame` is best-effort
The fetch happens after the lease is acquired and before the Docker
work, with the configured `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
`ErrLobbyUnavailable` and `ErrLobbyGameNotFound` are logged at
`debug`; the start operation continues either way. The fetched
`Status` and `TargetEngineVersion` enrich logs only — the start
envelope already carries the only required field (`image_ref`), and
the port docstring fixes the recoverable-failure contract.
## 8. `image_ref` validation
Validation uses `github.com/distribution/reference.ParseNormalizedNamed`
before any Docker round-trip. Rejected shapes surface as
`start_config_invalid` plus a `runtime.start_config_invalid` intent.
Daemon-side rejections after a valid parse (manifest unknown,
authentication required) surface as `image_pull_failed` plus a
`runtime.image_pull_failed` intent. The split keeps operator-actionable
configuration mistakes distinct from registry-side failures.
## 9. State-directory preparer is overrideable
`Dependencies.PrepareStateDir` is a `func(gameID string) (string, error)`
injection point that defaults to `os.MkdirAll` + `os.Chmod` +
`os.Chown` against `RTMANAGER_GAME_STATE_ROOT`. Tests override it to
point at a `t.TempDir()`-style fake without exercising the real
filesystem permissions (which require either matching uid/gid or
root). This is a deliberate non-port abstraction: the start service
does no other filesystem work and the cost of a new port for one
helper is not worth the indirection.
## 10. Container env: both `GAME_STATE_PATH` and `STORAGE_PATH`
Both names are accepted by the v1 engine. The start service always
sets both; the configured `RTMANAGER_ENGINE_STATE_ENV_NAME` controls
the primary. When the operator overrides the primary to `STORAGE_PATH`,
the deduplicating map collapses the two entries into one.
## 11. Wiring layer construction
`internal/app/wiring.go` is the single point that builds every
production store, adapter, and service from `config.Config`. The
struct exposes typed fields so handlers and workers can grab the
singletons without re-wiring; an `addCloser` slice releases adapter
resources (currently the Lobby HTTP client's idle-connection pool) at
runtime shutdown. The `runtimeRecordsProbe` adapter installed during
construction registers the `rtmanager.runtime_records_by_status`
gauge documented in [`../README.md` §Observability](../README.md).
The persistence-only `CountByStatus` method on the `runtimerecordstore`
adapter is **not** part of `ports.RuntimeRecordStore` because it is
only used by the gauge probe; widening the port for one caller would
force every adapter and test fake to grow with no benefit. The adapter
exposes it directly and the wiring composes a concrete-typed wrapper.
## 12. Shared lease across composed operations (restart, patch)
Restart and patch must hold the lease across the inner
`stop → docker rm → start` sequence, otherwise a concurrent stop or
restart could observe a half-recreated runtime.
`startruntime.Service` and `stopruntime.Service` therefore expose a
second public method:
```go
// Run executes the lifecycle assuming the per-game lease is already
// held by the caller. Reserved for orchestrator services that compose
// stop or start with another operation under a single outer lease.
// External callers must use Handle.
func (service *Service) Run(ctx context.Context, input Input) (Result, error)
```
`Handle` acquires the lease, defers its release, and calls `Run`.
Restart and patch acquire the outer lease themselves and call `Run`
on the inner services. The inner services record their own
`operation_log` entries, telemetry counters, health events, and admin
notification intents identically to a top-level `Handle`.
A typed `LeaseTicket` parameter (a small internal-package zero-size
struct that only the lease store can construct) was considered and
rejected for v1: only sister services in `internal/service/` ever call
`Run`, the docstring is loud about the precondition, and the pattern
can be tightened later without breaking the public surface that
consumers and handlers consume.
## 13. Correlation id on `source_ref`
The outer restart and patch services reuse the existing
`Input.SourceRef` as a correlation key:
- when `Input.SourceRef` is non-empty (REST request id, stream entry
id), all three entries — outer restart / patch + inner stop +
inner start — share that value;
- when empty, the outer service generates a 32-byte base64url string
via the same `NewToken` generator that produces lease tokens, and
uses it as the correlation key for all three entries.
The outer entry's `source_ref` keeps its dual semantics: actor ref
when the caller supplied one, generated correlation id otherwise. Pure
top-level operations (caller invokes start, stop, or cleanup directly)
keep the original meaning. Composed operations (restart, patch) use
the same value in three places to make audit queries trivial.
This is not the cleanest end-state — a dedicated `correlation_id`
column would carry the link without ambiguity — but it is the smallest
change that does not touch the schema. A future stage that adds the
column can rename the field and clear up the dual role in one move.
## 14. Semver validation for patch
`internal/service/patchruntime/semver.go` enforces the
patch-precondition (current and new `image_ref` parse as semver, share
major and minor):
- `extractSemverTag(imageRef)` parses with
`github.com/distribution/reference.ParseNormalizedNamed`, casts to
`reference.NamedTagged`, then validates the tag with
`golang.org/x/mod/semver.IsValid` (after prepending `v` when the tag
omits it). Failures map to `image_ref_not_semver`;
- `samePatchSeries(currentSemver, newSemver)` compares
`semver.MajorMinor` of the two canonical strings; mismatch maps to
`semver_patch_only`.
`golang.org/x/mod` is a direct require to avoid a transitive-version
surprise. `github.com/Masterminds/semver/v3` (also in the module
graph) was rejected to avoid two semver libraries on disk for the
same job; `x/mod/semver` already covers Lobby. A hand-rolled
`vMajor.Minor.Patch` parser was rejected as premature.
Pre-checks run before any inner stop or `docker rm`: a rejected patch
never disturbs the running runtime. Patch with
`new_image_ref == current_image_ref` proceeds through the recreate
flow unchanged (not `replay_no_op`: the inner start still runs); the
outer `op_kind=patch` entry records the no-op patch for audit.
## 15. `StopReason` placement
The reason enum mirrors `lobby/internal/ports/runtimemanager.go`
verbatim and lives at `internal/service/stopruntime/stopreason.go`.
The stream consumer and the REST handler import `stopruntime` for
the same enum the service requires.
Inner stop calls from restart and patch always pass
`StopReasonAdminRequest`. Restart and patch are platform-internal
recreate flows; `admin_request` is the closest semantic match in the
five-value vocabulary. The actor that originated the recreate (REST
request id, admin user id) flows through the `op_source` /
`source_ref` pair, not through the stop reason.
## 16. Error code centralisation
`internal/service/startruntime/errors.go` is the canonical home for
the stable error codes returned in `Result.ErrorCode`. The other four
services (`stopruntime`, `restartruntime`, `patchruntime`,
`cleanupcontainer`) import the constants from `startruntime` rather
than redeclaring them. The package comment of `errors.go` flags the
shared usage so future readers do not chase per-service declarations.
`start_config_invalid` is reserved for start because every start
validation failure also raises an admin notification intent. The
other services use the more general `invalid_request` for input
validation failures.
## 17. Stop / restart / patch / cleanup failure tables
### `stopruntime`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | No notification intent. |
| Lease busy | `conflict` | Lease release skipped because acquire returned false. |
| Lease error | `service_unavailable` | Redis unreachable. |
| Record missing | `not_found` | |
| Status `stopped` / `removed` | success / `replay_no_op` | Idempotent re-stop. |
| `docker.Stop` returns `ErrContainerNotFound` | success | Record transitions `running → removed`, `container_disappeared` health event published. |
| `docker.Stop` other error | `service_unavailable` | Record untouched; caller may retry. |
| `UpdateStatus` returns `ErrConflict` (CAS race) | success / `replay_no_op` | The desired state was reached by another path (reconciler / restart). |
| `UpdateStatus` returns `ErrNotFound` | `not_found` | Record vanished mid-stop. |
| `UpdateStatus` other error | `internal_error` | |
### `restartruntime`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | |
| Lease busy / lease error | `conflict` / `service_unavailable` | Same as stop. |
| Record missing | `not_found` | |
| Status `removed` | `conflict` | Image_ref may be empty; restart cannot proceed. |
| Inner stop fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner stop failed: ". |
| `docker.Remove` fails | `service_unavailable` | Inner stop already moved record to `stopped`; runtime stays in `stopped`. Admin must call `cleanup_container` before retrying restart. |
| Inner start fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner start failed: ". |
The post-stop `docker rm` failure is the only path that leaves the
runtime in a state from which the same operation cannot recover by
itself: a residual `galaxy-game-{game_id}` container blocks a fresh
inner start (the start service surfaces this as
`container_start_failed`). The runbook entry — "call cleanup, then
restart again" — is the standard remedy.
### `patchruntime`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | |
| Lease busy / lease error | `conflict` / `service_unavailable` | |
| Record missing | `not_found` | |
| Status `removed` | `conflict` | |
| Current `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
| New `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
| Major / minor mismatch | `semver_patch_only` | Pre-check; no inner ops fired. |
| Inner stop / `docker rm` / inner start fails | inherits inner code | Same propagation as restart. |
### `cleanupcontainer`
| Failure | Error code | Notes |
| --- | --- | --- |
| Invalid input | `invalid_request` | |
| Lease busy / lease error | `conflict` / `service_unavailable` | |
| Record missing | `not_found` | |
| Status `removed` | success / `replay_no_op` | |
| Status `running` | `conflict` | Error message: "stop the runtime first". |
| Status `stopped` | proceed | |
| `docker.Remove` returns `ErrContainerNotFound` | success | Adapter swallows not-found into nil. |
| `docker.Remove` other error | `service_unavailable` | Record untouched; caller may retry. |
| `UpdateStatus` returns `ErrConflict` | success / `replay_no_op` | Race with reconciler dispose. |
| `UpdateStatus` returns `ErrNotFound` | `not_found` | |
| `UpdateStatus` other error | `internal_error` | |
## 18. REST handler conventions
The internal HTTP handlers under
[`../internal/api/internalhttp/handlers/`](../internal/api/internalhttp/handlers)
follow these rules:
- **`X-Galaxy-Caller` header.** The optional header carries the
calling service identity (`gm` / `admin`); the handler records the
value as `op_source` in `operation_log` (`gm_rest` / `admin_rest`).
Missing or unknown values default to `admin_rest` because every
audit-log query already filters on the cleanup endpoint
(`op_source ∈ {auto_ttl, admin_rest}`); making the default match
the most-restricted surface keeps existing dashboards correct when
an unconfigured client hits the listener. The header is declared as
a reusable parameter (`components.parameters.XGalaxyCallerHeader`)
in the OpenAPI spec and is referenced from each runtime operation
but not from `/healthz` and `/readyz`.
- **Error code → HTTP status mapping.** One canonical table in
`handlers/common.go`:
| ErrorCode | HTTP status |
| --- | ---: |
| (success, including `replay_no_op`) | 200 |
| `invalid_request`, `start_config_invalid`, `image_ref_not_semver` | 400 |
| `not_found` | 404 |
| `conflict`, `semver_patch_only` | 409 |
| `service_unavailable`, `docker_unavailable` | 503 |
| `internal_error`, `image_pull_failed`, `container_start_failed` | 500 |
`image_pull_failed` and `container_start_failed` are operational
failures that originate inside RTM (registry / daemon problems),
not client-side validation issues; they map to `500` so callers
retry through their normal resilience paths instead of treating
the call as a 4xx that must be fixed at the source.
`docker_unavailable` is reserved for future producers; today the
start service emits `service_unavailable` for Docker-daemon
failures. Unknown error codes default to `500`.
- **List and Get bypass the service layer.** `internalListRuntimes`
and `internalGetRuntime` read directly from
`ports.RuntimeRecordStore`. Reads do not produce `operation_log`
rows, do not change Docker state, do not need the per-game lease,
and do not have a stream-side counterpart — none of the lifecycle
service machinery is justified.
- **`RuntimeRecordStore.List(ctx)` returns every record regardless
of status.** A single SELECT ordered by
`(last_op_at DESC, game_id ASC)` — the same direction the
`runtime_records_status_last_op_idx` index supports, so freshly
active games surface first. Pagination is intentionally not
modelled in v1; the working set is bounded by the games tracked
by Lobby.
- **Per-handler service ports use `mockgen`.** The handler layer
depends on five narrow interfaces — one per lifecycle service —
declared in `handlers/services.go`. Production wiring passes the
concrete `*<lifecycle>.Service` pointers (each satisfies the
matching interface implicitly); tests pass the mockgen-generated
mocks under `handlers/mocks/`.
- **Conformance test scope.** `internalhttp/conformance_test.go`
drives every documented runtime operation against a real
`internalhttp.Server` whose service deps are deterministic stubs.
The test uses `kin-openapi/routers/legacy.NewRouter`, calls
`openapi3filter.ValidateRequest` and
`openapi3filter.ValidateResponse` so both directions match the
contract. The scope is happy-path only; the failure-path response
shapes are validated by the per-handler tests.
+412
View File
@@ -0,0 +1,412 @@
# Background Workers
This document explains the design of the seven background workers
under [`../internal/worker/`](../internal/worker):
- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
[`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
subscription;
- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
`InspectContainer` worker;
- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
`/healthz` probe;
- [`reconcile`](../internal/worker/reconcile) — startup + periodic
drift reconciliation;
- [`containercleanup`](../internal/worker/containercleanup) —
periodic TTL cleanup.
The current-state behaviour and configuration surface live in
[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
§Reconciliation), and operational notes are in
[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
[`runbook.md`](runbook.md). This file records the rationale.
## 1. Single ownership per `event_type`
The `runtime:health_events` vocabulary is shared across four sources;
each event type is owned by exactly one of them.
| `event_type` | Owner |
| --- | --- |
| `container_started` | `internal/service/startruntime` |
| `container_exited` | `internal/worker/dockerevents` |
| `container_oom` | `internal/worker/dockerevents` |
| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
| `inspect_unhealthy` | `internal/worker/dockerinspect` |
| `probe_failed` | `internal/worker/healthprobe` |
| `probe_recovered` | `internal/worker/healthprobe` |
`container_started` is intentionally not duplicated by the events
listener, even though Docker emits a `start` action whenever the start
service runs the container. The start service already publishes the
event with the same wire shape; observing the action in the listener
would produce two entries per real start.
## 2. `container_disappeared` is conditional on PG state
The Docker events listener inspects the runtime record before emitting
`container_disappeared` for a `destroy` action. Three suppression rules
apply:
- record missing → suppress (the destroyed container was never owned
by RTM as a tracked runtime, so no consumer cares);
- record `status != running` → suppress (RTM already finished a stop
or cleanup; the destroy is the expected tail of that operation);
- record `current_container_id != event.ContainerID` → suppress (RTM
swapped to a new container through restart or patch; the destroy is
the expected removal of the prior container id).
Only a destroy that arrives for a `running` record whose
`current_container_id` still equals the event id is treated as
unexpected. This is the wire-side analogue of the reconciler's
PG-drift check: the reconciler observes "PG=running, no Docker
container" while the events listener observes "Docker says destroy,
PG still says running pointing at this container". Together they cover
both directions of drift.
A read failure against `runtime_records` is treated conservatively as
"suppress" — the listener cannot tell whether the destroy was external
or RTM-initiated, and over-emitting `container_disappeared` would lead
to a real consumer (`Game Master`) escalating a false positive.
## 3. `die` with exit code `0` is suppressed
`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
event with exit code `0`. The `container_exited` contract guarantees a
non-zero exit; emitting on exit `0` would shower consumers with
normal-stop noise. The listener silently drops the event; the
operation log already records the stop on the caller side.
## 4. Inspect worker leaves `container_disappeared` to the reconciler
When `dockerinspect` calls `InspectContainer` and the daemon returns
`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
- the reconciler is the single authority for PG-drift reconciliation.
Adding a third source for `container_disappeared` would risk double
emission and complicate the consumer story;
- inspect ticks every 30 seconds; the reconciler ticks every 5
minutes. The latency window for "Docker drops the container, RTM
notices" is therefore at most 5 minutes in v1, which is acceptable
for the kinds of drift the reconciler covers (manual `docker rm`
outside RTM, daemon restart with stale records). If a future
requirement tightens the window, promoting the inspect-side
observation to a real `container_disappeared` is a one-line change.
## 5. Probe hysteresis is in-memory and pruned per tick
The active probe worker keeps per-game state in a
`map[string]*probeState` guarded by a mutex. Two counters live there:
- `consecutiveFailures` — incremented on every failed probe, reset on
every success;
- `failurePublished` — prevents repeated `probe_failed` emission while
the failure persists, and triggers a single `probe_recovered` on the
first success after the threshold was crossed.
The state is non-persistent. RTM is single-instance in v1, and a
process restart that loses the counters merely re-establishes the
hysteresis from scratch — the only consequence is that a probe failure
already in progress at the moment of restart needs another full
threshold of failures to surface. Making the state durable would add a
Redis round-trip to every probe attempt without buying anything that
operators or downstream consumers depend on.
State pruning happens at the start of every tick. The worker reads the
current running list and removes any state entry whose `game_id` is
not in the list. A game that transitions through stopped → running
again starts fresh; previously-accumulated counters do not bleed into
the new lifecycle.
## 6. Probe concurrency is bounded by a fixed cap
Probes inside one tick run in parallel through a buffered-channel
semaphore (`defaultMaxConcurrency = 16`). Three reasons:
- A single slow engine cannot delay the entire cohort. Sequential
per-game probing would multiply the worst case by `len(records)`,
which is the wrong shape for what is fundamentally a fan-out
observation pattern.
- An unbounded fan-out (one goroutine per record per tick without a
cap) was rejected to avoid pathological CPU and connection bursts
if the running list ever grows beyond what RTM was sized for. 16
in-flight probes at the default 2s timeout fit a single RTM
instance well within typical OS file-descriptor and TCP
ephemeral-port limits.
- The cap is a constant rather than an env var because RTM v1 is
single-instance and the active-game count is bounded by Lobby; a
configurable cap is something we promote to env if a real workload
demands it.
The same reasoning argues against parallelism in the inspect worker:
inspect calls are cheap (sub-ms in the local Docker socket case) and
serial execution avoids unnecessary concurrency on the daemon socket.
## 7. Events listener reconnects with fixed backoff
The Docker daemon's events stream is a long-lived subscription; the
SDK channel terminates on any transport error (daemon restart, socket
hiccup, connection reset). The listener's outer loop handles this by
re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
indefinitely while ctx is alive.
Crashing the process on a transport error was rejected because losing
a few seconds of health observations is a much smaller blast radius
than losing the entire RTM process while the start/stop pipelines are
running. The save-offset case is different: a lost offset replays the
entire backlog and breaks correctness, while a missed health event is
observation-only.
A subscription error is logged at `Warn` so operators can see the
reconnect activity without it dominating the log volume.
## 8. Health publisher remains best-effort
Every emission goes through `ports.HealthEventPublisher.Publish`, the
same surface the start service already uses
([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
and proceeds; the worker does not retry, does not adjust its in-memory
hysteresis, and does not surface the failure to the caller. The
operation log is the source of truth for runtime state; the event
stream is a best-effort notification surface to consumers.
## 9. Stream offset labels are stable identifiers
Both consumers persist their progress through
`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
labels match `rtmanager:stream_offsets:{label}` and stay stable when
the underlying stream key is renamed via
`RTMANAGER_REDIS_START_JOBS_STREAM` /
`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
consumer at a different stream key does not lose the persisted offset.
## 10. `OpSource` and `SourceRef` originate at the consumer boundary
Every consumed envelope is translated into a `Service.Handle` call
with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
`SourceRef` is the Redis Stream entry id (`message.ID`); the
`operation_log` rows therefore record the originating envelope id, and
restart / patch correlation logic ([`services.md`](services.md) §13)
keeps working when those services are invoked indirectly.
## 11. Replay-no-op detection lives in the service layer
The consumer does not detect replays itself. `startruntime.Service`
returns `Outcome=success, ErrorCode=replay_no_op` when the existing
record is already `running` with the same `image_ref`;
`stopruntime.Service` does the same for an already-stopped or
already-removed record. The consumer copies the result fields into
the `RuntimeJobResult` payload verbatim and lets Lobby observe the
replay through `error_code`.
The wire-shape consequences:
- `success` + empty `error_code` → fresh start / fresh stop;
- `success` + `error_code=replay_no_op` → idempotent replay. For
start, the existing record carries `container_id` and
`engine_endpoint`; for stop on `status=removed`, both fields are
empty strings (the record was nulled by an earlier cleanup) — the
AsyncAPI contract permits empty strings on these required fields;
- `failure` + non-empty `error_code` → the start / stop service
returned a zero `Record`; the consumer publishes empty
`container_id` and `engine_endpoint`.
## 12. Per-message errors are absorbed; the offset always advances
The consumer run loop logs and absorbs any decode error, any go-level
service error, and any publish failure; `streamOffsetStore.Save` runs
unconditionally after each handled message. Pinning the offset on a
single transient publish failure was rejected because the durable side
effect (operation_log row, runtime_records mutation, Docker state) has
already happened on the first pass; pinning the offset to retry the
publish would duplicate audit rows for hours until the operator
intervened.
The exception is `streamOffsetStore.Save` itself: a save failure
returns a wrapped error from `Run`. The component supervisor in
`internal/app/app.go` then exits the process and lets the operator
escalate, because losing the offset would cause every subsequent
restart to re-process every prior envelope.
## 13. `requested_at_ms` is logged-only
The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
consumer parses it (rejecting unparseable values) but only includes
the value in structured logs — the field is "used for diagnostics, not
authoritative" per the contract. The service layer ignores it; the
operation_log uses `service.clock()` for `started_at` / `finished_at`
so Lobby's wall-clock skew never bleeds into RTM persistence.
## 14. Reconciler: per-game lease around every write
A `running → removed` mutation that races a restart's inner stop
would clobber the restart's freshly-installed `running` record without
any other guard. The reconciler honours the same per-game lease that
the lifecycle services hold ([`services.md`](services.md) §1).
The reconciler splits its work into two phases:
- **Read pass — lockless.**
`docker.List({com.galaxy.owner=rtmanager})` followed by
`RuntimeRecords.ListByStatus(running)`. No lease is taken; both
reads are point-in-time observations of independent systems and a
stale view here only delays a mutation by one tick.
- **Write pass — lease-guarded.** Every drift mutation
(`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
per-game lease, re-reads the record under the lease, and then
either applies the mutation or returns when state has changed.
A lease conflict (`acquired=false`) is logged at `info` and the
game is silently skipped — the next tick will retry. A lease-store
error is logged at `warn`; the rest of the pass continues.
The re-read after lease acquisition is intentional: the read pass is
lockless, so by the time the lease is held the runtime record may
have moved. `UpdateStatus` already provides CAS via
`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
does not, so the explicit re-read keeps the three paths uniform and
makes the skip condition obvious in code review.
## 15. Three drift kinds covered by the reconciler
- `adopt` — Docker reports a container labelled
`com.galaxy.owner=rtmanager` for which RTM has no record; insert a
fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
stop or remove the container (operators may have started it
manually for diagnostics).
- `dispose` — RTM has a `running` record whose container is missing
in Docker; mark `status=removed`, publish
`container_disappeared`, append `op_kind=reconcile_dispose`.
- `observed_exited` — RTM has a `running` record whose container
exists but is in `exited`; mark `status=stopped`, publish
`container_exited` with the observed exit code. This third path
exists because the events listener sees only live events; a
container that died while RTM was offline would otherwise stay
`running` indefinitely. The drift is exposed through
`rtmanager.reconcile_drift{kind=observed_exited}` and through the
`container_exited` health event; no `operation_log` entry is
written because the audit log records explicit RTM operations, not
passive observations of Docker state.
## 16. `stopped_at = now (reconciler observation time)`
The `observed_exited` path writes `stopped_at = now`, where `now` is
the reconciler's observation time. The persistence adapter
([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
`stopped_at = now` for the `stopped` destination — there is no
port-level knob for an explicit timestamp, and the reconciler does not
read `State.FinishedAt` from Docker.
The trade-off: `stopped_at` diverges from the daemon's
`State.FinishedAt` by at most one tick interval (default 5 minutes).
If a downstream consumer ever needs the daemon-observed exit
timestamp, the upgrade path is a one-call extension of
`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
that change is deferred until a consumer materialises.
## 17. Synchronous initial pass + periodic Component
`README §Startup dependencies` step 6 demands "Reconciler runs once
and blocks until done" before background workers start, but
`app.App.Run` starts every registered `Component` concurrently —
component ordering does not translate into start ordering.
The reconciler exposes a public `ReconcileNow(ctx)` method that the
runtime calls synchronously between `newWiring` and `app.New`. The
same `*Reconciler` is then registered as a `Component`; its `Run`
only ticks (no immediate pass) so the startup work is not duplicated.
The cost is one public method on the worker; the benefit is that the
README invariant holds verbatim and the periodic loop is a textbook
`Component`.
## 18. Adopt through `Upsert`, race with start is benign
The adopt path constructs a fresh `runtime.RuntimeRecord` (status
running, container id and image_ref from labels, `started_at` from
`com.galaxy.started_at_ms` or inspect, state path and docker network
from configuration, engine endpoint from the
`http://galaxy-game-{game_id}:8080` rule) and calls
`RuntimeRecords.Upsert`.
Race scenario: the start service has called `docker.Run` but has not
yet finished its own `Upsert` when the reconciler observes the
container without a record. Both writers eventually arrive at PG with
the same key data — the start service knows the canonical
`image_ref`, but the reconciler reads it from the
`com.galaxy.engine_image_ref` label that the start service itself
wrote. The CAS-free overwrite is therefore benign:
- `created_at` is preserved across upserts by the
`ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
game" timestamp stays stable regardless of which writer lands last;
- all other fields in this race carry identical values (same
container, same image, same hostname, same state path).
Under the per-game lease this is doubly safe: the reconciler only
issues `Upsert` while holding the lease, and only after re-reading
the record finds it absent. Concurrent start would block on the same
lease; concurrent stop / restart would have moved the record out of
"absent" by the time the reconciler re-reads.
## 19. Cleanup worker delegates to the service
The TTL-cleanup worker is intentionally tiny: it lists
`runtime_records.status='stopped'`, filters in process by
`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
candidate. The service already owns:
- the per-game lease around the Docker `Remove` call;
- the `running → removed` CAS via `UpdateStatus`;
- the operation_log entry (`op_kind=cleanup_container`,
`op_source=auto_ttl`);
- the telemetry counter and structured log fields.
In-memory filtering is acceptable in v1 because the cardinality of
`status=stopped` rows is bounded by Lobby's active-game count plus
retention period. The dedicated `(status, last_op_at)` index drives
the underlying `ListByStatus(stopped)` query so the database does
the heavy lifting; the Go-side filter is microseconds-per-row.
The worker uses a small `Cleaner` interface in its own package rather
than depending on `*cleanupcontainer.Service` directly. This keeps
the worker's tests light — no need to construct Docker, lease,
operation-log, and telemetry doubles just to verify TTL math — while
the production wiring still binds the real service via a compile-time
interface assertion in `internal/app/wiring.go`.
## 20. Sequential per-game work in reconciler and cleanup
Both workers process games sequentially within a tick. The
reconciler's mutations are dominated by `Get` + `Upsert` /
`UpdateStatus` round-trips against PG plus an occasional Docker
`InspectContainer`; the cleanup worker's mutations are dominated by
the cleanup service's `docker.Remove` call. Parallelising either
would multiply the load on the Docker daemon socket and the PG pool
without buying anything that v1 cardinality demands.
## 21. Cross-module test boundary for the consumer integration test
[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
covers the contract roundtrip without importing
`lobby/internal/...`:
- it XADDs a start envelope in the AsyncAPI wire shape (the same
shape Lobby's `runtimemanager.Publisher` writes);
- it runs the real `startruntime.Service` against in-memory fakes for
the persistence stores, the lease, and the notification / health
publishers, plus a gomock-backed `ports.DockerClient`;
- it lets the real `jobresultspublisher.Publisher` write to
`runtime:job_results`;
- it reads the resulting entry and asserts the symmetric wire shape;
- it then XADDs the same envelope a second time and asserts the
`error_code=replay_no_op` outcome with no further Docker calls.
The cross-module integration that runs both the real Lobby publisher
and the real Lobby consumer alongside RTM lives at
`integration/lobbyrtm/`, which is the home for inter-service
fixtures. Keeping the in-package test free of `lobby/...` imports
avoids module-internal coupling and keeps `rtmanager`'s test suite
buildable on its own.