feat: runtime manager
This commit is contained in:
@@ -0,0 +1,44 @@
|
||||
# Runtime Manager — Service-Local Documentation
|
||||
|
||||
This directory hosts the service-local documentation for `Runtime
|
||||
Manager`. The top-level [`../README.md`](../README.md) describes the
|
||||
current-state contract (purpose, scope, lifecycles, surfaces,
|
||||
configuration, observability); the documents below complement it with
|
||||
focused content docs and design-rationale records.
|
||||
|
||||
## Content docs
|
||||
|
||||
- [Runtime and components](runtime.md) — process diagram, listeners,
|
||||
workers, lifecycle services, stream offsets, configuration groups,
|
||||
runtime invariants.
|
||||
- [Flows](flows.md) — mermaid sequence diagrams for the lifecycle and
|
||||
observability flows.
|
||||
- [Operator runbook](runbook.md) — startup, readiness, shutdown, and
|
||||
recovery scenarios.
|
||||
- [Configuration and contract examples](examples.md) — `.env`,
|
||||
REST request bodies, stream payloads, storage inspection snippets.
|
||||
|
||||
## Design rationale
|
||||
|
||||
- [PostgreSQL schema decisions](postgres-migration.md) — the schema
|
||||
decision record consolidating the persistence-layer agreements
|
||||
(tables, indexes, CAS shape, `created_at` preservation, jsonb
|
||||
round-trip, schema/role provisioning split).
|
||||
- [Domain and ports](domain-and-ports.md) — string-typed enums, the
|
||||
four allowed runtime transitions, why `Inspect` splits into
|
||||
`InspectImage` / `InspectContainer`, why `LobbyGameRecord` is
|
||||
minimal, and other domain-layer choices.
|
||||
- [Adapters](adapters.md) — Docker SDK adapter, Lobby internal HTTP
|
||||
client, the three Redis publishers, the `mockgen` convention for
|
||||
wide ports, and the unit-test strategy for HTTP-backed adapters.
|
||||
- [Lifecycle services](services.md) — per-game lease semantics, the
|
||||
`Result`-shaped contract, failure-mode tables, the lease-bypass
|
||||
`Run` method on inner services, the `X-Galaxy-Caller` header
|
||||
convention, and the canonical error code → HTTP status mapping.
|
||||
- [Background workers](workers.md) — single-ownership table per
|
||||
`event_type`, `container_disappeared` suppression rules, probe
|
||||
hysteresis, the events listener reconnect policy, the reconciler's
|
||||
per-game lease and three drift kinds.
|
||||
- [Service-local integration suite](integration-tests.md) — the
|
||||
`integration` build tag, the in-process `app.NewRuntime` choice,
|
||||
the Lobby HTTP stub, and the test isolation strategy.
|
||||
@@ -0,0 +1,192 @@
|
||||
# Adapters
|
||||
|
||||
This document explains why the production adapters under
|
||||
[`../internal/adapters/`](../internal/adapters) — Docker SDK,
|
||||
Lobby internal HTTP client, notification-intent publisher, health-event
|
||||
publisher, job-result publisher — are shaped the way they are. The
|
||||
PostgreSQL stores and the Redis-coordination adapters live in
|
||||
[`postgres-migration.md`](postgres-migration.md).
|
||||
|
||||
## 1. `mockgen` is the repo-wide convention for wide ports
|
||||
|
||||
The Docker port has nine methods plus eight value types in the
|
||||
signatures, and most lifecycle services exercise nearly every method
|
||||
pair (start, stop, restart, patch, cleanup, reconcile, events, probe).
|
||||
A hand-rolled fake would either miss methods or balloon to a per-test
|
||||
fixture.
|
||||
|
||||
`internal/adapters/docker/` therefore uses `go.uber.org/mock` mocks:
|
||||
|
||||
- `//go:generate` directives live next to the interface declaration in
|
||||
`internal/ports/dockerclient.go`;
|
||||
- generated code is committed under `internal/adapters/docker/mocks/`
|
||||
(matching the `internal/adapters/postgres/jet/` discipline);
|
||||
- `make -C rtmanager mocks` is the single command operators run after
|
||||
a port-signature change.
|
||||
|
||||
The maintained `go.uber.org/mock` fork is preferred over the archived
|
||||
`github.com/golang/mock`. This convention applies to wide / recorder
|
||||
ports across the repository — Lobby uses the same pipeline for its
|
||||
narrow recorder ports (`RuntimeManager`, `IntentPublisher`,
|
||||
`GMClient`, `UserService`); see
|
||||
[`../../ARCHITECTURE.md`](../../ARCHITECTURE.md) for the cross-service
|
||||
rule.
|
||||
|
||||
The other two RTM ports (`LobbyInternalClient`,
|
||||
`NotificationIntentPublisher`) keep inline `_test.go` fakes: small
|
||||
surfaces, easy to fake by hand inside a single test file when needed.
|
||||
|
||||
## 2. `EngineEndpoint` is built inside the Docker adapter
|
||||
|
||||
The engine port is fixed at `8080`. Pushing it into `RunSpec` would
|
||||
force the start service to know an engine implementation detail;
|
||||
pushing it into config would give operators a knob that the engine
|
||||
image already does not honour. The Docker adapter exposes
|
||||
`EnginePort = 8080` as a package constant and constructs
|
||||
`RunResult.EngineEndpoint = "http://" + spec.Hostname + ":8080"`
|
||||
itself.
|
||||
|
||||
The adapter also leaves `container.Config.ExposedPorts` empty: RTM
|
||||
never publishes ports to the host. The user-defined Docker bridge
|
||||
network gives every container in the network DNS access to the engine
|
||||
via `galaxy-game-{game_id}:8080`.
|
||||
|
||||
## 3. `Run` removes the container on `ContainerStart` failure
|
||||
|
||||
`README.md §Lifecycles → Start` requires no orphan to remain after a
|
||||
failed start path. If `ContainerCreate` succeeds but `ContainerStart`
|
||||
fails, the adapter calls `ContainerRemove(force=true)` inside a fresh
|
||||
`context.Background()` (with a 10s timeout) so the cleanup runs even
|
||||
when the original ctx is already cancelled. The cleanup is best-effort:
|
||||
a remove failure is silently discarded because the original start
|
||||
failure is the actionable error returned to the caller.
|
||||
|
||||
The alternative — leaving rollback to the start service — would either
|
||||
duplicate the same code in every caller or invite a service that forgets
|
||||
to do it. Centralising the rule in the adapter keeps the port contract
|
||||
simple. The start service adds an additional rollback layer for the
|
||||
post-`Run` `Upsert` failure path; see [`services.md`](services.md) §5.
|
||||
|
||||
## 4. `RunSpec.Cmd` is optional
|
||||
|
||||
`ports.RunSpec` exposes an optional `Cmd []string`. Production callers
|
||||
leave it `nil` so the engine image's own `CMD` runs;
|
||||
`internal/adapters/docker/smoke_test.go` uses it to drive
|
||||
`["/bin/sh","-c","sleep 60"]` against `alpine:3.21`.
|
||||
|
||||
The alternative — building a dedicated test image with a pre-baked
|
||||
`sleep` command — would require an extra `Dockerfile` under testdata
|
||||
and a build step inside the smoke test. The single new field is
|
||||
documented as optional and ignored when empty; production behaviour is
|
||||
unchanged.
|
||||
|
||||
## 5. `EventsListen` filters at the adapter boundary
|
||||
|
||||
The Docker `/events` API accepts a `filters` query parameter, but the
|
||||
daemon treats it as a hint, not a guarantee. The adapter therefore
|
||||
double-checks at the boundary: only `Type == events.ContainerEventType`
|
||||
messages are passed through to the typed `<-chan ports.DockerEvent`.
|
||||
Doing the filter at the SDK level would still require a defensive
|
||||
recheck on the consumer side; consolidating the check in the adapter
|
||||
keeps the contract crisp and the consumer free of Docker-internal type
|
||||
discriminants.
|
||||
|
||||
The decoded event copies the actor's full `Attributes` map into
|
||||
`DockerEvent.Labels`. Docker mixes container labels and runtime
|
||||
attributes (`exitCode`, `image`, `name`, etc.) flat in the same map;
|
||||
RTM consumers filter by the `com.galaxy.` prefix when they care about
|
||||
labels, and the adapter extracts `exitCode` separately for `die`
|
||||
events.
|
||||
|
||||
## 6. Lobby HTTP client error mapping
|
||||
|
||||
`ports.LobbyInternalClient.GetGame` fixes:
|
||||
|
||||
- `200` → `LobbyGameRecord` decoded tolerantly (unknown fields
|
||||
ignored);
|
||||
- `404` → `ports.ErrLobbyGameNotFound`;
|
||||
- transport, timeout, or any other non-2xx → `ports.ErrLobbyUnavailable`
|
||||
wrapped with the original error so callers can `errors.Is` and still
|
||||
log the cause.
|
||||
|
||||
The start service treats `ErrLobbyUnavailable` as recoverable: it
|
||||
continues without the diagnostic data because the start envelope
|
||||
already carries the only required field (`image_ref`). The client
|
||||
mirrors `notification/internal/adapters/userservice/client.go`: cloned
|
||||
`*http.Transport`, `otelhttp.NewTransport` wrap, per-request
|
||||
`context.WithTimeout`, idempotent `Close()` releasing idle connections.
|
||||
|
||||
JSON decoding is tolerant: unknown fields in the success body do not
|
||||
break the call, so additive changes to Lobby's `GameRecord` schema do
|
||||
not require an RTM release.
|
||||
|
||||
## 7. Notification publisher wrapper signature
|
||||
|
||||
The wrapper drops the entry id returned by
|
||||
`notificationintent.Publisher.Publish` (rationale in
|
||||
[`domain-and-ports.md`](domain-and-ports.md) §7). The adapter is a
|
||||
thin shim:
|
||||
|
||||
- `NewPublisher(cfg)` constructs the inner publisher and forwards
|
||||
validation;
|
||||
- `Publish(ctx, intent)` calls the inner publisher and discards the
|
||||
entry id.
|
||||
|
||||
The compile-time assertion `var _ ports.NotificationIntentPublisher =
|
||||
(*Publisher)(nil)` lives in `publisher.go`.
|
||||
|
||||
## 8. Health-events publisher: snapshot upsert before stream XADD
|
||||
|
||||
Every emission goes through
|
||||
`ports.HealthEventPublisher.Publish`, which both XADDs to
|
||||
`runtime:health_events` and upserts `health_snapshots`. The snapshot
|
||||
upsert runs **before** the XADD: a successful Publish always leaves
|
||||
the snapshot store at least as fresh as the stream, and a partial
|
||||
failure leaves the snapshot a best-effort lower bound. Reversing the
|
||||
order would let consumers observe a stream entry whose
|
||||
`health_snapshots` row reflects the prior observation — a misleading
|
||||
inversion.
|
||||
|
||||
The `event_type → SnapshotStatus / SnapshotSource` mapping mirrors the
|
||||
table in [`../README.md` §Health Monitoring](../README.md). In
|
||||
particular, `container_started` collapses to `SnapshotStatusHealthy`
|
||||
and `probe_recovered` does the same (rationale in
|
||||
[`domain-and-ports.md`](domain-and-ports.md) §4).
|
||||
|
||||
## 9. Unit-test strategy
|
||||
|
||||
Both HTTP-backed adapters (Docker SDK, Lobby client) use
|
||||
`httptest.Server` fixtures. The Docker SDK speaks HTTP under the hood
|
||||
for both unix sockets and TCP, so adapter unit tests construct a
|
||||
Docker client with `client.WithHost(server.URL)` and
|
||||
`client.WithHTTPClient(server.Client())`, which lets table-driven
|
||||
handlers fake every Docker API endpoint without touching the real
|
||||
daemon. The Docker API version is pinned to `1.45`
|
||||
(`client.WithVersion("1.45")`) so the URL prefix is stable across CI
|
||||
machines whose daemon advertises a different default. Production
|
||||
wiring (in `internal/app/bootstrap.go`) keeps API negotiation enabled.
|
||||
|
||||
The notification publisher uses `miniredis` directly because the
|
||||
adapter's only side effect is an `XADD`, which `miniredis` reproduces
|
||||
faithfully and matches every other Galaxy intent test.
|
||||
|
||||
## 10. Docker smoke test
|
||||
|
||||
`internal/adapters/docker/smoke_test.go` runs on the default
|
||||
`go test ./...` invocation and calls `t.Skip` unless the local daemon
|
||||
is reachable (`/var/run/docker.sock` exists or `DOCKER_HOST` is set).
|
||||
The covered sequence:
|
||||
|
||||
1. provision a temporary user-defined bridge network;
|
||||
2. assert `EnsureNetwork` for present and missing names;
|
||||
3. pull `alpine:3.21` (`PullPolicyIfMissing`);
|
||||
4. subscribe to events;
|
||||
5. run a sleep container with the full `RunSpec` field set;
|
||||
6. observe a `start` event for the new container id;
|
||||
7. inspect, stop, remove, and verify `ErrContainerNotFound` is
|
||||
reported afterwards.
|
||||
|
||||
This is the production adapter's only end-to-end check that runs from
|
||||
the default `go test` pass; the broader service-local integration
|
||||
suite ([`integration-tests.md`](integration-tests.md)) is gated
|
||||
behind `-tags=integration`.
|
||||
@@ -0,0 +1,167 @@
|
||||
# Domain and Ports
|
||||
|
||||
This document explains why the `rtmanager` domain layer
|
||||
([`../internal/domain/`](../internal/domain)) and the port interfaces
|
||||
([`../internal/ports/`](../internal/ports)) are shaped the way they are.
|
||||
The current-state types and method signatures are the source of truth in
|
||||
the code; this file records the rationale so future readers do not
|
||||
re-litigate the same trade-offs.
|
||||
|
||||
For the surrounding behaviour see
|
||||
[`../README.md`](../README.md), the SQL CHECK constraints in
|
||||
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql),
|
||||
the wire contracts under [`../api/`](../api), and
|
||||
[`postgres-migration.md`](postgres-migration.md) for the persistence
|
||||
layer.
|
||||
|
||||
## 1. String-typed status enums
|
||||
|
||||
`runtime.Status`, `operation.OpKind`, `operation.OpSource`,
|
||||
`operation.Outcome`, `health.EventType`, `health.SnapshotStatus`, and
|
||||
`health.SnapshotSource` are all `type X string`.
|
||||
|
||||
The string approach wins on three counts:
|
||||
|
||||
- the SQL CHECK constraints already store the values as `text`, so a
|
||||
string domain type maps one-to-one with no codec layer;
|
||||
- it matches Lobby (`game.Status`, `membership.Status`,
|
||||
`application.Status`), so reviewers do not switch encoding mental
|
||||
models when crossing service boundaries;
|
||||
- `IsKnown` keeps the invariant cheap (a single switch); a `type X uint8`
|
||||
with stringer-generated names would pay a constant lookup and make raw
|
||||
SQL columns harder to read in diagnostics.
|
||||
|
||||
## 2. Plain `string` for `CurrentContainerID` and `CurrentImageRef`
|
||||
|
||||
The PostgreSQL columns are nullable. The domain model uses plain
|
||||
`string` with empty == NULL and bridges the SQL nullability inside the
|
||||
adapter. Pointer fields would force every consumer to dereference
|
||||
defensively even though business logic rarely cares about the
|
||||
NULL/empty distinction (removed records may legitimately carry either
|
||||
form depending on whether the record passed through `stopped` first).
|
||||
|
||||
The adapter's job is to translate `sql.NullString` ⇄ `string`; the rest
|
||||
of the codebase reads the field as a regular value.
|
||||
|
||||
## 3. `*time.Time` for nullable timestamps
|
||||
|
||||
`StartedAt`, `StoppedAt`, `RemovedAt` retain pointer types. `time.Time{}`
|
||||
is a real, comparable value in Go (`IsZero` only reports the canonical
|
||||
zero time); mixing "missing" and "set to UTC zero" through plain
|
||||
`time.Time` would invite bugs. The jet-generated `model.RuntimeRecords`
|
||||
already declares the same fields as `*time.Time`, so the domain type
|
||||
aligns with the persistence type and the adapter does not re-shape
|
||||
pointers.
|
||||
|
||||
## 4. `EventType` and `SnapshotStatus` are deliberately distinct
|
||||
|
||||
`runtime-health-asyncapi.yaml.EventType` enumerates seven values; the
|
||||
SQL CHECK on `health_snapshots.status` enumerates six. The two sets
|
||||
overlap but are not identical:
|
||||
|
||||
- `container_started` is an *event*; the snapshot collapses it to
|
||||
`healthy` (a successful start is observed as the container being
|
||||
live, not as an ongoing event);
|
||||
- `probe_recovered` is an *event*; it does not become a snapshot row of
|
||||
its own — the next inspect/probe overwrites the prior `probe_failed`
|
||||
with `healthy`.
|
||||
|
||||
Modelling them as one shared enum would require a separate "event vs
|
||||
snapshot" boolean and invite accidental mismatches. Two distinct types
|
||||
with explicit `IsKnown` matrices keep each surface honest at compile
|
||||
time.
|
||||
|
||||
## 5. `Inspect` split into `InspectImage` + `InspectContainer`
|
||||
|
||||
Two narrow methods replace a single polymorphic `Inspect`. The surface
|
||||
RTM exercises has two shapes:
|
||||
|
||||
- the start service inspects the *image* by reference to read resource
|
||||
limits from labels;
|
||||
- the periodic inspect worker, the reconciler, and the events listener
|
||||
inspect *containers* by id to read state, health, restart count, and
|
||||
exit code.
|
||||
|
||||
The inputs differ (ref vs id), and the result types differ
|
||||
(`ImageInspect.Labels` is the only field used at start time, while
|
||||
`ContainerInspect` carries a dozen state fields). One polymorphic
|
||||
method would either split internally on input type or return a tagged
|
||||
union; either is messier than two narrow methods.
|
||||
|
||||
## 6. `LobbyGameRecord` is intentionally minimal
|
||||
|
||||
`LobbyInternalClient.GetGame` returns `GameID`, `Status`, and
|
||||
`TargetEngineVersion`. The fetch is classified as ancillary diagnostics
|
||||
because the start envelope already carries the only required field
|
||||
(`image_ref`).
|
||||
|
||||
Anything more would invite RTM consumers to depend on Lobby's schema in
|
||||
ways that violate the "RTM never resolves engine versions" rule.
|
||||
Future fields are additive: each new field is opt-in to the consumer
|
||||
and does not break existing call sites. The minimalism is also a hedge
|
||||
against schema drift — Lobby's `GameRecord` is large and changes more
|
||||
often than RTM needs to track.
|
||||
|
||||
## 7. `NotificationIntentPublisher.Publish` returns `error`, not `(string, error)`
|
||||
|
||||
Lobby's `IntentPublisher.Publish` returns the Redis Stream entry id so
|
||||
business workflows that key on it (idempotency keys, audit
|
||||
correlation) can capture it. RTM publishes admin-only failure intents
|
||||
where the entry id has no consumer — failing starts do not loop back
|
||||
to RTM, and notification routing keys on the producer-supplied
|
||||
`idempotency_key` rather than the stream id. The adapter wraps
|
||||
`pkg/notificationintent.Publisher` and discards the entry id at the
|
||||
wrapper boundary.
|
||||
|
||||
## 8. Exactly four allowed runtime transitions
|
||||
|
||||
`runtime.AllowedTransitions` covers:
|
||||
|
||||
- `running → stopped` — graceful stop, observed exit, reconcile
|
||||
observed exited;
|
||||
- `running → removed` — `reconcile_dispose` when the container
|
||||
vanished;
|
||||
- `stopped → running` — restart and patch inner start;
|
||||
- `stopped → removed` — cleanup TTL or admin DELETE.
|
||||
|
||||
Other pairs are intentionally rejected:
|
||||
|
||||
- `running → running` and `stopped → stopped` would mean Upsert
|
||||
overwrote state without a CAS guard. Idempotent re-start / re-stop
|
||||
never transitions; the service layer returns `replay_no_op` and the
|
||||
record is left untouched.
|
||||
- `removed → *` is forbidden because `removed` is terminal. The
|
||||
reconciler creates fresh records with `reconcile_adopt` rather than
|
||||
resurrecting old ones.
|
||||
|
||||
Encoding the table this way means a future bug where a service tries
|
||||
to revive a removed record is rejected at the domain layer rather than
|
||||
the adapter, which keeps the failure mode close to the offending code.
|
||||
|
||||
## 9. `PullPolicy` re-declared inside `ports/dockerclient.go`
|
||||
|
||||
The same enum exists as `config.ImagePullPolicy`. Importing
|
||||
`internal/config` from the ports package would couple two unrelated
|
||||
layers and create a cyclic risk once the wiring layer pulls both in.
|
||||
The runtime/wiring layer (in `internal/app`) is the single point that
|
||||
translates between the two type aliases — both are `string`-typed, the
|
||||
value sets are identical, and the validation lives on each side
|
||||
independently.
|
||||
|
||||
## 10. Compile-time interface assertions live with adapters
|
||||
|
||||
Every interface has a `var _ ports.X = (*Y)(nil)` assertion, but the
|
||||
assertion lives in the adapter package (e.g.
|
||||
`var _ ports.RuntimeRecordStore = (*Store)(nil)` inside
|
||||
`internal/adapters/postgres/runtimerecordstore`). Putting the
|
||||
assertions in the port package would force the port package to import
|
||||
its own implementations and create an obvious import cycle.
|
||||
|
||||
## 11. `RunSpec.Validate` lives on the request type
|
||||
|
||||
The Docker port carries a non-trivial request type (`RunSpec`) with
|
||||
eight required fields and per-mount invariants. Putting `Validate` on
|
||||
the request struct keeps the rule next to the type definition, mirrors
|
||||
the pattern used by `lobby/internal/ports/gmclient.go`
|
||||
(`RegisterGameRequest.Validate`), and lets the adapter call it as the
|
||||
first defensive check before invoking the Docker SDK.
|
||||
@@ -0,0 +1,429 @@
|
||||
# Configuration And Contract Examples
|
||||
|
||||
The examples below are illustrative. Replace `localhost`, port
|
||||
numbers, IDs, and timestamps with values that match the deployment
|
||||
under inspection.
|
||||
|
||||
## Example `.env`
|
||||
|
||||
A minimum-viable `RTMANAGER_*` set for a local run against a single
|
||||
Redis container plus a PostgreSQL container with the `rtmanager`
|
||||
schema and the `rtmanagerservice` role provisioned. The full list
|
||||
with defaults lives in [`../README.md` §Configuration](../README.md).
|
||||
|
||||
```bash
|
||||
# Required
|
||||
RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
||||
RTMANAGER_POSTGRES_PRIMARY_DSN=postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable
|
||||
RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
||||
RTMANAGER_REDIS_PASSWORD=local
|
||||
RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
RTMANAGER_DOCKER_NETWORK=galaxy-net
|
||||
RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
||||
|
||||
# Lobby internal client (diagnostic GET only in v1)
|
||||
RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
||||
RTMANAGER_LOBBY_INTERNAL_TIMEOUT=2s
|
||||
|
||||
# Container defaults (image labels override these per container)
|
||||
RTMANAGER_DEFAULT_CPU_QUOTA=1.0
|
||||
RTMANAGER_DEFAULT_MEMORY=512m
|
||||
RTMANAGER_DEFAULT_PIDS_LIMIT=512
|
||||
RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS=30
|
||||
RTMANAGER_CONTAINER_RETENTION_DAYS=30
|
||||
RTMANAGER_ENGINE_STATE_MOUNT_PATH=/var/lib/galaxy-game
|
||||
RTMANAGER_ENGINE_STATE_ENV_NAME=GAME_STATE_PATH
|
||||
RTMANAGER_GAME_STATE_DIR_MODE=0750
|
||||
RTMANAGER_GAME_STATE_OWNER_UID=0
|
||||
RTMANAGER_GAME_STATE_OWNER_GID=0
|
||||
|
||||
# Workers
|
||||
RTMANAGER_INSPECT_INTERVAL=30s
|
||||
RTMANAGER_PROBE_INTERVAL=15s
|
||||
RTMANAGER_PROBE_TIMEOUT=2s
|
||||
RTMANAGER_PROBE_FAILURES_THRESHOLD=3
|
||||
RTMANAGER_RECONCILE_INTERVAL=5m
|
||||
RTMANAGER_CLEANUP_INTERVAL=1h
|
||||
|
||||
# Coordination
|
||||
RTMANAGER_GAME_LEASE_TTL_SECONDS=60
|
||||
|
||||
# Process and logging
|
||||
RTMANAGER_LOG_LEVEL=info
|
||||
RTMANAGER_SHUTDOWN_TIMEOUT=30s
|
||||
|
||||
# Telemetry (disabled for local dev — enable to ship traces / metrics)
|
||||
OTEL_SERVICE_NAME=galaxy-rtmanager
|
||||
OTEL_TRACES_EXPORTER=none
|
||||
OTEL_METRICS_EXPORTER=none
|
||||
```
|
||||
|
||||
For a production-shaped deployment, set
|
||||
`RTMANAGER_IMAGE_PULL_POLICY=always` (forces a pull on every start so
|
||||
a tag mutation is immediately visible to the next runtime),
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` to match the engine
|
||||
container's user, and configure `OTEL_*` against the cluster's OTLP
|
||||
collector. The `RTMANAGER_DOCKER_LOG_DRIVER` /
|
||||
`RTMANAGER_DOCKER_LOG_OPTS` pair routes engine stdout/stderr to the
|
||||
sink the operator runs (fluentd, journald, etc.).
|
||||
|
||||
For tests, point `RTMANAGER_POSTGRES_PRIMARY_DSN` and
|
||||
`RTMANAGER_REDIS_MASTER_ADDR` at the testcontainers fixtures the
|
||||
service-local harness brings up
|
||||
([`integration-tests.md` §7](integration-tests.md)).
|
||||
|
||||
## Internal HTTP Examples
|
||||
|
||||
Every endpoint admits the optional `X-Galaxy-Caller` header which the
|
||||
handler records as `op_source` in `operation_log` (`gm` → `gm_rest`,
|
||||
`admin` → `admin_rest`; missing or unknown values default to
|
||||
`admin_rest` in v1). Decision: [`services.md` §18](services.md).
|
||||
|
||||
### Probe a runtime record
|
||||
|
||||
```bash
|
||||
curl -s -H 'X-Galaxy-Caller: gm' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ...
|
||||
```
|
||||
|
||||
Response (`200 OK`):
|
||||
|
||||
```json
|
||||
{
|
||||
"game_id": "game-01HZ...",
|
||||
"status": "running",
|
||||
"current_container_id": "1f2a...",
|
||||
"current_image_ref": "galaxy/game:1.4.0",
|
||||
"engine_endpoint": "http://galaxy-game-game-01HZ...:8080",
|
||||
"state_path": "/var/lib/galaxy/games/game-01HZ...",
|
||||
"docker_network": "galaxy-net",
|
||||
"started_at": "2026-04-28T07:18:54Z",
|
||||
"stopped_at": null,
|
||||
"removed_at": null,
|
||||
"last_op_at": "2026-04-28T07:18:54Z",
|
||||
"created_at": "2026-04-28T07:18:54Z"
|
||||
}
|
||||
```
|
||||
|
||||
### List all runtimes
|
||||
|
||||
```bash
|
||||
curl -s -H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes
|
||||
```
|
||||
|
||||
The response shape is `{"items":[<RuntimeRecord>...]}`.
|
||||
|
||||
### Start a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: gm' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../start \
|
||||
-d '{"image_ref": "galaxy/game:1.4.0"}'
|
||||
```
|
||||
|
||||
A `200` returns the `RuntimeRecord` for the running runtime. Failure
|
||||
shapes use the canonical envelope; e.g. an invalid image_ref:
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"code": "start_config_invalid",
|
||||
"message": "image_ref shape rejected by docker reference parser"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Stop a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../stop \
|
||||
-d '{"reason": "admin_request"}'
|
||||
```
|
||||
|
||||
Valid `reason` values:
|
||||
`orphan_cleanup | cancelled | finished | admin_request | timeout`.
|
||||
|
||||
### Restart a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../restart
|
||||
```
|
||||
|
||||
The body is empty; restart re-uses the current `image_ref`.
|
||||
|
||||
### Patch a runtime
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../patch \
|
||||
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
||||
```
|
||||
|
||||
Patch enforces the semver-only rule: a non-semver tag returns
|
||||
`image_ref_not_semver`; a cross-major or cross-minor change returns
|
||||
`semver_patch_only`.
|
||||
|
||||
### Cleanup a stopped runtime container
|
||||
|
||||
```bash
|
||||
curl -s -X DELETE \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://localhost:8096/api/v1/internal/runtimes/game-01HZ.../container
|
||||
```
|
||||
|
||||
Cleanup refuses a `running` runtime with `409 conflict`; stop first.
|
||||
|
||||
## Stream Payload Examples
|
||||
|
||||
Every stream key shape is configurable via `RTMANAGER_REDIS_*_STREAM`;
|
||||
the defaults are used below. Field types and required/optional
|
||||
semantics are frozen by
|
||||
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml)
|
||||
and
|
||||
[`../api/runtime-health-asyncapi.yaml`](../api/runtime-health-asyncapi.yaml).
|
||||
|
||||
### `runtime:start_jobs` (Lobby → RTM)
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:start_jobs '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
image_ref 'galaxy/game:1.4.0' \
|
||||
requested_at_ms 1714081234567
|
||||
```
|
||||
|
||||
### `runtime:stop_jobs` (Lobby → RTM)
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:stop_jobs '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
reason 'cancelled' \
|
||||
requested_at_ms 1714081234567
|
||||
```
|
||||
|
||||
### `runtime:job_results` (RTM → Lobby)
|
||||
|
||||
Success envelope:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id '1f2a...' \
|
||||
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
||||
error_code '' \
|
||||
error_message ''
|
||||
```
|
||||
|
||||
Failure envelope:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'failure' \
|
||||
container_id '' \
|
||||
engine_endpoint '' \
|
||||
error_code 'image_pull_failed' \
|
||||
error_message 'pull failed: manifest unknown'
|
||||
```
|
||||
|
||||
Idempotent replay envelope (success outcome with explicit
|
||||
`replay_no_op`):
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:job_results '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
outcome 'success' \
|
||||
container_id '1f2a...' \
|
||||
engine_endpoint 'http://galaxy-game-game-01HZ...:8080' \
|
||||
error_code 'replay_no_op' \
|
||||
error_message ''
|
||||
```
|
||||
|
||||
The contract permits empty `container_id` and `engine_endpoint`
|
||||
strings on every value of `outcome` so the consumer can decode the
|
||||
envelope uniformly ([`workers.md` §11](workers.md)).
|
||||
|
||||
### `runtime:health_events` (RTM out)
|
||||
|
||||
The wire shape is the same for every event type — only the
|
||||
`details` payload differs.
|
||||
|
||||
`container_started`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_started' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"image_ref":"galaxy/game:1.4.0"}'
|
||||
```
|
||||
|
||||
`container_exited`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_exited' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"exit_code":137,"oom":false}'
|
||||
```
|
||||
|
||||
`container_oom`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_oom' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"exit_code":137}'
|
||||
```
|
||||
|
||||
`container_disappeared`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'container_disappeared' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{}'
|
||||
```
|
||||
|
||||
`inspect_unhealthy`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'inspect_unhealthy' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"restart_count":3,"state":"running","health":"unhealthy"}'
|
||||
```
|
||||
|
||||
`probe_failed` (after the threshold is crossed):
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'probe_failed' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"consecutive_failures":3,"last_status":0,"last_error":"context deadline exceeded"}'
|
||||
```
|
||||
|
||||
`probe_recovered`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD runtime:health_events '*' \
|
||||
game_id 'game-01HZ...' \
|
||||
container_id '1f2a...' \
|
||||
event_type 'probe_recovered' \
|
||||
occurred_at_ms 1714081234567 \
|
||||
details '{"prior_failure_count":3}'
|
||||
```
|
||||
|
||||
### `notification:intents` (RTM admin notifications)
|
||||
|
||||
RTM publishes admin-only notification intents only for the three
|
||||
first-touch start failures. Every payload shares the frozen field
|
||||
set `{game_id, image_ref, error_code, error_message,
|
||||
attempted_at_ms}`
|
||||
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
||||
|
||||
`runtime.image_pull_failed`:
|
||||
|
||||
```bash
|
||||
redis-cli XADD notification:intents '*' \
|
||||
envelope '{
|
||||
"type": "runtime.image_pull_failed",
|
||||
"producer": "rtmanager",
|
||||
"idempotency_key": "runtime.image_pull_failed:game-01HZ...:1714081234567",
|
||||
"audience": {"kind": "admin_email", "email_address_kind": "runtime_image_pull_failed"},
|
||||
"payload": {
|
||||
"game_id": "game-01HZ...",
|
||||
"image_ref": "galaxy/game:1.4.0",
|
||||
"error_code": "image_pull_failed",
|
||||
"error_message": "pull failed: manifest unknown",
|
||||
"attempted_at_ms": 1714081234567
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
`runtime.container_start_failed` and `runtime.start_config_invalid`
|
||||
share the same envelope with their respective `type` and
|
||||
`error_code` values.
|
||||
|
||||
## Storage Inspection
|
||||
|
||||
### Inspect a runtime record (PostgreSQL)
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.runtime_records WHERE game_id = 'game-01HZ...'"
|
||||
```
|
||||
|
||||
Columns mirror the fields documented in
|
||||
[`../README.md` §Persistence Layout](../README.md#persistence-layout).
|
||||
|
||||
### Inspect runtime status counts
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
||||
```
|
||||
|
||||
### Inspect the operation log for a game
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code,
|
||||
started_at, finished_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = 'game-01HZ...'
|
||||
ORDER BY started_at DESC, id DESC
|
||||
LIMIT 50"
|
||||
```
|
||||
|
||||
### Inspect the latest health snapshot
|
||||
|
||||
```bash
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT game_id, container_id, status, source, observed_at, details
|
||||
FROM rtmanager.health_snapshots
|
||||
WHERE game_id = 'game-01HZ...'"
|
||||
```
|
||||
|
||||
### Inspect Redis runtime-coordination keys
|
||||
|
||||
```bash
|
||||
# Stream offsets
|
||||
redis-cli GET rtmanager:stream_offsets:startjobs
|
||||
redis-cli GET rtmanager:stream_offsets:stopjobs
|
||||
|
||||
# Per-game lease (only present while an operation is in flight)
|
||||
redis-cli GET rtmanager:game_lease:game-01HZ...
|
||||
redis-cli TTL rtmanager:game_lease:game-01HZ...
|
||||
|
||||
# Recent stream entries
|
||||
redis-cli XRANGE runtime:start_jobs - + COUNT 20
|
||||
redis-cli XRANGE runtime:job_results - + COUNT 20
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 50
|
||||
|
||||
# Stream metadata
|
||||
redis-cli XINFO STREAM runtime:start_jobs
|
||||
redis-cli XINFO STREAM runtime:stop_jobs
|
||||
redis-cli XINFO STREAM runtime:health_events
|
||||
```
|
||||
@@ -0,0 +1,305 @@
|
||||
# Flows
|
||||
|
||||
This document collects the lifecycle and observability flows that
|
||||
span Runtime Manager and its synchronous and asynchronous neighbours.
|
||||
Narrative descriptions of the rules these flows enforce live in
|
||||
[`../README.md`](../README.md); the diagrams here focus on the message
|
||||
order across the boundary. Design-rationale records linked from each
|
||||
section explain the *why*.
|
||||
|
||||
## Start (happy path)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Lobby as Lobby publisher
|
||||
participant Stream as runtime:start_jobs
|
||||
participant Consumer as startjobsconsumer
|
||||
participant Service as startruntime
|
||||
participant Lease as Redis lease
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Health as runtime:health_events
|
||||
participant Results as runtime:job_results
|
||||
|
||||
Lobby->>Stream: XADD {game_id, image_ref, requested_at_ms}
|
||||
Consumer->>Stream: XREAD
|
||||
Consumer->>Service: Handle(game_id, image_ref, OpSourceLobbyStream, entry_id)
|
||||
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Service->>PG: SELECT runtime_records WHERE game_id
|
||||
Service->>Docker: PullImage(image_ref) per pull policy
|
||||
Service->>Docker: InspectImage → resource limits
|
||||
Service->>Service: prepareStateDir(<root>/{game_id})
|
||||
Service->>Docker: ContainerCreate + ContainerStart
|
||||
Service->>PG: Upsert runtime_records (status=running)
|
||||
Service->>PG: INSERT operation_log (op_kind=start, outcome=success)
|
||||
Service->>Health: XADD container_started
|
||||
Service-->>Consumer: Result{Outcome=success, ContainerID, EngineEndpoint}
|
||||
Consumer->>Results: XADD {outcome=success, container_id, engine_endpoint}
|
||||
Service->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
```
|
||||
|
||||
REST callers (Game Master, Admin Service) drive the same service
|
||||
through `POST /api/v1/internal/runtimes/{game_id}/start`; the
|
||||
diagram's last two arrows collapse to an HTTP `200` response carrying
|
||||
the runtime record. Sources:
|
||||
[`../README.md` §Lifecycles → Start](../README.md#start),
|
||||
[`services.md` §3](services.md).
|
||||
|
||||
## Start failure (image pull)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Service as startruntime
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Intents as notification:intents
|
||||
participant Results as runtime:job_results
|
||||
|
||||
Service->>Docker: PullImage(image_ref)
|
||||
Docker-->>Service: error
|
||||
Service->>PG: INSERT operation_log (op_kind=start, outcome=failure, error_code=image_pull_failed)
|
||||
Service->>Intents: XADD runtime.image_pull_failed {game_id, image_ref, error_code, error_message, attempted_at_ms}
|
||||
Service-->>Service: Result{Outcome=failure, ErrorCode=image_pull_failed}
|
||||
Service->>Results: XADD {outcome=failure, error_code=image_pull_failed}
|
||||
```
|
||||
|
||||
The same shape applies to the configuration-validation failures
|
||||
(`start_config_invalid` from `EnsureNetwork(ErrNetworkMissing)`,
|
||||
`prepareStateDir`, or invalid `image_ref` shape) and the Docker
|
||||
create/start failure (`container_start_failed`); only the error code
|
||||
and the matching `runtime.*` notification type differ. Three failure
|
||||
codes do **not** raise an admin notification: `conflict`,
|
||||
`service_unavailable`, `internal_error`
|
||||
([`services.md` §4](services.md)).
|
||||
|
||||
## Start failure (orphan / Upsert-after-Run rollback)
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Service as startruntime
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Intents as notification:intents
|
||||
|
||||
Service->>Docker: ContainerCreate + ContainerStart
|
||||
Docker-->>Service: container running
|
||||
Service->>PG: Upsert runtime_records
|
||||
PG-->>Service: error (transport / constraint)
|
||||
Note over Service: container is now an orphan<br/>(running, no PG record)
|
||||
Service->>Docker: Remove(container_id) [fresh background context]
|
||||
Docker-->>Service: ok or logged failure
|
||||
Service->>PG: INSERT operation_log (outcome=failure, error_code=container_start_failed)
|
||||
Service->>Intents: XADD runtime.container_start_failed
|
||||
Service-->>Service: Result{Outcome=failure, ErrorCode=container_start_failed}
|
||||
```
|
||||
|
||||
The Docker adapter already removes the container when `Run` itself
|
||||
fails after a successful `ContainerCreate`
|
||||
([`adapters.md` §3](adapters.md)); the start service adds the
|
||||
post-`Run` rollback for the `Upsert` path. A `Remove` failure is
|
||||
logged but not propagated; the reconciler adopts surviving orphans on
|
||||
its periodic pass ([`services.md` §5](services.md)).
|
||||
|
||||
## Stop
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Caller as Lobby / GM / Admin
|
||||
participant Service as stopruntime
|
||||
participant Lease as Redis lease
|
||||
participant PG as Postgres
|
||||
participant Docker
|
||||
participant Results as runtime:job_results
|
||||
|
||||
Caller->>Service: stop(game_id, reason)
|
||||
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Service->>PG: SELECT runtime_records WHERE game_id
|
||||
alt status in {stopped, removed}
|
||||
Service->>PG: INSERT operation_log (outcome=success, error_code=replay_no_op)
|
||||
Service-->>Caller: success / replay_no_op
|
||||
else status = running
|
||||
Service->>Docker: ContainerStop(container_id, RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS)
|
||||
Docker-->>Service: ok
|
||||
Service->>PG: UpdateStatus running→stopped (CAS by container_id)
|
||||
Service->>PG: INSERT operation_log (op_kind=stop, outcome=success)
|
||||
Service-->>Caller: success
|
||||
end
|
||||
Service->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
```
|
||||
|
||||
Lobby callers receive the outcome through `runtime:job_results`; REST
|
||||
callers receive an HTTP `200`. The `reason` enum
|
||||
(`orphan_cleanup | cancelled | finished | admin_request | timeout`)
|
||||
is recorded in `operation_log` and is otherwise opaque to the stop
|
||||
service — RTM does not branch on the reason in v1
|
||||
([`services.md` §15, §17](services.md)).
|
||||
|
||||
## Restart
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Admin as GM / Admin
|
||||
participant Service as restartruntime
|
||||
participant Stop as stopruntime.Run
|
||||
participant Start as startruntime.Run
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
|
||||
Admin->>Service: POST /restart
|
||||
Service->>PG: SELECT runtime_records WHERE game_id
|
||||
Note over Service: capture current image_ref
|
||||
Service->>Service: acquire per-game lease (held across both inner ops)
|
||||
Service->>Stop: Run(game_id) [lease bypass]
|
||||
Stop->>Docker: ContainerStop
|
||||
Stop->>PG: UpdateStatus running→stopped
|
||||
Service->>Docker: ContainerRemove
|
||||
Service->>Start: Run(game_id, image_ref) [lease bypass]
|
||||
Start->>Docker: PullImage / Run
|
||||
Start->>PG: Upsert runtime_records (status=running)
|
||||
Service->>PG: INSERT operation_log (op_kind=restart, outcome=success, source_ref=correlation_id)
|
||||
Service-->>Admin: 200 {runtime_record}
|
||||
Service->>Service: release lease
|
||||
```
|
||||
|
||||
The lease is acquired by `restartruntime` and held across both inner
|
||||
operations; `stopruntime.Run` and `startruntime.Run` are
|
||||
lease-bypass entry points that skip the inner lease acquisition
|
||||
([`services.md` §12](services.md)). The single `operation_log` row
|
||||
uses `Input.SourceRef` as a correlation id linking the implicit stop
|
||||
and start entries ([`services.md` §13](services.md)).
|
||||
|
||||
## Patch
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Admin as GM / Admin
|
||||
participant Service as patchruntime
|
||||
participant Restart as restartruntime.Run
|
||||
|
||||
Admin->>Service: POST /patch {image_ref: "galaxy/game:1.4.2"}
|
||||
Service->>Service: parse new image_ref + current image_ref
|
||||
alt either ref not semver
|
||||
Service-->>Admin: 422 image_ref_not_semver
|
||||
else major or minor differ
|
||||
Service-->>Admin: 422 semver_patch_only
|
||||
else major.minor match, patch differs (or equal)
|
||||
Service->>Restart: Run(game_id, new_image_ref)
|
||||
Restart-->>Service: Result
|
||||
Service-->>Admin: 200 {runtime_record}
|
||||
end
|
||||
```
|
||||
|
||||
The semver gate uses the tag fragment of the Docker reference; the
|
||||
extraction strategy is recorded in [`services.md` §14](services.md).
|
||||
The restart delegate already owns the lease, the inner stop/start,
|
||||
the operation log, and the `runtime:health_events container_started`
|
||||
emission ([`workers.md` §1](workers.md)).
|
||||
|
||||
## Cleanup TTL
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Worker as containercleanup worker
|
||||
participant PG as Postgres
|
||||
participant Service as cleanupcontainer
|
||||
participant Lease as Redis lease
|
||||
participant Docker
|
||||
|
||||
loop every RTMANAGER_CLEANUP_INTERVAL
|
||||
Worker->>PG: SELECT runtime_records WHERE status='stopped' AND last_op_at < now - retention
|
||||
loop per game
|
||||
Worker->>Service: cleanup(game_id, op_source=auto_ttl)
|
||||
Service->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Service->>PG: re-read runtime_records WHERE game_id
|
||||
alt status = running
|
||||
Service-->>Worker: refused / conflict
|
||||
else status in {stopped, removed}
|
||||
Service->>Docker: ContainerRemove(container_id)
|
||||
Service->>PG: UpdateStatus stopped→removed (CAS)
|
||||
Service->>PG: INSERT operation_log (op_kind=cleanup_container)
|
||||
Service-->>Worker: success
|
||||
end
|
||||
Service->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Admin-driven cleanup follows the same path through
|
||||
`DELETE /api/v1/internal/runtimes/{game_id}/container` with
|
||||
`op_source=admin_rest` instead of `auto_ttl`. The host state directory
|
||||
is **never** removed by this flow
|
||||
([`../README.md` §Cleanup](../README.md#cleanup),
|
||||
[`services.md` §17](services.md),
|
||||
[`workers.md` §19](workers.md)).
|
||||
|
||||
## Reconcile drift adopt
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Reconciler as reconcile worker
|
||||
participant Docker
|
||||
participant PG as Postgres
|
||||
participant Lease as Redis lease
|
||||
|
||||
Note over Reconciler: read pass (lockless)
|
||||
Reconciler->>Docker: List({label=com.galaxy.owner=rtmanager})
|
||||
Reconciler->>PG: ListByStatus(running)
|
||||
Note over Reconciler: write pass (per-game lease)
|
||||
loop per Docker container without matching record
|
||||
Reconciler->>Lease: SET NX PX rtmanager:game_lease:{game_id}
|
||||
Reconciler->>PG: re-read runtime_records WHERE game_id
|
||||
alt record now exists
|
||||
Reconciler-->>Reconciler: skip (state changed since read pass)
|
||||
else record still missing
|
||||
Reconciler->>PG: Upsert runtime_records (status=running, image_ref, started_at)
|
||||
Reconciler->>PG: INSERT operation_log (op_kind=reconcile_adopt, op_source=auto_reconcile)
|
||||
end
|
||||
Reconciler->>Lease: DEL rtmanager:game_lease:{game_id}
|
||||
end
|
||||
```
|
||||
|
||||
The reconciler **never** stops or removes an unrecorded container —
|
||||
operators may have started one manually for diagnostics. The
|
||||
`reconcile_dispose` and `observed_exited` paths follow the same
|
||||
read-pass / write-pass split, with `dispose` updating the orphaned
|
||||
record to `removed` and emitting `container_disappeared`, and
|
||||
`observed_exited` updating to `stopped` and emitting `container_exited`
|
||||
([`../README.md` §Reconciliation](../README.md#reconciliation),
|
||||
[`workers.md` §14–§16](workers.md)).
|
||||
|
||||
## Health probe hysteresis
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Worker as healthprobe worker
|
||||
participant State as in-memory probe state
|
||||
participant Engine as galaxy-game-{id}:8080
|
||||
participant Health as runtime:health_events
|
||||
|
||||
loop every RTMANAGER_PROBE_INTERVAL
|
||||
Worker->>Worker: ListByStatus(running)
|
||||
Worker->>State: prune entries for games no longer running
|
||||
loop per game (semaphore cap = 16)
|
||||
Worker->>Engine: GET /healthz (RTMANAGER_PROBE_TIMEOUT)
|
||||
alt success
|
||||
State->>State: consecutiveFailures = 0
|
||||
opt failurePublished was true
|
||||
Worker->>Health: XADD probe_recovered {prior_failure_count}
|
||||
State->>State: failurePublished = false
|
||||
end
|
||||
else failure
|
||||
State->>State: consecutiveFailures++
|
||||
opt consecutiveFailures == RTMANAGER_PROBE_FAILURES_THRESHOLD AND not failurePublished
|
||||
Worker->>Health: XADD probe_failed {consecutive_failures, last_status, last_error}
|
||||
State->>State: failurePublished = true
|
||||
end
|
||||
end
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
Hysteresis prevents a single transient failure from emitting a
|
||||
`probe_failed` event, and prevents repeated emission while the failure
|
||||
persists. State is non-persistent: a process restart re-establishes
|
||||
the counters from scratch; a game's state is pruned when it transitions
|
||||
out of the running list ([`workers.md` §5–§6](workers.md)).
|
||||
@@ -0,0 +1,163 @@
|
||||
# Service-Local Integration Suite
|
||||
|
||||
This document explains the design of the service-local integration
|
||||
suite under [`../integration/`](../integration). The current-state
|
||||
behaviour (harness layout, env knobs, scenario coverage) lives next
|
||||
to the files themselves; this document records the rationale.
|
||||
|
||||
The cross-service Lobby↔RTM suite at
|
||||
[`../../integration/lobbyrtm/`](../../integration/lobbyrtm) follows
|
||||
different rules (it lives in the top-level `galaxy/integration`
|
||||
module) and is documented inside that package.
|
||||
|
||||
## 1. Build tag `integration`
|
||||
|
||||
The scenarios under [`../integration/*_test.go`](../integration) are
|
||||
guarded by `//go:build integration`. The default `go test ./...`
|
||||
invocation skips them, while `go test -tags=integration
|
||||
./integration/...` (and the `make integration` target) runs the full
|
||||
set:
|
||||
|
||||
```sh
|
||||
make -C rtmanager integration
|
||||
```
|
||||
|
||||
The harness package itself ([`../integration/harness`](../integration/harness))
|
||||
has no build tag. It compiles on every run because each helper guards
|
||||
its Docker-dependent paths with `t.Skip` when the daemon is
|
||||
unavailable. This keeps the harness loadable from a tagless `go vet`
|
||||
or IDE workflow without dragging Docker into the default `go test`
|
||||
critical path.
|
||||
|
||||
## 2. Smoke test runs in the default `go test` pass
|
||||
|
||||
[`../internal/adapters/docker/smoke_test.go`](../internal/adapters/docker/smoke_test.go)
|
||||
runs in the regular `go test ./...` pass and falls back on
|
||||
`skipUnlessDockerAvailable` when no Docker socket is present. The
|
||||
smoke test is intentionally kept separate from the new `integration/`
|
||||
suite because it exercises the production adapter shape (one
|
||||
container at a time against `alpine:3.21`), not the full runtime;
|
||||
both surfaces are useful.
|
||||
|
||||
## 3. In-process `app.NewRuntime` instead of a `cmd/rtmanager` subprocess
|
||||
|
||||
The harness drives Runtime Manager through `app.NewRuntime(ctx, cfg,
|
||||
logger)` directly rather than spawning the binary from
|
||||
`cmd/rtmanager/main.go`:
|
||||
|
||||
- **Cleanup is deterministic.** A `t.Cleanup` block can `cancel()`
|
||||
the runtime context and call `runtime.Close()`; the goroutine
|
||||
driving `runtime.Run` returns with `context.Canceled` and the
|
||||
helper waits on it via the `runDone` channel. With a subprocess the
|
||||
equivalent dance requires SIGTERM, output capture, and graceful
|
||||
shutdown timing tied to the child's signal handler.
|
||||
- **Goroutine and store visibility.** Tests read the durable PG state
|
||||
directly through the harness-owned pool and read every Redis stream
|
||||
through the harness-owned client. Both observe the exact wire shape
|
||||
Lobby will see in the cross-service suite.
|
||||
- **Logger isolation.** The harness defaults to `slog.Discard` so the
|
||||
default test output stays focused on assertions; flipping
|
||||
`EnvOptions.LogToStderr` lights up the runtime's structured logs
|
||||
for local debugging without requiring any subprocess plumbing.
|
||||
|
||||
The cross-service inter-process suite at `integration/lobbyrtm/`
|
||||
re-uses the existing `integration/internal/harness` binary-spawn
|
||||
helpers; the in-process choice here is specific to the service-local
|
||||
scope.
|
||||
|
||||
## 4. `httptest.Server` stub for the Lobby internal client
|
||||
|
||||
Runtime Manager configuration requires a non-empty
|
||||
`RTMANAGER_LOBBY_INTERNAL_BASE_URL`, and the start service makes a
|
||||
diagnostic `GET /api/v1/internal/games/{game_id}` call that v1 treats
|
||||
as a no-op (the start envelope already carries the only required
|
||||
field, `image_ref`; rationale in [`services.md`](services.md) §7).
|
||||
The harness therefore stands up a tiny `httptest.Server` per test
|
||||
that returns a stable `200 OK` response. The stub is intentionally
|
||||
unconfigurable: every integration scenario produces the same
|
||||
ancillary fetch, and adding routing/error injection would invite
|
||||
test code to depend on a contract the start service deliberately
|
||||
ignores.
|
||||
|
||||
## 5. One built engine image, two semver-compatible tags
|
||||
|
||||
The patch lifecycle expects the new and current image refs to share
|
||||
the same major / minor version (`semver_patch_only` failure
|
||||
otherwise). Building two distinct images would multiply the per-run
|
||||
build cost without changing what the test verifies — the patch path
|
||||
exercises `image_ref_not_semver` and `semver_patch_only` validation
|
||||
plus the recreate-with-new-tag flow, none of which depend on
|
||||
distinct image *content*. The harness builds the engine once and
|
||||
calls `client.ImageTag` to alias it as both `galaxy/game:1.0.0-rtm-it`
|
||||
and `galaxy/game:1.0.1-rtm-it`. Both share the same digest.
|
||||
|
||||
The integration tags use the `*-rtm-it` suffix (rather than plain
|
||||
`galaxy/game:1.0.0`) so an operator running the suite locally cannot
|
||||
accidentally consume a hand-built dev image, and so a `docker image
|
||||
rm` of integration leftovers does not nuke a production-shaped tag.
|
||||
|
||||
## 6. Per-test Docker network and per-test state root
|
||||
|
||||
`EnsureNetwork(t)` creates a uniquely-named bridge network per test
|
||||
and registers cleanup; `t.ArtifactDir()` provides the per-game state
|
||||
root. Both ensure that two scenarios running back-to-back cannot
|
||||
collide on the per-game DNS hostname (`galaxy-game-{game_id}`) or on
|
||||
filesystem state. Game ids are themselves unique per test
|
||||
(`harness.IDFromTestName` adds a nanosecond suffix) — combined with
|
||||
the per-test network and state root, the suite is safe to run with
|
||||
`-count` greater than one.
|
||||
|
||||
`t.ArtifactDir()` keeps the engine state directory around when a
|
||||
test fails (Go ≥ 1.25), so an operator can `cd` into it after a CI
|
||||
failure and inspect what the engine wrote. On success the directory
|
||||
is automatically cleaned up.
|
||||
|
||||
## 7. PostgreSQL and Redis containers shared per-package
|
||||
|
||||
Both fixtures use `sync.Once` to start one testcontainer per test
|
||||
package, mirroring the
|
||||
[`../internal/adapters/postgres/internal/pgtest`](../internal/adapters/postgres/internal/pgtest)
|
||||
pattern. `TruncatePostgres` and `FlushRedis` reset state between
|
||||
tests so each scenario starts on an empty stack. The trade-off versus
|
||||
per-test containers is the standard one: container startup dominates
|
||||
the per-package latency, so amortising it across the suite keeps the
|
||||
loop tight while the truncate/flush ensures isolation. The ~1–2 s
|
||||
difference matters in CI.
|
||||
|
||||
## 8. Engine image cache is intentionally retained between runs
|
||||
|
||||
`buildAndTagEngineImage` runs once per package via `sync.Once` and
|
||||
leaves both image tags in the local Docker cache after the suite
|
||||
exits. The cache is a substantial speed-up on a developer laptop
|
||||
(`docker build` of `galaxy/game` takes 30+ seconds cold, sub-second
|
||||
hot), and a stale image is unlikely because the tags carry the
|
||||
`*-rtm-it` suffix and the underlying Dockerfile is forward-compatible
|
||||
with multiple test runs. Operators who suspect a stale image can
|
||||
`docker image rm galaxy/game:1.0.0-rtm-it galaxy/game:1.0.1-rtm-it`;
|
||||
the next run rebuilds.
|
||||
|
||||
## 9. Scenario coverage
|
||||
|
||||
The suite covers the four end-to-end flows operators care about:
|
||||
|
||||
- **lifecycle** (`lifecycle_test.go`) — start → inspect → stop →
|
||||
restart → patch → stop → cleanup. The intermediate `stop` between
|
||||
`patch` and `cleanup` is intentional: the cleanup endpoint refuses
|
||||
to remove a running container per
|
||||
[`../README.md` §Cleanup](../README.md#cleanup).
|
||||
- **replay** (`replay_test.go`) — duplicate start / stop entries
|
||||
surface as `replay_no_op` per [`workers.md`](workers.md) §11.
|
||||
- **health** (`health_test.go`) — external `docker rm` produces
|
||||
`container_disappeared`; manual `docker run` is adopted by the
|
||||
reconciler.
|
||||
- **notification** (`notification_test.go`) — unresolvable `image_ref`
|
||||
produces `runtime.image_pull_failed` plus a `failure` job_result.
|
||||
|
||||
## 10. Service-local scope only
|
||||
|
||||
This suite runs Runtime Manager against a real Docker daemon plus
|
||||
testcontainers PG / Redis but **does not** include any other Galaxy
|
||||
service. Cross-service flows (Lobby ↔ RTM, RTM ↔ Notification) live
|
||||
in the top-level `galaxy/integration/` module, where the harness
|
||||
spawns multiple service binaries and uses real (not stubbed) cross-
|
||||
service streams.
|
||||
@@ -0,0 +1,531 @@
|
||||
# PostgreSQL Schema Decisions
|
||||
|
||||
Runtime Manager has been PostgreSQL-and-Redis from day one — there is
|
||||
no Redis-only predecessor and no migration window. This document
|
||||
records the schema decisions and the non-obvious agreements behind
|
||||
them, mirroring the shape of
|
||||
[`../../notification/docs/postgres-migration.md`](../../notification/docs/postgres-migration.md)
|
||||
and serving the same role: a single coherent reference for "why does
|
||||
the persistence layer look this way".
|
||||
|
||||
Use this document together with the migration script
|
||||
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
and the runtime wiring
|
||||
[`../internal/app/runtime.go`](../internal/app/runtime.go).
|
||||
|
||||
## Outcomes
|
||||
|
||||
- Schema `rtmanager` (provisioned externally) holds the durable
|
||||
service state across three tables: `runtime_records`,
|
||||
`operation_log`, `health_snapshots`. The three tables map onto the
|
||||
three runtime concerns documented in
|
||||
[`../README.md` §Persistence Layout](../README.md#persistence-layout):
|
||||
current state per game, audit trail per operation, and latest
|
||||
technical health observation per game.
|
||||
- The runtime opens one PostgreSQL pool via `pkg/postgres.OpenPrimary`,
|
||||
applies embedded goose migrations strictly before any HTTP listener
|
||||
becomes ready, and exits non-zero when migration or ping fails.
|
||||
Already-applied migrations exit zero — the
|
||||
`pkg/postgres`-supplied migrator treats "no work to do" as success.
|
||||
- The runtime opens one shared `*redis.Client` via
|
||||
`pkg/redisconn.NewMasterClient` and passes it to the stream offset
|
||||
store, the per-game lease store, the consumer pipelines, and every
|
||||
publisher (`runtime:job_results`, `runtime:health_events`,
|
||||
`notification:intents`).
|
||||
- The Redis adapter package
|
||||
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
|
||||
owns one shared `Keyspace` struct with the
|
||||
`defaultPrefix = "rtmanager:"` constant and per-store subpackages
|
||||
for stream offsets and the per-game lease.
|
||||
- Generated jet code under
|
||||
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
|
||||
is committed; `make -C rtmanager jet` regenerates it via the
|
||||
testcontainers-driven `cmd/jetgen` pipeline.
|
||||
- Configuration uses the `RTMANAGER_` prefix for every variable.
|
||||
The schema-per-service rule from
|
||||
[`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md)
|
||||
applies: each service's role is grant-restricted to its own
|
||||
schema; RTM never touches Lobby's `lobby` schema or vice versa.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. One schema, externally-provisioned `rtmanagerservice` role
|
||||
|
||||
**Decision.** The `rtmanager` schema and the matching
|
||||
`rtmanagerservice` role are created outside the migration sequence
|
||||
(in tests, by the testcontainers harness in `cmd/jetgen/main.go::provisionRoleAndSchema`
|
||||
and by the integration harness; in production, by an ops init script
|
||||
not in scope for any service stage). The embedded migration
|
||||
`00001_init.sql` only contains DDL for the service-owned tables and
|
||||
indexes and assumes it runs as the schema owner with
|
||||
`search_path=rtmanager`.
|
||||
|
||||
**Why.** Mixing role creation, schema creation, and table DDL into
|
||||
one script forces every consumer of the migration to run as a
|
||||
superuser. The schema-per-service architectural rule
|
||||
(`ARCHITECTURE.md §Persistence Backends`) lines up neatly with the
|
||||
operational split: ops provisions roles and schemas, the service
|
||||
applies schema-scoped migrations. Letting RTM run `CREATE SCHEMA`
|
||||
from its runtime role would relax the
|
||||
"each service's role grants are restricted to its own schema"
|
||||
defense-in-depth rule.
|
||||
|
||||
### 2. `runtime_records.game_id` is the natural primary key
|
||||
|
||||
**Decision.** `runtime_records` uses
|
||||
`game_id text PRIMARY KEY`. There is no surrogate key. The `status`
|
||||
column carries a CHECK constraint enforcing the
|
||||
`running | stopped | removed` enum.
|
||||
|
||||
```sql
|
||||
CREATE TABLE runtime_records (
|
||||
game_id text PRIMARY KEY,
|
||||
status text NOT NULL,
|
||||
-- ...
|
||||
CONSTRAINT runtime_records_status_chk
|
||||
CHECK (status IN ('running', 'stopped', 'removed'))
|
||||
);
|
||||
```
|
||||
|
||||
**Why.** `game_id` is the platform-wide identifier owned by Lobby;
|
||||
RTM stores at most one record per game ever. A surrogate
|
||||
`bigserial` would force every cross-service join to translate
|
||||
through a lookup table; the natural key keeps RTM's persistence
|
||||
layer pin-compatible with the streams contract (every
|
||||
`runtime:start_jobs` envelope already names the `game_id`). The
|
||||
status CHECK reproduces the Go-level enum from
|
||||
[`../internal/domain/runtime/model.go`](../internal/domain/runtime/model.go)
|
||||
as a defense-in-depth gate at the storage boundary. Decision context:
|
||||
[`domain-and-ports.md`](domain-and-ports.md).
|
||||
|
||||
### 3. `(status, last_op_at)` index serves both the cleanup worker and `ListByStatus`
|
||||
|
||||
**Decision.** `runtime_records_status_last_op_idx` is a composite
|
||||
index on `(status, last_op_at)`. The container cleanup worker scans
|
||||
`status='stopped' AND last_op_at < cutoff`; the
|
||||
`runtimerecordstore.ListByStatus` adapter method orders rows
|
||||
`last_op_at DESC, game_id ASC`.
|
||||
|
||||
```sql
|
||||
CREATE INDEX runtime_records_status_last_op_idx
|
||||
ON runtime_records (status, last_op_at);
|
||||
```
|
||||
|
||||
**Why.** Both read shapes share the same composite. The cleanup
|
||||
worker drives the index from one direction (range scan on
|
||||
`last_op_at` filtered by status); `ListByStatus` drives it from the
|
||||
other (equality on status, sorted by `last_op_at`). PostgreSQL
|
||||
satisfies both shapes through one index scan once the planner picks
|
||||
the index for the WHERE clause. The secondary `game_id ASC` tiebreak
|
||||
in the adapter ORDER BY is satisfied by primary-key ordering after
|
||||
the index returns the rows.
|
||||
|
||||
A second supporting index for the cleanup worker was considered and
|
||||
rejected: the workload is so small (single-instance v1, bounded
|
||||
running game count) that one composite is dominantly cheaper than
|
||||
two narrow ones.
|
||||
|
||||
### 4. `operation_log` is append-only with `bigserial id` and a `(game_id, started_at DESC)` index
|
||||
|
||||
**Decision.** `operation_log` carries a `bigserial id PRIMARY KEY`
|
||||
and is written exclusively through INSERT — there is no UPDATE
|
||||
pathway, no soft-delete column, and no foreign key to
|
||||
`runtime_records`. The audit index
|
||||
`operation_log_game_started_idx (game_id, started_at DESC)` drives
|
||||
the GM/Admin REST audit reads. The adapter's `ListByGame` orders
|
||||
results `started_at DESC, id DESC` and applies `LIMIT $2`.
|
||||
|
||||
```sql
|
||||
CREATE INDEX operation_log_game_started_idx
|
||||
ON operation_log (game_id, started_at DESC);
|
||||
```
|
||||
|
||||
**Why.** The audit's correctness invariant is "every operation RTM
|
||||
performed gets exactly one row"; CASCADE deletes from
|
||||
`runtime_records` would silently lose history when an admin removes
|
||||
a runtime and would break the
|
||||
[`../README.md` §Persistence Layout](../README.md) commitment. The
|
||||
secondary `id DESC` tiebreak inside the adapter is necessary because
|
||||
the audit log can write multiple rows in the same millisecond when
|
||||
`reconcile_adopt` and a real operation interleave on a single tick;
|
||||
without the tiebreak the test that asserts insertion-order-stable
|
||||
reads becomes flaky. A non-positive `limit` is rejected before the
|
||||
SQL is issued; an empty result set returns as `nil` (matching the
|
||||
lobby pattern, so service-layer callers can do `len(entries) == 0`
|
||||
without an extra allocation).
|
||||
|
||||
### 5. Enum CHECK constraints on `op_kind`, `op_source`, `outcome`
|
||||
|
||||
**Decision.** `operation_log` reproduces the three Go-level enums
|
||||
as CHECK constraints:
|
||||
|
||||
```sql
|
||||
CONSTRAINT operation_log_op_kind_chk
|
||||
CHECK (op_kind IN (
|
||||
'start', 'stop', 'restart', 'patch',
|
||||
'cleanup_container', 'reconcile_adopt', 'reconcile_dispose'
|
||||
)),
|
||||
CONSTRAINT operation_log_op_source_chk
|
||||
CHECK (op_source IN (
|
||||
'lobby_stream', 'gm_rest', 'admin_rest',
|
||||
'auto_ttl', 'auto_reconcile'
|
||||
)),
|
||||
CONSTRAINT operation_log_outcome_chk
|
||||
CHECK (outcome IN ('success', 'failure'))
|
||||
```
|
||||
|
||||
The Go-level enums in
|
||||
[`../internal/domain/operation/log.go`](../internal/domain/operation/log.go)
|
||||
remain the source of truth.
|
||||
|
||||
**Why.** A defence-in-depth gate at the storage boundary catches any
|
||||
adapter regression that would otherwise persist an unexpected
|
||||
string. Operator-side queries (`SELECT … WHERE op_kind = 'restart'`)
|
||||
benefit from the enum being verifiable directly in psql without
|
||||
consulting the Go source. Adding a new value requires editing two
|
||||
places (the Go enum and the migration), which is the right friction
|
||||
level: every new value is a wire-protocol change and deserves an
|
||||
explicit migration. The alternative of using PostgreSQL's `CREATE
|
||||
TYPE … AS ENUM` was rejected because adding a value to a PG enum
|
||||
type requires `ALTER TYPE` outside a transaction and complicates the
|
||||
single-init pre-launch policy (decision §12).
|
||||
|
||||
### 6. `health_snapshots` is one row per game; status enum collapses event types
|
||||
|
||||
**Decision.** `health_snapshots` carries `game_id text PRIMARY KEY`
|
||||
and stores the latest technical health observation per game. The
|
||||
`status` column enumerates the **observed engine state**, not the
|
||||
**triggering event type**:
|
||||
|
||||
```sql
|
||||
CONSTRAINT health_snapshots_status_chk
|
||||
CHECK (status IN (
|
||||
'healthy', 'probe_failed', 'exited',
|
||||
'oom', 'inspect_unhealthy', 'container_disappeared'
|
||||
))
|
||||
```
|
||||
|
||||
The `runtime:health_events` `event_type` enum has seven values
|
||||
(`container_started`, `container_exited`, `container_oom`,
|
||||
`container_disappeared`, `inspect_unhealthy`, `probe_failed`,
|
||||
`probe_recovered`). The snapshot status has six — the two probe
|
||||
events fold into `healthy` (after `probe_recovered`) and
|
||||
`probe_failed`, and `container_started` collapses into `healthy`.
|
||||
|
||||
**Why.** Health snapshots answer "what state is the engine in
|
||||
**right now**", not "what event was just emitted". A consumer who
|
||||
wants the event firehose reads `runtime:health_events`; a consumer
|
||||
who wants the latest verdict reads `health_snapshots`. The two
|
||||
surfaces have different lifetimes (stream entries are bounded only
|
||||
by Redis trim; snapshot rows are overwritten on every new
|
||||
observation), so collapsing the seven event types into six status
|
||||
states aligns the column with the consumer's mental model. The
|
||||
adapter that implements this collapse lives in
|
||||
[`../internal/adapters/healtheventspublisher/publisher.go`](../internal/adapters/healtheventspublisher/publisher.go);
|
||||
every emission to the stream also upserts the snapshot.
|
||||
|
||||
### 7. Two-axis CAS shape on `runtime_records.UpdateStatus`
|
||||
|
||||
**Decision.** `runtimerecordstore.UpdateStatus` compiles its CAS
|
||||
guard into a single `WHERE … AND …` clause. Status must equal the
|
||||
caller's `ExpectedFrom`; when the caller supplies a non-empty
|
||||
`ExpectedContainerID`, `current_container_id` must equal it as
|
||||
well:
|
||||
|
||||
```sql
|
||||
UPDATE rtmanager.runtime_records
|
||||
SET status = $1, last_op_at = $2, ...
|
||||
WHERE game_id = $3
|
||||
AND status = $4
|
||||
[AND current_container_id = $5]
|
||||
```
|
||||
|
||||
A `RowsAffected() == 0` result is ambiguous — the row may be absent
|
||||
or the predicate may have failed. The adapter resolves the ambiguity
|
||||
through a follow-up `SELECT status FROM ... WHERE game_id = $1`:
|
||||
missing row → `runtime.ErrNotFound`; mismatch → `runtime.ErrConflict`.
|
||||
The probe runs only on the slow path; happy-path UPDATEs cost a
|
||||
single round trip.
|
||||
|
||||
**Why.** The two-axis CAS is what services need: a stop driven by an
|
||||
old container_id (from a stale REST request) must not clobber a
|
||||
fresh `running` record installed by a concurrent restart. Status-only
|
||||
CAS would collapse those two cases. The optional shape on
|
||||
`ExpectedContainerID` lets reconciliation flows that legitimately
|
||||
target "this game in `running` state without caring which container"
|
||||
omit the second predicate. The follow-up probe matches the
|
||||
gamestore / invitestore precedent in `lobby/internal/adapters/postgres`
|
||||
and produces clean per-error sentinels at the service layer.
|
||||
|
||||
`TestUpdateStatusConcurrentCAS` exercises the path end to end with
|
||||
eight goroutines racing the same transition: exactly one returns
|
||||
`nil`, the rest see `runtime.ErrConflict`. The test is deterministic
|
||||
because PostgreSQL serialises row-level UPDATEs through the row's
|
||||
MVCC tuple.
|
||||
|
||||
### 8. Destination-driven `SET` clause on `UpdateStatus`
|
||||
|
||||
**Decision.** `UpdateStatus` updates a different column subset
|
||||
depending on the destination status:
|
||||
|
||||
| Destination | Columns set |
|
||||
| --- | --- |
|
||||
| `stopped` | `status`, `last_op_at`, `stopped_at` |
|
||||
| `removed` | `status`, `last_op_at`, `removed_at`, `current_container_id = NULL` |
|
||||
| `running` | `status`, `last_op_at` |
|
||||
|
||||
The implementation switches on `input.To` and writes the UPDATE
|
||||
chain inline per branch — three short branches read better than one
|
||||
parametric helper.
|
||||
|
||||
**Why.** Each destination has a different invariant. `stopped`
|
||||
records the wall-clock at which the engine ceased serving; `removed`
|
||||
nulls the container_id because the row no longer points at any
|
||||
Docker resource; `running` only updates the status and the
|
||||
last-op timestamp because the running invariants
|
||||
(`current_container_id`, fresh `started_at`, `current_image_ref`,
|
||||
`engine_endpoint`) are installed through `Upsert` on the `start`
|
||||
path.
|
||||
|
||||
A previous draft built the SET list via `[]pg.Column` / `[]any`
|
||||
slices and a helper, but jet's `UPDATE(columns ...jet.Column)`
|
||||
variadic refuses a `[]postgres.Column` slice spread because the
|
||||
element type does not match `jet.Column` after the type-alias
|
||||
resolution. The final code switches inline per branch.
|
||||
|
||||
The `running` destination is implemented even though the start
|
||||
service uses `Upsert` for the inner start of restart and patch.
|
||||
Keeping the `running` path live preserves a one-to-one match between
|
||||
`runtime.AllowedTransitions()` and the adapter's capability matrix —
|
||||
otherwise a future caller exercising the `stopped → running`
|
||||
transition through `UpdateStatus` would hit a runtime error inside
|
||||
the adapter rather than a domain rejection. The path only updates
|
||||
`status` and `last_op_at`; callers responsible for the running
|
||||
invariants install them through `Upsert` first.
|
||||
|
||||
### 9. `created_at` preservation on `Upsert`
|
||||
|
||||
**Decision.** `runtimerecordstore.Upsert` is implemented as
|
||||
`INSERT ... ON CONFLICT (game_id) DO UPDATE SET <every mutable
|
||||
column from EXCLUDED>` — `created_at` is deliberately omitted from
|
||||
the DO UPDATE list, so a second `Upsert` with a fresh `CreatedAt`
|
||||
value never overwrites the stored timestamp.
|
||||
|
||||
```sql
|
||||
INSERT INTO rtmanager.runtime_records (...)
|
||||
VALUES (...)
|
||||
ON CONFLICT (game_id) DO UPDATE
|
||||
SET status = EXCLUDED.status,
|
||||
current_container_id = EXCLUDED.current_container_id,
|
||||
current_image_ref = EXCLUDED.current_image_ref,
|
||||
engine_endpoint = EXCLUDED.engine_endpoint,
|
||||
state_path = EXCLUDED.state_path,
|
||||
docker_network = EXCLUDED.docker_network,
|
||||
started_at = EXCLUDED.started_at,
|
||||
stopped_at = EXCLUDED.stopped_at,
|
||||
removed_at = EXCLUDED.removed_at,
|
||||
last_op_at = EXCLUDED.last_op_at
|
||||
-- created_at intentionally NOT updated
|
||||
```
|
||||
|
||||
`TestUpsertOverwritesMutableColumnsPreservesCreatedAt` covers the
|
||||
invariant.
|
||||
|
||||
**Why.** `runtime_records.created_at` records "first time RTM saw
|
||||
the game". Every restart and every reconcile_adopt re-Upserts the
|
||||
row with the current wall-clock as `CreatedAt` from the adapter
|
||||
boundary; without the omission rule the timestamp would drift
|
||||
forward. Preserving the original creation time keeps a stable
|
||||
horizon for retention reasoning and matches
|
||||
`lobby/internal/adapters/postgres/gamestore.Save`, which uses the
|
||||
same approach for the `games.created_at` column.
|
||||
|
||||
### 10. `health_snapshots.details` JSONB round-trip with `'{}'::jsonb` default
|
||||
|
||||
**Decision.** `health_snapshots.details` is `jsonb NOT NULL DEFAULT
|
||||
'{}'::jsonb`. The jet-generated model declares
|
||||
`Details string` (jet maps `jsonb` to `string`). The adapter:
|
||||
|
||||
- on `Upsert`, substitutes the SQL DEFAULT `{}` when
|
||||
`snapshot.Details` is empty, so the column never holds a non-JSON
|
||||
empty string;
|
||||
- on `Get`, scans `details` as `[]byte` and wraps the bytes in a
|
||||
`json.RawMessage` so the caller receives verbatim bytes without
|
||||
an extra round of parsing.
|
||||
|
||||
`TestUpsertEmptyDetailsRoundTripsAsEmptyObject` and
|
||||
`TestUpsertAndGetRoundTrip` cover the two cases.
|
||||
|
||||
**Why.** The detail payload is type-specific (the keys differ
|
||||
between `probe_failed` and `inspect_unhealthy`) and is opaque to
|
||||
queries — the column is never element-filtered. JSONB matches the
|
||||
"everything outside primary fields is JSON" pattern that the
|
||||
Notification Service already established and allows a future
|
||||
GIN index (e.g. for an admin search-by-key feature) without a
|
||||
schema rewrite. Substituting the SQL DEFAULT for an empty
|
||||
parameter avoids the trap where the database accepts `''` for
|
||||
`text` but rejects it for `jsonb`.
|
||||
|
||||
### 11. Timestamps are uniformly `timestamptz` with UTC normalisation at the adapter boundary
|
||||
|
||||
**Decision.** Every time-valued column on every RTM table uses
|
||||
PostgreSQL's `timestamptz`. The domain model continues to use
|
||||
`time.Time`; the adapter normalises every `time.Time` parameter to
|
||||
UTC at the binding site (`record.X.UTC()` or the `nullableTime`
|
||||
helper that wraps a possibly-zero `time.Time`), and re-wraps every
|
||||
scanned `time.Time` with `.UTC()` (directly or via
|
||||
`timeFromNullable` for nullable columns) before the value leaves
|
||||
the adapter.
|
||||
|
||||
The architecture-wide form of this rule lives in
|
||||
[`../../ARCHITECTURE.md` §Persistence Backends → Timestamp handling](../../ARCHITECTURE.md).
|
||||
|
||||
**Why.** `timestamptz` is the right column type for every cross-
|
||||
service timestamp the platform observes, and the domain model needs
|
||||
a `time.Time` API the service layer can compare and arithmetise.
|
||||
Without explicit `.UTC()` on the bind site, the pgx driver returns
|
||||
scanned values in `time.Local`, which silently breaks equality
|
||||
tests, JSON formatting, and comparison against pointer fields
|
||||
elsewhere in the codebase. The defensive `.UTC()` rule on both
|
||||
sides eliminates the class of bug where a timezone difference
|
||||
between the adapter and the test harness flips assertions
|
||||
intermittently.
|
||||
|
||||
The same shape is used in User Service, Mail Service, and
|
||||
Notification Service — RTM matches the existing convention rather
|
||||
than introducing a fourth encoding path.
|
||||
|
||||
### 12. Single-init pre-launch policy
|
||||
|
||||
**Decision.** `00001_init.sql` evolves in place until first
|
||||
production deploy. Adding a column, an index, or a new table during
|
||||
the pre-launch development window edits this file directly rather
|
||||
than producing `00002_*.sql`. The runtime applies the migration on
|
||||
every boot; if the schema is already at head, `pkg/postgres`'s
|
||||
goose adapter exits zero.
|
||||
|
||||
**Why.** The schema-per-service architectural rule
|
||||
([`../../ARCHITECTURE.md` §Persistence Backends](../../ARCHITECTURE.md))
|
||||
endorses a single-init policy for pre-launch services. The
|
||||
pre-launch window allows non-additive changes (column rename, type
|
||||
narrowing, CHECK tightening) that a multi-step migration sequence
|
||||
would force into awkward two-step rewrites. Once the service ships
|
||||
to production, the next schema change becomes `00002_*.sql` and
|
||||
the policy lifts; from that point onward edits to `00001_init.sql`
|
||||
are rejected by code review.
|
||||
|
||||
This applies to RTM exactly the same way it applies to every other
|
||||
PG-backed service in the workspace; the README explicitly carries
|
||||
the reminder. The exit-zero behaviour for already-applied
|
||||
migrations is what makes the policy operationally cheap: a
|
||||
freshly-spawned replica re-applies the same `00001_init.sql` with
|
||||
no work to do, no logged error, and proceeds to open its
|
||||
listeners.
|
||||
|
||||
### 13. Query layer is `go-jet/jet/v2`; generated code is committed
|
||||
|
||||
**Decision.** All three RTM PG-store packages
|
||||
([`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
|
||||
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
|
||||
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore))
|
||||
build SQL through the jet builder API
|
||||
(`pgtable.<Table>.INSERT/SELECT/UPDATE/DELETE` plus the
|
||||
`pg.AND/OR/SET/COALESCE/...` DSL).
|
||||
|
||||
Generated table models live under
|
||||
[`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
|
||||
and are regenerated by `make -C rtmanager jet`. The target invokes
|
||||
[`../cmd/jetgen/main.go`](../cmd/jetgen/main.go), which spins up a
|
||||
transient PostgreSQL container via testcontainers, provisions the
|
||||
`rtmanager` schema and `rtmanagerservice` role, applies the embedded
|
||||
goose migrations, and runs `github.com/go-jet/jet/v2/generator/postgres.GenerateDB`
|
||||
against the provisioned schema. Generated code is committed to the
|
||||
repo, so build consumers do not need Docker.
|
||||
|
||||
Statements are run through the `database/sql` API
|
||||
(`stmt.Sql() → db/tx.Exec/Query/QueryRow`); manual `rowScanner`
|
||||
helpers preserve the codecs.go boundary translations and
|
||||
domain-type mapping (status enum decoding, `time.Time` UTC
|
||||
normalisation, JSONB `[]byte` ↔ `json.RawMessage`).
|
||||
|
||||
PostgreSQL constructs that the jet builder does not cover natively
|
||||
(`COALESCE`, `LOWER` on subselects, JSONB params) are expressed
|
||||
through the per-DSL helpers (`pg.COALESCE`, `pg.LOWER`, direct
|
||||
`[]byte`/string params for JSONB columns).
|
||||
|
||||
**Why.** Aligns with the workspace-wide convention from
|
||||
[`../../PG_PLAN.md`](../../PG_PLAN.md): the query layer is
|
||||
`github.com/go-jet/jet/v2` (PostgreSQL dialect) for every PG-backed
|
||||
service. Hand-rolled SQL would multiply boundary-translation paths
|
||||
and require per-store query-builder helpers for what jet already
|
||||
covers. Committing generated code keeps `go build ./...` working
|
||||
without Docker.
|
||||
|
||||
### 14. `redisstate` keyspace ownership and per-store subpackages
|
||||
|
||||
**Decision.** The
|
||||
[`../internal/adapters/redisstate/`](../internal/adapters/redisstate)
|
||||
package owns one shared `Keyspace` struct with a
|
||||
`defaultPrefix = "rtmanager:"` constant. Each Redis-backed adapter
|
||||
lives in its own subpackage:
|
||||
|
||||
- [`redisstate/streamoffsets`](../internal/adapters/redisstate/streamoffsets/)
|
||||
for the stream offset store consumed by the start-jobs and
|
||||
stop-jobs consumers;
|
||||
- [`redisstate/gamelease`](../internal/adapters/redisstate/gamelease/)
|
||||
for the per-game lease store consumed by every lifecycle service
|
||||
and the reconciler.
|
||||
|
||||
Both subpackages take a `redisstate.Keyspace{}` value and use it to
|
||||
build their key shapes (`rtmanager:stream_offsets:{label}`,
|
||||
`rtmanager:game_lease:{game_id}`).
|
||||
|
||||
**Why.** Keeping the parent package as the single owner of the prefix
|
||||
and the key-shape builder mirrors the way Lobby's `redisstate`
|
||||
namespace centralises every key shape and supports multiple Redis-
|
||||
backed adapters (stream offsets, the per-game lease) without a
|
||||
restructure as the surface grows.
|
||||
|
||||
The per-store subpackage choice (rather than Lobby's flat
|
||||
single-package shape) is driven by three considerations:
|
||||
|
||||
- It keeps the docker mock generator scoped to one package, since
|
||||
`mockgen` regenerates per-directory.
|
||||
- It allows finer-grained dependency selection: `miniredis` is a
|
||||
dev-only dep, and keeping the `streamoffsets` package
|
||||
self-contained leaves room for `gamelease` to depend only on the
|
||||
production `redis` client.
|
||||
- Each subpackage carries its own tests, which keeps the test
|
||||
surface focused on one Redis primitive rather than mixing offset
|
||||
semantics with lease semantics in shared fixtures.
|
||||
|
||||
## Cross-References
|
||||
|
||||
- [`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql)
|
||||
— the embedded schema migration.
|
||||
- [`../internal/adapters/postgres/migrations/migrations.go`](../internal/adapters/postgres/migrations/migrations.go)
|
||||
— `//go:embed *.sql` and `FS()` exporter consumed by the runtime.
|
||||
- [`../internal/adapters/postgres/runtimerecordstore`](../internal/adapters/postgres/runtimerecordstore),
|
||||
[`../internal/adapters/postgres/operationlogstore`](../internal/adapters/postgres/operationlogstore),
|
||||
[`../internal/adapters/postgres/healthsnapshotstore`](../internal/adapters/postgres/healthsnapshotstore)
|
||||
— the three jet-backed PG adapters and their testcontainers-driven
|
||||
unit suites.
|
||||
- [`../internal/adapters/postgres/jet/`](../internal/adapters/postgres/jet)
|
||||
— committed generated jet models.
|
||||
- [`../cmd/jetgen/main.go`](../cmd/jetgen/main.go) and
|
||||
[`../Makefile`](../Makefile) `jet` target — the regeneration
|
||||
pipeline.
|
||||
- [`../internal/adapters/redisstate/`](../internal/adapters/redisstate),
|
||||
[`../internal/adapters/redisstate/streamoffsets/`](../internal/adapters/redisstate/streamoffsets/),
|
||||
[`../internal/adapters/redisstate/gamelease/`](../internal/adapters/redisstate/gamelease/)
|
||||
— Redis adapter package layout.
|
||||
- [`../internal/app/runtime.go`](../internal/app/runtime.go)
|
||||
— runtime wiring: PG pool open + migration apply + Redis client
|
||||
open + adapter assembly.
|
||||
- [`../internal/config/`](../internal/config) — the config groups
|
||||
consumed by the wiring (`Postgres`, `Redis`, `Streams`,
|
||||
`Coordination`).
|
||||
- Companion design rationales:
|
||||
[`domain-and-ports.md`](domain-and-ports.md) for status enum and
|
||||
domain shape, [`adapters.md`](adapters.md) for the redisstate
|
||||
publishers and clients.
|
||||
@@ -0,0 +1,368 @@
|
||||
# Operator Runbook
|
||||
|
||||
This runbook covers the checks that matter most during startup,
|
||||
steady-state readiness, shutdown, and the handful of recovery paths
|
||||
specific to Runtime Manager.
|
||||
|
||||
## Startup Checks
|
||||
|
||||
Before starting the process, confirm:
|
||||
|
||||
- `RTMANAGER_DOCKER_HOST` (default `unix:///var/run/docker.sock`)
|
||||
reaches a Docker daemon the operator controls. RTM is the only
|
||||
Galaxy service permitted to interact with the Docker socket;
|
||||
scoping the daemon to RTM-only callers is operator domain.
|
||||
- `RTMANAGER_DOCKER_NETWORK` (default `galaxy-net`) names a
|
||||
user-defined bridge network that has already been created (e.g.
|
||||
via `docker network create galaxy-net` in the environment's
|
||||
bootstrap script). RTM **validates** the network at startup but
|
||||
never creates it. A missing network is fail-fast and the process
|
||||
exits non-zero before opening any listener.
|
||||
- `RTMANAGER_GAME_STATE_ROOT` is a host directory the daemon's user
|
||||
can read and write. Per-game subdirectories are created with
|
||||
`RTMANAGER_GAME_STATE_DIR_MODE` (default `0750`) and
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID` / `_GID` (default `0:0`); set the
|
||||
uid/gid to match the engine container's user when running with a
|
||||
non-root engine.
|
||||
- `RTMANAGER_POSTGRES_PRIMARY_DSN` points to the PostgreSQL primary
|
||||
that hosts the `rtmanager` schema. The DSN must include
|
||||
`search_path=rtmanager` and `sslmode=disable` (or a real SSL mode
|
||||
for production). Embedded goose migrations apply at startup before
|
||||
any HTTP listener opens; a migration or ping failure terminates the
|
||||
process with a non-zero exit. The `rtmanager` schema and the
|
||||
matching `rtmanagerservice` role are provisioned externally
|
||||
([`postgres-migration.md` §1](postgres-migration.md)).
|
||||
- `RTMANAGER_REDIS_MASTER_ADDR` and `RTMANAGER_REDIS_PASSWORD` reach
|
||||
the Redis deployment used for the runtime-coordination state:
|
||||
stream consumers (`runtime:start_jobs`, `runtime:stop_jobs`),
|
||||
publishers (`runtime:job_results`, `runtime:health_events`,
|
||||
`notification:intents`), persisted offsets, and the per-game
|
||||
lease. RTM does not maintain durable business state on Redis.
|
||||
- Stream names match the producers and consumers RTM integrates with:
|
||||
- `RTMANAGER_REDIS_START_JOBS_STREAM` (default `runtime:start_jobs`)
|
||||
- `RTMANAGER_REDIS_STOP_JOBS_STREAM` (default `runtime:stop_jobs`)
|
||||
- `RTMANAGER_REDIS_JOB_RESULTS_STREAM` (default `runtime:job_results`)
|
||||
- `RTMANAGER_REDIS_HEALTH_EVENTS_STREAM` (default `runtime:health_events`)
|
||||
- `RTMANAGER_NOTIFICATION_INTENTS_STREAM` (default `notification:intents`)
|
||||
- `RTMANAGER_LOBBY_INTERNAL_BASE_URL` resolves to Lobby's internal
|
||||
HTTP listener. RTM's start service issues a diagnostic
|
||||
`GET /api/v1/internal/games/{game_id}` per start; failure is logged
|
||||
at debug and does not abort the start
|
||||
([`services.md` §7](services.md)).
|
||||
|
||||
The startup sequence runs in the order recorded in
|
||||
[`../README.md` §Startup dependencies](../README.md#startup-dependencies):
|
||||
|
||||
1. PostgreSQL primary opens; goose migrations apply synchronously.
|
||||
2. Redis master client opens and pings.
|
||||
3. Docker daemon ping; configured network presence check.
|
||||
4. Telemetry exporter (OTLP grpc/http or stdout).
|
||||
5. Internal HTTP listener.
|
||||
6. Reconciler runs **once synchronously** and blocks until done.
|
||||
7. Background workers start.
|
||||
|
||||
A failure at any step is fatal. The synchronous reconciler pass is
|
||||
the reason orphaned containers from a prior process never reach the
|
||||
periodic workers in an inconsistent state
|
||||
([`workers.md` §17](workers.md)).
|
||||
|
||||
Expected log lines on a healthy boot:
|
||||
|
||||
- `migrations applied`,
|
||||
- `postgres ping ok`,
|
||||
- `redis ping ok`,
|
||||
- `docker ping ok` and `docker network found`,
|
||||
- `telemetry exporter started`,
|
||||
- `internal http listening`,
|
||||
- `reconciler initial pass completed`,
|
||||
- one `worker started` entry per background worker (seven expected).
|
||||
|
||||
## Readiness
|
||||
|
||||
Use the probes according to what they actually verify:
|
||||
|
||||
- `GET /healthz` confirms the listener is alive — no dependency
|
||||
check.
|
||||
- `GET /readyz` live-pings PostgreSQL primary, Redis master, and the
|
||||
Docker daemon, then asserts the configured Docker network exists.
|
||||
Returns `{"status":"ready"}` when every check passes; otherwise
|
||||
returns `503` with the canonical
|
||||
`{"error":{"code":"service_unavailable","message":"…"}}` envelope
|
||||
identifying the first failing dependency.
|
||||
|
||||
`/readyz` is the strongest readiness signal RTM exposes; unlike
|
||||
Lobby's `/readyz`, it does **not** rely on a one-shot boot ping.
|
||||
Each request hits the daemon and the database fresh.
|
||||
|
||||
For a practical readiness check in production:
|
||||
|
||||
1. confirm the process emitted the listener and worker startup logs;
|
||||
2. check `GET /healthz` and `GET /readyz`;
|
||||
3. verify `rtmanager.runtime_records_by_status{status="running"}`
|
||||
gauge tracks the expected live game count after the first start
|
||||
completes;
|
||||
4. verify `rtmanager.docker_op_latency` histograms have at least one
|
||||
sample after the first lifecycle operation.
|
||||
|
||||
## Shutdown
|
||||
|
||||
The process handles `SIGINT` and `SIGTERM`.
|
||||
|
||||
Shutdown behaviour:
|
||||
|
||||
- the per-component shutdown budget is controlled by
|
||||
`RTMANAGER_SHUTDOWN_TIMEOUT` (default `30s`);
|
||||
- the internal HTTP listener drains in-flight requests before closing;
|
||||
- stream consumers stop their `XREAD` loops and persist the latest
|
||||
offset before returning; the offset survives the restart
|
||||
([`workers.md` §9](workers.md));
|
||||
- the Docker events listener cancels its subscription;
|
||||
- the in-flight services release their per-game lease through the
|
||||
surrounding context cancellation;
|
||||
- the reconciler completes its current pass or aborts mid-write at
|
||||
the next lease re-acquisition.
|
||||
|
||||
During planned restarts:
|
||||
|
||||
1. send `SIGTERM`;
|
||||
2. wait for the listener and component-stop logs;
|
||||
3. expect any consumer that was mid-cycle to retry from the persisted
|
||||
offset on the next process start;
|
||||
4. investigate only if shutdown exceeds `RTMANAGER_SHUTDOWN_TIMEOUT`.
|
||||
|
||||
## Engine Container Died
|
||||
|
||||
A running engine container that exits unexpectedly surfaces through
|
||||
three observation channels:
|
||||
|
||||
- The Docker events listener emits `container_exited` (non-zero exit
|
||||
code) or `container_oom` (Docker action `oom`).
|
||||
- The active probe worker eventually emits `probe_failed` once the
|
||||
threshold is crossed.
|
||||
- The Docker inspect worker may emit `inspect_unhealthy` if the
|
||||
engine restarts under Docker's healthcheck or if Docker reports an
|
||||
unexpected status.
|
||||
|
||||
Triage:
|
||||
|
||||
1. Inspect the `runtime:health_events` stream for the affected
|
||||
`game_id` and `event_type`:
|
||||
```bash
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 200 \
|
||||
| grep -A4 'game_id\s*<game_id>'
|
||||
```
|
||||
2. Read the runtime record and the operation log:
|
||||
```bash
|
||||
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code, started_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = '<game_id>'
|
||||
ORDER BY started_at DESC LIMIT 20"
|
||||
```
|
||||
3. If Lobby has not reacted (the game's status remains `running` in
|
||||
`lobby.games`), check `runtime:job_results` lag and Lobby's
|
||||
`runtimejobresult` worker. RTM publishes the result; Lobby is the
|
||||
consumer.
|
||||
4. If the container is already gone (`docker ps -a` shows no row for
|
||||
`galaxy-game-<game_id>`), the reconciler will move the record to
|
||||
`removed` on its next pass. Run the periodic reconcile manually
|
||||
by sending `SIGHUP` is **not** supported — wait
|
||||
`RTMANAGER_RECONCILE_INTERVAL` (default `5m`) or restart the
|
||||
process; the synchronous boot pass will handle the drift.
|
||||
5. The `notification:intents` stream is **not** the place to look
|
||||
for ongoing health changes. Only the three first-touch start
|
||||
failures (`runtime.image_pull_failed`,
|
||||
`runtime.container_start_failed`,
|
||||
`runtime.start_config_invalid`) produce a notification intent;
|
||||
probe failures, OOMs, and exits flow through health events only
|
||||
([`../README.md` §Notification Contracts](../README.md#notification-contracts)).
|
||||
|
||||
## Patch Upgrade
|
||||
|
||||
A patch upgrade replaces the container with a new `image_ref` while
|
||||
preserving the bind-mounted state directory.
|
||||
|
||||
Pre-conditions:
|
||||
|
||||
- The new and current `image_ref` tags both parse as semver. RTM
|
||||
rejects non-semver tags with `image_ref_not_semver`.
|
||||
- The new and current major / minor versions match. A cross-major or
|
||||
cross-minor patch returns `semver_patch_only`.
|
||||
|
||||
Driving the upgrade:
|
||||
|
||||
```bash
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/patch \
|
||||
-d '{"image_ref": "galaxy/game:1.4.2"}'
|
||||
```
|
||||
|
||||
Behaviour:
|
||||
|
||||
- The container is stopped, removed, and recreated. The
|
||||
`current_container_id` changes; the `engine_endpoint`
|
||||
(`http://galaxy-game-<game_id>:8080`) is stable.
|
||||
- The engine reads its state from the bind mount on startup, so any
|
||||
data written before the patch survives.
|
||||
- A single `operation_log` row is appended with `op_kind=patch` and
|
||||
the old / new image refs.
|
||||
- A `runtime:health_events container_started` is emitted by the
|
||||
inner start ([`workers.md` §1](workers.md)).
|
||||
|
||||
Post-patch verification:
|
||||
|
||||
```bash
|
||||
curl -s http://galaxy-game-<game_id>:8080/healthz
|
||||
curl -s http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>
|
||||
```
|
||||
|
||||
The `current_image_ref` field on the runtime record reflects the new
|
||||
tag.
|
||||
|
||||
## Manual Cleanup
|
||||
|
||||
The cleanup endpoint removes the container and updates the record to
|
||||
`removed`. It refuses to remove a `running` container — stop first.
|
||||
|
||||
```bash
|
||||
# Stop, then clean up
|
||||
curl -s -X POST \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/stop \
|
||||
-d '{"reason":"admin_request"}'
|
||||
|
||||
curl -s -X DELETE \
|
||||
-H 'X-Galaxy-Caller: admin' \
|
||||
http://<rtm-host>:8096/api/v1/internal/runtimes/<game_id>/container
|
||||
```
|
||||
|
||||
The host state directory under `<RTMANAGER_GAME_STATE_ROOT>/<game_id>`
|
||||
is **never** deleted by RTM. Removing the directory is operator
|
||||
domain (backup tooling, future Admin Service workflow). The
|
||||
operation_log records `op_kind=cleanup_container` with
|
||||
`op_source=admin_rest`.
|
||||
|
||||
## Reconcile Drift After Docker Daemon Restart
|
||||
|
||||
A Docker daemon restart drops every running engine container; PG
|
||||
records remain. On RTM's next boot (or its next periodic reconcile):
|
||||
|
||||
1. The reconciler observes `running` records whose containers are
|
||||
missing from `docker ps`. It updates each record to `removed`,
|
||||
appends `operation_log` with `op_kind=reconcile_dispose`, and
|
||||
publishes `runtime:health_events container_disappeared`
|
||||
([`workers.md` §14–§15](workers.md)).
|
||||
2. Lobby's `runtimejobresult` worker does not consume the dispose
|
||||
event in v1, so the cascade does not auto-restart the engine.
|
||||
Operators trigger restarts through Lobby's user-facing flow or
|
||||
directly via the GM/Admin REST `restart` endpoint.
|
||||
3. If the operator brings up an engine container manually for
|
||||
diagnostics (`docker run` with the
|
||||
`com.galaxy.owner=rtmanager,com.galaxy.game_id=<game_id>` labels),
|
||||
the reconciler **adopts** it on the next pass: a new
|
||||
`runtime_records` row appears with `op_kind=reconcile_adopt`.
|
||||
The reconciler **never stops or removes** an unrecorded
|
||||
container — operators stay in control of manual containers
|
||||
([`../README.md` §Reconciliation](../README.md#reconciliation)).
|
||||
|
||||
Three drift kinds run through the same lease-guarded write pass:
|
||||
`adopt`, `dispose`, and the README-level path
|
||||
`observed_exited` (a record marked `running` whose container exists
|
||||
but is in `exited`). Telemetry counter
|
||||
`rtmanager.reconcile_drift{kind}` exposes the three independently
|
||||
([`workers.md` §15](workers.md)).
|
||||
|
||||
## Testing Locally
|
||||
|
||||
```sh
|
||||
# One-time bootstrap
|
||||
docker network create galaxy-net
|
||||
|
||||
# Minimal env (see docs/examples.md for a complete .env)
|
||||
export RTMANAGER_GAME_STATE_ROOT=/var/lib/galaxy/games
|
||||
export RTMANAGER_DOCKER_NETWORK=galaxy-net
|
||||
export RTMANAGER_INTERNAL_HTTP_ADDR=:8096
|
||||
export RTMANAGER_DOCKER_HOST=unix:///var/run/docker.sock
|
||||
export RTMANAGER_POSTGRES_PRIMARY_DSN='postgres://rtmanagerservice:rtmanagerservice@127.0.0.1:5432/galaxy?search_path=rtmanager&sslmode=disable'
|
||||
export RTMANAGER_REDIS_MASTER_ADDR=127.0.0.1:6379
|
||||
export RTMANAGER_REDIS_PASSWORD=local
|
||||
export RTMANAGER_LOBBY_INTERNAL_BASE_URL=http://127.0.0.1:8095
|
||||
|
||||
go run ./rtmanager/cmd/rtmanager
|
||||
```
|
||||
|
||||
After start:
|
||||
|
||||
- `curl http://localhost:8096/healthz` returns `{"status":"ok"}`;
|
||||
- `curl http://localhost:8096/readyz` returns `{"status":"ready"}`
|
||||
once PG, Redis, and Docker pings pass and the configured network
|
||||
exists;
|
||||
- driving Lobby through its public flow (`POST /api/v1/lobby/games/<id>/start`)
|
||||
brings up `galaxy-game-<game_id>` containers; RTM logs each
|
||||
lifecycle transition.
|
||||
|
||||
The integration suite under `rtmanager/integration/` exercises the
|
||||
end-to-end flows against the real Docker daemon. The default
|
||||
`go test ./...` skips it via the `integration` build tag; run
|
||||
explicitly with:
|
||||
|
||||
```sh
|
||||
make -C rtmanager integration
|
||||
```
|
||||
|
||||
The suite requires a reachable Docker daemon. Without one, the
|
||||
harness helpers call `t.Skip` and the package becomes a no-op
|
||||
([`integration-tests.md` §1](integration-tests.md)).
|
||||
|
||||
## Diagnostic Queries
|
||||
|
||||
Durable runtime state lives in PostgreSQL; runtime-coordination state
|
||||
stays in Redis. CLI snippets that help during incidents:
|
||||
|
||||
```bash
|
||||
# Live runtime count by status (PostgreSQL)
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT status, COUNT(*) FROM rtmanager.runtime_records GROUP BY status"
|
||||
|
||||
# Inspect a specific runtime record
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.runtime_records WHERE game_id = '<game_id>'"
|
||||
|
||||
# Last 20 operations for a game (newest first)
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT id, op_kind, op_source, outcome, error_code,
|
||||
started_at, finished_at
|
||||
FROM rtmanager.operation_log
|
||||
WHERE game_id = '<game_id>'
|
||||
ORDER BY started_at DESC, id DESC
|
||||
LIMIT 20"
|
||||
|
||||
# Latest health snapshot
|
||||
psql "$RTMANAGER_POSTGRES_PRIMARY_DSN" -c \
|
||||
"SELECT * FROM rtmanager.health_snapshots WHERE game_id = '<game_id>'"
|
||||
|
||||
# Containers RTM owns (Docker)
|
||||
docker ps --filter label=com.galaxy.owner=rtmanager \
|
||||
--format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Labels}}'
|
||||
|
||||
# Stream lag (Redis)
|
||||
redis-cli XINFO STREAM runtime:start_jobs
|
||||
redis-cli XINFO STREAM runtime:stop_jobs
|
||||
redis-cli GET rtmanager:stream_offsets:startjobs
|
||||
redis-cli GET rtmanager:stream_offsets:stopjobs
|
||||
|
||||
# Recent health events (oldest first)
|
||||
redis-cli XRANGE runtime:health_events - + COUNT 100
|
||||
|
||||
# Per-game lease (only present while an operation runs)
|
||||
redis-cli GET rtmanager:game_lease:<game_id>
|
||||
redis-cli TTL rtmanager:game_lease:<game_id>
|
||||
```
|
||||
|
||||
Operators reach the gauges and counters surfaced through
|
||||
OpenTelemetry as the primary observability surface; raw PostgreSQL
|
||||
and Redis access is for last-resort triage.
|
||||
@@ -0,0 +1,309 @@
|
||||
# Runtime and Components
|
||||
|
||||
The diagram below focuses on the deployed `galaxy/rtmanager` process
|
||||
and its runtime dependencies. The current-state contract for every
|
||||
listener, worker, and adapter lives in [`../README.md`](../README.md);
|
||||
this document is the navigation aid that points at the right code path
|
||||
and the right design-rationale record.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph Clients
|
||||
GM["Game Master"]
|
||||
Admin["Admin Service"]
|
||||
Lobby["Game Lobby"]
|
||||
end
|
||||
|
||||
subgraph RTM["Runtime Manager process"]
|
||||
InternalHTTP["Internal HTTP listener\n:8096 /healthz /readyz + REST"]
|
||||
StartJobs["startjobsconsumer"]
|
||||
StopJobs["stopjobsconsumer"]
|
||||
DockerEvents["dockerevents listener"]
|
||||
HealthProbe["healthprobe worker"]
|
||||
DockerInspect["dockerinspect worker"]
|
||||
Reconcile["reconcile worker"]
|
||||
Cleanup["containercleanup worker"]
|
||||
Services["lifecycle services\n(start, stop, restart, patch, cleanupcontainer)"]
|
||||
IntentPublisher["notification:intents publisher"]
|
||||
ResultsPublisher["runtime:job_results publisher"]
|
||||
HealthPublisher["runtime:health_events publisher"]
|
||||
Telemetry["Logs, traces, metrics"]
|
||||
end
|
||||
|
||||
Docker["Docker Daemon"]
|
||||
Engine["galaxy-game-{game_id} container"]
|
||||
Postgres["PostgreSQL\nschema rtmanager"]
|
||||
Redis["Redis\nstreams + leases + offsets"]
|
||||
LobbyHTTP["Lobby internal HTTP"]
|
||||
|
||||
Lobby -. runtime:start_jobs .-> StartJobs
|
||||
Lobby -. runtime:stop_jobs .-> StopJobs
|
||||
GM --> InternalHTTP
|
||||
Admin --> InternalHTTP
|
||||
|
||||
StartJobs --> Services
|
||||
StopJobs --> Services
|
||||
InternalHTTP --> Services
|
||||
|
||||
Services --> Docker
|
||||
Services --> Postgres
|
||||
Services --> Redis
|
||||
Services --> ResultsPublisher
|
||||
Services --> HealthPublisher
|
||||
Services --> IntentPublisher
|
||||
Services -. GET diagnostic .-> LobbyHTTP
|
||||
|
||||
DockerEvents --> Docker
|
||||
DockerInspect --> Docker
|
||||
HealthProbe --> Engine
|
||||
Reconcile --> Docker
|
||||
Reconcile --> Postgres
|
||||
Cleanup --> Postgres
|
||||
Cleanup --> Services
|
||||
|
||||
DockerEvents --> HealthPublisher
|
||||
DockerInspect --> HealthPublisher
|
||||
HealthProbe --> HealthPublisher
|
||||
|
||||
HealthPublisher --> Redis
|
||||
ResultsPublisher --> Redis
|
||||
IntentPublisher --> Redis
|
||||
|
||||
StartJobs --> Redis
|
||||
StopJobs --> Redis
|
||||
InternalHTTP --> Postgres
|
||||
|
||||
Docker -->|create / start / stop / rm| Engine
|
||||
Engine -. bind mount .- StateDir["host:\n<RTMANAGER_GAME_STATE_ROOT>/{game_id}"]
|
||||
|
||||
InternalHTTP --> Telemetry
|
||||
Services --> Telemetry
|
||||
StartJobs --> Telemetry
|
||||
StopJobs --> Telemetry
|
||||
DockerEvents --> Telemetry
|
||||
HealthProbe --> Telemetry
|
||||
DockerInspect --> Telemetry
|
||||
Reconcile --> Telemetry
|
||||
Cleanup --> Telemetry
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- `cmd/rtmanager` refuses startup when PostgreSQL is unreachable, when
|
||||
goose migrations fail, when Redis ping fails, when the Docker daemon
|
||||
ping fails, or when the configured Docker network is missing. Lobby
|
||||
reachability is **not** verified at boot — the start service's
|
||||
diagnostic `GET /api/v1/internal/games/{game_id}` call is a no-op
|
||||
outside of debug logging
|
||||
([`services.md` §7](services.md)).
|
||||
- The reconciler runs **synchronously** once on startup before
|
||||
`app.App.Run` registers any other component, then re-runs
|
||||
periodically as a regular `Component`. The synchronous pass is the
|
||||
reason why orphaned containers from a prior process can never be
|
||||
observed by the events listener with no PG record
|
||||
([`workers.md` §17](workers.md)).
|
||||
- A single internal HTTP listener exposes both probes
|
||||
(`/healthz`, `/readyz`) and the trusted REST surface for Game Master
|
||||
and Admin Service. There is no public listener — RTM does not face
|
||||
end users.
|
||||
|
||||
## Listeners
|
||||
|
||||
| Listener | Default addr | Purpose |
|
||||
| --- | --- | --- |
|
||||
| Internal HTTP | `:8096` | Probes (`/healthz`, `/readyz`) plus the trusted REST surface for `Game Master` and `Admin Service` |
|
||||
|
||||
Shared listener defaults from `RTMANAGER_INTERNAL_HTTP_*`:
|
||||
|
||||
- read timeout: `5s`
|
||||
- write timeout: `15s`
|
||||
- idle timeout: `60s`
|
||||
|
||||
The listener is unauthenticated and assumes a trusted network segment.
|
||||
The `X-Galaxy-Caller` request header carries an optional caller
|
||||
identity (`gm` or `admin`) that the handler records as
|
||||
`operation_log.op_source`
|
||||
([`services.md` §18](services.md)).
|
||||
|
||||
Probe routes:
|
||||
|
||||
- `GET /healthz` — process liveness; returns `{"status":"ok"}` while
|
||||
the listener is up.
|
||||
- `GET /readyz` — live-pings PostgreSQL primary, Redis master, and the
|
||||
Docker daemon, then asserts the configured Docker network exists.
|
||||
Returns `{"status":"ready"}` only when every check passes; otherwise
|
||||
returns `503` with the canonical error envelope.
|
||||
|
||||
## Background Workers
|
||||
|
||||
Every worker runs as an `app.Component` and is registered in the
|
||||
order below by [`internal/app/runtime.go`](../internal/app/runtime.go).
|
||||
|
||||
| Worker | Source | Trigger | Function |
|
||||
| --- | --- | --- | --- |
|
||||
| Start jobs consumer | [`internal/worker/startjobsconsumer`](../internal/worker/startjobsconsumer) | Redis `XREAD runtime:start_jobs` | Decodes `{game_id, image_ref, requested_at_ms}` and invokes `startruntime.Service`; publishes the outcome to `runtime:job_results` |
|
||||
| Stop jobs consumer | [`internal/worker/stopjobsconsumer`](../internal/worker/stopjobsconsumer) | Redis `XREAD runtime:stop_jobs` | Decodes `{game_id, reason, requested_at_ms}` and invokes `stopruntime.Service`; publishes the outcome to `runtime:job_results` |
|
||||
| Docker events listener | [`internal/worker/dockerevents`](../internal/worker/dockerevents) | Docker `/events` API filtered by `com.galaxy.owner=rtmanager` | Emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. Reconnects on transport errors with a fixed 5s backoff ([`workers.md` §7](workers.md)) |
|
||||
| Health probe worker | [`internal/worker/healthprobe`](../internal/worker/healthprobe) | Periodic `RTMANAGER_PROBE_INTERVAL` | `GET {engine_endpoint}/healthz` for every running runtime; in-memory hysteresis emits `probe_failed` after `RTMANAGER_PROBE_FAILURES_THRESHOLD` consecutive failures and `probe_recovered` on the first success thereafter ([`workers.md` §5–§6](workers.md)) |
|
||||
| Docker inspect worker | [`internal/worker/dockerinspect`](../internal/worker/dockerinspect) | Periodic `RTMANAGER_INSPECT_INTERVAL` | Calls `InspectContainer` for every running runtime; emits `inspect_unhealthy` on `RestartCount` growth, unexpected status, or Docker `HEALTHCHECK=unhealthy` |
|
||||
| Reconciler | [`internal/worker/reconcile`](../internal/worker/reconcile) | Synchronous startup pass + periodic `RTMANAGER_RECONCILE_INTERVAL` | Adopts unrecorded containers (`reconcile_adopt`), disposes records whose container vanished (`reconcile_dispose`), records observed exits (`observed_exited`); every mutation runs under the per-game lease ([`workers.md` §14–§15](workers.md)) |
|
||||
| Container cleanup | [`internal/worker/containercleanup`](../internal/worker/containercleanup) | Periodic `RTMANAGER_CLEANUP_INTERVAL` | Lists `runtime_records` rows with `status=stopped AND last_op_at < now - retention`, delegates to `cleanupcontainer.Service` per game ([`workers.md` §19](workers.md)) |
|
||||
|
||||
The events listener and the inspect worker do **not** emit
|
||||
`container_started` — that event is owned by the start service
|
||||
([`workers.md` §1](workers.md)). The events listener and the inspect
|
||||
worker also do not emit `container_disappeared` autonomously when a
|
||||
record is missing or stale; the conditional emission rules live in
|
||||
[`workers.md` §2](workers.md) and [`§4`](workers.md).
|
||||
|
||||
## Lifecycle Services
|
||||
|
||||
The five lifecycle services are pure orchestrators called from both
|
||||
the stream consumers and the REST handlers. Each service owns the
|
||||
per-game lease for the duration of its operation.
|
||||
|
||||
| Service | Source | Triggers | Failure envelope |
|
||||
| --- | --- | --- | --- |
|
||||
| `startruntime` | [`internal/service/startruntime`](../internal/service/startruntime) | `runtime:start_jobs`, `POST /api/v1/internal/runtimes/{id}/start` | `start_config_invalid`, `image_pull_failed`, `container_start_failed`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §4](services.md)) |
|
||||
| `stopruntime` | [`internal/service/stopruntime`](../internal/service/stopruntime) | `runtime:stop_jobs`, `POST /api/v1/internal/runtimes/{id}/stop` | `conflict`, `service_unavailable`, `internal_error`, `not_found` ([`services.md` §17](services.md)) |
|
||||
| `restartruntime` | [`internal/service/restartruntime`](../internal/service/restartruntime) | `POST /api/v1/internal/runtimes/{id}/restart` | inherited from inner stop / start; lease covers both inner ops ([`services.md` §12, §17](services.md)) |
|
||||
| `patchruntime` | [`internal/service/patchruntime`](../internal/service/patchruntime) | `POST /api/v1/internal/runtimes/{id}/patch` | `image_ref_not_semver`, `semver_patch_only`, plus inherited start/stop codes ([`services.md` §14, §17](services.md)) |
|
||||
| `cleanupcontainer` | [`internal/service/cleanupcontainer`](../internal/service/cleanupcontainer) | `DELETE /api/v1/internal/runtimes/{id}/container`, periodic cleanup worker | `not_found`, `conflict`, `service_unavailable`, `internal_error` ([`services.md` §17](services.md)) |
|
||||
|
||||
All services share three behaviours captured in
|
||||
[`services.md`](services.md):
|
||||
|
||||
- the per-game Redis lease (`rtmanager:game_lease:{game_id}`,
|
||||
TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`) is acquired by the service,
|
||||
not by the caller — which keeps consumer and REST callers symmetric
|
||||
([`services.md` §1](services.md));
|
||||
- the canonical `Result` shape (`Outcome`, `ErrorCode`, `Record`,
|
||||
`ContainerID`, `EngineEndpoint`) is what consumers and REST
|
||||
handlers translate into job_results / HTTP responses
|
||||
([`services.md` §3](services.md));
|
||||
- failures pass through one `operation_log` write before returning,
|
||||
and three of the failure codes (`start_config_invalid`,
|
||||
`image_pull_failed`, `container_start_failed`) also publish a
|
||||
`runtime.*` admin notification intent
|
||||
([`services.md` §4](services.md)).
|
||||
|
||||
## Synchronous Upstream Client
|
||||
|
||||
| Client | Endpoint | Failure mapping |
|
||||
| --- | --- | --- |
|
||||
| `Game Lobby` internal | `GET {RTMANAGER_LOBBY_INTERNAL_BASE_URL}/api/v1/internal/games/{game_id}` | Diagnostic-only in v1; the start service ignores the body and absorbs network failures with a debug log. Decision: [`services.md` §7](services.md) |
|
||||
|
||||
Lobby's outbound transport is the only synchronous client RTM holds.
|
||||
Every other interaction (Notification Service, Game Master, Admin
|
||||
Service) crosses an asynchronous boundary or is initiated by the peer.
|
||||
|
||||
## Stream Offsets
|
||||
|
||||
Each consumer persists its position under a fixed label so process
|
||||
restart preserves stream progress.
|
||||
|
||||
| Stream | Offset key | Block timeout env |
|
||||
| --- | --- | --- |
|
||||
| `runtime:start_jobs` | `rtmanager:stream_offsets:startjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
|
||||
| `runtime:stop_jobs` | `rtmanager:stream_offsets:stopjobs` | `RTMANAGER_STREAM_BLOCK_TIMEOUT` |
|
||||
|
||||
The labels `startjobs` and `stopjobs` are stable identifiers — they
|
||||
are decoupled from the underlying stream key. An operator who renames
|
||||
a stream via `RTMANAGER_REDIS_START_JOBS_STREAM` /
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM` does not lose the persisted offset.
|
||||
Decision: [`workers.md` §9](workers.md).
|
||||
|
||||
The `runtime:job_results`, `runtime:health_events`, and
|
||||
`notification:intents` streams are outbound; RTM does not consume them
|
||||
itself.
|
||||
|
||||
## Configuration Groups
|
||||
|
||||
The full env-var list with defaults lives in
|
||||
[`../README.md` §Configuration](../README.md). The groups below
|
||||
summarise the structure:
|
||||
|
||||
- **Required** — `RTMANAGER_INTERNAL_HTTP_ADDR`,
|
||||
`RTMANAGER_POSTGRES_PRIMARY_DSN`, `RTMANAGER_REDIS_MASTER_ADDR`,
|
||||
`RTMANAGER_REDIS_PASSWORD`, `RTMANAGER_DOCKER_HOST`,
|
||||
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_GAME_STATE_ROOT`.
|
||||
- **Listener** — `RTMANAGER_INTERNAL_HTTP_*` timeouts.
|
||||
- **Docker** — `RTMANAGER_DOCKER_HOST`, `RTMANAGER_DOCKER_API_VERSION`,
|
||||
`RTMANAGER_DOCKER_NETWORK`, `RTMANAGER_DOCKER_LOG_DRIVER`,
|
||||
`RTMANAGER_DOCKER_LOG_OPTS`, `RTMANAGER_IMAGE_PULL_POLICY`.
|
||||
- **Container defaults** — `RTMANAGER_DEFAULT_CPU_QUOTA`,
|
||||
`RTMANAGER_DEFAULT_MEMORY`, `RTMANAGER_DEFAULT_PIDS_LIMIT`,
|
||||
`RTMANAGER_CONTAINER_STOP_TIMEOUT_SECONDS`,
|
||||
`RTMANAGER_CONTAINER_RETENTION_DAYS`,
|
||||
`RTMANAGER_ENGINE_STATE_MOUNT_PATH`,
|
||||
`RTMANAGER_ENGINE_STATE_ENV_NAME`,
|
||||
`RTMANAGER_GAME_STATE_DIR_MODE`,
|
||||
`RTMANAGER_GAME_STATE_OWNER_UID`,
|
||||
`RTMANAGER_GAME_STATE_OWNER_GID`.
|
||||
- **PostgreSQL connectivity** — `RTMANAGER_POSTGRES_PRIMARY_DSN`,
|
||||
`RTMANAGER_POSTGRES_REPLICA_DSNS`,
|
||||
`RTMANAGER_POSTGRES_OPERATION_TIMEOUT`,
|
||||
`RTMANAGER_POSTGRES_MAX_OPEN_CONNS`,
|
||||
`RTMANAGER_POSTGRES_MAX_IDLE_CONNS`,
|
||||
`RTMANAGER_POSTGRES_CONN_MAX_LIFETIME`.
|
||||
- **Redis connectivity** — `RTMANAGER_REDIS_MASTER_ADDR`,
|
||||
`RTMANAGER_REDIS_REPLICA_ADDRS`, `RTMANAGER_REDIS_PASSWORD`,
|
||||
`RTMANAGER_REDIS_DB`, `RTMANAGER_REDIS_OPERATION_TIMEOUT`.
|
||||
- **Streams** — `RTMANAGER_REDIS_START_JOBS_STREAM`,
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM`,
|
||||
`RTMANAGER_REDIS_JOB_RESULTS_STREAM`,
|
||||
`RTMANAGER_REDIS_HEALTH_EVENTS_STREAM`,
|
||||
`RTMANAGER_NOTIFICATION_INTENTS_STREAM`,
|
||||
`RTMANAGER_STREAM_BLOCK_TIMEOUT`.
|
||||
- **Health monitoring** — `RTMANAGER_INSPECT_INTERVAL`,
|
||||
`RTMANAGER_PROBE_INTERVAL`, `RTMANAGER_PROBE_TIMEOUT`,
|
||||
`RTMANAGER_PROBE_FAILURES_THRESHOLD`.
|
||||
- **Reconciler / cleanup** — `RTMANAGER_RECONCILE_INTERVAL`,
|
||||
`RTMANAGER_CLEANUP_INTERVAL`.
|
||||
- **Coordination** — `RTMANAGER_GAME_LEASE_TTL_SECONDS`.
|
||||
- **Lobby internal client** — `RTMANAGER_LOBBY_INTERNAL_BASE_URL`,
|
||||
`RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
|
||||
- **Process and logging** — `RTMANAGER_LOG_LEVEL`,
|
||||
`RTMANAGER_SHUTDOWN_TIMEOUT`.
|
||||
- **Telemetry** — standard `OTEL_*`.
|
||||
|
||||
## Runtime Notes
|
||||
|
||||
- **Single-instance v1.** Multi-instance Runtime Manager with Redis
|
||||
Streams consumer groups is explicitly out of scope for the current
|
||||
iteration. The per-game lease serialises operations on one game
|
||||
across the consumer + REST entry points; cross-instance
|
||||
coordination is deferred until a real workload demands it.
|
||||
- **Lease semantics.** `rtmanager:game_lease:{game_id}` is
|
||||
`SET ... NX PX <ttl>` with TTL `RTMANAGER_GAME_LEASE_TTL_SECONDS`
|
||||
(default `60s`). The lease is **not renewed mid-operation** in v1;
|
||||
long pulls of multi-GB images can therefore expire the lease
|
||||
before the operation finishes — the trade-off is documented in
|
||||
[`services.md` §1](services.md). The reconciler honours the same
|
||||
lease around every drift mutation
|
||||
([`workers.md` §14](workers.md)).
|
||||
- **Operation log is the source of truth.** Every lifecycle and
|
||||
reconcile mutation appends one row to `rtmanager.operation_log`.
|
||||
The `runtime:health_events` stream and the `notification:intents`
|
||||
emissions are best-effort — a publish failure logs at `Error` and
|
||||
proceeds, never rolling back the recorded operation
|
||||
([`workers.md` §8](workers.md)).
|
||||
- **In-memory probe hysteresis.** The active HTTP probe keeps
|
||||
per-game `consecutiveFailures` and `failurePublished` counters in a
|
||||
mutex-guarded map. State is non-persistent: a process restart that
|
||||
loses the counters re-establishes hysteresis from scratch, and
|
||||
state for a game that transitions through `stopped → running` is
|
||||
pruned at the start of every probe tick
|
||||
([`workers.md` §5](workers.md)).
|
||||
- **Pull policy fallbacks.** `RTMANAGER_IMAGE_PULL_POLICY` accepts
|
||||
`if_missing` (default), `always`, and `never`. Image labels
|
||||
(`com.galaxy.cpu_quota`, `com.galaxy.memory`,
|
||||
`com.galaxy.pids_limit`) drive resource limits when present; the
|
||||
matching `RTMANAGER_DEFAULT_*` env vars supply the fallback when a
|
||||
label is absent or unparseable. Producers never pass limits.
|
||||
- **State directory ownership.** RTM creates per-game state
|
||||
directories under `RTMANAGER_GAME_STATE_ROOT` with the configured
|
||||
mode and uid/gid, but **never deletes them**. Removing the directory
|
||||
is operator domain (backup tooling, a future Admin Service
|
||||
workflow). A cleanup that removes the container leaves the
|
||||
directory intact.
|
||||
@@ -0,0 +1,443 @@
|
||||
# Lifecycle Services
|
||||
|
||||
This document explains the design of the five lifecycle services
|
||||
(`startruntime`, `stopruntime`, `restartruntime`, `patchruntime`,
|
||||
`cleanupcontainer`) under [`../internal/service/`](../internal/service)
|
||||
plus the per-handler REST glue under
|
||||
[`../internal/api/internalhttp/`](../internal/api/internalhttp).
|
||||
|
||||
The current-state behaviour (lifecycle steps, failure tables, the
|
||||
per-game lease semantics, the wire contracts) lives in
|
||||
[`../README.md`](../README.md), the OpenAPI spec at
|
||||
[`../api/internal-openapi.yaml`](../api/internal-openapi.yaml), and the
|
||||
AsyncAPI spec at
|
||||
[`../api/runtime-jobs-asyncapi.yaml`](../api/runtime-jobs-asyncapi.yaml).
|
||||
This file records the *why*.
|
||||
|
||||
## 1. Per-game lease lives at the service layer
|
||||
|
||||
Every lifecycle service acquires `rtmanager:game_lease:{game_id}` via
|
||||
[`ports.GameLeaseStore`](../internal/ports/gamelease.go) before doing
|
||||
any work, and releases it on the way out:
|
||||
|
||||
- the lease primitive serialises operations on a single game across
|
||||
every entry point (stream consumers and REST handlers);
|
||||
- holding the lease at the service layer keeps the consumer / REST
|
||||
callers symmetric — neither acquires the lease itself, both call
|
||||
the service the same way;
|
||||
- the Redis-backed adapter
|
||||
([`../internal/adapters/redisstate/gamelease/store.go`](../internal/adapters/redisstate/gamelease/store.go))
|
||||
uses `SET NX PX` on acquire, Lua compare-and-delete on release; a
|
||||
release whose caller-supplied token no longer matches is a silent
|
||||
no-op.
|
||||
|
||||
The lease key shape is `rtmanager:game_lease:{base64url(game_id)}` so
|
||||
opaque game ids may contain any characters without leaking through
|
||||
the key syntax.
|
||||
|
||||
The lease TTL is `RTMANAGER_GAME_LEASE_TTL_SECONDS` (default `60s`)
|
||||
and is **not renewed mid-operation** in v1. A multi-GB image pull can
|
||||
theoretically expire the lease before the start service finishes;
|
||||
operators see this as a `reconcile_adopt` event later because the
|
||||
container is created with the standard owner labels. A renewal helper
|
||||
is deliberately deferred until a workload makes it necessary.
|
||||
|
||||
The reconciler ([`workers.md`](workers.md) §4) honours the same lease
|
||||
around every drift mutation, which closes the
|
||||
restart-vs-`reconcile_dispose` race documented in §6 below.
|
||||
|
||||
## 2. Health-events publisher lands with the start service
|
||||
|
||||
The start service publishes `container_started` after `docker run`
|
||||
returns; the events listener intentionally does **not** duplicate
|
||||
the event ([`workers.md`](workers.md) §1). Centralising the publisher
|
||||
on the start service avoids a "who emits what" ambiguity and lets the
|
||||
publisher be a thin port wrapper rather than a worker-specific
|
||||
helper.
|
||||
|
||||
The publisher port lives next to the snapshot-upsert rule
|
||||
([`adapters.md`](adapters.md) §8): one Publish call updates both
|
||||
surfaces.
|
||||
|
||||
## 3. `Result`-shaped contract
|
||||
|
||||
`Service.Handle` returns `(Result, error)`. The Go-level `error` is
|
||||
reserved for system-level / programmer faults (nil context, nil
|
||||
service). All business outcomes flow through `Result`:
|
||||
|
||||
- `Outcome=success`, `ErrorCode=""` — fresh start succeeded;
|
||||
- `Outcome=success`, `ErrorCode="replay_no_op"` — idempotent replay;
|
||||
- `Outcome=failure`, `ErrorCode` set — business failure
|
||||
(`start_config_invalid` / `image_pull_failed` /
|
||||
`container_start_failed` / `conflict` / `service_unavailable` /
|
||||
`internal_error`).
|
||||
|
||||
The stream consumer uses `Outcome` and `ErrorCode` to populate
|
||||
`runtime:job_results` directly; the REST handler maps `Outcome=failure`
|
||||
plus `ErrorCode` to the matching HTTP status. Both callers are simpler
|
||||
with this contract than with an `errors.Is`-driven sentinel taxonomy.
|
||||
|
||||
`ports.JobResult` and the two `JobOutcome*` string constants live in
|
||||
the ports package next to `JobResultPublisher` so the wire shape is
|
||||
defined exactly once. The constants are intentionally not aliases of
|
||||
`operation.Outcome` — the audit-log enum is allowed to grow without
|
||||
breaking the wire format.
|
||||
|
||||
## 4. Start service failure-mode mapping
|
||||
|
||||
| Failure | Error code | Notification intent |
|
||||
| --- | --- | --- |
|
||||
| Invalid input (empty fields, unknown op_source) | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| Lease busy | `conflict` | — |
|
||||
| Existing record running with a different image_ref | `conflict` | — |
|
||||
| Get returns a non-NotFound transport error | `internal_error` | — |
|
||||
| `image_ref` shape rejected by `distribution/reference` | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| `EnsureNetwork` returns `ErrNetworkMissing` | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| `EnsureNetwork` returns any other error | `service_unavailable` | — |
|
||||
| `PullImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
|
||||
| `InspectImage` failure | `image_pull_failed` | `runtime.image_pull_failed` |
|
||||
| `prepareStateDir` failure | `start_config_invalid` | `runtime.start_config_invalid` |
|
||||
| `Run` failure | `container_start_failed` | `runtime.container_start_failed` |
|
||||
| `Upsert` failure after successful Run | `container_start_failed` | `runtime.container_start_failed` |
|
||||
|
||||
Three error codes do **not** raise an admin notification: `conflict`,
|
||||
`service_unavailable`, and `internal_error` are operational classes
|
||||
(another caller is in flight, a dependency is down, an unclassified
|
||||
fault) where the corrective action is not a configuration change. The
|
||||
operator already sees them through telemetry and structured logs; an
|
||||
email per occurrence would be noise.
|
||||
|
||||
## 5. Upsert-after-Run rollback
|
||||
|
||||
A `Run` that succeeded but whose `Upsert` failed leaves a running
|
||||
container with no PG record. The service issues a best-effort
|
||||
`docker.Remove(containerID)` in a fresh `context.Background()` (the
|
||||
request context may already be cancelled) before recording the failure.
|
||||
A Remove failure is logged but not propagated; the reconciler adopts
|
||||
surviving orphans on its periodic pass.
|
||||
|
||||
The Docker adapter already removes the container when `Run` itself
|
||||
returns an error after a successful `ContainerCreate` ([`adapters.md`](adapters.md) §3).
|
||||
The service-layer rollback covers the additional post-`Run` Upsert
|
||||
failure path.
|
||||
|
||||
## 6. Pre-existing record handling
|
||||
|
||||
Only `status=running` + same `image_ref` is a `replay_no_op`.
|
||||
`running` + a different `image_ref` returns `failure / conflict` (use
|
||||
`patch` to change the image of a running container).
|
||||
|
||||
Anything else (`stopped`, `removed`, missing record) proceeds with a
|
||||
fresh start that ends in `Upsert`. `Upsert` overwrites verbatim and is
|
||||
not bound by the transitions table, so installing a `running` record
|
||||
over a `removed` row is permitted — the `removed` terminus rule lives
|
||||
in `runtime.AllowedTransitions` (which guards `UpdateStatus`), not in
|
||||
`Upsert`.
|
||||
|
||||
`created_at` is preserved across re-starts: the start service reuses
|
||||
`existing.CreatedAt` when the record was found, so the
|
||||
"first time RTM saw the game" semantics from
|
||||
[`postgres-migration.md`](postgres-migration.md) §9 hold even when the
|
||||
start path goes through `Upsert` rather than through the runtime
|
||||
adapter's `INSERT ... ON CONFLICT DO UPDATE` EXCLUDED list.
|
||||
|
||||
A residual `galaxy-game-{game_id}` container left over from a previous
|
||||
start that was stopped but never cleaned up will fail at `docker run`
|
||||
with a name conflict. The service surfaces that as
|
||||
`container_start_failed`; cleanup plus the reconciler is the standard
|
||||
remedy. A pre-emptive Remove inside the start service was rejected
|
||||
because it would silently undo manual operator inspection on stopped
|
||||
containers.
|
||||
|
||||
## 7. `LobbyInternalClient.GetGame` is best-effort
|
||||
|
||||
The fetch happens after the lease is acquired and before the Docker
|
||||
work, with the configured `RTMANAGER_LOBBY_INTERNAL_TIMEOUT`.
|
||||
`ErrLobbyUnavailable` and `ErrLobbyGameNotFound` are logged at
|
||||
`debug`; the start operation continues either way. The fetched
|
||||
`Status` and `TargetEngineVersion` enrich logs only — the start
|
||||
envelope already carries the only required field (`image_ref`), and
|
||||
the port docstring fixes the recoverable-failure contract.
|
||||
|
||||
## 8. `image_ref` validation
|
||||
|
||||
Validation uses `github.com/distribution/reference.ParseNormalizedNamed`
|
||||
before any Docker round-trip. Rejected shapes surface as
|
||||
`start_config_invalid` plus a `runtime.start_config_invalid` intent.
|
||||
Daemon-side rejections after a valid parse (manifest unknown,
|
||||
authentication required) surface as `image_pull_failed` plus a
|
||||
`runtime.image_pull_failed` intent. The split keeps operator-actionable
|
||||
configuration mistakes distinct from registry-side failures.
|
||||
|
||||
## 9. State-directory preparer is overrideable
|
||||
|
||||
`Dependencies.PrepareStateDir` is a `func(gameID string) (string, error)`
|
||||
injection point that defaults to `os.MkdirAll` + `os.Chmod` +
|
||||
`os.Chown` against `RTMANAGER_GAME_STATE_ROOT`. Tests override it to
|
||||
point at a `t.TempDir()`-style fake without exercising the real
|
||||
filesystem permissions (which require either matching uid/gid or
|
||||
root). This is a deliberate non-port abstraction: the start service
|
||||
does no other filesystem work and the cost of a new port for one
|
||||
helper is not worth the indirection.
|
||||
|
||||
## 10. Container env: both `GAME_STATE_PATH` and `STORAGE_PATH`
|
||||
|
||||
Both names are accepted by the v1 engine. The start service always
|
||||
sets both; the configured `RTMANAGER_ENGINE_STATE_ENV_NAME` controls
|
||||
the primary. When the operator overrides the primary to `STORAGE_PATH`,
|
||||
the deduplicating map collapses the two entries into one.
|
||||
|
||||
## 11. Wiring layer construction
|
||||
|
||||
`internal/app/wiring.go` is the single point that builds every
|
||||
production store, adapter, and service from `config.Config`. The
|
||||
struct exposes typed fields so handlers and workers can grab the
|
||||
singletons without re-wiring; an `addCloser` slice releases adapter
|
||||
resources (currently the Lobby HTTP client's idle-connection pool) at
|
||||
runtime shutdown. The `runtimeRecordsProbe` adapter installed during
|
||||
construction registers the `rtmanager.runtime_records_by_status`
|
||||
gauge documented in [`../README.md` §Observability](../README.md).
|
||||
|
||||
The persistence-only `CountByStatus` method on the `runtimerecordstore`
|
||||
adapter is **not** part of `ports.RuntimeRecordStore` because it is
|
||||
only used by the gauge probe; widening the port for one caller would
|
||||
force every adapter and test fake to grow with no benefit. The adapter
|
||||
exposes it directly and the wiring composes a concrete-typed wrapper.
|
||||
|
||||
## 12. Shared lease across composed operations (restart, patch)
|
||||
|
||||
Restart and patch must hold the lease across the inner
|
||||
`stop → docker rm → start` sequence, otherwise a concurrent stop or
|
||||
restart could observe a half-recreated runtime.
|
||||
|
||||
`startruntime.Service` and `stopruntime.Service` therefore expose a
|
||||
second public method:
|
||||
|
||||
```go
|
||||
// Run executes the lifecycle assuming the per-game lease is already
|
||||
// held by the caller. Reserved for orchestrator services that compose
|
||||
// stop or start with another operation under a single outer lease.
|
||||
// External callers must use Handle.
|
||||
func (service *Service) Run(ctx context.Context, input Input) (Result, error)
|
||||
```
|
||||
|
||||
`Handle` acquires the lease, defers its release, and calls `Run`.
|
||||
Restart and patch acquire the outer lease themselves and call `Run`
|
||||
on the inner services. The inner services record their own
|
||||
`operation_log` entries, telemetry counters, health events, and admin
|
||||
notification intents identically to a top-level `Handle`.
|
||||
|
||||
A typed `LeaseTicket` parameter (a small internal-package zero-size
|
||||
struct that only the lease store can construct) was considered and
|
||||
rejected for v1: only sister services in `internal/service/` ever call
|
||||
`Run`, the docstring is loud about the precondition, and the pattern
|
||||
can be tightened later without breaking the public surface that
|
||||
consumers and handlers consume.
|
||||
|
||||
## 13. Correlation id on `source_ref`
|
||||
|
||||
The outer restart and patch services reuse the existing
|
||||
`Input.SourceRef` as a correlation key:
|
||||
|
||||
- when `Input.SourceRef` is non-empty (REST request id, stream entry
|
||||
id), all three entries — outer restart / patch + inner stop +
|
||||
inner start — share that value;
|
||||
- when empty, the outer service generates a 32-byte base64url string
|
||||
via the same `NewToken` generator that produces lease tokens, and
|
||||
uses it as the correlation key for all three entries.
|
||||
|
||||
The outer entry's `source_ref` keeps its dual semantics: actor ref
|
||||
when the caller supplied one, generated correlation id otherwise. Pure
|
||||
top-level operations (caller invokes start, stop, or cleanup directly)
|
||||
keep the original meaning. Composed operations (restart, patch) use
|
||||
the same value in three places to make audit queries trivial.
|
||||
|
||||
This is not the cleanest end-state — a dedicated `correlation_id`
|
||||
column would carry the link without ambiguity — but it is the smallest
|
||||
change that does not touch the schema. A future stage that adds the
|
||||
column can rename the field and clear up the dual role in one move.
|
||||
|
||||
## 14. Semver validation for patch
|
||||
|
||||
`internal/service/patchruntime/semver.go` enforces the
|
||||
patch-precondition (current and new `image_ref` parse as semver, share
|
||||
major and minor):
|
||||
|
||||
- `extractSemverTag(imageRef)` parses with
|
||||
`github.com/distribution/reference.ParseNormalizedNamed`, casts to
|
||||
`reference.NamedTagged`, then validates the tag with
|
||||
`golang.org/x/mod/semver.IsValid` (after prepending `v` when the tag
|
||||
omits it). Failures map to `image_ref_not_semver`;
|
||||
- `samePatchSeries(currentSemver, newSemver)` compares
|
||||
`semver.MajorMinor` of the two canonical strings; mismatch maps to
|
||||
`semver_patch_only`.
|
||||
|
||||
`golang.org/x/mod` is a direct require to avoid a transitive-version
|
||||
surprise. `github.com/Masterminds/semver/v3` (also in the module
|
||||
graph) was rejected to avoid two semver libraries on disk for the
|
||||
same job; `x/mod/semver` already covers Lobby. A hand-rolled
|
||||
`vMajor.Minor.Patch` parser was rejected as premature.
|
||||
|
||||
Pre-checks run before any inner stop or `docker rm`: a rejected patch
|
||||
never disturbs the running runtime. Patch with
|
||||
`new_image_ref == current_image_ref` proceeds through the recreate
|
||||
flow unchanged (not `replay_no_op`: the inner start still runs); the
|
||||
outer `op_kind=patch` entry records the no-op patch for audit.
|
||||
|
||||
## 15. `StopReason` placement
|
||||
|
||||
The reason enum mirrors `lobby/internal/ports/runtimemanager.go`
|
||||
verbatim and lives at `internal/service/stopruntime/stopreason.go`.
|
||||
The stream consumer and the REST handler import `stopruntime` for
|
||||
the same enum the service requires.
|
||||
|
||||
Inner stop calls from restart and patch always pass
|
||||
`StopReasonAdminRequest`. Restart and patch are platform-internal
|
||||
recreate flows; `admin_request` is the closest semantic match in the
|
||||
five-value vocabulary. The actor that originated the recreate (REST
|
||||
request id, admin user id) flows through the `op_source` /
|
||||
`source_ref` pair, not through the stop reason.
|
||||
|
||||
## 16. Error code centralisation
|
||||
|
||||
`internal/service/startruntime/errors.go` is the canonical home for
|
||||
the stable error codes returned in `Result.ErrorCode`. The other four
|
||||
services (`stopruntime`, `restartruntime`, `patchruntime`,
|
||||
`cleanupcontainer`) import the constants from `startruntime` rather
|
||||
than redeclaring them. The package comment of `errors.go` flags the
|
||||
shared usage so future readers do not chase per-service declarations.
|
||||
|
||||
`start_config_invalid` is reserved for start because every start
|
||||
validation failure also raises an admin notification intent. The
|
||||
other services use the more general `invalid_request` for input
|
||||
validation failures.
|
||||
|
||||
## 17. Stop / restart / patch / cleanup failure tables
|
||||
|
||||
### `stopruntime`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | No notification intent. |
|
||||
| Lease busy | `conflict` | Lease release skipped because acquire returned false. |
|
||||
| Lease error | `service_unavailable` | Redis unreachable. |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `stopped` / `removed` | success / `replay_no_op` | Idempotent re-stop. |
|
||||
| `docker.Stop` returns `ErrContainerNotFound` | success | Record transitions `running → removed`, `container_disappeared` health event published. |
|
||||
| `docker.Stop` other error | `service_unavailable` | Record untouched; caller may retry. |
|
||||
| `UpdateStatus` returns `ErrConflict` (CAS race) | success / `replay_no_op` | The desired state was reached by another path (reconciler / restart). |
|
||||
| `UpdateStatus` returns `ErrNotFound` | `not_found` | Record vanished mid-stop. |
|
||||
| `UpdateStatus` other error | `internal_error` | |
|
||||
|
||||
### `restartruntime`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | |
|
||||
| Lease busy / lease error | `conflict` / `service_unavailable` | Same as stop. |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `removed` | `conflict` | Image_ref may be empty; restart cannot proceed. |
|
||||
| Inner stop fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner stop failed: ". |
|
||||
| `docker.Remove` fails | `service_unavailable` | Inner stop already moved record to `stopped`; runtime stays in `stopped`. Admin must call `cleanup_container` before retrying restart. |
|
||||
| Inner start fails | inner `ErrorCode` | Outer `ErrorMessage` prefixes "inner start failed: ". |
|
||||
|
||||
The post-stop `docker rm` failure is the only path that leaves the
|
||||
runtime in a state from which the same operation cannot recover by
|
||||
itself: a residual `galaxy-game-{game_id}` container blocks a fresh
|
||||
inner start (the start service surfaces this as
|
||||
`container_start_failed`). The runbook entry — "call cleanup, then
|
||||
restart again" — is the standard remedy.
|
||||
|
||||
### `patchruntime`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | |
|
||||
| Lease busy / lease error | `conflict` / `service_unavailable` | |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `removed` | `conflict` | |
|
||||
| Current `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
|
||||
| New `image_ref` not parseable as semver tag | `image_ref_not_semver` | Pre-check; no inner ops fired. |
|
||||
| Major / minor mismatch | `semver_patch_only` | Pre-check; no inner ops fired. |
|
||||
| Inner stop / `docker rm` / inner start fails | inherits inner code | Same propagation as restart. |
|
||||
|
||||
### `cleanupcontainer`
|
||||
|
||||
| Failure | Error code | Notes |
|
||||
| --- | --- | --- |
|
||||
| Invalid input | `invalid_request` | |
|
||||
| Lease busy / lease error | `conflict` / `service_unavailable` | |
|
||||
| Record missing | `not_found` | |
|
||||
| Status `removed` | success / `replay_no_op` | |
|
||||
| Status `running` | `conflict` | Error message: "stop the runtime first". |
|
||||
| Status `stopped` | proceed | |
|
||||
| `docker.Remove` returns `ErrContainerNotFound` | success | Adapter swallows not-found into nil. |
|
||||
| `docker.Remove` other error | `service_unavailable` | Record untouched; caller may retry. |
|
||||
| `UpdateStatus` returns `ErrConflict` | success / `replay_no_op` | Race with reconciler dispose. |
|
||||
| `UpdateStatus` returns `ErrNotFound` | `not_found` | |
|
||||
| `UpdateStatus` other error | `internal_error` | |
|
||||
|
||||
## 18. REST handler conventions
|
||||
|
||||
The internal HTTP handlers under
|
||||
[`../internal/api/internalhttp/handlers/`](../internal/api/internalhttp/handlers)
|
||||
follow these rules:
|
||||
|
||||
- **`X-Galaxy-Caller` header.** The optional header carries the
|
||||
calling service identity (`gm` / `admin`); the handler records the
|
||||
value as `op_source` in `operation_log` (`gm_rest` / `admin_rest`).
|
||||
Missing or unknown values default to `admin_rest` because every
|
||||
audit-log query already filters on the cleanup endpoint
|
||||
(`op_source ∈ {auto_ttl, admin_rest}`); making the default match
|
||||
the most-restricted surface keeps existing dashboards correct when
|
||||
an unconfigured client hits the listener. The header is declared as
|
||||
a reusable parameter (`components.parameters.XGalaxyCallerHeader`)
|
||||
in the OpenAPI spec and is referenced from each runtime operation
|
||||
but not from `/healthz` and `/readyz`.
|
||||
- **Error code → HTTP status mapping.** One canonical table in
|
||||
`handlers/common.go`:
|
||||
|
||||
| ErrorCode | HTTP status |
|
||||
| --- | ---: |
|
||||
| (success, including `replay_no_op`) | 200 |
|
||||
| `invalid_request`, `start_config_invalid`, `image_ref_not_semver` | 400 |
|
||||
| `not_found` | 404 |
|
||||
| `conflict`, `semver_patch_only` | 409 |
|
||||
| `service_unavailable`, `docker_unavailable` | 503 |
|
||||
| `internal_error`, `image_pull_failed`, `container_start_failed` | 500 |
|
||||
|
||||
`image_pull_failed` and `container_start_failed` are operational
|
||||
failures that originate inside RTM (registry / daemon problems),
|
||||
not client-side validation issues; they map to `500` so callers
|
||||
retry through their normal resilience paths instead of treating
|
||||
the call as a 4xx that must be fixed at the source.
|
||||
`docker_unavailable` is reserved for future producers; today the
|
||||
start service emits `service_unavailable` for Docker-daemon
|
||||
failures. Unknown error codes default to `500`.
|
||||
- **List and Get bypass the service layer.** `internalListRuntimes`
|
||||
and `internalGetRuntime` read directly from
|
||||
`ports.RuntimeRecordStore`. Reads do not produce `operation_log`
|
||||
rows, do not change Docker state, do not need the per-game lease,
|
||||
and do not have a stream-side counterpart — none of the lifecycle
|
||||
service machinery is justified.
|
||||
- **`RuntimeRecordStore.List(ctx)` returns every record regardless
|
||||
of status.** A single SELECT ordered by
|
||||
`(last_op_at DESC, game_id ASC)` — the same direction the
|
||||
`runtime_records_status_last_op_idx` index supports, so freshly
|
||||
active games surface first. Pagination is intentionally not
|
||||
modelled in v1; the working set is bounded by the games tracked
|
||||
by Lobby.
|
||||
- **Per-handler service ports use `mockgen`.** The handler layer
|
||||
depends on five narrow interfaces — one per lifecycle service —
|
||||
declared in `handlers/services.go`. Production wiring passes the
|
||||
concrete `*<lifecycle>.Service` pointers (each satisfies the
|
||||
matching interface implicitly); tests pass the mockgen-generated
|
||||
mocks under `handlers/mocks/`.
|
||||
- **Conformance test scope.** `internalhttp/conformance_test.go`
|
||||
drives every documented runtime operation against a real
|
||||
`internalhttp.Server` whose service deps are deterministic stubs.
|
||||
The test uses `kin-openapi/routers/legacy.NewRouter`, calls
|
||||
`openapi3filter.ValidateRequest` and
|
||||
`openapi3filter.ValidateResponse` so both directions match the
|
||||
contract. The scope is happy-path only; the failure-path response
|
||||
shapes are validated by the per-handler tests.
|
||||
@@ -0,0 +1,412 @@
|
||||
# Background Workers
|
||||
|
||||
This document explains the design of the seven background workers
|
||||
under [`../internal/worker/`](../internal/worker):
|
||||
|
||||
- [`startjobsconsumer`](../internal/worker/startjobsconsumer) and
|
||||
[`stopjobsconsumer`](../internal/worker/stopjobsconsumer) — async
|
||||
consumers driven by `runtime:start_jobs` / `runtime:stop_jobs`;
|
||||
- [`dockerevents`](../internal/worker/dockerevents) — Docker `/events`
|
||||
subscription;
|
||||
- [`dockerinspect`](../internal/worker/dockerinspect) — periodic
|
||||
`InspectContainer` worker;
|
||||
- [`healthprobe`](../internal/worker/healthprobe) — active HTTP
|
||||
`/healthz` probe;
|
||||
- [`reconcile`](../internal/worker/reconcile) — startup + periodic
|
||||
drift reconciliation;
|
||||
- [`containercleanup`](../internal/worker/containercleanup) —
|
||||
periodic TTL cleanup.
|
||||
|
||||
The current-state behaviour and configuration surface live in
|
||||
[`../README.md`](../README.md) (§Runtime Surface, §Health Monitoring,
|
||||
§Reconciliation), and operational notes are in
|
||||
[`runtime.md`](runtime.md), [`flows.md`](flows.md), and
|
||||
[`runbook.md`](runbook.md). This file records the rationale.
|
||||
|
||||
## 1. Single ownership per `event_type`
|
||||
|
||||
The `runtime:health_events` vocabulary is shared across four sources;
|
||||
each event type is owned by exactly one of them.
|
||||
|
||||
| `event_type` | Owner |
|
||||
| --- | --- |
|
||||
| `container_started` | `internal/service/startruntime` |
|
||||
| `container_exited` | `internal/worker/dockerevents` |
|
||||
| `container_oom` | `internal/worker/dockerevents` |
|
||||
| `container_disappeared` | `internal/worker/dockerevents` (external destroy) and `internal/worker/reconcile` (PG-drift) |
|
||||
| `inspect_unhealthy` | `internal/worker/dockerinspect` |
|
||||
| `probe_failed` | `internal/worker/healthprobe` |
|
||||
| `probe_recovered` | `internal/worker/healthprobe` |
|
||||
|
||||
`container_started` is intentionally not duplicated by the events
|
||||
listener, even though Docker emits a `start` action whenever the start
|
||||
service runs the container. The start service already publishes the
|
||||
event with the same wire shape; observing the action in the listener
|
||||
would produce two entries per real start.
|
||||
|
||||
## 2. `container_disappeared` is conditional on PG state
|
||||
|
||||
The Docker events listener inspects the runtime record before emitting
|
||||
`container_disappeared` for a `destroy` action. Three suppression rules
|
||||
apply:
|
||||
|
||||
- record missing → suppress (the destroyed container was never owned
|
||||
by RTM as a tracked runtime, so no consumer cares);
|
||||
- record `status != running` → suppress (RTM already finished a stop
|
||||
or cleanup; the destroy is the expected tail of that operation);
|
||||
- record `current_container_id != event.ContainerID` → suppress (RTM
|
||||
swapped to a new container through restart or patch; the destroy is
|
||||
the expected removal of the prior container id).
|
||||
|
||||
Only a destroy that arrives for a `running` record whose
|
||||
`current_container_id` still equals the event id is treated as
|
||||
unexpected. This is the wire-side analogue of the reconciler's
|
||||
PG-drift check: the reconciler observes "PG=running, no Docker
|
||||
container" while the events listener observes "Docker says destroy,
|
||||
PG still says running pointing at this container". Together they cover
|
||||
both directions of drift.
|
||||
|
||||
A read failure against `runtime_records` is treated conservatively as
|
||||
"suppress" — the listener cannot tell whether the destroy was external
|
||||
or RTM-initiated, and over-emitting `container_disappeared` would lead
|
||||
to a real consumer (`Game Master`) escalating a false positive.
|
||||
|
||||
## 3. `die` with exit code `0` is suppressed
|
||||
|
||||
`docker stop` (and graceful shutdowns via SIGTERM) produces a `die`
|
||||
event with exit code `0`. The `container_exited` contract guarantees a
|
||||
non-zero exit; emitting on exit `0` would shower consumers with
|
||||
normal-stop noise. The listener silently drops the event; the
|
||||
operation log already records the stop on the caller side.
|
||||
|
||||
## 4. Inspect worker leaves `container_disappeared` to the reconciler
|
||||
|
||||
When `dockerinspect` calls `InspectContainer` and the daemon returns
|
||||
`ports.ErrContainerNotFound`, the worker logs at `Debug` and skips:
|
||||
|
||||
- the reconciler is the single authority for PG-drift reconciliation.
|
||||
Adding a third source for `container_disappeared` would risk double
|
||||
emission and complicate the consumer story;
|
||||
- inspect ticks every 30 seconds; the reconciler ticks every 5
|
||||
minutes. The latency window for "Docker drops the container, RTM
|
||||
notices" is therefore at most 5 minutes in v1, which is acceptable
|
||||
for the kinds of drift the reconciler covers (manual `docker rm`
|
||||
outside RTM, daemon restart with stale records). If a future
|
||||
requirement tightens the window, promoting the inspect-side
|
||||
observation to a real `container_disappeared` is a one-line change.
|
||||
|
||||
## 5. Probe hysteresis is in-memory and pruned per tick
|
||||
|
||||
The active probe worker keeps per-game state in a
|
||||
`map[string]*probeState` guarded by a mutex. Two counters live there:
|
||||
|
||||
- `consecutiveFailures` — incremented on every failed probe, reset on
|
||||
every success;
|
||||
- `failurePublished` — prevents repeated `probe_failed` emission while
|
||||
the failure persists, and triggers a single `probe_recovered` on the
|
||||
first success after the threshold was crossed.
|
||||
|
||||
The state is non-persistent. RTM is single-instance in v1, and a
|
||||
process restart that loses the counters merely re-establishes the
|
||||
hysteresis from scratch — the only consequence is that a probe failure
|
||||
already in progress at the moment of restart needs another full
|
||||
threshold of failures to surface. Making the state durable would add a
|
||||
Redis round-trip to every probe attempt without buying anything that
|
||||
operators or downstream consumers depend on.
|
||||
|
||||
State pruning happens at the start of every tick. The worker reads the
|
||||
current running list and removes any state entry whose `game_id` is
|
||||
not in the list. A game that transitions through stopped → running
|
||||
again starts fresh; previously-accumulated counters do not bleed into
|
||||
the new lifecycle.
|
||||
|
||||
## 6. Probe concurrency is bounded by a fixed cap
|
||||
|
||||
Probes inside one tick run in parallel through a buffered-channel
|
||||
semaphore (`defaultMaxConcurrency = 16`). Three reasons:
|
||||
|
||||
- A single slow engine cannot delay the entire cohort. Sequential
|
||||
per-game probing would multiply the worst case by `len(records)`,
|
||||
which is the wrong shape for what is fundamentally a fan-out
|
||||
observation pattern.
|
||||
- An unbounded fan-out (one goroutine per record per tick without a
|
||||
cap) was rejected to avoid pathological CPU and connection bursts
|
||||
if the running list ever grows beyond what RTM was sized for. 16
|
||||
in-flight probes at the default 2s timeout fit a single RTM
|
||||
instance well within typical OS file-descriptor and TCP
|
||||
ephemeral-port limits.
|
||||
- The cap is a constant rather than an env var because RTM v1 is
|
||||
single-instance and the active-game count is bounded by Lobby; a
|
||||
configurable cap is something we promote to env if a real workload
|
||||
demands it.
|
||||
|
||||
The same reasoning argues against parallelism in the inspect worker:
|
||||
inspect calls are cheap (sub-ms in the local Docker socket case) and
|
||||
serial execution avoids unnecessary concurrency on the daemon socket.
|
||||
|
||||
## 7. Events listener reconnects with fixed backoff
|
||||
|
||||
The Docker daemon's events stream is a long-lived subscription; the
|
||||
SDK channel terminates on any transport error (daemon restart, socket
|
||||
hiccup, connection reset). The listener's outer loop handles this by
|
||||
re-subscribing after a fixed `defaultReconnectBackoff = 5s` wait,
|
||||
indefinitely while ctx is alive.
|
||||
|
||||
Crashing the process on a transport error was rejected because losing
|
||||
a few seconds of health observations is a much smaller blast radius
|
||||
than losing the entire RTM process while the start/stop pipelines are
|
||||
running. The save-offset case is different: a lost offset replays the
|
||||
entire backlog and breaks correctness, while a missed health event is
|
||||
observation-only.
|
||||
|
||||
A subscription error is logged at `Warn` so operators can see the
|
||||
reconnect activity without it dominating the log volume.
|
||||
|
||||
## 8. Health publisher remains best-effort
|
||||
|
||||
Every emission goes through `ports.HealthEventPublisher.Publish`, the
|
||||
same surface the start service already uses
|
||||
([`adapters.md`](adapters.md) §8). A publish failure logs at `Error`
|
||||
and proceeds; the worker does not retry, does not adjust its in-memory
|
||||
hysteresis, and does not surface the failure to the caller. The
|
||||
operation log is the source of truth for runtime state; the event
|
||||
stream is a best-effort notification surface to consumers.
|
||||
|
||||
## 9. Stream offset labels are stable identifiers
|
||||
|
||||
Both consumers persist their progress through
|
||||
`ports.StreamOffsetStore` under fixed labels — `startjobs` for the
|
||||
start-jobs consumer and `stopjobs` for the stop-jobs consumer. The
|
||||
labels match `rtmanager:stream_offsets:{label}` and stay stable when
|
||||
the underlying stream key is renamed via
|
||||
`RTMANAGER_REDIS_START_JOBS_STREAM` /
|
||||
`RTMANAGER_REDIS_STOP_JOBS_STREAM`, so an operator who points the
|
||||
consumer at a different stream key does not lose the persisted offset.
|
||||
|
||||
## 10. `OpSource` and `SourceRef` originate at the consumer boundary
|
||||
|
||||
Every consumed envelope is translated into a `Service.Handle` call
|
||||
with `OpSource = operation.OpSourceLobbyStream`. The opaque per-source
|
||||
`SourceRef` is the Redis Stream entry id (`message.ID`); the
|
||||
`operation_log` rows therefore record the originating envelope id, and
|
||||
restart / patch correlation logic ([`services.md`](services.md) §13)
|
||||
keeps working when those services are invoked indirectly.
|
||||
|
||||
## 11. Replay-no-op detection lives in the service layer
|
||||
|
||||
The consumer does not detect replays itself. `startruntime.Service`
|
||||
returns `Outcome=success, ErrorCode=replay_no_op` when the existing
|
||||
record is already `running` with the same `image_ref`;
|
||||
`stopruntime.Service` does the same for an already-stopped or
|
||||
already-removed record. The consumer copies the result fields into
|
||||
the `RuntimeJobResult` payload verbatim and lets Lobby observe the
|
||||
replay through `error_code`.
|
||||
|
||||
The wire-shape consequences:
|
||||
|
||||
- `success` + empty `error_code` → fresh start / fresh stop;
|
||||
- `success` + `error_code=replay_no_op` → idempotent replay. For
|
||||
start, the existing record carries `container_id` and
|
||||
`engine_endpoint`; for stop on `status=removed`, both fields are
|
||||
empty strings (the record was nulled by an earlier cleanup) — the
|
||||
AsyncAPI contract permits empty strings on these required fields;
|
||||
- `failure` + non-empty `error_code` → the start / stop service
|
||||
returned a zero `Record`; the consumer publishes empty
|
||||
`container_id` and `engine_endpoint`.
|
||||
|
||||
## 12. Per-message errors are absorbed; the offset always advances
|
||||
|
||||
The consumer run loop logs and absorbs any decode error, any go-level
|
||||
service error, and any publish failure; `streamOffsetStore.Save` runs
|
||||
unconditionally after each handled message. Pinning the offset on a
|
||||
single transient publish failure was rejected because the durable side
|
||||
effect (operation_log row, runtime_records mutation, Docker state) has
|
||||
already happened on the first pass; pinning the offset to retry the
|
||||
publish would duplicate audit rows for hours until the operator
|
||||
intervened.
|
||||
|
||||
The exception is `streamOffsetStore.Save` itself: a save failure
|
||||
returns a wrapped error from `Run`. The component supervisor in
|
||||
`internal/app/app.go` then exits the process and lets the operator
|
||||
escalate, because losing the offset would cause every subsequent
|
||||
restart to re-process every prior envelope.
|
||||
|
||||
## 13. `requested_at_ms` is logged-only
|
||||
|
||||
The AsyncAPI envelopes carry `requested_at_ms` from Lobby. The
|
||||
consumer parses it (rejecting unparseable values) but only includes
|
||||
the value in structured logs — the field is "used for diagnostics, not
|
||||
authoritative" per the contract. The service layer ignores it; the
|
||||
operation_log uses `service.clock()` for `started_at` / `finished_at`
|
||||
so Lobby's wall-clock skew never bleeds into RTM persistence.
|
||||
|
||||
## 14. Reconciler: per-game lease around every write
|
||||
|
||||
A `running → removed` mutation that races a restart's inner stop
|
||||
would clobber the restart's freshly-installed `running` record without
|
||||
any other guard. The reconciler honours the same per-game lease that
|
||||
the lifecycle services hold ([`services.md`](services.md) §1).
|
||||
|
||||
The reconciler splits its work into two phases:
|
||||
|
||||
- **Read pass — lockless.**
|
||||
`docker.List({com.galaxy.owner=rtmanager})` followed by
|
||||
`RuntimeRecords.ListByStatus(running)`. No lease is taken; both
|
||||
reads are point-in-time observations of independent systems and a
|
||||
stale view here only delays a mutation by one tick.
|
||||
- **Write pass — lease-guarded.** Every drift mutation
|
||||
(`adoptOne` / `disposeOne` / `observedExitedOne`) acquires the
|
||||
per-game lease, re-reads the record under the lease, and then
|
||||
either applies the mutation or returns when state has changed.
|
||||
A lease conflict (`acquired=false`) is logged at `info` and the
|
||||
game is silently skipped — the next tick will retry. A lease-store
|
||||
error is logged at `warn`; the rest of the pass continues.
|
||||
|
||||
The re-read after lease acquisition is intentional: the read pass is
|
||||
lockless, so by the time the lease is held the runtime record may
|
||||
have moved. `UpdateStatus` already provides CAS via
|
||||
`ExpectedFrom + ExpectedContainerID`, but `Upsert` (used for adopt)
|
||||
does not, so the explicit re-read keeps the three paths uniform and
|
||||
makes the skip condition obvious in code review.
|
||||
|
||||
## 15. Three drift kinds covered by the reconciler
|
||||
|
||||
- `adopt` — Docker reports a container labelled
|
||||
`com.galaxy.owner=rtmanager` for which RTM has no record; insert a
|
||||
fresh `runtime_records` row with `op_kind=reconcile_adopt` and never
|
||||
stop or remove the container (operators may have started it
|
||||
manually for diagnostics).
|
||||
- `dispose` — RTM has a `running` record whose container is missing
|
||||
in Docker; mark `status=removed`, publish
|
||||
`container_disappeared`, append `op_kind=reconcile_dispose`.
|
||||
- `observed_exited` — RTM has a `running` record whose container
|
||||
exists but is in `exited`; mark `status=stopped`, publish
|
||||
`container_exited` with the observed exit code. This third path
|
||||
exists because the events listener sees only live events; a
|
||||
container that died while RTM was offline would otherwise stay
|
||||
`running` indefinitely. The drift is exposed through
|
||||
`rtmanager.reconcile_drift{kind=observed_exited}` and through the
|
||||
`container_exited` health event; no `operation_log` entry is
|
||||
written because the audit log records explicit RTM operations, not
|
||||
passive observations of Docker state.
|
||||
|
||||
## 16. `stopped_at = now (reconciler observation time)`
|
||||
|
||||
The `observed_exited` path writes `stopped_at = now`, where `now` is
|
||||
the reconciler's observation time. The persistence adapter
|
||||
([`postgres-migration.md`](postgres-migration.md) §8) hard-codes
|
||||
`stopped_at = now` for the `stopped` destination — there is no
|
||||
port-level knob for an explicit timestamp, and the reconciler does not
|
||||
read `State.FinishedAt` from Docker.
|
||||
|
||||
The trade-off: `stopped_at` diverges from the daemon's
|
||||
`State.FinishedAt` by at most one tick interval (default 5 minutes).
|
||||
If a downstream consumer ever needs the daemon-observed exit
|
||||
timestamp, the upgrade path is a one-call extension of
|
||||
`UpdateStatusInput` with an optional `StoppedAt *time.Time` field;
|
||||
that change is deferred until a consumer materialises.
|
||||
|
||||
## 17. Synchronous initial pass + periodic Component
|
||||
|
||||
`README §Startup dependencies` step 6 demands "Reconciler runs once
|
||||
and blocks until done" before background workers start, but
|
||||
`app.App.Run` starts every registered `Component` concurrently —
|
||||
component ordering does not translate into start ordering.
|
||||
|
||||
The reconciler exposes a public `ReconcileNow(ctx)` method that the
|
||||
runtime calls synchronously between `newWiring` and `app.New`. The
|
||||
same `*Reconciler` is then registered as a `Component`; its `Run`
|
||||
only ticks (no immediate pass) so the startup work is not duplicated.
|
||||
The cost is one public method on the worker; the benefit is that the
|
||||
README invariant holds verbatim and the periodic loop is a textbook
|
||||
`Component`.
|
||||
|
||||
## 18. Adopt through `Upsert`, race with start is benign
|
||||
|
||||
The adopt path constructs a fresh `runtime.RuntimeRecord` (status
|
||||
running, container id and image_ref from labels, `started_at` from
|
||||
`com.galaxy.started_at_ms` or inspect, state path and docker network
|
||||
from configuration, engine endpoint from the
|
||||
`http://galaxy-game-{game_id}:8080` rule) and calls
|
||||
`RuntimeRecords.Upsert`.
|
||||
|
||||
Race scenario: the start service has called `docker.Run` but has not
|
||||
yet finished its own `Upsert` when the reconciler observes the
|
||||
container without a record. Both writers eventually arrive at PG with
|
||||
the same key data — the start service knows the canonical
|
||||
`image_ref`, but the reconciler reads it from the
|
||||
`com.galaxy.engine_image_ref` label that the start service itself
|
||||
wrote. The CAS-free overwrite is therefore benign:
|
||||
|
||||
- `created_at` is preserved across upserts by the
|
||||
`ON CONFLICT DO UPDATE` clause, so the "first time RTM saw this
|
||||
game" timestamp stays stable regardless of which writer lands last;
|
||||
- all other fields in this race carry identical values (same
|
||||
container, same image, same hostname, same state path).
|
||||
|
||||
Under the per-game lease this is doubly safe: the reconciler only
|
||||
issues `Upsert` while holding the lease, and only after re-reading
|
||||
the record finds it absent. Concurrent start would block on the same
|
||||
lease; concurrent stop / restart would have moved the record out of
|
||||
"absent" by the time the reconciler re-reads.
|
||||
|
||||
## 19. Cleanup worker delegates to the service
|
||||
|
||||
The TTL-cleanup worker is intentionally tiny: it lists
|
||||
`runtime_records.status='stopped'`, filters in process by
|
||||
`record.LastOpAt.Before(now - cfg.Container.Retention)`, and calls
|
||||
`cleanupcontainer.Service.Handle` with `OpSource=auto_ttl` for each
|
||||
candidate. The service already owns:
|
||||
|
||||
- the per-game lease around the Docker `Remove` call;
|
||||
- the `running → removed` CAS via `UpdateStatus`;
|
||||
- the operation_log entry (`op_kind=cleanup_container`,
|
||||
`op_source=auto_ttl`);
|
||||
- the telemetry counter and structured log fields.
|
||||
|
||||
In-memory filtering is acceptable in v1 because the cardinality of
|
||||
`status=stopped` rows is bounded by Lobby's active-game count plus
|
||||
retention period. The dedicated `(status, last_op_at)` index drives
|
||||
the underlying `ListByStatus(stopped)` query so the database does
|
||||
the heavy lifting; the Go-side filter is microseconds-per-row.
|
||||
|
||||
The worker uses a small `Cleaner` interface in its own package rather
|
||||
than depending on `*cleanupcontainer.Service` directly. This keeps
|
||||
the worker's tests light — no need to construct Docker, lease,
|
||||
operation-log, and telemetry doubles just to verify TTL math — while
|
||||
the production wiring still binds the real service via a compile-time
|
||||
interface assertion in `internal/app/wiring.go`.
|
||||
|
||||
## 20. Sequential per-game work in reconciler and cleanup
|
||||
|
||||
Both workers process games sequentially within a tick. The
|
||||
reconciler's mutations are dominated by `Get` + `Upsert` /
|
||||
`UpdateStatus` round-trips against PG plus an occasional Docker
|
||||
`InspectContainer`; the cleanup worker's mutations are dominated by
|
||||
the cleanup service's `docker.Remove` call. Parallelising either
|
||||
would multiply the load on the Docker daemon socket and the PG pool
|
||||
without buying anything that v1 cardinality demands.
|
||||
|
||||
## 21. Cross-module test boundary for the consumer integration test
|
||||
|
||||
[`../internal/worker/startjobsconsumer/integration_test.go`](../internal/worker/startjobsconsumer/integration_test.go)
|
||||
covers the contract roundtrip without importing
|
||||
`lobby/internal/...`:
|
||||
|
||||
- it XADDs a start envelope in the AsyncAPI wire shape (the same
|
||||
shape Lobby's `runtimemanager.Publisher` writes);
|
||||
- it runs the real `startruntime.Service` against in-memory fakes for
|
||||
the persistence stores, the lease, and the notification / health
|
||||
publishers, plus a gomock-backed `ports.DockerClient`;
|
||||
- it lets the real `jobresultspublisher.Publisher` write to
|
||||
`runtime:job_results`;
|
||||
- it reads the resulting entry and asserts the symmetric wire shape;
|
||||
- it then XADDs the same envelope a second time and asserts the
|
||||
`error_code=replay_no_op` outcome with no further Docker calls.
|
||||
|
||||
The cross-module integration that runs both the real Lobby publisher
|
||||
and the real Lobby consumer alongside RTM lives at
|
||||
`integration/lobbyrtm/`, which is the home for inter-service
|
||||
fixtures. Keeping the in-package test free of `lobby/...` imports
|
||||
avoids module-internal coupling and keeps `rtmanager`'s test suite
|
||||
buildable on its own.
|
||||
Reference in New Issue
Block a user