# Runtime Manager Implementation Plan This plan has been already implemented and stays here for historical reasons. It should NOT be threated as source of truth for service functionality. ## Summary This plan delivers `Runtime Manager` (RTM), the only Galaxy service with direct Docker access. It owns container lifecycle (start, stop, restart, patch, cleanup), three-source health monitoring, and a synchronous internal REST surface used by `Game Master` and `Admin Service`. `Game Lobby` continues to drive RTM asynchronously through Redis Streams. The plan also delivers the upstream changes that RTM depends on: a new `image_ref` field in the start envelope and a `reason` field in the stop envelope produced by Lobby; a `/healthz` endpoint, `Dockerfile`, and `STORAGE_PATH` / `GAME_STATE_PATH` contract on `galaxy/game`; new admin-only notification types in the catalog plus matching constructors in `galaxy/notificationintent`. The architectural rules behind every decision are recorded in [`./README.md`](./README.md). This file describes the order in which the implementation lands. ## Global Rules - Documentation always lands before contracts; contracts before code. - Each stage leaves the repository in a buildable, test-green state. No stage relies on a later stage to fix a regression it introduced. - Existing-service refactors (Lobby publisher, Notification catalog, Game engine) are full-fledged stages of this plan; they precede every RTM stage that depends on them. - RTM never resolves engine versions. Producer supplies `image_ref`. RTM never deletes the host state directory. RTM never kills containers it does not own a record for. - Every functional change ships its tests in the same stage. Contract tests freeze operation IDs and stream message names from Stage 04 onward. - All code, docs, and identifiers are written in English. ## Suggested Module Structure ```text rtmanager/ ├── cmd/ │ ├── rtmanager/ │ │ └── main.go │ └── jetgen/ │ └── main.go │ ├── internal/ │ ├── app/ │ │ ├── app.go │ │ ├── runtime.go │ │ ├── wiring.go │ │ └── bootstrap.go │ │ │ ├── config/ │ │ ├── config.go │ │ ├── env.go │ │ └── validation.go │ │ │ ├── logging/ │ │ ├── logger.go │ │ └── context.go │ │ │ ├── telemetry/ │ │ └── runtime.go │ │ │ ├── domain/ │ │ ├── runtime/ │ │ │ ├── model.go │ │ │ └── transitions.go │ │ ├── operation/ │ │ │ └── log.go │ │ └── health/ │ │ └── snapshot.go │ │ │ ├── ports/ │ │ ├── runtimerecordstore.go │ │ ├── operationlogstore.go │ │ ├── healthsnapshotstore.go │ │ ├── streamoffsetstore.go │ │ ├── dockerclient.go │ │ ├── lobbyinternal.go │ │ └── notificationintents.go │ │ │ ├── adapters/ │ │ ├── postgres/ │ │ │ ├── migrations/ │ │ │ ├── jet/ │ │ │ ├── runtimerecordstore/ │ │ │ ├── operationlogstore/ │ │ │ └── healthsnapshotstore/ │ │ ├── redisstate/ │ │ │ └── streamoffsets/ │ │ ├── docker/ │ │ │ ├── client.go │ │ │ └── mocks/ │ │ ├── lobbyclient/ │ │ ├── notificationpublisher/ │ │ ├── jobresultspublisher/ │ │ └── healtheventspublisher/ │ │ │ ├── service/ │ │ ├── startruntime/ │ │ ├── stopruntime/ │ │ ├── restartruntime/ │ │ ├── patchruntime/ │ │ └── cleanupcontainer/ │ │ │ ├── worker/ │ │ ├── startjobsconsumer/ │ │ ├── stopjobsconsumer/ │ │ ├── dockerevents/ │ │ ├── healthprobe/ │ │ ├── dockerinspect/ │ │ ├── reconcile/ │ │ └── containercleanup/ │ │ │ └── api/ │ └── internalhttp/ │ ├── server.go │ └── handlers/ │ ├── api/ │ ├── internal-openapi.yaml │ ├── runtime-jobs-asyncapi.yaml │ └── runtime-health-asyncapi.yaml │ ├── integration/ │ ├── harness/ │ ├── lifecycle_test.go │ ├── replay_test.go │ ├── health_test.go │ └── notification_test.go │ ├── docs/ │ ├── README.md │ ├── runtime.md │ ├── flows.md │ ├── runbook.md │ ├── examples.md │ └── postgres-migration.md │ ├── README.md ├── PLAN.md ├── Makefile └── go.mod ``` ## ~~Stage 01.~~ Update `ARCHITECTURE.md` Status: implemented. Goal: - align the project-wide source of truth with every decision recorded in [`./README.md`](./README.md) before any code change touches it. Tasks: - Expand `ARCHITECTURE.md` §9 (Runtime Manager) with subsections: container model (`galaxy-game-{game_id}` DNS naming, bind-mount ABI, network prerequisite), image policy (producer-supplied `image_ref`), state ownership rule (RTM never deletes the host state directory), reconcile policy (adopt unrecorded containers, never kill them). - Update §«Fixed asynchronous interactions»: note the `image_ref` field on `Lobby → RTM`, add the `runtime:health_events` outbound stream, add `Runtime Manager → Notification Service` for admin alerts. - Update §«Fixed synchronous interactions»: add `Game Master → Runtime Manager` and `Admin Service → Runtime Manager` for REST inspect / restart / patch / stop / cleanup, and remove the corresponding async entries. - Update §«Persistence Backends»: add `rtmanager` schema to the schema-per-service list and to PG-backed services. - Update §«Configuration»: add `RTMANAGER` to the env-var prefix list with the same shape rules as other PG/Redis-backed services. - Update §«Recommended Order of Service Implementation» entry 7 with the now-fixed scope (start, stop, restart, patch, inspect, health monitoring). Files touched: - `ARCHITECTURE.md`. Exit criteria: - every later RTM, Lobby, Notification, or Game stage can quote its rules from `ARCHITECTURE.md` without re-deciding them. ## ~~Stage 02.~~ Freeze RTM `README.md` Status: implemented as part of this planning task — see [`./README.md`](./README.md). Goal: - publish the complete service description so contracts and code can reference one source. Tasks: - Write `rtmanager/README.md` covering Purpose, Scope, Non-Goals, Position in the System, Responsibility Boundaries, Container Model, Runtime Surface, Lifecycles, Health Monitoring, Reconciliation, Trusted Surfaces, Async Stream Contracts, Notification Contracts, Persistence Layout, Error Model, Configuration, Observability, Verification. Exit criteria: - a reviewer can answer any «what does RTM do when X» question by reading the README alone. ## ~~Stage 03.~~ Sync existing-service docs (Lobby, Notification, Game) Status: implemented. Goal: - bring the READMEs of every touched service into agreement with the RTM contract before any code in those services changes. Tasks: - `lobby/README.md`: update Game Start Flow — start envelope is now `{game_id, image_ref, requested_at_ms}`. Add `LOBBY_ENGINE_IMAGE_TEMPLATE` to the Configuration section. Document the new stop envelope `reason` enum (`orphan_cleanup | cancelled | finished | admin_request | timeout`). Note that the Lobby ↔ RTM transport stays asynchronous indefinitely. - `lobby/PLAN.md`: append a single closing note that runtime-job envelope changes belong to the Runtime Manager plan; no new stages added there. - `notification/README.md`: add three admin notification types to the catalog (`runtime.image_pull_failed`, `runtime.container_start_failed`, `runtime.start_config_invalid`), each `email`-only with audience admin in v1. - `notification/PLAN.md`: append a closing note pointing at the Runtime Manager plan for the catalog extension. - `game/README.md` (create if absent): document the new `/healthz` endpoint, the `STORAGE_PATH` / `GAME_STATE_PATH` env contract, and the new `Dockerfile` location. Files touched: - `lobby/README.md`, `lobby/PLAN.md`, `notification/README.md`, `notification/PLAN.md`, `game/README.md`. Exit criteria: - every doc in the repo agrees on the post-RTM contract; no contradiction remains between any two READMEs. ## ~~Stage 04.~~ RTM contract files and contract tests Status: implemented. Goal: - ship machine-readable contracts before any RTM handler is written, so the implementation has a target spec. Tasks: - `rtmanager/api/internal-openapi.yaml`: every internal REST endpoint with request and response schemas; error envelope `{ "error": { "code", "message" } }` identical to Lobby. Operation IDs: `internalListRuntimes`, `internalGetRuntime`, `internalStartRuntime`, `internalStopRuntime`, `internalRestartRuntime`, `internalPatchRuntime`, `internalCleanupRuntimeContainer`, `internalHealthz`, `internalReadyz`. - `rtmanager/api/runtime-jobs-asyncapi.yaml`: AsyncAPI 2.6.0 spec for `runtime:start_jobs`, `runtime:stop_jobs`, `runtime:job_results`. Frozen field set per-message. - `rtmanager/api/runtime-health-asyncapi.yaml`: AsyncAPI 2.6.0 spec for `runtime:health_events` with the `event_type` enum and `details` polymorphic schema (`oneOf` per type). - `rtmanager/contract_openapi_test.go` and `rtmanager/contract_asyncapi_test.go`: load specs via `kin-openapi` (and the AsyncAPI loader pattern from `notification/contract_asyncapi_test.go`), assert operation IDs / message names / field presence. Files new: - the four files above. Exit criteria: - all three specs validate; contract tests pass; tests fail loudly if any operation ID, message name, or required field disappears. ## ~~Stage 05.~~ Game engine `/healthz`, `Dockerfile`, `STORAGE_PATH` Status: implemented. Goal: - make `galaxy/game` runnable as the test engine image RTM uses in integration tests. Tasks: - Add `GET /healthz` to `game/internal/router` returning `{"status":"ok"}` (200) when the engine process is up, irrespective of whether a game has been initialised. The existing `/api/v1/status` keeps its current `501` behaviour for an uninitialised engine. - Make engine read storage path from `STORAGE_PATH` env, falling back to `GAME_STATE_PATH` when set. Both names are accepted; `GAME_STATE_PATH` is the contract RTM writes. - Update `game/cmd/http/main.go` to bind the env. - Add `galaxy/game/Dockerfile`: multi-stage (golang builder + small runtime base). Exposes `:8080`. Default `STORAGE_PATH=/var/lib/galaxy-game`. Copies the binary. Runs as non-root user. - Add image labels to the `Dockerfile`: `com.galaxy.cpu_quota=1.0`, `com.galaxy.memory=512m`, `com.galaxy.pids_limit=512`, `org.opencontainers.image.title=galaxy-game-engine`. - Update `game/openapi.yaml` to document `/healthz`. - Update `game/openapi_contract_test.go` to assert `/healthz` presence. Files new: - `galaxy/game/Dockerfile`. Files touched: - `galaxy/game/internal/router/*.go`, `galaxy/game/cmd/http/main.go`, `galaxy/game/openapi.yaml`, `galaxy/game/openapi_contract_test.go`. Exit criteria: - `docker build -t galaxy/game:test -f game/Dockerfile .` (run from the workspace root) succeeds. The build context is the workspace root because `game/` resolves `galaxy/{model,error,util,...}` through `go.work` `replace` directives; see `rtmanager/docs/game-dockerfile-build-context.md`. - `docker run --rm -e STORAGE_PATH=/tmp/x -p 8080:8080 galaxy/game:test` answers `/healthz` with `200`. - `go test ./game/...` passes. ## ~~Stage 06.~~ Lobby publisher refactor Status: implemented. Goal: - ship the new `runtime:start_jobs` and `runtime:stop_jobs` envelopes from Lobby. After this stage Lobby is RTM-ready; the real RTM appears in Stage 13 onwards. Tasks: - Add `LOBBY_ENGINE_IMAGE_TEMPLATE` (default `galaxy/game:{engine_version}`) and validation to `lobby/internal/config/config.go` and `env.go`. - Build `lobby/internal/domain/engineimage/resolver.go` that turns `(template, target_engine_version)` into `image_ref`, validating both inputs. Reject templates without `{engine_version}`; reject empty engine versions. - `lobby/internal/ports/runtimemanager.go`: change interface to `PublishStartJob(ctx, gameID, imageRef string) error` and `PublishStopJob(ctx, gameID string, reason StopReason) error` with a `StopReason` enum (`orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`) declared in the same package. - `lobby/internal/adapters/runtimemanager/publisher.go`: write the new fields into the `XADD` payload. - Update callers: - `lobby/internal/service/startgame/`: resolve `image_ref` from the loaded game record, pass to `PublishStartJob`. - `lobby/internal/worker/runtimejobresult/consumer.go`: pass `reason=orphan_cleanup` to `PublishStopJob` from the orphan-container path. - Update Lobby unit tests (publisher, services) and contract tests (if Lobby has any describing the runtime envelopes; otherwise add `TestPublisherStartJobIncludesImageRef` and `TestPublisherStopJobIncludesReason`). Files new: - `lobby/internal/domain/engineimage/resolver.go` and its test file. Files touched: - the Lobby files listed above. Exit criteria: - `go test ./lobby/...` passes. - An `XADD` against the start stream contains the `image_ref` field; an `XADD` against the stop stream contains the `reason` field. ## ~~Stage 07.~~ Notification intent constructors and catalog extension Status: implemented. Goal: - expose three admin-only notification types so RTM (Stage 13 onwards) can publish them without later cross-cutting refactors. Tasks: - Add constructors and payload structs to `galaxy/notificationintent/`: - `NewRuntimeImagePullFailedIntent(meta, payload)`, - `NewRuntimeContainerStartFailedIntent(meta, payload)`, - `NewRuntimeStartConfigInvalidIntent(meta, payload)`. Each payload includes `game_id`, `image_ref`, `error_code`, `error_message`, `attempted_at_ms`. - Extend `notification/api/intents-asyncapi.yaml` with the three new payload schemas and add them to the catalog. - Extend the notification routing tables (data only — no service code) so the existing routing rules cover the new types: delivery decision `email`-only, audience admin. - Extend `notification/contract_asyncapi_test.go` to freeze the new message names and payload required fields. Files touched: - `galaxy/notificationintent/*.go`, - `notification/api/intents-asyncapi.yaml`, - notification catalog data tables (locations defined inside `notification/internal/...`), - `notification/contract_asyncapi_test.go`. Exit criteria: - unit tests for the new constructors pass. - AsyncAPI validates. - Notification's existing integration suites still pass with the new types added. ## ~~Stage 08.~~ RTM module skeleton Status: implemented. Goal: - create a buildable `rtmanager` binary that loads config, opens dependencies, and exits cleanly on SIGTERM. It does no business work yet. Tasks: - `rtmanager/cmd/rtmanager/main.go` mirroring `lobby/cmd/lobby/main.go`. - `rtmanager/internal/config/{config.go, env.go, validation.go}` with env prefix `RTMANAGER` and groups Listener, Docker, Postgres, Redis, Streams, Container defaults, Health, Cleanup, Coordination, Lobby internal client, Logging, Lifecycle, Telemetry. Required variables fail-fast. - `rtmanager/internal/logging/{logger.go, context.go}` copied from lobby/notification. - `rtmanager/internal/telemetry/runtime.go` registering the metrics named in `README.md §Observability`. - `rtmanager/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go}` — empty wiring with PostgreSQL open, Redis open, Docker client open (ping only), telemetry open, probe listener open. - `rtmanager/internal/api/internalhttp/server.go` — listener with `/healthz` and `/readyz` only. - `rtmanager/Makefile` with the `jet` target (real generation lands in Stage 09). - `rtmanager/go.mod` and `go.sum` with dependencies: `github.com/docker/docker`, `github.com/redis/go-redis/v9`, `github.com/jackc/pgx/v5`, `github.com/go-jet/jet/v2`, `github.com/pressly/goose/v3`, `github.com/stretchr/testify`, the testcontainers modules for postgres / redis / docker, and the OpenTelemetry stack identical to lobby. - Update repo-level `go.work` to include `./rtmanager`. Files new: - the entire skeleton tree. Exit criteria: - `go build ./rtmanager/cmd/rtmanager` succeeds. - Running with valid env brings `/healthz` and `/readyz` up. - `SIGTERM` returns within `RTMANAGER_SHUTDOWN_TIMEOUT`. ## ~~Stage 09.~~ PostgreSQL schema, migrations, jet Status: implemented. Goal: - finalise the persistence schema and the code-generation pipeline. Tasks: - `internal/adapters/postgres/migrations/00001_init.sql` — `CREATE SCHEMA IF NOT EXISTS rtmanager;` plus the three tables and indexes from `README.md §Persistence Layout`. - `internal/adapters/postgres/migrations/migrations.go` — `//go:embed *.sql` and `FS()` exporter, identical pattern to lobby. - `cmd/jetgen/main.go` — testcontainers PostgreSQL + goose up + jet generation against the resulting database. Mirrors `lobby/cmd/jetgen/main.go`. - Generated `internal/adapters/postgres/jet/...` committed to the repo. - Wire goose migrations into `internal/app/runtime.go` startup so they apply before any listener opens; non-zero exit on failure (matches `pkg/postgres` policy). Files new: - as above. Exit criteria: - `make -C rtmanager jet` regenerates the jet code with no diff after a clean run. - Service start applies migrations to a fresh database and exits zero if migrations are already applied. ## ~~Stage 10.~~ Domain layer and ports Status: implemented. Goal: - lock the in-memory domain model and the port interfaces for adapters. Tasks: - `internal/domain/runtime/model.go` — `RuntimeRecord` struct, status enum (`StatusRunning`, `StatusStopped`, `StatusRemoved`), error sentinels. - `internal/domain/runtime/transitions.go` — allowed transitions table and a CAS-friendly validator. - `internal/domain/operation/log.go` — `OpKind`, `OpSource`, `Outcome` enums plus the `OperationEntry` struct. - `internal/domain/health/snapshot.go` — `HealthEventType` enum, `HealthSnapshot` struct. - `internal/ports/`: - `runtimerecordstore.go` — `Get`, `Upsert`, `UpdateStatus` (CAS by `current_container_id`), `ListByStatus`. - `operationlogstore.go` — `Append`, `ListByGame`. - `healthsnapshotstore.go` — `Upsert`, `Get`. - `streamoffsetstore.go` — `Load`, `Save` (Redis offset persistence per consumer label). - `dockerclient.go` — narrow surface RTM uses: `EnsureNetwork`, `PullImage`, `Inspect`, `Run`, `Stop`, `Remove`, `List`, `EventsListen`. (`Logs` reserved; not in v1.) - `lobbyinternal.go` — `GetGame(ctx, gameID) (LobbyGameRecord, error)`. - `notificationintents.go` — `Publish(ctx, intent) error`. Files new: - as above. Exit criteria: - the package compiles. - every interface has a `_ ports.X = (*Y)(nil)` assertion slot ready for the adapters that follow. ## ~~Stage 11.~~ Persistence adapters Status: implemented. Decision record: [`docs/stage11-persistence-adapters.md`](docs/stage11-persistence-adapters.md). Goal: - implement the three PostgreSQL stores and the Redis offset store. Tasks: - `internal/adapters/postgres/runtimerecordstore/store.go` using jet. - `internal/adapters/postgres/operationlogstore/store.go`. - `internal/adapters/postgres/healthsnapshotstore/store.go`. - `internal/adapters/redisstate/streamoffsets/store.go` (mirror Lobby's `redisstate/streamoffsets`). - For each adapter: store-level integration tests against testcontainers PostgreSQL or Redis. CAS semantics on `runtime_records.UpdateStatus` are verified by an explicit concurrent-update test (only one of two callers wins). Files new: - as above and per-package `_test.go`. Exit criteria: - store tests pass on a CI runner with Docker available. ## ~~Stage 12.~~ Docker adapter and external clients Status: implemented. Decision record: [`docs/stage12-docker-and-clients.md`](docs/stage12-docker-and-clients.md). Goal: - ship the Docker SDK adapter and the external HTTP clients for Lobby internal API and notification publishing. Tasks: - `internal/adapters/docker/client.go` — implements `ports.DockerClient` over `github.com/docker/docker/client`. Behaviour: - `EnsureNetwork` validates the configured network's presence (no creation). - `PullImage` honours the configured pull policy. - `Inspect` returns image and container metadata in domain-friendly shape. - `Run` builds the create + start sequence with labels, env (`GAME_STATE_PATH`, `STORAGE_PATH`), bind mount, log driver, resource limits read from image labels with config fallback. - `Stop` calls `ContainerStop` with the configured timeout. - `Remove` calls `ContainerRemove`. - `List` filters by `label=com.galaxy.owner=rtmanager`. - `EventsListen` returns a typed channel of decoded events. - `internal/adapters/docker/mocks/` — `mockgen`-generated mock for `ports.DockerClient`, used by service tests. - `internal/adapters/lobbyclient/client.go` — REST client over an `otelhttp`-wrapped `http.Client` for `GET /api/v1/internal/games/{game_id}`. Returns `LobbyGameRecord`. - `internal/adapters/notificationpublisher/publisher.go` — wraps `galaxy/notificationintent` plus `redis.XAdd` against `notification:intents`. - Per-adapter unit tests with mocks. A small testcontainers Docker smoke test guarded by build tag `rtmanager_docker_smoke` until Stage 19 promotes it to default. Files new: - as above. Exit criteria: - mocks regenerate cleanly via `go generate`. - unit tests pass. - the smoke test passes on a runner with Docker available. ## ~~Stage 13.~~ Service: start Status: implemented. Decision record: [`docs/stage13-start-service.md`](docs/stage13-start-service.md). Goal: - end-to-end `start` operation in the service layer, callable from both the async consumer and the REST handler in later stages. Tasks: - `internal/service/startruntime/service.go` orchestrator: 1. Acquire game-id lease (Redis). 2. Read `runtime_records`. If `running` with same `image_ref`, return idempotent success with `error_code=replay_no_op`. 3. Optionally fetch `LobbyGameRecord` for ancillary fields; in v1 only `image_ref` is required, so this fetch is a no-op except for diagnostics. 4. Pull image (per policy), inspect labels for resource limits. 5. Ensure the per-game state directory exists with the configured mode and ownership. 6. `docker run` with the configured network, hostname, labels, env, bind mount, log driver, resource limits. 7. Upsert `runtime_records` (`status=running`, `current_container_id`, `engine_endpoint`, `current_image_ref`, `started_at`, `last_op_at`). 8. Append `operation_log` entry (`op_kind=start`, `outcome=success`, `op_source` from caller). 9. Publish `runtime:health_events` `container_started`. 10. Return success outcome to caller (consumer publishes `job_result`, REST returns 200). - Failure paths in the table from `README.md §Lifecycles → Start`. Each failure path: - rolls back any partially created Docker resource; - publishes the matching admin-only notification intent; - records `operation_log` with `outcome=failure` and the stable error code; - returns failure to the caller. - Unit tests cover happy path, idempotent re-start, each failure mode, lease conflict, and partial-rollback paths. Files new: - `service/startruntime/{service.go, service_test.go, errors.go}`. Exit criteria: - service-level tests pass. ## ~~Stage 14.~~ Service: stop, restart, patch, cleanup Status: implemented. Decision record: [`docs/stage14-stop-restart-patch-cleanup.md`](docs/stage14-stop-restart-patch-cleanup.md). Goal: - the remaining four lifecycle operations, sharing helpers with `start`. Tasks: - `internal/service/stopruntime/service.go` — graceful `docker stop` with timeout, record `stopped` state. Idempotent re-stop returns success no-op. - `internal/service/restartruntime/service.go` — orchestrate `stopruntime` then `startruntime` with the current `image_ref`. Same Redis lease shared across both inner operations. Records a single `operation_log` entry with `op_kind=restart` plus a correlation id linking it to the implicit start/stop entries. - `internal/service/patchruntime/service.go` — restart with a new `image_ref`. Validates the semver-patch-only rule (major and minor must equal current version; otherwise return `semver_patch_only` failure). If the engine version is not parseable as semver, return `image_ref_not_semver`. - `internal/service/cleanupcontainer/service.go` — `docker rm` for an already-stopped container; refuses if `status=running`. Sets `runtime_records.status=removed`. - The Redis lease covers each operation end-to-end; restart and patch hold the lease across the inner stop+start to prevent races. - Unit tests for each service. Cross-operation race tests assert that concurrent start vs. stop on the same `game_id` either succeed in some order or both observe the lease and one returns conflict. Files new: - `service/{stopruntime, restartruntime, patchruntime, cleanupcontainer}/...`. Exit criteria: - service-level tests pass. ## ~~Stage 15.~~ Async consumers and `runtime:job_results` publisher Status: implemented. Decision record: [`docs/stage15-async-consumers.md`](docs/stage15-async-consumers.md). Goal: - wire the Lobby-side stream contract into the freshly built service layer. Tasks: - `internal/worker/startjobsconsumer/consumer.go` — XREAD over `runtime:start_jobs`, decodes envelope `{game_id, image_ref, requested_at_ms}`, calls `startruntime` service, publishes `runtime:job_results` with the canonical schema, advances the Redis offset. Mirrors patterns from `lobby/internal/worker/runtimejobresult/consumer.go`. - `internal/worker/stopjobsconsumer/consumer.go` — XREAD over `runtime:stop_jobs`, decodes `{game_id, reason, requested_at_ms}`, calls `stopruntime`. - `internal/adapters/jobresultspublisher/publisher.go` — small XADD wrapper for `runtime:job_results`. - Replay safety: deterministic «already running» / «already stopped» idempotent outcomes surface as `outcome=success` with `error_code=replay_no_op`. - Tests use `miniredis` and a fake `ports.DockerClient`. A consumer integration test drives a full Lobby → RTM → Lobby roundtrip end-to-end. Files new: - as above + tests. Exit criteria: - consumer integration test passes. ## ~~Stage 16.~~ Internal REST handlers Status: implemented. Decision record: [`docs/stage16-internal-rest-handlers.md`](docs/stage16-internal-rest-handlers.md). Goal: - ship the GM/Admin-facing REST surface backed by the service layer. Tasks: - `internal/api/internalhttp/handlers/{list, get, start, stop, restart, patch, cleanup}.go` — one file per operation, each delegating to the corresponding service. JSON in / JSON out. Unknown JSON fields rejected with `invalid_request`. - Error envelope identical to lobby: `{ "error": { "code", "message" } }`. Stable codes: `invalid_request`, `not_found`, `conflict`, `service_unavailable`, `internal_error`, `image_ref_not_semver`, `semver_patch_only`, `image_pull_failed`, `container_start_failed`, `start_config_invalid`, `docker_unavailable`. - Wiring under the existing internal HTTP listener; route registration in `internal/app/wiring.go`. - Handler-level table-driven tests; OpenAPI conformance test that loads `api/internal-openapi.yaml` and asserts every defined operation is reachable and matches its declared response. Files new: - handlers + tests. Exit criteria: - OpenAPI conformance test passes for every endpoint. - Handlers reject unknown JSON fields. ## ~~Stage 17.~~ Health monitoring Status: implemented. Decision record: [`docs/stage17-health-monitoring.md`](docs/stage17-health-monitoring.md). Goal: - observability of running containers via the three sources from `README.md §Health Monitoring`. Tasks: - `internal/worker/dockerevents/listener.go` — subscribes to Docker events with the `com.galaxy.owner=rtmanager` label filter, looks up `runtime_records` by labels, emits `runtime:health_events` for `container_exited`, `container_oom`, `container_disappeared`. `container_started` is emitted directly by the start service (Stage 13) when it runs the container. - `internal/worker/healthprobe/worker.go` — periodic worker iterating `runtime_records.status=running`. Calls `GET {engine_endpoint}/healthz` with the configured timeout, applies the `RTMANAGER_PROBE_FAILURES_THRESHOLD` hysteresis, emits `probe_failed` / `probe_recovered`. Uses `otelhttp` client. - `internal/worker/dockerinspect/worker.go` — periodic full inspect; emits `inspect_unhealthy` on observed `RestartCount` growth or unexpected status. - `internal/adapters/healtheventspublisher/publisher.go` — XADD wrapper for `runtime:health_events`. Always also upserts the latest snapshot into `health_snapshots`. Files new: - as above + tests. Exit criteria: - worker tests use a Docker mock that programmatically emits events and asserts the published stream entries match the AsyncAPI spec. ## ~~Stage 18.~~ Reconciler and container cleanup Status: implemented. Decision record: [`docs/stage18-reconcile-and-cleanup.md`](docs/stage18-reconcile-and-cleanup.md). Goal: - drift management and TTL-based cleanup. Tasks: - `internal/worker/reconcile/reconciler.go` — runs at startup (blocking before workers start) and periodically (`RTMANAGER_RECONCILE_INTERVAL`). Implements the rules from `README.md §Reconciliation`: - record running containers without a PG record, never kill them (`op_kind=reconcile_adopt`); - mark `runtime_records.status=running` rows whose container is missing as `removed`, publish `container_disappeared` (`op_kind=reconcile_dispose`). - `internal/worker/containercleanup/worker.go` — periodic worker (`RTMANAGER_CLEANUP_INTERVAL`) that lists `runtime_records` with `status=stopped` and `last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS`, calls `cleanupcontainer` service for each. - Both workers are registered as `app.Component`s in `internal/app/wiring.go`. Files new: - as above + tests. Exit criteria: - reconciler test using mocked Docker proves both adopt and dispose paths. - cleanup test proves TTL math with a fake clock. ## ~~Stage 19.~~ Service-local integration suite Status: implemented. Decision record: [`docs/stage19-integration.md`](docs/stage19-integration.md). Goal: - end-to-end suite running against testcontainers PostgreSQL + Redis + the real Docker daemon, using the freshly-built `galaxy/game` test image. Tasks: - `rtmanager/integration/harness/` — set up PostgreSQL with goose-applied migrations; Redis (miniredis is sufficient for stream-only suites; testcontainers Redis for coordination suites that exercise leases); ensure the Docker bridge network exists; build `galaxy/game` test image once per package run with `sync.Once`; tear everything down via `t.Cleanup`. - `rtmanager/integration/lifecycle_test.go` — start → inspect → stop → restart → patch → cleanup against the real engine; assert each step's PG, Redis-stream, and Docker side-effects. Engine state directories are created via `t.ArtifactDir()`. - `rtmanager/integration/replay_test.go` — duplicate start/stop messages are no-ops with `error_code=replay_no_op`. - `rtmanager/integration/health_test.go` — kill the engine container externally; assert `container_disappeared` event publishes within timeout. Bring it back with a manual `docker run`; assert the reconciler adopts it. - `rtmanager/integration/notification_test.go` — drive a start with an unresolvable image ref; assert RTM publishes the `runtime.image_pull_failed` notification intent and a `failure` job_result. Files new: - as above. Exit criteria: - `go test ./rtmanager/integration/...` passes locally with Docker available. - CI runs the suite under a profile that exposes the Docker socket. ## ~~Stage 20.~~ Inter-service test: Lobby ↔ RTM Status: implemented. Decision record: [`docs/stage20-lobbyrtm.md`](docs/stage20-lobbyrtm.md). Goal: - satisfy the `TESTING.md §7` inter-service requirement with real Lobby + real RTM. Tasks: - `integration/lobbyrtm/` (top-level integration directory, mirroring existing `integration/notificationgateway`, etc.): runs real Lobby, real RTM, real PostgreSQL, real Redis, and the `galaxy/game` test engine container. - Scenarios: - Lobby creates a game, publishes a start_job with `image_ref`, RTM starts the engine, publishes `job_result`, Lobby transitions the game to `running`. The engine answers `/healthz`. - Lobby transitions a game to `cancelled`, publishes `stop_job` with `reason=cancelled`, RTM stops the engine. RTM `operation_log` records the transition. - Failure path: `image_ref` points at a missing image. RTM publishes a `failure` `job_result` and the matching notification intent. Lobby transitions the game to `start_failed`. Files new: - as above. Exit criteria: - all scenarios pass in CI when the Docker socket is available. ## ~~Stage 21.~~ Service-local docs Status: implemented. Goal: - drop per-stage decisions captured during this plan into discoverable service-local documentation, mirroring `lobby/docs/`. Tasks: - `docs/README.md` — index pointing at the four content docs and the postgres-migration record. - `docs/runtime.md` — components, processes, in-memory state of each worker. - `docs/flows.md` — mermaid diagrams for: start happy path, start failure (image pull), start failure (orphan), stop, restart, patch, cleanup TTL, reconcile drift adopt, health probe hysteresis. - `docs/runbook.md` — operator scenarios: «engine container died», «patch upgrade», «manual cleanup», «reconcile drift after Docker daemon restart», «testing locally». - `docs/examples.md` — env-var examples per environment (dev / test / prod skeletons), example payloads for each stream and each REST endpoint. - `docs/postgres-migration.md` — decision record for the schema (mirrors `notification/docs/postgres-migration.md` style). Files new: - all six. Exit criteria: - the README of RTM links to `docs/README.md`. - a reviewer can find any operational how-to within two clicks. ## ~~Stage 22.~~ Migrate hand-rolled stubs to `mockgen` Status: implemented. Decision record: [`docs/stage22-stub-migration.md`](docs/stage22-stub-migration.md). Goal: - unify the test-double style across the repository on the `mockgen` pipeline introduced for the RTM Docker port in Stage 12. Today every Galaxy service except RTM hand-rolls `*stub` packages; mixing styles raises onboarding cost and makes port-signature drift easier to miss. Tasks (high-level only — each package gets its own decision when this stage is opened): - Replace the stubs under `lobby/internal/adapters/` with `mockgen`-generated mocks. Affected packages today (one per port): [`runtimemanagerstub`](../lobby/internal/adapters/runtimemanagerstub), [`intentpubstub`](../lobby/internal/adapters/intentpubstub), [`gmclientstub`](../lobby/internal/adapters/gmclientstub), [`userservicestub`](../lobby/internal/adapters/userservicestub), [`gameturnstatsstub`](../lobby/internal/adapters/gameturnstatsstub), [`streamoffsetstub`](../lobby/internal/adapters/streamoffsetstub), [`membershipstub`](../lobby/internal/adapters/membershipstub), [`evaluationguardstub`](../lobby/internal/adapters/evaluationguardstub), [`streamlagprobestub`](../lobby/internal/adapters/streamlagprobestub), [`userlifecyclestub`](../lobby/internal/adapters/userlifecyclestub), [`invitestub`](../lobby/internal/adapters/invitestub), [`racenamestub`](../lobby/internal/adapters/racenamestub), [`gapactivationstub`](../lobby/internal/adapters/gapactivationstub), [`gamestub`](../lobby/internal/adapters/gamestub), [`applicationstub`](../lobby/internal/adapters/applicationstub). - Add `//go:generate mockgen ...` directives next to each port declaration under [`lobby/internal/ports/`](../lobby/internal/ports) and a `mocks` target to `lobby/Makefile`, mirroring the [`rtmanager/Makefile`](./Makefile) shape. - Audit the rest of the workspace for similar hand-rolls before touching Lobby. Not every `*stub`-style package is in scope: - [`mail/internal/adapters/stubprovider`](../mail/internal/adapters/stubprovider) is a production/local-mode provider, not a test fixture — keep it. - [`authsession/internal/adapters/contracttest`](../authsession/internal/adapters/contracttest) is a port-conformance suite, not a stub — keep it. - [`authsession/internal/adapters/local`](../authsession/internal/adapters/local) is local-mode runtime — keep it. - Documentation sweep — these documents reference the hand-rolled convention and must be updated alongside the code: - [`rtmanager/docs/stage12-docker-and-clients.md §1`](./docs/stage12-docker-and-clients.md) currently frames `mockgen` as a one-time deviation; rephrase as the repo-wide convention. - [`lobby/docs/`](../lobby/docs/) — any decision record that named a `*stub` package by path needs the new `mocks/` target referenced in its place. - Top-level [`AGENTS.md`](../AGENTS.md) and any service-level `CLAUDE.md` / `README.md` touching test conventions. - Cross-cutting test impact: each stub today often carries hand-curated helper methods (e.g. seeded fixtures, deterministic ID generators) that pure `mockgen` mocks do not provide. Where a stub is more than a method-table, the migration extracts the helper into a small test-data builder and keeps the mock as the port surface. Files new: - one `mocks/` directory under each affected adapter group, plus a `lobby/Makefile` `mocks` target (and equivalents for any other service the audit identifies). Files touched: - every `*stub` package listed above plus its consumers. - `lobby/Makefile`, `lobby/internal/ports/*.go` (for `//go:generate` directives). - the documentation listed above. Exit criteria: - `*stub` packages are gone from `lobby/internal/adapters/` and the `mocks/` packages compile against the current ports. - `make -C lobby mocks` regenerates with no diff after a clean run. - `go test ./lobby/...` is green. - Documentation across `rtmanager/docs/`, `lobby/docs/`, top-level `AGENTS.md`, and any affected `README.md` references the unified convention. ## Final Acceptance Criteria - `go build ./...` from the repository root succeeds. - `go test ./...` from the repository root passes. - `go test -tags=integration ./rtmanager/integration/...` passes when Docker is available. - `go test ./integration/lobbyrtm/...` passes when Docker is available. - `make -C rtmanager jet` regenerates jet code with no diff after a clean run. - Manual smoke: bring Lobby + RTM + the rest of the stack up via the existing dev compose; create a game; observe a real `galaxy-game-{game_id}` container; `curl http://galaxy-game-{game_id}:8080/healthz` returns `200`; stop the game; the container moves to `exited`; the admin cleanup endpoint removes it. - Documentation across `ARCHITECTURE.md`, `lobby`, `notification`, `game`, and `rtmanager` is internally consistent. ## Out of Scope - Multi-instance Runtime Manager with Redis Streams consumer groups (`XREADGROUP` / `XCLAIM`). - Engine version registry inside `Game Master`. Producer-supplied `image_ref` decouples this work from RTM. - TLS / mTLS on the internal listener. - Engine in-place upgrades driven by an engine API. Patch is always recreate. - Backup, archival, or cleanup of host state directories. - Kubernetes, Docker Swarm, or any non-Docker orchestrator. - Consumption of `runtime:health_events` by Game Master, Game Lobby, or Notification Service. Those are next-stage concerns of those services. ## Risks and Notes - CI must expose a Docker socket (or run rootless equivalent) to execute the integration suites. Without Docker the integration tests are skipped through a build-tag guard. - The `reason` enum on `runtime:stop_jobs` is fixed in this plan (`{orphan_cleanup, cancelled, finished, admin_request, timeout}`). Adding a new value requires a contract bump in `runtime-jobs-asyncapi.yaml` and a Lobby publisher change. Keep the enum small. - Lobby's existing `runtimejobresult` worker only reacts to start outcomes today. Stop outcomes are observable in RTM `operation_log` but Lobby does not yet update game status from them. Adding a stop-result consumer to Lobby is a future Lobby stage and is explicitly out of scope here. - Pre-launch single-init policy applies to RTM exactly as documented in `ARCHITECTURE.md §Persistence Backends`: schema evolves by editing `00001_init.sql` until first production deploy.