Files
galaxy-game/rtmanager/PLAN.md
T
2026-04-28 20:39:18 +02:00

41 KiB

Runtime Manager Implementation Plan

This plan has been already implemented and stays here for historical reasons.

It should NOT be threated as source of truth for service functionality.

Summary

This plan delivers Runtime Manager (RTM), the only Galaxy service with direct Docker access. It owns container lifecycle (start, stop, restart, patch, cleanup), three-source health monitoring, and a synchronous internal REST surface used by Game Master and Admin Service. Game Lobby continues to drive RTM asynchronously through Redis Streams.

The plan also delivers the upstream changes that RTM depends on: a new image_ref field in the start envelope and a reason field in the stop envelope produced by Lobby; a /healthz endpoint, Dockerfile, and STORAGE_PATH / GAME_STATE_PATH contract on galaxy/game; new admin-only notification types in the catalog plus matching constructors in galaxy/notificationintent.

The architectural rules behind every decision are recorded in ./README.md. This file describes the order in which the implementation lands.

Global Rules

  • Documentation always lands before contracts; contracts before code.
  • Each stage leaves the repository in a buildable, test-green state. No stage relies on a later stage to fix a regression it introduced.
  • Existing-service refactors (Lobby publisher, Notification catalog, Game engine) are full-fledged stages of this plan; they precede every RTM stage that depends on them.
  • RTM never resolves engine versions. Producer supplies image_ref. RTM never deletes the host state directory. RTM never kills containers it does not own a record for.
  • Every functional change ships its tests in the same stage. Contract tests freeze operation IDs and stream message names from Stage 04 onward.
  • All code, docs, and identifiers are written in English.

Suggested Module Structure

rtmanager/
├── cmd/
│   ├── rtmanager/
│   │   └── main.go
│   └── jetgen/
│       └── main.go
│
├── internal/
│   ├── app/
│   │   ├── app.go
│   │   ├── runtime.go
│   │   ├── wiring.go
│   │   └── bootstrap.go
│   │
│   ├── config/
│   │   ├── config.go
│   │   ├── env.go
│   │   └── validation.go
│   │
│   ├── logging/
│   │   ├── logger.go
│   │   └── context.go
│   │
│   ├── telemetry/
│   │   └── runtime.go
│   │
│   ├── domain/
│   │   ├── runtime/
│   │   │   ├── model.go
│   │   │   └── transitions.go
│   │   ├── operation/
│   │   │   └── log.go
│   │   └── health/
│   │       └── snapshot.go
│   │
│   ├── ports/
│   │   ├── runtimerecordstore.go
│   │   ├── operationlogstore.go
│   │   ├── healthsnapshotstore.go
│   │   ├── streamoffsetstore.go
│   │   ├── dockerclient.go
│   │   ├── lobbyinternal.go
│   │   └── notificationintents.go
│   │
│   ├── adapters/
│   │   ├── postgres/
│   │   │   ├── migrations/
│   │   │   ├── jet/
│   │   │   ├── runtimerecordstore/
│   │   │   ├── operationlogstore/
│   │   │   └── healthsnapshotstore/
│   │   ├── redisstate/
│   │   │   └── streamoffsets/
│   │   ├── docker/
│   │   │   ├── client.go
│   │   │   └── mocks/
│   │   ├── lobbyclient/
│   │   ├── notificationpublisher/
│   │   ├── jobresultspublisher/
│   │   └── healtheventspublisher/
│   │
│   ├── service/
│   │   ├── startruntime/
│   │   ├── stopruntime/
│   │   ├── restartruntime/
│   │   ├── patchruntime/
│   │   └── cleanupcontainer/
│   │
│   ├── worker/
│   │   ├── startjobsconsumer/
│   │   ├── stopjobsconsumer/
│   │   ├── dockerevents/
│   │   ├── healthprobe/
│   │   ├── dockerinspect/
│   │   ├── reconcile/
│   │   └── containercleanup/
│   │
│   └── api/
│       └── internalhttp/
│           ├── server.go
│           └── handlers/
│
├── api/
│   ├── internal-openapi.yaml
│   ├── runtime-jobs-asyncapi.yaml
│   └── runtime-health-asyncapi.yaml
│
├── integration/
│   ├── harness/
│   ├── lifecycle_test.go
│   ├── replay_test.go
│   ├── health_test.go
│   └── notification_test.go
│
├── docs/
│   ├── README.md
│   ├── runtime.md
│   ├── flows.md
│   ├── runbook.md
│   ├── examples.md
│   └── postgres-migration.md
│
├── README.md
├── PLAN.md
├── Makefile
└── go.mod

Stage 01. Update ARCHITECTURE.md

Status: implemented.

Goal:

  • align the project-wide source of truth with every decision recorded in ./README.md before any code change touches it.

Tasks:

  • Expand ARCHITECTURE.md §9 (Runtime Manager) with subsections: container model (galaxy-game-{game_id} DNS naming, bind-mount ABI, network prerequisite), image policy (producer-supplied image_ref), state ownership rule (RTM never deletes the host state directory), reconcile policy (adopt unrecorded containers, never kill them).
  • Update §«Fixed asynchronous interactions»: note the image_ref field on Lobby → RTM, add the runtime:health_events outbound stream, add Runtime Manager → Notification Service for admin alerts.
  • Update §«Fixed synchronous interactions»: add Game Master → Runtime Manager and Admin Service → Runtime Manager for REST inspect / restart / patch / stop / cleanup, and remove the corresponding async entries.
  • Update §«Persistence Backends»: add rtmanager schema to the schema-per-service list and to PG-backed services.
  • Update §«Configuration»: add RTMANAGER to the env-var prefix list with the same shape rules as other PG/Redis-backed services.
  • Update §«Recommended Order of Service Implementation» entry 7 with the now-fixed scope (start, stop, restart, patch, inspect, health monitoring).

Files touched:

  • ARCHITECTURE.md.

Exit criteria:

  • every later RTM, Lobby, Notification, or Game stage can quote its rules from ARCHITECTURE.md without re-deciding them.

Stage 02. Freeze RTM README.md

Status: implemented as part of this planning task — see ./README.md.

Goal:

  • publish the complete service description so contracts and code can reference one source.

Tasks:

  • Write rtmanager/README.md covering Purpose, Scope, Non-Goals, Position in the System, Responsibility Boundaries, Container Model, Runtime Surface, Lifecycles, Health Monitoring, Reconciliation, Trusted Surfaces, Async Stream Contracts, Notification Contracts, Persistence Layout, Error Model, Configuration, Observability, Verification.

Exit criteria:

  • a reviewer can answer any «what does RTM do when X» question by reading the README alone.

Stage 03. Sync existing-service docs (Lobby, Notification, Game)

Status: implemented.

Goal:

  • bring the READMEs of every touched service into agreement with the RTM contract before any code in those services changes.

Tasks:

  • lobby/README.md: update Game Start Flow — start envelope is now {game_id, image_ref, requested_at_ms}. Add LOBBY_ENGINE_IMAGE_TEMPLATE to the Configuration section. Document the new stop envelope reason enum (orphan_cleanup | cancelled | finished | admin_request | timeout). Note that the Lobby ↔ RTM transport stays asynchronous indefinitely.
  • lobby/PLAN.md: append a single closing note that runtime-job envelope changes belong to the Runtime Manager plan; no new stages added there.
  • notification/README.md: add three admin notification types to the catalog (runtime.image_pull_failed, runtime.container_start_failed, runtime.start_config_invalid), each email-only with audience admin in v1.
  • notification/PLAN.md: append a closing note pointing at the Runtime Manager plan for the catalog extension.
  • game/README.md (create if absent): document the new /healthz endpoint, the STORAGE_PATH / GAME_STATE_PATH env contract, and the new Dockerfile location.

Files touched:

  • lobby/README.md, lobby/PLAN.md, notification/README.md, notification/PLAN.md, game/README.md.

Exit criteria:

  • every doc in the repo agrees on the post-RTM contract; no contradiction remains between any two READMEs.

Stage 04. RTM contract files and contract tests

Status: implemented.

Goal:

  • ship machine-readable contracts before any RTM handler is written, so the implementation has a target spec.

Tasks:

  • rtmanager/api/internal-openapi.yaml: every internal REST endpoint with request and response schemas; error envelope { "error": { "code", "message" } } identical to Lobby. Operation IDs: internalListRuntimes, internalGetRuntime, internalStartRuntime, internalStopRuntime, internalRestartRuntime, internalPatchRuntime, internalCleanupRuntimeContainer, internalHealthz, internalReadyz.
  • rtmanager/api/runtime-jobs-asyncapi.yaml: AsyncAPI 2.6.0 spec for runtime:start_jobs, runtime:stop_jobs, runtime:job_results. Frozen field set per-message.
  • rtmanager/api/runtime-health-asyncapi.yaml: AsyncAPI 2.6.0 spec for runtime:health_events with the event_type enum and details polymorphic schema (oneOf per type).
  • rtmanager/contract_openapi_test.go and rtmanager/contract_asyncapi_test.go: load specs via kin-openapi (and the AsyncAPI loader pattern from notification/contract_asyncapi_test.go), assert operation IDs / message names / field presence.

Files new:

  • the four files above.

Exit criteria:

  • all three specs validate; contract tests pass; tests fail loudly if any operation ID, message name, or required field disappears.

Stage 05. Game engine /healthz, Dockerfile, STORAGE_PATH

Status: implemented.

Goal:

  • make galaxy/game runnable as the test engine image RTM uses in integration tests.

Tasks:

  • Add GET /healthz to game/internal/router returning {"status":"ok"} (200) when the engine process is up, irrespective of whether a game has been initialised. The existing /api/v1/status keeps its current 501 behaviour for an uninitialised engine.
  • Make engine read storage path from STORAGE_PATH env, falling back to GAME_STATE_PATH when set. Both names are accepted; GAME_STATE_PATH is the contract RTM writes.
  • Update game/cmd/http/main.go to bind the env.
  • Add galaxy/game/Dockerfile: multi-stage (golang builder + small runtime base). Exposes :8080. Default STORAGE_PATH=/var/lib/galaxy-game. Copies the binary. Runs as non-root user.
  • Add image labels to the Dockerfile: com.galaxy.cpu_quota=1.0, com.galaxy.memory=512m, com.galaxy.pids_limit=512, org.opencontainers.image.title=galaxy-game-engine.
  • Update game/openapi.yaml to document /healthz.
  • Update game/openapi_contract_test.go to assert /healthz presence.

Files new:

  • galaxy/game/Dockerfile.

Files touched:

  • galaxy/game/internal/router/*.go, galaxy/game/cmd/http/main.go, galaxy/game/openapi.yaml, galaxy/game/openapi_contract_test.go.

Exit criteria:

  • docker build -t galaxy/game:test -f game/Dockerfile . (run from the workspace root) succeeds. The build context is the workspace root because game/ resolves galaxy/{model,error,util,...} through go.work replace directives; see rtmanager/docs/game-dockerfile-build-context.md.
  • docker run --rm -e STORAGE_PATH=/tmp/x -p 8080:8080 galaxy/game:test answers /healthz with 200.
  • go test ./game/... passes.

Stage 06. Lobby publisher refactor

Status: implemented.

Goal:

  • ship the new runtime:start_jobs and runtime:stop_jobs envelopes from Lobby. After this stage Lobby is RTM-ready; the real RTM appears in Stage 13 onwards.

Tasks:

  • Add LOBBY_ENGINE_IMAGE_TEMPLATE (default galaxy/game:{engine_version}) and validation to lobby/internal/config/config.go and env.go.
  • Build lobby/internal/domain/engineimage/resolver.go that turns (template, target_engine_version) into image_ref, validating both inputs. Reject templates without {engine_version}; reject empty engine versions.
  • lobby/internal/ports/runtimemanager.go: change interface to PublishStartJob(ctx, gameID, imageRef string) error and PublishStopJob(ctx, gameID string, reason StopReason) error with a StopReason enum (orphan_cleanup, cancelled, finished, admin_request, timeout) declared in the same package.
  • lobby/internal/adapters/runtimemanager/publisher.go: write the new fields into the XADD payload.
  • Update callers:
    • lobby/internal/service/startgame/: resolve image_ref from the loaded game record, pass to PublishStartJob.
    • lobby/internal/worker/runtimejobresult/consumer.go: pass reason=orphan_cleanup to PublishStopJob from the orphan-container path.
  • Update Lobby unit tests (publisher, services) and contract tests (if Lobby has any describing the runtime envelopes; otherwise add TestPublisherStartJobIncludesImageRef and TestPublisherStopJobIncludesReason).

Files new:

  • lobby/internal/domain/engineimage/resolver.go and its test file.

Files touched:

  • the Lobby files listed above.

Exit criteria:

  • go test ./lobby/... passes.
  • An XADD against the start stream contains the image_ref field; an XADD against the stop stream contains the reason field.

Stage 07. Notification intent constructors and catalog extension

Status: implemented.

Goal:

  • expose three admin-only notification types so RTM (Stage 13 onwards) can publish them without later cross-cutting refactors.

Tasks:

  • Add constructors and payload structs to galaxy/notificationintent/:
    • NewRuntimeImagePullFailedIntent(meta, payload),
    • NewRuntimeContainerStartFailedIntent(meta, payload),
    • NewRuntimeStartConfigInvalidIntent(meta, payload). Each payload includes game_id, image_ref, error_code, error_message, attempted_at_ms.
  • Extend notification/api/intents-asyncapi.yaml with the three new payload schemas and add them to the catalog.
  • Extend the notification routing tables (data only — no service code) so the existing routing rules cover the new types: delivery decision email-only, audience admin.
  • Extend notification/contract_asyncapi_test.go to freeze the new message names and payload required fields.

Files touched:

  • galaxy/notificationintent/*.go,
  • notification/api/intents-asyncapi.yaml,
  • notification catalog data tables (locations defined inside notification/internal/...),
  • notification/contract_asyncapi_test.go.

Exit criteria:

  • unit tests for the new constructors pass.
  • AsyncAPI validates.
  • Notification's existing integration suites still pass with the new types added.

Stage 08. RTM module skeleton

Status: implemented.

Goal:

  • create a buildable rtmanager binary that loads config, opens dependencies, and exits cleanly on SIGTERM. It does no business work yet.

Tasks:

  • rtmanager/cmd/rtmanager/main.go mirroring lobby/cmd/lobby/main.go.
  • rtmanager/internal/config/{config.go, env.go, validation.go} with env prefix RTMANAGER and groups Listener, Docker, Postgres, Redis, Streams, Container defaults, Health, Cleanup, Coordination, Lobby internal client, Logging, Lifecycle, Telemetry. Required variables fail-fast.
  • rtmanager/internal/logging/{logger.go, context.go} copied from lobby/notification.
  • rtmanager/internal/telemetry/runtime.go registering the metrics named in README.md §Observability.
  • rtmanager/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go} — empty wiring with PostgreSQL open, Redis open, Docker client open (ping only), telemetry open, probe listener open.
  • rtmanager/internal/api/internalhttp/server.go — listener with /healthz and /readyz only.
  • rtmanager/Makefile with the jet target (real generation lands in Stage 09).
  • rtmanager/go.mod and go.sum with dependencies: github.com/docker/docker, github.com/redis/go-redis/v9, github.com/jackc/pgx/v5, github.com/go-jet/jet/v2, github.com/pressly/goose/v3, github.com/stretchr/testify, the testcontainers modules for postgres / redis / docker, and the OpenTelemetry stack identical to lobby.
  • Update repo-level go.work to include ./rtmanager.

Files new:

  • the entire skeleton tree.

Exit criteria:

  • go build ./rtmanager/cmd/rtmanager succeeds.
  • Running with valid env brings /healthz and /readyz up.
  • SIGTERM returns within RTMANAGER_SHUTDOWN_TIMEOUT.

Stage 09. PostgreSQL schema, migrations, jet

Status: implemented.

Goal:

  • finalise the persistence schema and the code-generation pipeline.

Tasks:

  • internal/adapters/postgres/migrations/00001_init.sqlCREATE SCHEMA IF NOT EXISTS rtmanager; plus the three tables and indexes from README.md §Persistence Layout.
  • internal/adapters/postgres/migrations/migrations.go//go:embed *.sql and FS() exporter, identical pattern to lobby.
  • cmd/jetgen/main.go — testcontainers PostgreSQL + goose up + jet generation against the resulting database. Mirrors lobby/cmd/jetgen/main.go.
  • Generated internal/adapters/postgres/jet/... committed to the repo.
  • Wire goose migrations into internal/app/runtime.go startup so they apply before any listener opens; non-zero exit on failure (matches pkg/postgres policy).

Files new:

  • as above.

Exit criteria:

  • make -C rtmanager jet regenerates the jet code with no diff after a clean run.
  • Service start applies migrations to a fresh database and exits zero if migrations are already applied.

Stage 10. Domain layer and ports

Status: implemented.

Goal:

  • lock the in-memory domain model and the port interfaces for adapters.

Tasks:

  • internal/domain/runtime/model.goRuntimeRecord struct, status enum (StatusRunning, StatusStopped, StatusRemoved), error sentinels.
  • internal/domain/runtime/transitions.go — allowed transitions table and a CAS-friendly validator.
  • internal/domain/operation/log.goOpKind, OpSource, Outcome enums plus the OperationEntry struct.
  • internal/domain/health/snapshot.goHealthEventType enum, HealthSnapshot struct.
  • internal/ports/:
    • runtimerecordstore.goGet, Upsert, UpdateStatus (CAS by current_container_id), ListByStatus.
    • operationlogstore.goAppend, ListByGame.
    • healthsnapshotstore.goUpsert, Get.
    • streamoffsetstore.goLoad, Save (Redis offset persistence per consumer label).
    • dockerclient.go — narrow surface RTM uses: EnsureNetwork, PullImage, Inspect, Run, Stop, Remove, List, EventsListen. (Logs reserved; not in v1.)
    • lobbyinternal.goGetGame(ctx, gameID) (LobbyGameRecord, error).
    • notificationintents.goPublish(ctx, intent) error.

Files new:

  • as above.

Exit criteria:

  • the package compiles.
  • every interface has a _ ports.X = (*Y)(nil) assertion slot ready for the adapters that follow.

Stage 11. Persistence adapters

Status: implemented. Decision record: docs/stage11-persistence-adapters.md.

Goal:

  • implement the three PostgreSQL stores and the Redis offset store.

Tasks:

  • internal/adapters/postgres/runtimerecordstore/store.go using jet.
  • internal/adapters/postgres/operationlogstore/store.go.
  • internal/adapters/postgres/healthsnapshotstore/store.go.
  • internal/adapters/redisstate/streamoffsets/store.go (mirror Lobby's redisstate/streamoffsets).
  • For each adapter: store-level integration tests against testcontainers PostgreSQL or Redis. CAS semantics on runtime_records.UpdateStatus are verified by an explicit concurrent-update test (only one of two callers wins).

Files new:

  • as above and per-package _test.go.

Exit criteria:

  • store tests pass on a CI runner with Docker available.

Stage 12. Docker adapter and external clients

Status: implemented. Decision record: docs/stage12-docker-and-clients.md.

Goal:

  • ship the Docker SDK adapter and the external HTTP clients for Lobby internal API and notification publishing.

Tasks:

  • internal/adapters/docker/client.go — implements ports.DockerClient over github.com/docker/docker/client. Behaviour:
    • EnsureNetwork validates the configured network's presence (no creation).
    • PullImage honours the configured pull policy.
    • Inspect returns image and container metadata in domain-friendly shape.
    • Run builds the create + start sequence with labels, env (GAME_STATE_PATH, STORAGE_PATH), bind mount, log driver, resource limits read from image labels with config fallback.
    • Stop calls ContainerStop with the configured timeout.
    • Remove calls ContainerRemove.
    • List filters by label=com.galaxy.owner=rtmanager.
    • EventsListen returns a typed channel of decoded events.
  • internal/adapters/docker/mocks/mockgen-generated mock for ports.DockerClient, used by service tests.
  • internal/adapters/lobbyclient/client.go — REST client over an otelhttp-wrapped http.Client for GET /api/v1/internal/games/{game_id}. Returns LobbyGameRecord.
  • internal/adapters/notificationpublisher/publisher.go — wraps galaxy/notificationintent plus redis.XAdd against notification:intents.
  • Per-adapter unit tests with mocks. A small testcontainers Docker smoke test guarded by build tag rtmanager_docker_smoke until Stage 19 promotes it to default.

Files new:

  • as above.

Exit criteria:

  • mocks regenerate cleanly via go generate.
  • unit tests pass.
  • the smoke test passes on a runner with Docker available.

Stage 13. Service: start

Status: implemented. Decision record: docs/stage13-start-service.md.

Goal:

  • end-to-end start operation in the service layer, callable from both the async consumer and the REST handler in later stages.

Tasks:

  • internal/service/startruntime/service.go orchestrator:
    1. Acquire game-id lease (Redis).
    2. Read runtime_records. If running with same image_ref, return idempotent success with error_code=replay_no_op.
    3. Optionally fetch LobbyGameRecord for ancillary fields; in v1 only image_ref is required, so this fetch is a no-op except for diagnostics.
    4. Pull image (per policy), inspect labels for resource limits.
    5. Ensure the per-game state directory exists with the configured mode and ownership.
    6. docker run with the configured network, hostname, labels, env, bind mount, log driver, resource limits.
    7. Upsert runtime_records (status=running, current_container_id, engine_endpoint, current_image_ref, started_at, last_op_at).
    8. Append operation_log entry (op_kind=start, outcome=success, op_source from caller).
    9. Publish runtime:health_events container_started.
    10. Return success outcome to caller (consumer publishes job_result, REST returns 200).
  • Failure paths in the table from README.md §Lifecycles → Start. Each failure path:
    • rolls back any partially created Docker resource;
    • publishes the matching admin-only notification intent;
    • records operation_log with outcome=failure and the stable error code;
    • returns failure to the caller.
  • Unit tests cover happy path, idempotent re-start, each failure mode, lease conflict, and partial-rollback paths.

Files new:

  • service/startruntime/{service.go, service_test.go, errors.go}.

Exit criteria:

  • service-level tests pass.

Stage 14. Service: stop, restart, patch, cleanup

Status: implemented. Decision record: docs/stage14-stop-restart-patch-cleanup.md.

Goal:

  • the remaining four lifecycle operations, sharing helpers with start.

Tasks:

  • internal/service/stopruntime/service.go — graceful docker stop with timeout, record stopped state. Idempotent re-stop returns success no-op.
  • internal/service/restartruntime/service.go — orchestrate stopruntime then startruntime with the current image_ref. Same Redis lease shared across both inner operations. Records a single operation_log entry with op_kind=restart plus a correlation id linking it to the implicit start/stop entries.
  • internal/service/patchruntime/service.go — restart with a new image_ref. Validates the semver-patch-only rule (major and minor must equal current version; otherwise return semver_patch_only failure). If the engine version is not parseable as semver, return image_ref_not_semver.
  • internal/service/cleanupcontainer/service.godocker rm for an already-stopped container; refuses if status=running. Sets runtime_records.status=removed.
  • The Redis lease covers each operation end-to-end; restart and patch hold the lease across the inner stop+start to prevent races.
  • Unit tests for each service. Cross-operation race tests assert that concurrent start vs. stop on the same game_id either succeed in some order or both observe the lease and one returns conflict.

Files new:

  • service/{stopruntime, restartruntime, patchruntime, cleanupcontainer}/....

Exit criteria:

  • service-level tests pass.

Stage 15. Async consumers and runtime:job_results publisher

Status: implemented. Decision record: docs/stage15-async-consumers.md.

Goal:

  • wire the Lobby-side stream contract into the freshly built service layer.

Tasks:

  • internal/worker/startjobsconsumer/consumer.go — XREAD over runtime:start_jobs, decodes envelope {game_id, image_ref, requested_at_ms}, calls startruntime service, publishes runtime:job_results with the canonical schema, advances the Redis offset. Mirrors patterns from lobby/internal/worker/runtimejobresult/consumer.go.
  • internal/worker/stopjobsconsumer/consumer.go — XREAD over runtime:stop_jobs, decodes {game_id, reason, requested_at_ms}, calls stopruntime.
  • internal/adapters/jobresultspublisher/publisher.go — small XADD wrapper for runtime:job_results.
  • Replay safety: deterministic «already running» / «already stopped» idempotent outcomes surface as outcome=success with error_code=replay_no_op.
  • Tests use miniredis and a fake ports.DockerClient. A consumer integration test drives a full Lobby → RTM → Lobby roundtrip end-to-end.

Files new:

  • as above + tests.

Exit criteria:

  • consumer integration test passes.

Stage 16. Internal REST handlers

Status: implemented. Decision record: docs/stage16-internal-rest-handlers.md.

Goal:

  • ship the GM/Admin-facing REST surface backed by the service layer.

Tasks:

  • internal/api/internalhttp/handlers/{list, get, start, stop, restart, patch, cleanup}.go — one file per operation, each delegating to the corresponding service. JSON in / JSON out. Unknown JSON fields rejected with invalid_request.
  • Error envelope identical to lobby: { "error": { "code", "message" } }. Stable codes: invalid_request, not_found, conflict, service_unavailable, internal_error, image_ref_not_semver, semver_patch_only, image_pull_failed, container_start_failed, start_config_invalid, docker_unavailable.
  • Wiring under the existing internal HTTP listener; route registration in internal/app/wiring.go.
  • Handler-level table-driven tests; OpenAPI conformance test that loads api/internal-openapi.yaml and asserts every defined operation is reachable and matches its declared response.

Files new:

  • handlers + tests.

Exit criteria:

  • OpenAPI conformance test passes for every endpoint.
  • Handlers reject unknown JSON fields.

Stage 17. Health monitoring

Status: implemented. Decision record: docs/stage17-health-monitoring.md.

Goal:

  • observability of running containers via the three sources from README.md §Health Monitoring.

Tasks:

  • internal/worker/dockerevents/listener.go — subscribes to Docker events with the com.galaxy.owner=rtmanager label filter, looks up runtime_records by labels, emits runtime:health_events for container_exited, container_oom, container_disappeared. container_started is emitted directly by the start service (Stage 13) when it runs the container.
  • internal/worker/healthprobe/worker.go — periodic worker iterating runtime_records.status=running. Calls GET {engine_endpoint}/healthz with the configured timeout, applies the RTMANAGER_PROBE_FAILURES_THRESHOLD hysteresis, emits probe_failed / probe_recovered. Uses otelhttp client.
  • internal/worker/dockerinspect/worker.go — periodic full inspect; emits inspect_unhealthy on observed RestartCount growth or unexpected status.
  • internal/adapters/healtheventspublisher/publisher.go — XADD wrapper for runtime:health_events. Always also upserts the latest snapshot into health_snapshots.

Files new:

  • as above + tests.

Exit criteria:

  • worker tests use a Docker mock that programmatically emits events and asserts the published stream entries match the AsyncAPI spec.

Stage 18. Reconciler and container cleanup

Status: implemented. Decision record: docs/stage18-reconcile-and-cleanup.md.

Goal:

  • drift management and TTL-based cleanup.

Tasks:

  • internal/worker/reconcile/reconciler.go — runs at startup (blocking before workers start) and periodically (RTMANAGER_RECONCILE_INTERVAL). Implements the rules from README.md §Reconciliation:
    • record running containers without a PG record, never kill them (op_kind=reconcile_adopt);
    • mark runtime_records.status=running rows whose container is missing as removed, publish container_disappeared (op_kind=reconcile_dispose).
  • internal/worker/containercleanup/worker.go — periodic worker (RTMANAGER_CLEANUP_INTERVAL) that lists runtime_records with status=stopped and last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS, calls cleanupcontainer service for each.
  • Both workers are registered as app.Components in internal/app/wiring.go.

Files new:

  • as above + tests.

Exit criteria:

  • reconciler test using mocked Docker proves both adopt and dispose paths.
  • cleanup test proves TTL math with a fake clock.

Stage 19. Service-local integration suite

Status: implemented. Decision record: docs/stage19-integration.md.

Goal:

  • end-to-end suite running against testcontainers PostgreSQL + Redis + the real Docker daemon, using the freshly-built galaxy/game test image.

Tasks:

  • rtmanager/integration/harness/ — set up PostgreSQL with goose-applied migrations; Redis (miniredis is sufficient for stream-only suites; testcontainers Redis for coordination suites that exercise leases); ensure the Docker bridge network exists; build galaxy/game test image once per package run with sync.Once; tear everything down via t.Cleanup.
  • rtmanager/integration/lifecycle_test.go — start → inspect → stop → restart → patch → cleanup against the real engine; assert each step's PG, Redis-stream, and Docker side-effects. Engine state directories are created via t.ArtifactDir().
  • rtmanager/integration/replay_test.go — duplicate start/stop messages are no-ops with error_code=replay_no_op.
  • rtmanager/integration/health_test.go — kill the engine container externally; assert container_disappeared event publishes within timeout. Bring it back with a manual docker run; assert the reconciler adopts it.
  • rtmanager/integration/notification_test.go — drive a start with an unresolvable image ref; assert RTM publishes the runtime.image_pull_failed notification intent and a failure job_result.

Files new:

  • as above.

Exit criteria:

  • go test ./rtmanager/integration/... passes locally with Docker available.
  • CI runs the suite under a profile that exposes the Docker socket.

Stage 20. Inter-service test: Lobby ↔ RTM

Status: implemented. Decision record: docs/stage20-lobbyrtm.md.

Goal:

  • satisfy the TESTING.md §7 inter-service requirement with real Lobby + real RTM.

Tasks:

  • integration/lobbyrtm/ (top-level integration directory, mirroring existing integration/notificationgateway, etc.): runs real Lobby, real RTM, real PostgreSQL, real Redis, and the galaxy/game test engine container.
  • Scenarios:
    • Lobby creates a game, publishes a start_job with image_ref, RTM starts the engine, publishes job_result, Lobby transitions the game to running. The engine answers /healthz.
    • Lobby transitions a game to cancelled, publishes stop_job with reason=cancelled, RTM stops the engine. RTM operation_log records the transition.
    • Failure path: image_ref points at a missing image. RTM publishes a failure job_result and the matching notification intent. Lobby transitions the game to start_failed.

Files new:

  • as above.

Exit criteria:

  • all scenarios pass in CI when the Docker socket is available.

Stage 21. Service-local docs

Status: implemented.

Goal:

  • drop per-stage decisions captured during this plan into discoverable service-local documentation, mirroring lobby/docs/.

Tasks:

  • docs/README.md — index pointing at the four content docs and the postgres-migration record.
  • docs/runtime.md — components, processes, in-memory state of each worker.
  • docs/flows.md — mermaid diagrams for: start happy path, start failure (image pull), start failure (orphan), stop, restart, patch, cleanup TTL, reconcile drift adopt, health probe hysteresis.
  • docs/runbook.md — operator scenarios: «engine container died», «patch upgrade», «manual cleanup», «reconcile drift after Docker daemon restart», «testing locally».
  • docs/examples.md — env-var examples per environment (dev / test / prod skeletons), example payloads for each stream and each REST endpoint.
  • docs/postgres-migration.md — decision record for the schema (mirrors notification/docs/postgres-migration.md style).

Files new:

  • all six.

Exit criteria:

  • the README of RTM links to docs/README.md.
  • a reviewer can find any operational how-to within two clicks.

Stage 22. Migrate hand-rolled stubs to mockgen

Status: implemented. Decision record: docs/stage22-stub-migration.md.

Goal:

  • unify the test-double style across the repository on the mockgen pipeline introduced for the RTM Docker port in Stage 12. Today every Galaxy service except RTM hand-rolls *stub packages; mixing styles raises onboarding cost and makes port-signature drift easier to miss.

Tasks (high-level only — each package gets its own decision when this stage is opened):

Files new:

  • one mocks/ directory under each affected adapter group, plus a lobby/Makefile mocks target (and equivalents for any other service the audit identifies).

Files touched:

  • every *stub package listed above plus its consumers.
  • lobby/Makefile, lobby/internal/ports/*.go (for //go:generate directives).
  • the documentation listed above.

Exit criteria:

  • *stub packages are gone from lobby/internal/adapters/ and the mocks/ packages compile against the current ports.
  • make -C lobby mocks regenerates with no diff after a clean run.
  • go test ./lobby/... is green.
  • Documentation across rtmanager/docs/, lobby/docs/, top-level AGENTS.md, and any affected README.md references the unified convention.

Final Acceptance Criteria

  • go build ./... from the repository root succeeds.
  • go test ./... from the repository root passes.
  • go test -tags=integration ./rtmanager/integration/... passes when Docker is available.
  • go test ./integration/lobbyrtm/... passes when Docker is available.
  • make -C rtmanager jet regenerates jet code with no diff after a clean run.
  • Manual smoke: bring Lobby + RTM + the rest of the stack up via the existing dev compose; create a game; observe a real galaxy-game-{game_id} container; curl http://galaxy-game-{game_id}:8080/healthz returns 200; stop the game; the container moves to exited; the admin cleanup endpoint removes it.
  • Documentation across ARCHITECTURE.md, lobby, notification, game, and rtmanager is internally consistent.

Out of Scope

  • Multi-instance Runtime Manager with Redis Streams consumer groups (XREADGROUP / XCLAIM).
  • Engine version registry inside Game Master. Producer-supplied image_ref decouples this work from RTM.
  • TLS / mTLS on the internal listener.
  • Engine in-place upgrades driven by an engine API. Patch is always recreate.
  • Backup, archival, or cleanup of host state directories.
  • Kubernetes, Docker Swarm, or any non-Docker orchestrator.
  • Consumption of runtime:health_events by Game Master, Game Lobby, or Notification Service. Those are next-stage concerns of those services.

Risks and Notes

  • CI must expose a Docker socket (or run rootless equivalent) to execute the integration suites. Without Docker the integration tests are skipped through a build-tag guard.
  • The reason enum on runtime:stop_jobs is fixed in this plan ({orphan_cleanup, cancelled, finished, admin_request, timeout}). Adding a new value requires a contract bump in runtime-jobs-asyncapi.yaml and a Lobby publisher change. Keep the enum small.
  • Lobby's existing runtimejobresult worker only reacts to start outcomes today. Stop outcomes are observable in RTM operation_log but Lobby does not yet update game status from them. Adding a stop-result consumer to Lobby is a future Lobby stage and is explicitly out of scope here.
  • Pre-launch single-init policy applies to RTM exactly as documented in ARCHITECTURE.md §Persistence Backends: schema evolves by editing 00001_init.sql until first production deploy.