41 KiB
Runtime Manager Implementation Plan
This plan has been already implemented and stays here for historical reasons.
It should NOT be threated as source of truth for service functionality.
Summary
This plan delivers Runtime Manager (RTM), the only Galaxy service with direct Docker access.
It owns container lifecycle (start, stop, restart, patch, cleanup), three-source health
monitoring, and a synchronous internal REST surface used by Game Master and Admin Service.
Game Lobby continues to drive RTM asynchronously through Redis Streams.
The plan also delivers the upstream changes that RTM depends on: a new image_ref field in
the start envelope and a reason field in the stop envelope produced by Lobby; a /healthz
endpoint, Dockerfile, and STORAGE_PATH / GAME_STATE_PATH contract on galaxy/game; new
admin-only notification types in the catalog plus matching constructors in
galaxy/notificationintent.
The architectural rules behind every decision are recorded in
./README.md. This file describes the order in which the implementation
lands.
Global Rules
- Documentation always lands before contracts; contracts before code.
- Each stage leaves the repository in a buildable, test-green state. No stage relies on a later stage to fix a regression it introduced.
- Existing-service refactors (Lobby publisher, Notification catalog, Game engine) are full-fledged stages of this plan; they precede every RTM stage that depends on them.
- RTM never resolves engine versions. Producer supplies
image_ref. RTM never deletes the host state directory. RTM never kills containers it does not own a record for. - Every functional change ships its tests in the same stage. Contract tests freeze operation IDs and stream message names from Stage 04 onward.
- All code, docs, and identifiers are written in English.
Suggested Module Structure
rtmanager/
├── cmd/
│ ├── rtmanager/
│ │ └── main.go
│ └── jetgen/
│ └── main.go
│
├── internal/
│ ├── app/
│ │ ├── app.go
│ │ ├── runtime.go
│ │ ├── wiring.go
│ │ └── bootstrap.go
│ │
│ ├── config/
│ │ ├── config.go
│ │ ├── env.go
│ │ └── validation.go
│ │
│ ├── logging/
│ │ ├── logger.go
│ │ └── context.go
│ │
│ ├── telemetry/
│ │ └── runtime.go
│ │
│ ├── domain/
│ │ ├── runtime/
│ │ │ ├── model.go
│ │ │ └── transitions.go
│ │ ├── operation/
│ │ │ └── log.go
│ │ └── health/
│ │ └── snapshot.go
│ │
│ ├── ports/
│ │ ├── runtimerecordstore.go
│ │ ├── operationlogstore.go
│ │ ├── healthsnapshotstore.go
│ │ ├── streamoffsetstore.go
│ │ ├── dockerclient.go
│ │ ├── lobbyinternal.go
│ │ └── notificationintents.go
│ │
│ ├── adapters/
│ │ ├── postgres/
│ │ │ ├── migrations/
│ │ │ ├── jet/
│ │ │ ├── runtimerecordstore/
│ │ │ ├── operationlogstore/
│ │ │ └── healthsnapshotstore/
│ │ ├── redisstate/
│ │ │ └── streamoffsets/
│ │ ├── docker/
│ │ │ ├── client.go
│ │ │ └── mocks/
│ │ ├── lobbyclient/
│ │ ├── notificationpublisher/
│ │ ├── jobresultspublisher/
│ │ └── healtheventspublisher/
│ │
│ ├── service/
│ │ ├── startruntime/
│ │ ├── stopruntime/
│ │ ├── restartruntime/
│ │ ├── patchruntime/
│ │ └── cleanupcontainer/
│ │
│ ├── worker/
│ │ ├── startjobsconsumer/
│ │ ├── stopjobsconsumer/
│ │ ├── dockerevents/
│ │ ├── healthprobe/
│ │ ├── dockerinspect/
│ │ ├── reconcile/
│ │ └── containercleanup/
│ │
│ └── api/
│ └── internalhttp/
│ ├── server.go
│ └── handlers/
│
├── api/
│ ├── internal-openapi.yaml
│ ├── runtime-jobs-asyncapi.yaml
│ └── runtime-health-asyncapi.yaml
│
├── integration/
│ ├── harness/
│ ├── lifecycle_test.go
│ ├── replay_test.go
│ ├── health_test.go
│ └── notification_test.go
│
├── docs/
│ ├── README.md
│ ├── runtime.md
│ ├── flows.md
│ ├── runbook.md
│ ├── examples.md
│ └── postgres-migration.md
│
├── README.md
├── PLAN.md
├── Makefile
└── go.mod
Stage 01. Update ARCHITECTURE.md
Status: implemented.
Goal:
- align the project-wide source of truth with every decision recorded in
./README.mdbefore any code change touches it.
Tasks:
- Expand
ARCHITECTURE.md§9 (Runtime Manager) with subsections: container model (galaxy-game-{game_id}DNS naming, bind-mount ABI, network prerequisite), image policy (producer-suppliedimage_ref), state ownership rule (RTM never deletes the host state directory), reconcile policy (adopt unrecorded containers, never kill them). - Update §«Fixed asynchronous interactions»: note the
image_reffield onLobby → RTM, add theruntime:health_eventsoutbound stream, addRuntime Manager → Notification Servicefor admin alerts. - Update §«Fixed synchronous interactions»: add
Game Master → Runtime ManagerandAdmin Service → Runtime Managerfor REST inspect / restart / patch / stop / cleanup, and remove the corresponding async entries. - Update §«Persistence Backends»: add
rtmanagerschema to the schema-per-service list and to PG-backed services. - Update §«Configuration»: add
RTMANAGERto the env-var prefix list with the same shape rules as other PG/Redis-backed services. - Update §«Recommended Order of Service Implementation» entry 7 with the now-fixed scope (start, stop, restart, patch, inspect, health monitoring).
Files touched:
ARCHITECTURE.md.
Exit criteria:
- every later RTM, Lobby, Notification, or Game stage can quote its rules from
ARCHITECTURE.mdwithout re-deciding them.
Stage 02. Freeze RTM README.md
Status: implemented as part of this planning task — see ./README.md.
Goal:
- publish the complete service description so contracts and code can reference one source.
Tasks:
- Write
rtmanager/README.mdcovering Purpose, Scope, Non-Goals, Position in the System, Responsibility Boundaries, Container Model, Runtime Surface, Lifecycles, Health Monitoring, Reconciliation, Trusted Surfaces, Async Stream Contracts, Notification Contracts, Persistence Layout, Error Model, Configuration, Observability, Verification.
Exit criteria:
- a reviewer can answer any «what does RTM do when X» question by reading the README alone.
Stage 03. Sync existing-service docs (Lobby, Notification, Game)
Status: implemented.
Goal:
- bring the READMEs of every touched service into agreement with the RTM contract before any code in those services changes.
Tasks:
lobby/README.md: update Game Start Flow — start envelope is now{game_id, image_ref, requested_at_ms}. AddLOBBY_ENGINE_IMAGE_TEMPLATEto the Configuration section. Document the new stop envelopereasonenum (orphan_cleanup | cancelled | finished | admin_request | timeout). Note that the Lobby ↔ RTM transport stays asynchronous indefinitely.lobby/PLAN.md: append a single closing note that runtime-job envelope changes belong to the Runtime Manager plan; no new stages added there.notification/README.md: add three admin notification types to the catalog (runtime.image_pull_failed,runtime.container_start_failed,runtime.start_config_invalid), eachemail-only with audience admin in v1.notification/PLAN.md: append a closing note pointing at the Runtime Manager plan for the catalog extension.game/README.md(create if absent): document the new/healthzendpoint, theSTORAGE_PATH/GAME_STATE_PATHenv contract, and the newDockerfilelocation.
Files touched:
lobby/README.md,lobby/PLAN.md,notification/README.md,notification/PLAN.md,game/README.md.
Exit criteria:
- every doc in the repo agrees on the post-RTM contract; no contradiction remains between any two READMEs.
Stage 04. RTM contract files and contract tests
Status: implemented.
Goal:
- ship machine-readable contracts before any RTM handler is written, so the implementation has a target spec.
Tasks:
rtmanager/api/internal-openapi.yaml: every internal REST endpoint with request and response schemas; error envelope{ "error": { "code", "message" } }identical to Lobby. Operation IDs:internalListRuntimes,internalGetRuntime,internalStartRuntime,internalStopRuntime,internalRestartRuntime,internalPatchRuntime,internalCleanupRuntimeContainer,internalHealthz,internalReadyz.rtmanager/api/runtime-jobs-asyncapi.yaml: AsyncAPI 2.6.0 spec forruntime:start_jobs,runtime:stop_jobs,runtime:job_results. Frozen field set per-message.rtmanager/api/runtime-health-asyncapi.yaml: AsyncAPI 2.6.0 spec forruntime:health_eventswith theevent_typeenum anddetailspolymorphic schema (oneOfper type).rtmanager/contract_openapi_test.goandrtmanager/contract_asyncapi_test.go: load specs viakin-openapi(and the AsyncAPI loader pattern fromnotification/contract_asyncapi_test.go), assert operation IDs / message names / field presence.
Files new:
- the four files above.
Exit criteria:
- all three specs validate; contract tests pass; tests fail loudly if any operation ID, message name, or required field disappears.
Stage 05. Game engine /healthz, Dockerfile, STORAGE_PATH
Status: implemented.
Goal:
- make
galaxy/gamerunnable as the test engine image RTM uses in integration tests.
Tasks:
- Add
GET /healthztogame/internal/routerreturning{"status":"ok"}(200) when the engine process is up, irrespective of whether a game has been initialised. The existing/api/v1/statuskeeps its current501behaviour for an uninitialised engine. - Make engine read storage path from
STORAGE_PATHenv, falling back toGAME_STATE_PATHwhen set. Both names are accepted;GAME_STATE_PATHis the contract RTM writes. - Update
game/cmd/http/main.goto bind the env. - Add
galaxy/game/Dockerfile: multi-stage (golang builder + small runtime base). Exposes:8080. DefaultSTORAGE_PATH=/var/lib/galaxy-game. Copies the binary. Runs as non-root user. - Add image labels to the
Dockerfile:com.galaxy.cpu_quota=1.0,com.galaxy.memory=512m,com.galaxy.pids_limit=512,org.opencontainers.image.title=galaxy-game-engine. - Update
game/openapi.yamlto document/healthz. - Update
game/openapi_contract_test.goto assert/healthzpresence.
Files new:
galaxy/game/Dockerfile.
Files touched:
galaxy/game/internal/router/*.go,galaxy/game/cmd/http/main.go,galaxy/game/openapi.yaml,galaxy/game/openapi_contract_test.go.
Exit criteria:
docker build -t galaxy/game:test -f game/Dockerfile .(run from the workspace root) succeeds. The build context is the workspace root becausegame/resolvesgalaxy/{model,error,util,...}throughgo.workreplacedirectives; seertmanager/docs/game-dockerfile-build-context.md.docker run --rm -e STORAGE_PATH=/tmp/x -p 8080:8080 galaxy/game:testanswers/healthzwith200.go test ./game/...passes.
Stage 06. Lobby publisher refactor
Status: implemented.
Goal:
- ship the new
runtime:start_jobsandruntime:stop_jobsenvelopes from Lobby. After this stage Lobby is RTM-ready; the real RTM appears in Stage 13 onwards.
Tasks:
- Add
LOBBY_ENGINE_IMAGE_TEMPLATE(defaultgalaxy/game:{engine_version}) and validation tolobby/internal/config/config.goandenv.go. - Build
lobby/internal/domain/engineimage/resolver.gothat turns(template, target_engine_version)intoimage_ref, validating both inputs. Reject templates without{engine_version}; reject empty engine versions. lobby/internal/ports/runtimemanager.go: change interface toPublishStartJob(ctx, gameID, imageRef string) errorandPublishStopJob(ctx, gameID string, reason StopReason) errorwith aStopReasonenum (orphan_cleanup,cancelled,finished,admin_request,timeout) declared in the same package.lobby/internal/adapters/runtimemanager/publisher.go: write the new fields into theXADDpayload.- Update callers:
lobby/internal/service/startgame/: resolveimage_reffrom the loaded game record, pass toPublishStartJob.lobby/internal/worker/runtimejobresult/consumer.go: passreason=orphan_cleanuptoPublishStopJobfrom the orphan-container path.
- Update Lobby unit tests (publisher, services) and contract tests (if Lobby has any
describing the runtime envelopes; otherwise add
TestPublisherStartJobIncludesImageRefandTestPublisherStopJobIncludesReason).
Files new:
lobby/internal/domain/engineimage/resolver.goand its test file.
Files touched:
- the Lobby files listed above.
Exit criteria:
go test ./lobby/...passes.- An
XADDagainst the start stream contains theimage_reffield; anXADDagainst the stop stream contains thereasonfield.
Stage 07. Notification intent constructors and catalog extension
Status: implemented.
Goal:
- expose three admin-only notification types so RTM (Stage 13 onwards) can publish them without later cross-cutting refactors.
Tasks:
- Add constructors and payload structs to
galaxy/notificationintent/:NewRuntimeImagePullFailedIntent(meta, payload),NewRuntimeContainerStartFailedIntent(meta, payload),NewRuntimeStartConfigInvalidIntent(meta, payload). Each payload includesgame_id,image_ref,error_code,error_message,attempted_at_ms.
- Extend
notification/api/intents-asyncapi.yamlwith the three new payload schemas and add them to the catalog. - Extend the notification routing tables (data only — no service code) so the existing
routing rules cover the new types: delivery decision
email-only, audience admin. - Extend
notification/contract_asyncapi_test.goto freeze the new message names and payload required fields.
Files touched:
galaxy/notificationintent/*.go,notification/api/intents-asyncapi.yaml,- notification catalog data tables (locations defined inside
notification/internal/...), notification/contract_asyncapi_test.go.
Exit criteria:
- unit tests for the new constructors pass.
- AsyncAPI validates.
- Notification's existing integration suites still pass with the new types added.
Stage 08. RTM module skeleton
Status: implemented.
Goal:
- create a buildable
rtmanagerbinary that loads config, opens dependencies, and exits cleanly on SIGTERM. It does no business work yet.
Tasks:
rtmanager/cmd/rtmanager/main.gomirroringlobby/cmd/lobby/main.go.rtmanager/internal/config/{config.go, env.go, validation.go}with env prefixRTMANAGERand groups Listener, Docker, Postgres, Redis, Streams, Container defaults, Health, Cleanup, Coordination, Lobby internal client, Logging, Lifecycle, Telemetry. Required variables fail-fast.rtmanager/internal/logging/{logger.go, context.go}copied from lobby/notification.rtmanager/internal/telemetry/runtime.goregistering the metrics named inREADME.md §Observability.rtmanager/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go}— empty wiring with PostgreSQL open, Redis open, Docker client open (ping only), telemetry open, probe listener open.rtmanager/internal/api/internalhttp/server.go— listener with/healthzand/readyzonly.rtmanager/Makefilewith thejettarget (real generation lands in Stage 09).rtmanager/go.modandgo.sumwith dependencies:github.com/docker/docker,github.com/redis/go-redis/v9,github.com/jackc/pgx/v5,github.com/go-jet/jet/v2,github.com/pressly/goose/v3,github.com/stretchr/testify, the testcontainers modules for postgres / redis / docker, and the OpenTelemetry stack identical to lobby.- Update repo-level
go.workto include./rtmanager.
Files new:
- the entire skeleton tree.
Exit criteria:
go build ./rtmanager/cmd/rtmanagersucceeds.- Running with valid env brings
/healthzand/readyzup. SIGTERMreturns withinRTMANAGER_SHUTDOWN_TIMEOUT.
Stage 09. PostgreSQL schema, migrations, jet
Status: implemented.
Goal:
- finalise the persistence schema and the code-generation pipeline.
Tasks:
internal/adapters/postgres/migrations/00001_init.sql—CREATE SCHEMA IF NOT EXISTS rtmanager;plus the three tables and indexes fromREADME.md §Persistence Layout.internal/adapters/postgres/migrations/migrations.go—//go:embed *.sqlandFS()exporter, identical pattern to lobby.cmd/jetgen/main.go— testcontainers PostgreSQL + goose up + jet generation against the resulting database. Mirrorslobby/cmd/jetgen/main.go.- Generated
internal/adapters/postgres/jet/...committed to the repo. - Wire goose migrations into
internal/app/runtime.gostartup so they apply before any listener opens; non-zero exit on failure (matchespkg/postgrespolicy).
Files new:
- as above.
Exit criteria:
make -C rtmanager jetregenerates the jet code with no diff after a clean run.- Service start applies migrations to a fresh database and exits zero if migrations are already applied.
Stage 10. Domain layer and ports
Status: implemented.
Goal:
- lock the in-memory domain model and the port interfaces for adapters.
Tasks:
internal/domain/runtime/model.go—RuntimeRecordstruct, status enum (StatusRunning,StatusStopped,StatusRemoved), error sentinels.internal/domain/runtime/transitions.go— allowed transitions table and a CAS-friendly validator.internal/domain/operation/log.go—OpKind,OpSource,Outcomeenums plus theOperationEntrystruct.internal/domain/health/snapshot.go—HealthEventTypeenum,HealthSnapshotstruct.internal/ports/:runtimerecordstore.go—Get,Upsert,UpdateStatus(CAS bycurrent_container_id),ListByStatus.operationlogstore.go—Append,ListByGame.healthsnapshotstore.go—Upsert,Get.streamoffsetstore.go—Load,Save(Redis offset persistence per consumer label).dockerclient.go— narrow surface RTM uses:EnsureNetwork,PullImage,Inspect,Run,Stop,Remove,List,EventsListen. (Logsreserved; not in v1.)lobbyinternal.go—GetGame(ctx, gameID) (LobbyGameRecord, error).notificationintents.go—Publish(ctx, intent) error.
Files new:
- as above.
Exit criteria:
- the package compiles.
- every interface has a
_ ports.X = (*Y)(nil)assertion slot ready for the adapters that follow.
Stage 11. Persistence adapters
Status: implemented. Decision record:
docs/stage11-persistence-adapters.md.
Goal:
- implement the three PostgreSQL stores and the Redis offset store.
Tasks:
internal/adapters/postgres/runtimerecordstore/store.gousing jet.internal/adapters/postgres/operationlogstore/store.go.internal/adapters/postgres/healthsnapshotstore/store.go.internal/adapters/redisstate/streamoffsets/store.go(mirror Lobby'sredisstate/streamoffsets).- For each adapter: store-level integration tests against testcontainers PostgreSQL or
Redis. CAS semantics on
runtime_records.UpdateStatusare verified by an explicit concurrent-update test (only one of two callers wins).
Files new:
- as above and per-package
_test.go.
Exit criteria:
- store tests pass on a CI runner with Docker available.
Stage 12. Docker adapter and external clients
Status: implemented. Decision record:
docs/stage12-docker-and-clients.md.
Goal:
- ship the Docker SDK adapter and the external HTTP clients for Lobby internal API and notification publishing.
Tasks:
internal/adapters/docker/client.go— implementsports.DockerClientovergithub.com/docker/docker/client. Behaviour:EnsureNetworkvalidates the configured network's presence (no creation).PullImagehonours the configured pull policy.Inspectreturns image and container metadata in domain-friendly shape.Runbuilds the create + start sequence with labels, env (GAME_STATE_PATH,STORAGE_PATH), bind mount, log driver, resource limits read from image labels with config fallback.StopcallsContainerStopwith the configured timeout.RemovecallsContainerRemove.Listfilters bylabel=com.galaxy.owner=rtmanager.EventsListenreturns a typed channel of decoded events.
internal/adapters/docker/mocks/—mockgen-generated mock forports.DockerClient, used by service tests.internal/adapters/lobbyclient/client.go— REST client over anotelhttp-wrappedhttp.ClientforGET /api/v1/internal/games/{game_id}. ReturnsLobbyGameRecord.internal/adapters/notificationpublisher/publisher.go— wrapsgalaxy/notificationintentplusredis.XAddagainstnotification:intents.- Per-adapter unit tests with mocks. A small testcontainers Docker smoke test guarded by
build tag
rtmanager_docker_smokeuntil Stage 19 promotes it to default.
Files new:
- as above.
Exit criteria:
- mocks regenerate cleanly via
go generate. - unit tests pass.
- the smoke test passes on a runner with Docker available.
Stage 13. Service: start
Status: implemented. Decision record:
docs/stage13-start-service.md.
Goal:
- end-to-end
startoperation in the service layer, callable from both the async consumer and the REST handler in later stages.
Tasks:
internal/service/startruntime/service.goorchestrator:- Acquire game-id lease (Redis).
- Read
runtime_records. Ifrunningwith sameimage_ref, return idempotent success witherror_code=replay_no_op. - Optionally fetch
LobbyGameRecordfor ancillary fields; in v1 onlyimage_refis required, so this fetch is a no-op except for diagnostics. - Pull image (per policy), inspect labels for resource limits.
- Ensure the per-game state directory exists with the configured mode and ownership.
docker runwith the configured network, hostname, labels, env, bind mount, log driver, resource limits.- Upsert
runtime_records(status=running,current_container_id,engine_endpoint,current_image_ref,started_at,last_op_at). - Append
operation_logentry (op_kind=start,outcome=success,op_sourcefrom caller). - Publish
runtime:health_eventscontainer_started. - Return success outcome to caller (consumer publishes
job_result, REST returns 200).
- Failure paths in the table from
README.md §Lifecycles → Start. Each failure path:- rolls back any partially created Docker resource;
- publishes the matching admin-only notification intent;
- records
operation_logwithoutcome=failureand the stable error code; - returns failure to the caller.
- Unit tests cover happy path, idempotent re-start, each failure mode, lease conflict, and partial-rollback paths.
Files new:
service/startruntime/{service.go, service_test.go, errors.go}.
Exit criteria:
- service-level tests pass.
Stage 14. Service: stop, restart, patch, cleanup
Status: implemented. Decision record:
docs/stage14-stop-restart-patch-cleanup.md.
Goal:
- the remaining four lifecycle operations, sharing helpers with
start.
Tasks:
internal/service/stopruntime/service.go— gracefuldocker stopwith timeout, recordstoppedstate. Idempotent re-stop returns success no-op.internal/service/restartruntime/service.go— orchestratestopruntimethenstartruntimewith the currentimage_ref. Same Redis lease shared across both inner operations. Records a singleoperation_logentry withop_kind=restartplus a correlation id linking it to the implicit start/stop entries.internal/service/patchruntime/service.go— restart with a newimage_ref. Validates the semver-patch-only rule (major and minor must equal current version; otherwise returnsemver_patch_onlyfailure). If the engine version is not parseable as semver, returnimage_ref_not_semver.internal/service/cleanupcontainer/service.go—docker rmfor an already-stopped container; refuses ifstatus=running. Setsruntime_records.status=removed.- The Redis lease covers each operation end-to-end; restart and patch hold the lease across the inner stop+start to prevent races.
- Unit tests for each service. Cross-operation race tests assert that concurrent start vs.
stop on the same
game_ideither succeed in some order or both observe the lease and one returns conflict.
Files new:
service/{stopruntime, restartruntime, patchruntime, cleanupcontainer}/....
Exit criteria:
- service-level tests pass.
Stage 15. Async consumers and runtime:job_results publisher
Status: implemented. Decision record:
docs/stage15-async-consumers.md.
Goal:
- wire the Lobby-side stream contract into the freshly built service layer.
Tasks:
internal/worker/startjobsconsumer/consumer.go— XREAD overruntime:start_jobs, decodes envelope{game_id, image_ref, requested_at_ms}, callsstartruntimeservice, publishesruntime:job_resultswith the canonical schema, advances the Redis offset. Mirrors patterns fromlobby/internal/worker/runtimejobresult/consumer.go.internal/worker/stopjobsconsumer/consumer.go— XREAD overruntime:stop_jobs, decodes{game_id, reason, requested_at_ms}, callsstopruntime.internal/adapters/jobresultspublisher/publisher.go— small XADD wrapper forruntime:job_results.- Replay safety: deterministic «already running» / «already stopped» idempotent outcomes
surface as
outcome=successwitherror_code=replay_no_op. - Tests use
miniredisand a fakeports.DockerClient. A consumer integration test drives a full Lobby → RTM → Lobby roundtrip end-to-end.
Files new:
- as above + tests.
Exit criteria:
- consumer integration test passes.
Stage 16. Internal REST handlers
Status: implemented. Decision record:
docs/stage16-internal-rest-handlers.md.
Goal:
- ship the GM/Admin-facing REST surface backed by the service layer.
Tasks:
internal/api/internalhttp/handlers/{list, get, start, stop, restart, patch, cleanup}.go— one file per operation, each delegating to the corresponding service. JSON in / JSON out. Unknown JSON fields rejected withinvalid_request.- Error envelope identical to lobby:
{ "error": { "code", "message" } }. Stable codes:invalid_request,not_found,conflict,service_unavailable,internal_error,image_ref_not_semver,semver_patch_only,image_pull_failed,container_start_failed,start_config_invalid,docker_unavailable. - Wiring under the existing internal HTTP listener; route registration in
internal/app/wiring.go. - Handler-level table-driven tests; OpenAPI conformance test that loads
api/internal-openapi.yamland asserts every defined operation is reachable and matches its declared response.
Files new:
- handlers + tests.
Exit criteria:
- OpenAPI conformance test passes for every endpoint.
- Handlers reject unknown JSON fields.
Stage 17. Health monitoring
Status: implemented. Decision record:
docs/stage17-health-monitoring.md.
Goal:
- observability of running containers via the three sources from
README.md §Health Monitoring.
Tasks:
internal/worker/dockerevents/listener.go— subscribes to Docker events with thecom.galaxy.owner=rtmanagerlabel filter, looks upruntime_recordsby labels, emitsruntime:health_eventsforcontainer_exited,container_oom,container_disappeared.container_startedis emitted directly by the start service (Stage 13) when it runs the container.internal/worker/healthprobe/worker.go— periodic worker iteratingruntime_records.status=running. CallsGET {engine_endpoint}/healthzwith the configured timeout, applies theRTMANAGER_PROBE_FAILURES_THRESHOLDhysteresis, emitsprobe_failed/probe_recovered. Usesotelhttpclient.internal/worker/dockerinspect/worker.go— periodic full inspect; emitsinspect_unhealthyon observedRestartCountgrowth or unexpected status.internal/adapters/healtheventspublisher/publisher.go— XADD wrapper forruntime:health_events. Always also upserts the latest snapshot intohealth_snapshots.
Files new:
- as above + tests.
Exit criteria:
- worker tests use a Docker mock that programmatically emits events and asserts the published stream entries match the AsyncAPI spec.
Stage 18. Reconciler and container cleanup
Status: implemented. Decision record:
docs/stage18-reconcile-and-cleanup.md.
Goal:
- drift management and TTL-based cleanup.
Tasks:
internal/worker/reconcile/reconciler.go— runs at startup (blocking before workers start) and periodically (RTMANAGER_RECONCILE_INTERVAL). Implements the rules fromREADME.md §Reconciliation:- record running containers without a PG record, never kill them
(
op_kind=reconcile_adopt); - mark
runtime_records.status=runningrows whose container is missing asremoved, publishcontainer_disappeared(op_kind=reconcile_dispose).
- record running containers without a PG record, never kill them
(
internal/worker/containercleanup/worker.go— periodic worker (RTMANAGER_CLEANUP_INTERVAL) that listsruntime_recordswithstatus=stoppedandlast_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS, callscleanupcontainerservice for each.- Both workers are registered as
app.Components ininternal/app/wiring.go.
Files new:
- as above + tests.
Exit criteria:
- reconciler test using mocked Docker proves both adopt and dispose paths.
- cleanup test proves TTL math with a fake clock.
Stage 19. Service-local integration suite
Status: implemented. Decision record:
docs/stage19-integration.md.
Goal:
- end-to-end suite running against testcontainers PostgreSQL + Redis + the real Docker
daemon, using the freshly-built
galaxy/gametest image.
Tasks:
rtmanager/integration/harness/— set up PostgreSQL with goose-applied migrations; Redis (miniredis is sufficient for stream-only suites; testcontainers Redis for coordination suites that exercise leases); ensure the Docker bridge network exists; buildgalaxy/gametest image once per package run withsync.Once; tear everything down viat.Cleanup.rtmanager/integration/lifecycle_test.go— start → inspect → stop → restart → patch → cleanup against the real engine; assert each step's PG, Redis-stream, and Docker side-effects. Engine state directories are created viat.ArtifactDir().rtmanager/integration/replay_test.go— duplicate start/stop messages are no-ops witherror_code=replay_no_op.rtmanager/integration/health_test.go— kill the engine container externally; assertcontainer_disappearedevent publishes within timeout. Bring it back with a manualdocker run; assert the reconciler adopts it.rtmanager/integration/notification_test.go— drive a start with an unresolvable image ref; assert RTM publishes theruntime.image_pull_failednotification intent and afailurejob_result.
Files new:
- as above.
Exit criteria:
go test ./rtmanager/integration/...passes locally with Docker available.- CI runs the suite under a profile that exposes the Docker socket.
Stage 20. Inter-service test: Lobby ↔ RTM
Status: implemented. Decision record:
docs/stage20-lobbyrtm.md.
Goal:
- satisfy the
TESTING.md §7inter-service requirement with real Lobby + real RTM.
Tasks:
integration/lobbyrtm/(top-level integration directory, mirroring existingintegration/notificationgateway, etc.): runs real Lobby, real RTM, real PostgreSQL, real Redis, and thegalaxy/gametest engine container.- Scenarios:
- Lobby creates a game, publishes a start_job with
image_ref, RTM starts the engine, publishesjob_result, Lobby transitions the game torunning. The engine answers/healthz. - Lobby transitions a game to
cancelled, publishesstop_jobwithreason=cancelled, RTM stops the engine. RTMoperation_logrecords the transition. - Failure path:
image_refpoints at a missing image. RTM publishes afailurejob_resultand the matching notification intent. Lobby transitions the game tostart_failed.
- Lobby creates a game, publishes a start_job with
Files new:
- as above.
Exit criteria:
- all scenarios pass in CI when the Docker socket is available.
Stage 21. Service-local docs
Status: implemented.
Goal:
- drop per-stage decisions captured during this plan into discoverable service-local
documentation, mirroring
lobby/docs/.
Tasks:
docs/README.md— index pointing at the four content docs and the postgres-migration record.docs/runtime.md— components, processes, in-memory state of each worker.docs/flows.md— mermaid diagrams for: start happy path, start failure (image pull), start failure (orphan), stop, restart, patch, cleanup TTL, reconcile drift adopt, health probe hysteresis.docs/runbook.md— operator scenarios: «engine container died», «patch upgrade», «manual cleanup», «reconcile drift after Docker daemon restart», «testing locally».docs/examples.md— env-var examples per environment (dev / test / prod skeletons), example payloads for each stream and each REST endpoint.docs/postgres-migration.md— decision record for the schema (mirrorsnotification/docs/postgres-migration.mdstyle).
Files new:
- all six.
Exit criteria:
- the README of RTM links to
docs/README.md. - a reviewer can find any operational how-to within two clicks.
Stage 22. Migrate hand-rolled stubs to mockgen
Status: implemented. Decision record:
docs/stage22-stub-migration.md.
Goal:
- unify the test-double style across the repository on the
mockgenpipeline introduced for the RTM Docker port in Stage 12. Today every Galaxy service except RTM hand-rolls*stubpackages; mixing styles raises onboarding cost and makes port-signature drift easier to miss.
Tasks (high-level only — each package gets its own decision when this stage is opened):
- Replace the stubs under
lobby/internal/adapters/withmockgen-generated mocks. Affected packages today (one per port):runtimemanagerstub,intentpubstub,gmclientstub,userservicestub,gameturnstatsstub,streamoffsetstub,membershipstub,evaluationguardstub,streamlagprobestub,userlifecyclestub,invitestub,racenamestub,gapactivationstub,gamestub,applicationstub. - Add
//go:generate mockgen ...directives next to each port declaration underlobby/internal/ports/and amockstarget tolobby/Makefile, mirroring thertmanager/Makefileshape. - Audit the rest of the workspace for similar hand-rolls before touching
Lobby. Not every
*stub-style package is in scope:mail/internal/adapters/stubprovideris a production/local-mode provider, not a test fixture — keep it.authsession/internal/adapters/contracttestis a port-conformance suite, not a stub — keep it.authsession/internal/adapters/localis local-mode runtime — keep it.
- Documentation sweep — these documents reference the hand-rolled
convention and must be updated alongside the code:
rtmanager/docs/stage12-docker-and-clients.md §1currently framesmockgenas a one-time deviation; rephrase as the repo-wide convention.lobby/docs/— any decision record that named a*stubpackage by path needs the newmocks/target referenced in its place.- Top-level
AGENTS.mdand any service-levelCLAUDE.md/README.mdtouching test conventions.
- Cross-cutting test impact: each stub today often carries hand-curated
helper methods (e.g. seeded fixtures, deterministic ID generators)
that pure
mockgenmocks do not provide. Where a stub is more than a method-table, the migration extracts the helper into a small test-data builder and keeps the mock as the port surface.
Files new:
- one
mocks/directory under each affected adapter group, plus alobby/Makefilemockstarget (and equivalents for any other service the audit identifies).
Files touched:
- every
*stubpackage listed above plus its consumers. lobby/Makefile,lobby/internal/ports/*.go(for//go:generatedirectives).- the documentation listed above.
Exit criteria:
*stubpackages are gone fromlobby/internal/adapters/and themocks/packages compile against the current ports.make -C lobby mocksregenerates with no diff after a clean run.go test ./lobby/...is green.- Documentation across
rtmanager/docs/,lobby/docs/, top-levelAGENTS.md, and any affectedREADME.mdreferences the unified convention.
Final Acceptance Criteria
go build ./...from the repository root succeeds.go test ./...from the repository root passes.go test -tags=integration ./rtmanager/integration/...passes when Docker is available.go test ./integration/lobbyrtm/...passes when Docker is available.make -C rtmanager jetregenerates jet code with no diff after a clean run.- Manual smoke: bring Lobby + RTM + the rest of the stack up via the existing dev compose;
create a game; observe a real
galaxy-game-{game_id}container;curl http://galaxy-game-{game_id}:8080/healthzreturns200; stop the game; the container moves toexited; the admin cleanup endpoint removes it. - Documentation across
ARCHITECTURE.md,lobby,notification,game, andrtmanageris internally consistent.
Out of Scope
- Multi-instance Runtime Manager with Redis Streams consumer groups (
XREADGROUP/XCLAIM). - Engine version registry inside
Game Master. Producer-suppliedimage_refdecouples this work from RTM. - TLS / mTLS on the internal listener.
- Engine in-place upgrades driven by an engine API. Patch is always recreate.
- Backup, archival, or cleanup of host state directories.
- Kubernetes, Docker Swarm, or any non-Docker orchestrator.
- Consumption of
runtime:health_eventsby Game Master, Game Lobby, or Notification Service. Those are next-stage concerns of those services.
Risks and Notes
- CI must expose a Docker socket (or run rootless equivalent) to execute the integration suites. Without Docker the integration tests are skipped through a build-tag guard.
- The
reasonenum onruntime:stop_jobsis fixed in this plan ({orphan_cleanup, cancelled, finished, admin_request, timeout}). Adding a new value requires a contract bump inruntime-jobs-asyncapi.yamland a Lobby publisher change. Keep the enum small. - Lobby's existing
runtimejobresultworker only reacts to start outcomes today. Stop outcomes are observable in RTMoperation_logbut Lobby does not yet update game status from them. Adding a stop-result consumer to Lobby is a future Lobby stage and is explicitly out of scope here. - Pre-launch single-init policy applies to RTM exactly as documented in
ARCHITECTURE.md §Persistence Backends: schema evolves by editing00001_init.sqluntil first production deploy.