Files
galaxy-game/rtmanager/PLAN.md
T
2026-04-28 20:39:18 +02:00

1023 lines
41 KiB
Markdown

# Runtime Manager Implementation Plan
This plan has been already implemented and stays here for historical reasons.
It should NOT be threated as source of truth for service functionality.
## Summary
This plan delivers `Runtime Manager` (RTM), the only Galaxy service with direct Docker access.
It owns container lifecycle (start, stop, restart, patch, cleanup), three-source health
monitoring, and a synchronous internal REST surface used by `Game Master` and `Admin Service`.
`Game Lobby` continues to drive RTM asynchronously through Redis Streams.
The plan also delivers the upstream changes that RTM depends on: a new `image_ref` field in
the start envelope and a `reason` field in the stop envelope produced by Lobby; a `/healthz`
endpoint, `Dockerfile`, and `STORAGE_PATH` / `GAME_STATE_PATH` contract on `galaxy/game`; new
admin-only notification types in the catalog plus matching constructors in
`galaxy/notificationintent`.
The architectural rules behind every decision are recorded in
[`./README.md`](./README.md). This file describes the order in which the implementation
lands.
## Global Rules
- Documentation always lands before contracts; contracts before code.
- Each stage leaves the repository in a buildable, test-green state. No stage relies on a
later stage to fix a regression it introduced.
- Existing-service refactors (Lobby publisher, Notification catalog, Game engine) are
full-fledged stages of this plan; they precede every RTM stage that depends on them.
- RTM never resolves engine versions. Producer supplies `image_ref`. RTM never deletes the
host state directory. RTM never kills containers it does not own a record for.
- Every functional change ships its tests in the same stage. Contract tests freeze
operation IDs and stream message names from Stage 04 onward.
- All code, docs, and identifiers are written in English.
## Suggested Module Structure
```text
rtmanager/
├── cmd/
│ ├── rtmanager/
│ │ └── main.go
│ └── jetgen/
│ └── main.go
├── internal/
│ ├── app/
│ │ ├── app.go
│ │ ├── runtime.go
│ │ ├── wiring.go
│ │ └── bootstrap.go
│ │
│ ├── config/
│ │ ├── config.go
│ │ ├── env.go
│ │ └── validation.go
│ │
│ ├── logging/
│ │ ├── logger.go
│ │ └── context.go
│ │
│ ├── telemetry/
│ │ └── runtime.go
│ │
│ ├── domain/
│ │ ├── runtime/
│ │ │ ├── model.go
│ │ │ └── transitions.go
│ │ ├── operation/
│ │ │ └── log.go
│ │ └── health/
│ │ └── snapshot.go
│ │
│ ├── ports/
│ │ ├── runtimerecordstore.go
│ │ ├── operationlogstore.go
│ │ ├── healthsnapshotstore.go
│ │ ├── streamoffsetstore.go
│ │ ├── dockerclient.go
│ │ ├── lobbyinternal.go
│ │ └── notificationintents.go
│ │
│ ├── adapters/
│ │ ├── postgres/
│ │ │ ├── migrations/
│ │ │ ├── jet/
│ │ │ ├── runtimerecordstore/
│ │ │ ├── operationlogstore/
│ │ │ └── healthsnapshotstore/
│ │ ├── redisstate/
│ │ │ └── streamoffsets/
│ │ ├── docker/
│ │ │ ├── client.go
│ │ │ └── mocks/
│ │ ├── lobbyclient/
│ │ ├── notificationpublisher/
│ │ ├── jobresultspublisher/
│ │ └── healtheventspublisher/
│ │
│ ├── service/
│ │ ├── startruntime/
│ │ ├── stopruntime/
│ │ ├── restartruntime/
│ │ ├── patchruntime/
│ │ └── cleanupcontainer/
│ │
│ ├── worker/
│ │ ├── startjobsconsumer/
│ │ ├── stopjobsconsumer/
│ │ ├── dockerevents/
│ │ ├── healthprobe/
│ │ ├── dockerinspect/
│ │ ├── reconcile/
│ │ └── containercleanup/
│ │
│ └── api/
│ └── internalhttp/
│ ├── server.go
│ └── handlers/
├── api/
│ ├── internal-openapi.yaml
│ ├── runtime-jobs-asyncapi.yaml
│ └── runtime-health-asyncapi.yaml
├── integration/
│ ├── harness/
│ ├── lifecycle_test.go
│ ├── replay_test.go
│ ├── health_test.go
│ └── notification_test.go
├── docs/
│ ├── README.md
│ ├── runtime.md
│ ├── flows.md
│ ├── runbook.md
│ ├── examples.md
│ └── postgres-migration.md
├── README.md
├── PLAN.md
├── Makefile
└── go.mod
```
## ~~Stage 01.~~ Update `ARCHITECTURE.md`
Status: implemented.
Goal:
- align the project-wide source of truth with every decision recorded in
[`./README.md`](./README.md) before any code change touches it.
Tasks:
- Expand `ARCHITECTURE.md` §9 (Runtime Manager) with subsections: container model
(`galaxy-game-{game_id}` DNS naming, bind-mount ABI, network prerequisite), image policy
(producer-supplied `image_ref`), state ownership rule (RTM never deletes the host state
directory), reconcile policy (adopt unrecorded containers, never kill them).
- Update §«Fixed asynchronous interactions»: note the `image_ref` field on `Lobby → RTM`,
add the `runtime:health_events` outbound stream, add `Runtime Manager → Notification
Service` for admin alerts.
- Update §«Fixed synchronous interactions»: add `Game Master → Runtime Manager` and
`Admin Service → Runtime Manager` for REST inspect / restart / patch / stop / cleanup, and
remove the corresponding async entries.
- Update §«Persistence Backends»: add `rtmanager` schema to the schema-per-service list and
to PG-backed services.
- Update §«Configuration»: add `RTMANAGER` to the env-var prefix list with the same shape
rules as other PG/Redis-backed services.
- Update §«Recommended Order of Service Implementation» entry 7 with the now-fixed scope
(start, stop, restart, patch, inspect, health monitoring).
Files touched:
- `ARCHITECTURE.md`.
Exit criteria:
- every later RTM, Lobby, Notification, or Game stage can quote its rules from
`ARCHITECTURE.md` without re-deciding them.
## ~~Stage 02.~~ Freeze RTM `README.md`
Status: implemented as part of this planning task — see [`./README.md`](./README.md).
Goal:
- publish the complete service description so contracts and code can reference one source.
Tasks:
- Write `rtmanager/README.md` covering Purpose, Scope, Non-Goals, Position in the System,
Responsibility Boundaries, Container Model, Runtime Surface, Lifecycles, Health Monitoring,
Reconciliation, Trusted Surfaces, Async Stream Contracts, Notification Contracts,
Persistence Layout, Error Model, Configuration, Observability, Verification.
Exit criteria:
- a reviewer can answer any «what does RTM do when X» question by reading the README alone.
## ~~Stage 03.~~ Sync existing-service docs (Lobby, Notification, Game)
Status: implemented.
Goal:
- bring the READMEs of every touched service into agreement with the RTM contract before any
code in those services changes.
Tasks:
- `lobby/README.md`: update Game Start Flow — start envelope is now `{game_id, image_ref,
requested_at_ms}`. Add `LOBBY_ENGINE_IMAGE_TEMPLATE` to the Configuration section.
Document the new stop envelope `reason` enum
(`orphan_cleanup | cancelled | finished | admin_request | timeout`). Note that the
Lobby ↔ RTM transport stays asynchronous indefinitely.
- `lobby/PLAN.md`: append a single closing note that runtime-job envelope changes belong to
the Runtime Manager plan; no new stages added there.
- `notification/README.md`: add three admin notification types to the catalog
(`runtime.image_pull_failed`, `runtime.container_start_failed`,
`runtime.start_config_invalid`), each `email`-only with audience admin in v1.
- `notification/PLAN.md`: append a closing note pointing at the Runtime Manager plan for the
catalog extension.
- `game/README.md` (create if absent): document the new `/healthz` endpoint, the
`STORAGE_PATH` / `GAME_STATE_PATH` env contract, and the new `Dockerfile` location.
Files touched:
- `lobby/README.md`, `lobby/PLAN.md`, `notification/README.md`, `notification/PLAN.md`,
`game/README.md`.
Exit criteria:
- every doc in the repo agrees on the post-RTM contract; no contradiction remains between
any two READMEs.
## ~~Stage 04.~~ RTM contract files and contract tests
Status: implemented.
Goal:
- ship machine-readable contracts before any RTM handler is written, so the implementation
has a target spec.
Tasks:
- `rtmanager/api/internal-openapi.yaml`: every internal REST endpoint with request and
response schemas; error envelope `{ "error": { "code", "message" } }` identical to Lobby.
Operation IDs:
`internalListRuntimes`, `internalGetRuntime`, `internalStartRuntime`,
`internalStopRuntime`, `internalRestartRuntime`, `internalPatchRuntime`,
`internalCleanupRuntimeContainer`, `internalHealthz`, `internalReadyz`.
- `rtmanager/api/runtime-jobs-asyncapi.yaml`: AsyncAPI 2.6.0 spec for `runtime:start_jobs`,
`runtime:stop_jobs`, `runtime:job_results`. Frozen field set per-message.
- `rtmanager/api/runtime-health-asyncapi.yaml`: AsyncAPI 2.6.0 spec for
`runtime:health_events` with the `event_type` enum and `details` polymorphic schema
(`oneOf` per type).
- `rtmanager/contract_openapi_test.go` and `rtmanager/contract_asyncapi_test.go`: load specs
via `kin-openapi` (and the AsyncAPI loader pattern from `notification/contract_asyncapi_test.go`),
assert operation IDs / message names / field presence.
Files new:
- the four files above.
Exit criteria:
- all three specs validate; contract tests pass; tests fail loudly if any operation ID,
message name, or required field disappears.
## ~~Stage 05.~~ Game engine `/healthz`, `Dockerfile`, `STORAGE_PATH`
Status: implemented.
Goal:
- make `galaxy/game` runnable as the test engine image RTM uses in integration tests.
Tasks:
- Add `GET /healthz` to `game/internal/router` returning `{"status":"ok"}` (200) when the
engine process is up, irrespective of whether a game has been initialised. The existing
`/api/v1/status` keeps its current `501` behaviour for an uninitialised engine.
- Make engine read storage path from `STORAGE_PATH` env, falling back to `GAME_STATE_PATH`
when set. Both names are accepted; `GAME_STATE_PATH` is the contract RTM writes.
- Update `game/cmd/http/main.go` to bind the env.
- Add `galaxy/game/Dockerfile`: multi-stage (golang builder + small runtime base). Exposes
`:8080`. Default `STORAGE_PATH=/var/lib/galaxy-game`. Copies the binary. Runs as non-root
user.
- Add image labels to the `Dockerfile`: `com.galaxy.cpu_quota=1.0`, `com.galaxy.memory=512m`,
`com.galaxy.pids_limit=512`, `org.opencontainers.image.title=galaxy-game-engine`.
- Update `game/openapi.yaml` to document `/healthz`.
- Update `game/openapi_contract_test.go` to assert `/healthz` presence.
Files new:
- `galaxy/game/Dockerfile`.
Files touched:
- `galaxy/game/internal/router/*.go`, `galaxy/game/cmd/http/main.go`,
`galaxy/game/openapi.yaml`, `galaxy/game/openapi_contract_test.go`.
Exit criteria:
- `docker build -t galaxy/game:test -f game/Dockerfile .` (run from the workspace
root) succeeds. The build context is the workspace root because `game/` resolves
`galaxy/{model,error,util,...}` through `go.work` `replace` directives; see
`rtmanager/docs/game-dockerfile-build-context.md`.
- `docker run --rm -e STORAGE_PATH=/tmp/x -p 8080:8080 galaxy/game:test` answers
`/healthz` with `200`.
- `go test ./game/...` passes.
## ~~Stage 06.~~ Lobby publisher refactor
Status: implemented.
Goal:
- ship the new `runtime:start_jobs` and `runtime:stop_jobs` envelopes from Lobby. After this
stage Lobby is RTM-ready; the real RTM appears in Stage 13 onwards.
Tasks:
- Add `LOBBY_ENGINE_IMAGE_TEMPLATE` (default `galaxy/game:{engine_version}`) and validation
to `lobby/internal/config/config.go` and `env.go`.
- Build `lobby/internal/domain/engineimage/resolver.go` that turns
`(template, target_engine_version)` into `image_ref`, validating both inputs. Reject
templates without `{engine_version}`; reject empty engine versions.
- `lobby/internal/ports/runtimemanager.go`: change interface to
`PublishStartJob(ctx, gameID, imageRef string) error` and
`PublishStopJob(ctx, gameID string, reason StopReason) error` with a `StopReason` enum
(`orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`) declared in the
same package.
- `lobby/internal/adapters/runtimemanager/publisher.go`: write the new fields into the
`XADD` payload.
- Update callers:
- `lobby/internal/service/startgame/`: resolve `image_ref` from the loaded game record,
pass to `PublishStartJob`.
- `lobby/internal/worker/runtimejobresult/consumer.go`: pass
`reason=orphan_cleanup` to `PublishStopJob` from the orphan-container path.
- Update Lobby unit tests (publisher, services) and contract tests (if Lobby has any
describing the runtime envelopes; otherwise add `TestPublisherStartJobIncludesImageRef`
and `TestPublisherStopJobIncludesReason`).
Files new:
- `lobby/internal/domain/engineimage/resolver.go` and its test file.
Files touched:
- the Lobby files listed above.
Exit criteria:
- `go test ./lobby/...` passes.
- An `XADD` against the start stream contains the `image_ref` field; an `XADD` against the
stop stream contains the `reason` field.
## ~~Stage 07.~~ Notification intent constructors and catalog extension
Status: implemented.
Goal:
- expose three admin-only notification types so RTM (Stage 13 onwards) can publish them
without later cross-cutting refactors.
Tasks:
- Add constructors and payload structs to `galaxy/notificationintent/`:
- `NewRuntimeImagePullFailedIntent(meta, payload)`,
- `NewRuntimeContainerStartFailedIntent(meta, payload)`,
- `NewRuntimeStartConfigInvalidIntent(meta, payload)`.
Each payload includes `game_id`, `image_ref`, `error_code`, `error_message`,
`attempted_at_ms`.
- Extend `notification/api/intents-asyncapi.yaml` with the three new payload schemas and
add them to the catalog.
- Extend the notification routing tables (data only — no service code) so the existing
routing rules cover the new types: delivery decision `email`-only, audience admin.
- Extend `notification/contract_asyncapi_test.go` to freeze the new message names and
payload required fields.
Files touched:
- `galaxy/notificationintent/*.go`,
- `notification/api/intents-asyncapi.yaml`,
- notification catalog data tables (locations defined inside `notification/internal/...`),
- `notification/contract_asyncapi_test.go`.
Exit criteria:
- unit tests for the new constructors pass.
- AsyncAPI validates.
- Notification's existing integration suites still pass with the new types added.
## ~~Stage 08.~~ RTM module skeleton
Status: implemented.
Goal:
- create a buildable `rtmanager` binary that loads config, opens dependencies, and exits
cleanly on SIGTERM. It does no business work yet.
Tasks:
- `rtmanager/cmd/rtmanager/main.go` mirroring `lobby/cmd/lobby/main.go`.
- `rtmanager/internal/config/{config.go, env.go, validation.go}` with env prefix `RTMANAGER`
and groups Listener, Docker, Postgres, Redis, Streams, Container defaults, Health,
Cleanup, Coordination, Lobby internal client, Logging, Lifecycle, Telemetry. Required
variables fail-fast.
- `rtmanager/internal/logging/{logger.go, context.go}` copied from lobby/notification.
- `rtmanager/internal/telemetry/runtime.go` registering the metrics named in
`README.md §Observability`.
- `rtmanager/internal/app/{runtime.go, app.go, wiring.go, bootstrap.go}` — empty wiring with
PostgreSQL open, Redis open, Docker client open (ping only), telemetry open, probe
listener open.
- `rtmanager/internal/api/internalhttp/server.go` — listener with `/healthz` and `/readyz`
only.
- `rtmanager/Makefile` with the `jet` target (real generation lands in Stage 09).
- `rtmanager/go.mod` and `go.sum` with dependencies: `github.com/docker/docker`,
`github.com/redis/go-redis/v9`, `github.com/jackc/pgx/v5`, `github.com/go-jet/jet/v2`,
`github.com/pressly/goose/v3`, `github.com/stretchr/testify`, the testcontainers modules
for postgres / redis / docker, and the OpenTelemetry stack identical to lobby.
- Update repo-level `go.work` to include `./rtmanager`.
Files new:
- the entire skeleton tree.
Exit criteria:
- `go build ./rtmanager/cmd/rtmanager` succeeds.
- Running with valid env brings `/healthz` and `/readyz` up.
- `SIGTERM` returns within `RTMANAGER_SHUTDOWN_TIMEOUT`.
## ~~Stage 09.~~ PostgreSQL schema, migrations, jet
Status: implemented.
Goal:
- finalise the persistence schema and the code-generation pipeline.
Tasks:
- `internal/adapters/postgres/migrations/00001_init.sql` — `CREATE SCHEMA IF NOT EXISTS
rtmanager;` plus the three tables and indexes from `README.md §Persistence Layout`.
- `internal/adapters/postgres/migrations/migrations.go` — `//go:embed *.sql` and `FS()`
exporter, identical pattern to lobby.
- `cmd/jetgen/main.go` — testcontainers PostgreSQL + goose up + jet generation against the
resulting database. Mirrors `lobby/cmd/jetgen/main.go`.
- Generated `internal/adapters/postgres/jet/...` committed to the repo.
- Wire goose migrations into `internal/app/runtime.go` startup so they apply before any
listener opens; non-zero exit on failure (matches `pkg/postgres` policy).
Files new:
- as above.
Exit criteria:
- `make -C rtmanager jet` regenerates the jet code with no diff after a clean run.
- Service start applies migrations to a fresh database and exits zero if migrations are
already applied.
## ~~Stage 10.~~ Domain layer and ports
Status: implemented.
Goal:
- lock the in-memory domain model and the port interfaces for adapters.
Tasks:
- `internal/domain/runtime/model.go` — `RuntimeRecord` struct, status enum
(`StatusRunning`, `StatusStopped`, `StatusRemoved`), error sentinels.
- `internal/domain/runtime/transitions.go` — allowed transitions table and a CAS-friendly
validator.
- `internal/domain/operation/log.go` — `OpKind`, `OpSource`, `Outcome` enums plus the
`OperationEntry` struct.
- `internal/domain/health/snapshot.go` — `HealthEventType` enum, `HealthSnapshot` struct.
- `internal/ports/`:
- `runtimerecordstore.go` — `Get`, `Upsert`, `UpdateStatus` (CAS by
`current_container_id`), `ListByStatus`.
- `operationlogstore.go` — `Append`, `ListByGame`.
- `healthsnapshotstore.go` — `Upsert`, `Get`.
- `streamoffsetstore.go` — `Load`, `Save` (Redis offset persistence per consumer label).
- `dockerclient.go` — narrow surface RTM uses: `EnsureNetwork`, `PullImage`, `Inspect`,
`Run`, `Stop`, `Remove`, `List`, `EventsListen`. (`Logs` reserved; not in v1.)
- `lobbyinternal.go` — `GetGame(ctx, gameID) (LobbyGameRecord, error)`.
- `notificationintents.go` — `Publish(ctx, intent) error`.
Files new:
- as above.
Exit criteria:
- the package compiles.
- every interface has a `_ ports.X = (*Y)(nil)` assertion slot ready for the adapters that
follow.
## ~~Stage 11.~~ Persistence adapters
Status: implemented. Decision record:
[`docs/stage11-persistence-adapters.md`](docs/stage11-persistence-adapters.md).
Goal:
- implement the three PostgreSQL stores and the Redis offset store.
Tasks:
- `internal/adapters/postgres/runtimerecordstore/store.go` using jet.
- `internal/adapters/postgres/operationlogstore/store.go`.
- `internal/adapters/postgres/healthsnapshotstore/store.go`.
- `internal/adapters/redisstate/streamoffsets/store.go` (mirror Lobby's
`redisstate/streamoffsets`).
- For each adapter: store-level integration tests against testcontainers PostgreSQL or
Redis. CAS semantics on `runtime_records.UpdateStatus` are verified by an explicit
concurrent-update test (only one of two callers wins).
Files new:
- as above and per-package `_test.go`.
Exit criteria:
- store tests pass on a CI runner with Docker available.
## ~~Stage 12.~~ Docker adapter and external clients
Status: implemented. Decision record:
[`docs/stage12-docker-and-clients.md`](docs/stage12-docker-and-clients.md).
Goal:
- ship the Docker SDK adapter and the external HTTP clients for Lobby internal API and
notification publishing.
Tasks:
- `internal/adapters/docker/client.go` — implements `ports.DockerClient` over
`github.com/docker/docker/client`. Behaviour:
- `EnsureNetwork` validates the configured network's presence (no creation).
- `PullImage` honours the configured pull policy.
- `Inspect` returns image and container metadata in domain-friendly shape.
- `Run` builds the create + start sequence with labels, env (`GAME_STATE_PATH`,
`STORAGE_PATH`), bind mount, log driver, resource limits read from image labels with
config fallback.
- `Stop` calls `ContainerStop` with the configured timeout.
- `Remove` calls `ContainerRemove`.
- `List` filters by `label=com.galaxy.owner=rtmanager`.
- `EventsListen` returns a typed channel of decoded events.
- `internal/adapters/docker/mocks/` — `mockgen`-generated mock for `ports.DockerClient`,
used by service tests.
- `internal/adapters/lobbyclient/client.go` — REST client over an `otelhttp`-wrapped
`http.Client` for `GET /api/v1/internal/games/{game_id}`. Returns `LobbyGameRecord`.
- `internal/adapters/notificationpublisher/publisher.go` — wraps
`galaxy/notificationintent` plus `redis.XAdd` against `notification:intents`.
- Per-adapter unit tests with mocks. A small testcontainers Docker smoke test guarded by
build tag `rtmanager_docker_smoke` until Stage 19 promotes it to default.
Files new:
- as above.
Exit criteria:
- mocks regenerate cleanly via `go generate`.
- unit tests pass.
- the smoke test passes on a runner with Docker available.
## ~~Stage 13.~~ Service: start
Status: implemented. Decision record:
[`docs/stage13-start-service.md`](docs/stage13-start-service.md).
Goal:
- end-to-end `start` operation in the service layer, callable from both the async consumer
and the REST handler in later stages.
Tasks:
- `internal/service/startruntime/service.go` orchestrator:
1. Acquire game-id lease (Redis).
2. Read `runtime_records`. If `running` with same `image_ref`, return idempotent success
with `error_code=replay_no_op`.
3. Optionally fetch `LobbyGameRecord` for ancillary fields; in v1 only `image_ref` is
required, so this fetch is a no-op except for diagnostics.
4. Pull image (per policy), inspect labels for resource limits.
5. Ensure the per-game state directory exists with the configured mode and ownership.
6. `docker run` with the configured network, hostname, labels, env, bind mount, log
driver, resource limits.
7. Upsert `runtime_records` (`status=running`, `current_container_id`, `engine_endpoint`,
`current_image_ref`, `started_at`, `last_op_at`).
8. Append `operation_log` entry (`op_kind=start`, `outcome=success`, `op_source` from
caller).
9. Publish `runtime:health_events` `container_started`.
10. Return success outcome to caller (consumer publishes `job_result`, REST returns 200).
- Failure paths in the table from `README.md §Lifecycles → Start`. Each failure path:
- rolls back any partially created Docker resource;
- publishes the matching admin-only notification intent;
- records `operation_log` with `outcome=failure` and the stable error code;
- returns failure to the caller.
- Unit tests cover happy path, idempotent re-start, each failure mode, lease conflict, and
partial-rollback paths.
Files new:
- `service/startruntime/{service.go, service_test.go, errors.go}`.
Exit criteria:
- service-level tests pass.
## ~~Stage 14.~~ Service: stop, restart, patch, cleanup
Status: implemented. Decision record:
[`docs/stage14-stop-restart-patch-cleanup.md`](docs/stage14-stop-restart-patch-cleanup.md).
Goal:
- the remaining four lifecycle operations, sharing helpers with `start`.
Tasks:
- `internal/service/stopruntime/service.go` — graceful `docker stop` with timeout, record
`stopped` state. Idempotent re-stop returns success no-op.
- `internal/service/restartruntime/service.go` — orchestrate `stopruntime` then
`startruntime` with the current `image_ref`. Same Redis lease shared across both inner
operations. Records a single `operation_log` entry with `op_kind=restart` plus a
correlation id linking it to the implicit start/stop entries.
- `internal/service/patchruntime/service.go` — restart with a new `image_ref`. Validates the
semver-patch-only rule (major and minor must equal current version; otherwise return
`semver_patch_only` failure). If the engine version is not parseable as semver, return
`image_ref_not_semver`.
- `internal/service/cleanupcontainer/service.go` — `docker rm` for an already-stopped
container; refuses if `status=running`. Sets `runtime_records.status=removed`.
- The Redis lease covers each operation end-to-end; restart and patch hold the lease across
the inner stop+start to prevent races.
- Unit tests for each service. Cross-operation race tests assert that concurrent start vs.
stop on the same `game_id` either succeed in some order or both observe the lease and
one returns conflict.
Files new:
- `service/{stopruntime, restartruntime, patchruntime, cleanupcontainer}/...`.
Exit criteria:
- service-level tests pass.
## ~~Stage 15.~~ Async consumers and `runtime:job_results` publisher
Status: implemented. Decision record:
[`docs/stage15-async-consumers.md`](docs/stage15-async-consumers.md).
Goal:
- wire the Lobby-side stream contract into the freshly built service layer.
Tasks:
- `internal/worker/startjobsconsumer/consumer.go` — XREAD over `runtime:start_jobs`,
decodes envelope `{game_id, image_ref, requested_at_ms}`, calls `startruntime` service,
publishes `runtime:job_results` with the canonical schema, advances the Redis offset.
Mirrors patterns from `lobby/internal/worker/runtimejobresult/consumer.go`.
- `internal/worker/stopjobsconsumer/consumer.go` — XREAD over `runtime:stop_jobs`, decodes
`{game_id, reason, requested_at_ms}`, calls `stopruntime`.
- `internal/adapters/jobresultspublisher/publisher.go` — small XADD wrapper for
`runtime:job_results`.
- Replay safety: deterministic «already running» / «already stopped» idempotent outcomes
surface as `outcome=success` with `error_code=replay_no_op`.
- Tests use `miniredis` and a fake `ports.DockerClient`. A consumer integration test drives
a full Lobby → RTM → Lobby roundtrip end-to-end.
Files new:
- as above + tests.
Exit criteria:
- consumer integration test passes.
## ~~Stage 16.~~ Internal REST handlers
Status: implemented. Decision record:
[`docs/stage16-internal-rest-handlers.md`](docs/stage16-internal-rest-handlers.md).
Goal:
- ship the GM/Admin-facing REST surface backed by the service layer.
Tasks:
- `internal/api/internalhttp/handlers/{list, get, start, stop, restart, patch, cleanup}.go`
— one file per operation, each delegating to the corresponding service. JSON in / JSON
out. Unknown JSON fields rejected with `invalid_request`.
- Error envelope identical to lobby: `{ "error": { "code", "message" } }`. Stable codes:
`invalid_request`, `not_found`, `conflict`, `service_unavailable`, `internal_error`,
`image_ref_not_semver`, `semver_patch_only`, `image_pull_failed`,
`container_start_failed`, `start_config_invalid`, `docker_unavailable`.
- Wiring under the existing internal HTTP listener; route registration in
`internal/app/wiring.go`.
- Handler-level table-driven tests; OpenAPI conformance test that loads
`api/internal-openapi.yaml` and asserts every defined operation is reachable and matches
its declared response.
Files new:
- handlers + tests.
Exit criteria:
- OpenAPI conformance test passes for every endpoint.
- Handlers reject unknown JSON fields.
## ~~Stage 17.~~ Health monitoring
Status: implemented. Decision record:
[`docs/stage17-health-monitoring.md`](docs/stage17-health-monitoring.md).
Goal:
- observability of running containers via the three sources from `README.md §Health
Monitoring`.
Tasks:
- `internal/worker/dockerevents/listener.go` — subscribes to Docker events with the
`com.galaxy.owner=rtmanager` label filter, looks up `runtime_records` by labels, emits
`runtime:health_events` for `container_exited`, `container_oom`,
`container_disappeared`. `container_started` is emitted directly by the start service
(Stage 13) when it runs the container.
- `internal/worker/healthprobe/worker.go` — periodic worker iterating
`runtime_records.status=running`. Calls `GET {engine_endpoint}/healthz` with the
configured timeout, applies the `RTMANAGER_PROBE_FAILURES_THRESHOLD` hysteresis, emits
`probe_failed` / `probe_recovered`. Uses `otelhttp` client.
- `internal/worker/dockerinspect/worker.go` — periodic full inspect; emits
`inspect_unhealthy` on observed `RestartCount` growth or unexpected status.
- `internal/adapters/healtheventspublisher/publisher.go` — XADD wrapper for
`runtime:health_events`. Always also upserts the latest snapshot into `health_snapshots`.
Files new:
- as above + tests.
Exit criteria:
- worker tests use a Docker mock that programmatically emits events and asserts the
published stream entries match the AsyncAPI spec.
## ~~Stage 18.~~ Reconciler and container cleanup
Status: implemented. Decision record:
[`docs/stage18-reconcile-and-cleanup.md`](docs/stage18-reconcile-and-cleanup.md).
Goal:
- drift management and TTL-based cleanup.
Tasks:
- `internal/worker/reconcile/reconciler.go` — runs at startup (blocking before workers
start) and periodically (`RTMANAGER_RECONCILE_INTERVAL`). Implements the rules from
`README.md §Reconciliation`:
- record running containers without a PG record, never kill them
(`op_kind=reconcile_adopt`);
- mark `runtime_records.status=running` rows whose container is missing as `removed`,
publish `container_disappeared` (`op_kind=reconcile_dispose`).
- `internal/worker/containercleanup/worker.go` — periodic worker
(`RTMANAGER_CLEANUP_INTERVAL`) that lists `runtime_records` with `status=stopped` and
`last_op_at < now - RTMANAGER_CONTAINER_RETENTION_DAYS`, calls
`cleanupcontainer` service for each.
- Both workers are registered as `app.Component`s in `internal/app/wiring.go`.
Files new:
- as above + tests.
Exit criteria:
- reconciler test using mocked Docker proves both adopt and dispose paths.
- cleanup test proves TTL math with a fake clock.
## ~~Stage 19.~~ Service-local integration suite
Status: implemented. Decision record:
[`docs/stage19-integration.md`](docs/stage19-integration.md).
Goal:
- end-to-end suite running against testcontainers PostgreSQL + Redis + the real Docker
daemon, using the freshly-built `galaxy/game` test image.
Tasks:
- `rtmanager/integration/harness/` — set up PostgreSQL with goose-applied migrations;
Redis (miniredis is sufficient for stream-only suites; testcontainers Redis for
coordination suites that exercise leases); ensure the Docker bridge network exists; build
`galaxy/game` test image once per package run with `sync.Once`; tear everything down via
`t.Cleanup`.
- `rtmanager/integration/lifecycle_test.go` — start → inspect → stop → restart → patch →
cleanup against the real engine; assert each step's PG, Redis-stream, and Docker
side-effects. Engine state directories are created via `t.ArtifactDir()`.
- `rtmanager/integration/replay_test.go` — duplicate start/stop messages are no-ops with
`error_code=replay_no_op`.
- `rtmanager/integration/health_test.go` — kill the engine container externally; assert
`container_disappeared` event publishes within timeout. Bring it back with a manual
`docker run`; assert the reconciler adopts it.
- `rtmanager/integration/notification_test.go` — drive a start with an unresolvable image
ref; assert RTM publishes the `runtime.image_pull_failed` notification intent and a
`failure` job_result.
Files new:
- as above.
Exit criteria:
- `go test ./rtmanager/integration/...` passes locally with Docker available.
- CI runs the suite under a profile that exposes the Docker socket.
## ~~Stage 20.~~ Inter-service test: Lobby ↔ RTM
Status: implemented. Decision record:
[`docs/stage20-lobbyrtm.md`](docs/stage20-lobbyrtm.md).
Goal:
- satisfy the `TESTING.md §7` inter-service requirement with real Lobby + real RTM.
Tasks:
- `integration/lobbyrtm/` (top-level integration directory, mirroring existing
`integration/notificationgateway`, etc.): runs real Lobby, real RTM, real PostgreSQL,
real Redis, and the `galaxy/game` test engine container.
- Scenarios:
- Lobby creates a game, publishes a start_job with `image_ref`, RTM starts the engine,
publishes `job_result`, Lobby transitions the game to `running`. The engine answers
`/healthz`.
- Lobby transitions a game to `cancelled`, publishes `stop_job` with `reason=cancelled`,
RTM stops the engine. RTM `operation_log` records the transition.
- Failure path: `image_ref` points at a missing image. RTM publishes a `failure`
`job_result` and the matching notification intent. Lobby transitions the game to
`start_failed`.
Files new:
- as above.
Exit criteria:
- all scenarios pass in CI when the Docker socket is available.
## ~~Stage 21.~~ Service-local docs
Status: implemented.
Goal:
- drop per-stage decisions captured during this plan into discoverable service-local
documentation, mirroring `lobby/docs/`.
Tasks:
- `docs/README.md` — index pointing at the four content docs and the postgres-migration
record.
- `docs/runtime.md` — components, processes, in-memory state of each worker.
- `docs/flows.md` — mermaid diagrams for: start happy path, start failure (image pull),
start failure (orphan), stop, restart, patch, cleanup TTL, reconcile drift adopt, health
probe hysteresis.
- `docs/runbook.md` — operator scenarios: «engine container died», «patch upgrade», «manual
cleanup», «reconcile drift after Docker daemon restart», «testing locally».
- `docs/examples.md` — env-var examples per environment (dev / test / prod skeletons),
example payloads for each stream and each REST endpoint.
- `docs/postgres-migration.md` — decision record for the schema (mirrors
`notification/docs/postgres-migration.md` style).
Files new:
- all six.
Exit criteria:
- the README of RTM links to `docs/README.md`.
- a reviewer can find any operational how-to within two clicks.
## ~~Stage 22.~~ Migrate hand-rolled stubs to `mockgen`
Status: implemented. Decision record:
[`docs/stage22-stub-migration.md`](docs/stage22-stub-migration.md).
Goal:
- unify the test-double style across the repository on the `mockgen`
pipeline introduced for the RTM Docker port in Stage 12. Today every
Galaxy service except RTM hand-rolls `*stub` packages; mixing styles
raises onboarding cost and makes port-signature drift easier to miss.
Tasks (high-level only — each package gets its own decision when this
stage is opened):
- Replace the stubs under `lobby/internal/adapters/` with `mockgen`-generated
mocks. Affected packages today (one per port):
[`runtimemanagerstub`](../lobby/internal/adapters/runtimemanagerstub),
[`intentpubstub`](../lobby/internal/adapters/intentpubstub),
[`gmclientstub`](../lobby/internal/adapters/gmclientstub),
[`userservicestub`](../lobby/internal/adapters/userservicestub),
[`gameturnstatsstub`](../lobby/internal/adapters/gameturnstatsstub),
[`streamoffsetstub`](../lobby/internal/adapters/streamoffsetstub),
[`membershipstub`](../lobby/internal/adapters/membershipstub),
[`evaluationguardstub`](../lobby/internal/adapters/evaluationguardstub),
[`streamlagprobestub`](../lobby/internal/adapters/streamlagprobestub),
[`userlifecyclestub`](../lobby/internal/adapters/userlifecyclestub),
[`invitestub`](../lobby/internal/adapters/invitestub),
[`racenamestub`](../lobby/internal/adapters/racenamestub),
[`gapactivationstub`](../lobby/internal/adapters/gapactivationstub),
[`gamestub`](../lobby/internal/adapters/gamestub),
[`applicationstub`](../lobby/internal/adapters/applicationstub).
- Add `//go:generate mockgen ...` directives next to each port
declaration under [`lobby/internal/ports/`](../lobby/internal/ports)
and a `mocks` target to `lobby/Makefile`, mirroring the
[`rtmanager/Makefile`](./Makefile) shape.
- Audit the rest of the workspace for similar hand-rolls before touching
Lobby. Not every `*stub`-style package is in scope:
- [`mail/internal/adapters/stubprovider`](../mail/internal/adapters/stubprovider)
is a production/local-mode provider, not a test fixture — keep it.
- [`authsession/internal/adapters/contracttest`](../authsession/internal/adapters/contracttest)
is a port-conformance suite, not a stub — keep it.
- [`authsession/internal/adapters/local`](../authsession/internal/adapters/local)
is local-mode runtime — keep it.
- Documentation sweep — these documents reference the hand-rolled
convention and must be updated alongside the code:
- [`rtmanager/docs/stage12-docker-and-clients.md §1`](./docs/stage12-docker-and-clients.md)
currently frames `mockgen` as a one-time deviation; rephrase as the
repo-wide convention.
- [`lobby/docs/`](../lobby/docs/) — any decision record that named a
`*stub` package by path needs the new `mocks/` target referenced in
its place.
- Top-level [`AGENTS.md`](../AGENTS.md) and any service-level
`CLAUDE.md` / `README.md` touching test conventions.
- Cross-cutting test impact: each stub today often carries hand-curated
helper methods (e.g. seeded fixtures, deterministic ID generators)
that pure `mockgen` mocks do not provide. Where a stub is more than
a method-table, the migration extracts the helper into a small
test-data builder and keeps the mock as the port surface.
Files new:
- one `mocks/` directory under each affected adapter group, plus a
`lobby/Makefile` `mocks` target (and equivalents for any other
service the audit identifies).
Files touched:
- every `*stub` package listed above plus its consumers.
- `lobby/Makefile`, `lobby/internal/ports/*.go` (for `//go:generate`
directives).
- the documentation listed above.
Exit criteria:
- `*stub` packages are gone from `lobby/internal/adapters/` and the
`mocks/` packages compile against the current ports.
- `make -C lobby mocks` regenerates with no diff after a clean run.
- `go test ./lobby/...` is green.
- Documentation across `rtmanager/docs/`, `lobby/docs/`, top-level
`AGENTS.md`, and any affected `README.md` references the unified
convention.
## Final Acceptance Criteria
- `go build ./...` from the repository root succeeds.
- `go test ./...` from the repository root passes.
- `go test -tags=integration ./rtmanager/integration/...` passes when Docker is available.
- `go test ./integration/lobbyrtm/...` passes when Docker is available.
- `make -C rtmanager jet` regenerates jet code with no diff after a clean run.
- Manual smoke: bring Lobby + RTM + the rest of the stack up via the existing dev compose;
create a game; observe a real `galaxy-game-{game_id}` container; `curl
http://galaxy-game-{game_id}:8080/healthz` returns `200`; stop the game; the container
moves to `exited`; the admin cleanup endpoint removes it.
- Documentation across `ARCHITECTURE.md`, `lobby`, `notification`, `game`, and `rtmanager`
is internally consistent.
## Out of Scope
- Multi-instance Runtime Manager with Redis Streams consumer groups (`XREADGROUP` /
`XCLAIM`).
- Engine version registry inside `Game Master`. Producer-supplied `image_ref` decouples
this work from RTM.
- TLS / mTLS on the internal listener.
- Engine in-place upgrades driven by an engine API. Patch is always recreate.
- Backup, archival, or cleanup of host state directories.
- Kubernetes, Docker Swarm, or any non-Docker orchestrator.
- Consumption of `runtime:health_events` by Game Master, Game Lobby, or Notification
Service. Those are next-stage concerns of those services.
## Risks and Notes
- CI must expose a Docker socket (or run rootless equivalent) to execute the integration
suites. Without Docker the integration tests are skipped through a build-tag guard.
- The `reason` enum on `runtime:stop_jobs` is fixed in this plan
(`{orphan_cleanup, cancelled, finished, admin_request, timeout}`). Adding a new value
requires a contract bump in `runtime-jobs-asyncapi.yaml` and a Lobby publisher change.
Keep the enum small.
- Lobby's existing `runtimejobresult` worker only reacts to start outcomes today. Stop
outcomes are observable in RTM `operation_log` but Lobby does not yet update game status
from them. Adding a stop-result consumer to Lobby is a future Lobby stage and is
explicitly out of scope here.
- Pre-launch single-init policy applies to RTM exactly as documented in
`ARCHITECTURE.md §Persistence Backends`: schema evolves by editing `00001_init.sql`
until first production deploy.