Files

T

Ilia Denisov a7cee15115 feat: runtime manager

2026-04-28 20:39:18 +02:00

22 KiB

Raw Blame History

Lifecycle Services

This document explains the design of the five lifecycle services (startruntime, stopruntime, restartruntime, patchruntime, cleanupcontainer) under ../internal/service/ plus the per-handler REST glue under ../internal/api/internalhttp/.

The current-state behaviour (lifecycle steps, failure tables, the per-game lease semantics, the wire contracts) lives in ../README.md, the OpenAPI spec at ../api/internal-openapi.yaml, and the AsyncAPI spec at ../api/runtime-jobs-asyncapi.yaml. This file records the why.

1. Per-game lease lives at the service layer

Every lifecycle service acquires rtmanager:game_lease:{game_id} via ports.GameLeaseStore before doing any work, and releases it on the way out:

the lease primitive serialises operations on a single game across every entry point (stream consumers and REST handlers);
holding the lease at the service layer keeps the consumer / REST callers symmetric — neither acquires the lease itself, both call the service the same way;
the Redis-backed adapter (../internal/adapters/redisstate/gamelease/store.go) uses SET NX PX on acquire, Lua compare-and-delete on release; a release whose caller-supplied token no longer matches is a silent no-op.

The lease key shape is rtmanager:game_lease:{base64url(game_id)} so opaque game ids may contain any characters without leaking through the key syntax.

The lease TTL is RTMANAGER_GAME_LEASE_TTL_SECONDS (default 60s) and is not renewed mid-operation in v1. A multi-GB image pull can theoretically expire the lease before the start service finishes; operators see this as a reconcile_adopt event later because the container is created with the standard owner labels. A renewal helper is deliberately deferred until a workload makes it necessary.

The reconciler (workers.md §4) honours the same lease around every drift mutation, which closes the restart-vs-reconcile_dispose race documented in §6 below.

2. Health-events publisher lands with the start service

The start service publishes container_started after docker run returns; the events listener intentionally does not duplicate the event (workers.md §1). Centralising the publisher on the start service avoids a "who emits what" ambiguity and lets the publisher be a thin port wrapper rather than a worker-specific helper.

The publisher port lives next to the snapshot-upsert rule (adapters.md §8): one Publish call updates both surfaces.

3. `Result`-shaped contract

Service.Handle returns (Result, error). The Go-level error is reserved for system-level / programmer faults (nil context, nil service). All business outcomes flow through Result:

Outcome=success, ErrorCode="" — fresh start succeeded;
Outcome=success, ErrorCode="replay_no_op" — idempotent replay;
Outcome=failure, ErrorCode set — business failure (start_config_invalid / image_pull_failed / container_start_failed / conflict / service_unavailable / internal_error).

The stream consumer uses Outcome and ErrorCode to populate runtime:job_results directly; the REST handler maps Outcome=failure plus ErrorCode to the matching HTTP status. Both callers are simpler with this contract than with an errors.Is-driven sentinel taxonomy.

ports.JobResult and the two JobOutcome* string constants live in the ports package next to JobResultPublisher so the wire shape is defined exactly once. The constants are intentionally not aliases of operation.Outcome — the audit-log enum is allowed to grow without breaking the wire format.

4. Start service failure-mode mapping

Failure	Error code	Notification intent
Invalid input (empty fields, unknown op_source)	`start_config_invalid`	`runtime.start_config_invalid`
Lease busy	`conflict`	—
Existing record running with a different image_ref	`conflict`	—
Get returns a non-NotFound transport error	`internal_error`	—
`image_ref` shape rejected by `distribution/reference`	`start_config_invalid`	`runtime.start_config_invalid`
`EnsureNetwork` returns `ErrNetworkMissing`	`start_config_invalid`	`runtime.start_config_invalid`
`EnsureNetwork` returns any other error	`service_unavailable`	—
`PullImage` failure	`image_pull_failed`	`runtime.image_pull_failed`
`InspectImage` failure	`image_pull_failed`	`runtime.image_pull_failed`
`prepareStateDir` failure	`start_config_invalid`	`runtime.start_config_invalid`
`Run` failure	`container_start_failed`	`runtime.container_start_failed`
`Upsert` failure after successful Run	`container_start_failed`	`runtime.container_start_failed`

Three error codes do not raise an admin notification: conflict, service_unavailable, and internal_error are operational classes (another caller is in flight, a dependency is down, an unclassified fault) where the corrective action is not a configuration change. The operator already sees them through telemetry and structured logs; an email per occurrence would be noise.

5. Upsert-after-Run rollback

A Run that succeeded but whose Upsert failed leaves a running container with no PG record. The service issues a best-effort docker.Remove(containerID) in a fresh context.Background() (the request context may already be cancelled) before recording the failure. A Remove failure is logged but not propagated; the reconciler adopts surviving orphans on its periodic pass.

The Docker adapter already removes the container when Run itself returns an error after a successful ContainerCreate (adapters.md §3). The service-layer rollback covers the additional post-Run Upsert failure path.

6. Pre-existing record handling

Only status=running + same image_ref is a replay_no_op. running + a different image_ref returns failure / conflict (use patch to change the image of a running container).

Anything else (stopped, removed, missing record) proceeds with a fresh start that ends in Upsert. Upsert overwrites verbatim and is not bound by the transitions table, so installing a running record over a removed row is permitted — the removed terminus rule lives in runtime.AllowedTransitions (which guards UpdateStatus), not in Upsert.

created_at is preserved across re-starts: the start service reuses existing.CreatedAt when the record was found, so the "first time RTM saw the game" semantics from postgres-migration.md §9 hold even when the start path goes through Upsert rather than through the runtime adapter's INSERT ... ON CONFLICT DO UPDATE EXCLUDED list.

A residual galaxy-game-{game_id} container left over from a previous start that was stopped but never cleaned up will fail at docker run with a name conflict. The service surfaces that as container_start_failed; cleanup plus the reconciler is the standard remedy. A pre-emptive Remove inside the start service was rejected because it would silently undo manual operator inspection on stopped containers.

7. `LobbyInternalClient.GetGame` is best-effort

The fetch happens after the lease is acquired and before the Docker work, with the configured RTMANAGER_LOBBY_INTERNAL_TIMEOUT. ErrLobbyUnavailable and ErrLobbyGameNotFound are logged at debug; the start operation continues either way. The fetched Status and TargetEngineVersion enrich logs only — the start envelope already carries the only required field (image_ref), and the port docstring fixes the recoverable-failure contract.

8. `image_ref` validation

Validation uses github.com/distribution/reference.ParseNormalizedNamed before any Docker round-trip. Rejected shapes surface as start_config_invalid plus a runtime.start_config_invalid intent. Daemon-side rejections after a valid parse (manifest unknown, authentication required) surface as image_pull_failed plus a runtime.image_pull_failed intent. The split keeps operator-actionable configuration mistakes distinct from registry-side failures.

9. State-directory preparer is overrideable

Dependencies.PrepareStateDir is a func(gameID string) (string, error) injection point that defaults to os.MkdirAll + os.Chmod + os.Chown against RTMANAGER_GAME_STATE_ROOT. Tests override it to point at a t.TempDir()-style fake without exercising the real filesystem permissions (which require either matching uid/gid or root). This is a deliberate non-port abstraction: the start service does no other filesystem work and the cost of a new port for one helper is not worth the indirection.

10. Container env: both `GAME_STATE_PATH` and `STORAGE_PATH`

Both names are accepted by the v1 engine. The start service always sets both; the configured RTMANAGER_ENGINE_STATE_ENV_NAME controls the primary. When the operator overrides the primary to STORAGE_PATH, the deduplicating map collapses the two entries into one.

11. Wiring layer construction

internal/app/wiring.go is the single point that builds every production store, adapter, and service from config.Config. The struct exposes typed fields so handlers and workers can grab the singletons without re-wiring; an addCloser slice releases adapter resources (currently the Lobby HTTP client's idle-connection pool) at runtime shutdown. The runtimeRecordsProbe adapter installed during construction registers the rtmanager.runtime_records_by_status gauge documented in ../README.md §Observability.

The persistence-only CountByStatus method on the runtimerecordstore adapter is not part of ports.RuntimeRecordStore because it is only used by the gauge probe; widening the port for one caller would force every adapter and test fake to grow with no benefit. The adapter exposes it directly and the wiring composes a concrete-typed wrapper.

12. Shared lease across composed operations (restart, patch)

Restart and patch must hold the lease across the inner stop → docker rm → start sequence, otherwise a concurrent stop or restart could observe a half-recreated runtime.

startruntime.Service and stopruntime.Service therefore expose a second public method:

// Run executes the lifecycle assuming the per-game lease is already
// held by the caller. Reserved for orchestrator services that compose
// stop or start with another operation under a single outer lease.
// External callers must use Handle.
func (service *Service) Run(ctx context.Context, input Input) (Result, error)

Handle acquires the lease, defers its release, and calls Run. Restart and patch acquire the outer lease themselves and call Run on the inner services. The inner services record their own operation_log entries, telemetry counters, health events, and admin notification intents identically to a top-level Handle.

A typed LeaseTicket parameter (a small internal-package zero-size struct that only the lease store can construct) was considered and rejected for v1: only sister services in internal/service/ ever call Run, the docstring is loud about the precondition, and the pattern can be tightened later without breaking the public surface that consumers and handlers consume.

13. Correlation id on `source_ref`

The outer restart and patch services reuse the existing Input.SourceRef as a correlation key:

when Input.SourceRef is non-empty (REST request id, stream entry id), all three entries — outer restart / patch + inner stop + inner start — share that value;
when empty, the outer service generates a 32-byte base64url string via the same NewToken generator that produces lease tokens, and uses it as the correlation key for all three entries.

The outer entry's source_ref keeps its dual semantics: actor ref when the caller supplied one, generated correlation id otherwise. Pure top-level operations (caller invokes start, stop, or cleanup directly) keep the original meaning. Composed operations (restart, patch) use the same value in three places to make audit queries trivial.

This is not the cleanest end-state — a dedicated correlation_id column would carry the link without ambiguity — but it is the smallest change that does not touch the schema. A future stage that adds the column can rename the field and clear up the dual role in one move.

14. Semver validation for patch

internal/service/patchruntime/semver.go enforces the patch-precondition (current and new image_ref parse as semver, share major and minor):

extractSemverTag(imageRef) parses with github.com/distribution/reference.ParseNormalizedNamed, casts to reference.NamedTagged, then validates the tag with golang.org/x/mod/semver.IsValid (after prepending v when the tag omits it). Failures map to image_ref_not_semver;
samePatchSeries(currentSemver, newSemver) compares semver.MajorMinor of the two canonical strings; mismatch maps to semver_patch_only.

golang.org/x/mod is a direct require to avoid a transitive-version surprise. github.com/Masterminds/semver/v3 (also in the module graph) was rejected to avoid two semver libraries on disk for the same job; x/mod/semver already covers Lobby. A hand-rolled vMajor.Minor.Patch parser was rejected as premature.

Pre-checks run before any inner stop or docker rm: a rejected patch never disturbs the running runtime. Patch with new_image_ref == current_image_ref proceeds through the recreate flow unchanged (not replay_no_op: the inner start still runs); the outer op_kind=patch entry records the no-op patch for audit.

15. `StopReason` placement

The reason enum mirrors lobby/internal/ports/runtimemanager.go verbatim and lives at internal/service/stopruntime/stopreason.go. The stream consumer and the REST handler import stopruntime for the same enum the service requires.

Inner stop calls from restart and patch always pass StopReasonAdminRequest. Restart and patch are platform-internal recreate flows; admin_request is the closest semantic match in the five-value vocabulary. The actor that originated the recreate (REST request id, admin user id) flows through the op_source / source_ref pair, not through the stop reason.

16. Error code centralisation

internal/service/startruntime/errors.go is the canonical home for the stable error codes returned in Result.ErrorCode. The other four services (stopruntime, restartruntime, patchruntime, cleanupcontainer) import the constants from startruntime rather than redeclaring them. The package comment of errors.go flags the shared usage so future readers do not chase per-service declarations.

start_config_invalid is reserved for start because every start validation failure also raises an admin notification intent. The other services use the more general invalid_request for input validation failures.

17. Stop / restart / patch / cleanup failure tables

`stopruntime`

Failure	Error code	Notes
Invalid input	`invalid_request`	No notification intent.
Lease busy	`conflict`	Lease release skipped because acquire returned false.
Lease error	`service_unavailable`	Redis unreachable.
Record missing	`not_found`
Status `stopped` / `removed`	success / `replay_no_op`	Idempotent re-stop.
`docker.Stop` returns `ErrContainerNotFound`	success	Record transitions `running → removed`, `container_disappeared` health event published.
`docker.Stop` other error	`service_unavailable`	Record untouched; caller may retry.
`UpdateStatus` returns `ErrConflict` (CAS race)	success / `replay_no_op`	The desired state was reached by another path (reconciler / restart).
`UpdateStatus` returns `ErrNotFound`	`not_found`	Record vanished mid-stop.
`UpdateStatus` other error	`internal_error`

`restartruntime`

Failure	Error code	Notes
Invalid input	`invalid_request`
Lease busy / lease error	`conflict` / `service_unavailable`	Same as stop.
Record missing	`not_found`
Status `removed`	`conflict`	Image_ref may be empty; restart cannot proceed.
Inner stop fails	inner `ErrorCode`	Outer `ErrorMessage` prefixes "inner stop failed: ".
`docker.Remove` fails	`service_unavailable`	Inner stop already moved record to `stopped`; runtime stays in `stopped`. Admin must call `cleanup_container` before retrying restart.
Inner start fails	inner `ErrorCode`	Outer `ErrorMessage` prefixes "inner start failed: ".

The post-stop docker rm failure is the only path that leaves the runtime in a state from which the same operation cannot recover by itself: a residual galaxy-game-{game_id} container blocks a fresh inner start (the start service surfaces this as container_start_failed). The runbook entry — "call cleanup, then restart again" — is the standard remedy.

`patchruntime`

Failure	Error code	Notes
Invalid input	`invalid_request`
Lease busy / lease error	`conflict` / `service_unavailable`
Record missing	`not_found`
Status `removed`	`conflict`
Current `image_ref` not parseable as semver tag	`image_ref_not_semver`	Pre-check; no inner ops fired.
New `image_ref` not parseable as semver tag	`image_ref_not_semver`	Pre-check; no inner ops fired.
Major / minor mismatch	`semver_patch_only`	Pre-check; no inner ops fired.
Inner stop / `docker rm` / inner start fails	inherits inner code	Same propagation as restart.

`cleanupcontainer`

Failure	Error code	Notes
Invalid input	`invalid_request`
Lease busy / lease error	`conflict` / `service_unavailable`
Record missing	`not_found`
Status `removed`	success / `replay_no_op`
Status `running`	`conflict`	Error message: "stop the runtime first".
Status `stopped`	proceed
`docker.Remove` returns `ErrContainerNotFound`	success	Adapter swallows not-found into nil.
`docker.Remove` other error	`service_unavailable`	Record untouched; caller may retry.
`UpdateStatus` returns `ErrConflict`	success / `replay_no_op`	Race with reconciler dispose.
`UpdateStatus` returns `ErrNotFound`	`not_found`
`UpdateStatus` other error	`internal_error`

18. REST handler conventions

The internal HTTP handlers under ../internal/api/internalhttp/handlers/ follow these rules:

X-Galaxy-Caller header. The optional header carries the calling service identity (gm / admin); the handler records the value as op_source in operation_log (gm_rest / admin_rest). Missing or unknown values default to admin_rest because every audit-log query already filters on the cleanup endpoint (op_source ∈ {auto_ttl, admin_rest}); making the default match the most-restricted surface keeps existing dashboards correct when an unconfigured client hits the listener. The header is declared as a reusable parameter (components.parameters.XGalaxyCallerHeader) in the OpenAPI spec and is referenced from each runtime operation but not from /healthz and /readyz.

Error code → HTTP status mapping. One canonical table in handlers/common.go:

ErrorCode	HTTP status
(success, including `replay_no_op`)	200
`invalid_request`, `start_config_invalid`, `image_ref_not_semver`	400
`not_found`	404
`conflict`, `semver_patch_only`	409
`service_unavailable`, `docker_unavailable`	503
`internal_error`, `image_pull_failed`, `container_start_failed`	500

image_pull_failed and container_start_failed are operational failures that originate inside RTM (registry / daemon problems), not client-side validation issues; they map to 500 so callers retry through their normal resilience paths instead of treating the call as a 4xx that must be fixed at the source. docker_unavailable is reserved for future producers; today the start service emits service_unavailable for Docker-daemon failures. Unknown error codes default to 500.

List and Get bypass the service layer. internalListRuntimes and internalGetRuntime read directly from ports.RuntimeRecordStore. Reads do not produce operation_log rows, do not change Docker state, do not need the per-game lease, and do not have a stream-side counterpart — none of the lifecycle service machinery is justified.
RuntimeRecordStore.List(ctx) returns every record regardless of status. A single SELECT ordered by (last_op_at DESC, game_id ASC) — the same direction the runtime_records_status_last_op_idx index supports, so freshly active games surface first. Pagination is intentionally not modelled in v1; the working set is bounded by the games tracked by Lobby.
Per-handler service ports use mockgen. The handler layer depends on five narrow interfaces — one per lifecycle service — declared in handlers/services.go. Production wiring passes the concrete *<lifecycle>.Service pointers (each satisfies the matching interface implicitly); tests pass the mockgen-generated mocks under handlers/mocks/.
Conformance test scope. internalhttp/conformance_test.go drives every documented runtime operation against a real internalhttp.Server whose service deps are deterministic stubs. The test uses kin-openapi/routers/legacy.NewRouter, calls openapi3filter.ValidateRequest and openapi3filter.ValidateResponse so both directions match the contract. The scope is happy-path only; the failure-path response shapes are validated by the per-handler tests.

22 KiB Raw Blame History