feat: runtime manager

This commit is contained in:
Ilia Denisov
2026-04-28 20:39:18 +02:00
committed by GitHub
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
+167
View File
@@ -0,0 +1,167 @@
# Domain and Ports
This document explains why the `rtmanager` domain layer
([`../internal/domain/`](../internal/domain)) and the port interfaces
([`../internal/ports/`](../internal/ports)) are shaped the way they are.
The current-state types and method signatures are the source of truth in
the code; this file records the rationale so future readers do not
re-litigate the same trade-offs.
For the surrounding behaviour see
[`../README.md`](../README.md), the SQL CHECK constraints in
[`../internal/adapters/postgres/migrations/00001_init.sql`](../internal/adapters/postgres/migrations/00001_init.sql),
the wire contracts under [`../api/`](../api), and
[`postgres-migration.md`](postgres-migration.md) for the persistence
layer.
## 1. String-typed status enums
`runtime.Status`, `operation.OpKind`, `operation.OpSource`,
`operation.Outcome`, `health.EventType`, `health.SnapshotStatus`, and
`health.SnapshotSource` are all `type X string`.
The string approach wins on three counts:
- the SQL CHECK constraints already store the values as `text`, so a
string domain type maps one-to-one with no codec layer;
- it matches Lobby (`game.Status`, `membership.Status`,
`application.Status`), so reviewers do not switch encoding mental
models when crossing service boundaries;
- `IsKnown` keeps the invariant cheap (a single switch); a `type X uint8`
with stringer-generated names would pay a constant lookup and make raw
SQL columns harder to read in diagnostics.
## 2. Plain `string` for `CurrentContainerID` and `CurrentImageRef`
The PostgreSQL columns are nullable. The domain model uses plain
`string` with empty == NULL and bridges the SQL nullability inside the
adapter. Pointer fields would force every consumer to dereference
defensively even though business logic rarely cares about the
NULL/empty distinction (removed records may legitimately carry either
form depending on whether the record passed through `stopped` first).
The adapter's job is to translate `sql.NullString``string`; the rest
of the codebase reads the field as a regular value.
## 3. `*time.Time` for nullable timestamps
`StartedAt`, `StoppedAt`, `RemovedAt` retain pointer types. `time.Time{}`
is a real, comparable value in Go (`IsZero` only reports the canonical
zero time); mixing "missing" and "set to UTC zero" through plain
`time.Time` would invite bugs. The jet-generated `model.RuntimeRecords`
already declares the same fields as `*time.Time`, so the domain type
aligns with the persistence type and the adapter does not re-shape
pointers.
## 4. `EventType` and `SnapshotStatus` are deliberately distinct
`runtime-health-asyncapi.yaml.EventType` enumerates seven values; the
SQL CHECK on `health_snapshots.status` enumerates six. The two sets
overlap but are not identical:
- `container_started` is an *event*; the snapshot collapses it to
`healthy` (a successful start is observed as the container being
live, not as an ongoing event);
- `probe_recovered` is an *event*; it does not become a snapshot row of
its own — the next inspect/probe overwrites the prior `probe_failed`
with `healthy`.
Modelling them as one shared enum would require a separate "event vs
snapshot" boolean and invite accidental mismatches. Two distinct types
with explicit `IsKnown` matrices keep each surface honest at compile
time.
## 5. `Inspect` split into `InspectImage` + `InspectContainer`
Two narrow methods replace a single polymorphic `Inspect`. The surface
RTM exercises has two shapes:
- the start service inspects the *image* by reference to read resource
limits from labels;
- the periodic inspect worker, the reconciler, and the events listener
inspect *containers* by id to read state, health, restart count, and
exit code.
The inputs differ (ref vs id), and the result types differ
(`ImageInspect.Labels` is the only field used at start time, while
`ContainerInspect` carries a dozen state fields). One polymorphic
method would either split internally on input type or return a tagged
union; either is messier than two narrow methods.
## 6. `LobbyGameRecord` is intentionally minimal
`LobbyInternalClient.GetGame` returns `GameID`, `Status`, and
`TargetEngineVersion`. The fetch is classified as ancillary diagnostics
because the start envelope already carries the only required field
(`image_ref`).
Anything more would invite RTM consumers to depend on Lobby's schema in
ways that violate the "RTM never resolves engine versions" rule.
Future fields are additive: each new field is opt-in to the consumer
and does not break existing call sites. The minimalism is also a hedge
against schema drift — Lobby's `GameRecord` is large and changes more
often than RTM needs to track.
## 7. `NotificationIntentPublisher.Publish` returns `error`, not `(string, error)`
Lobby's `IntentPublisher.Publish` returns the Redis Stream entry id so
business workflows that key on it (idempotency keys, audit
correlation) can capture it. RTM publishes admin-only failure intents
where the entry id has no consumer — failing starts do not loop back
to RTM, and notification routing keys on the producer-supplied
`idempotency_key` rather than the stream id. The adapter wraps
`pkg/notificationintent.Publisher` and discards the entry id at the
wrapper boundary.
## 8. Exactly four allowed runtime transitions
`runtime.AllowedTransitions` covers:
- `running → stopped` — graceful stop, observed exit, reconcile
observed exited;
- `running → removed``reconcile_dispose` when the container
vanished;
- `stopped → running` — restart and patch inner start;
- `stopped → removed` — cleanup TTL or admin DELETE.
Other pairs are intentionally rejected:
- `running → running` and `stopped → stopped` would mean Upsert
overwrote state without a CAS guard. Idempotent re-start / re-stop
never transitions; the service layer returns `replay_no_op` and the
record is left untouched.
- `removed → *` is forbidden because `removed` is terminal. The
reconciler creates fresh records with `reconcile_adopt` rather than
resurrecting old ones.
Encoding the table this way means a future bug where a service tries
to revive a removed record is rejected at the domain layer rather than
the adapter, which keeps the failure mode close to the offending code.
## 9. `PullPolicy` re-declared inside `ports/dockerclient.go`
The same enum exists as `config.ImagePullPolicy`. Importing
`internal/config` from the ports package would couple two unrelated
layers and create a cyclic risk once the wiring layer pulls both in.
The runtime/wiring layer (in `internal/app`) is the single point that
translates between the two type aliases — both are `string`-typed, the
value sets are identical, and the validation lives on each side
independently.
## 10. Compile-time interface assertions live with adapters
Every interface has a `var _ ports.X = (*Y)(nil)` assertion, but the
assertion lives in the adapter package (e.g.
`var _ ports.RuntimeRecordStore = (*Store)(nil)` inside
`internal/adapters/postgres/runtimerecordstore`). Putting the
assertions in the port package would force the port package to import
its own implementations and create an obvious import cycle.
## 11. `RunSpec.Validate` lives on the request type
The Docker port carries a non-trivial request type (`RunSpec`) with
eight required fields and per-mount invariants. Putting `Validate` on
the request struct keeps the rule next to the type definition, mirrors
the pattern used by `lobby/internal/ports/gmclient.go`
(`RegisterGameRequest.Validate`), and lets the adapter call it as the
first defensive check before invoking the Docker SDK.