feat: runtime manager

This commit is contained in:
Ilia Denisov
2026-04-28 20:39:18 +02:00
committed by GitHub
parent e0a99b346b
commit a7cee15115
289 changed files with 45660 additions and 2207 deletions
+119 -7
View File
@@ -344,7 +344,7 @@ On success:
### Application state machine
```
```text
submitted → approved
submitted → rejected
```
@@ -453,7 +453,7 @@ with payload: `game_id`, `game_name`, `invitee_user_id`, `invitee_name`.
### Invite state machine
```
```text
created → redeemed
created → declined
created → revoked
@@ -591,9 +591,11 @@ Sentinel errors: `ErrNameTaken`, `ErrInvalidName`, `ErrPendingMissing`,
`pg_advisory_xact_lock(hashtextextended(canonical_key, 0))`. See
`docs/postgres-migration.md` §6B for the full schema and decision
record.
- **Stub** (`lobby/internal/adapters/racenamestub/directory.go`) — in-process
implementation for unit tests that do not need PostgreSQL. Chosen by
`LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub`.
- **In-memory** (`lobby/internal/adapters/racenameinmem/directory.go`) —
in-process implementation used by unit tests that do not need
PostgreSQL and by deployments that select the in-memory backend with
`LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub` (the config token name is
preserved for backward compatibility).
A future dedicated `Race Name Service` replaces the adapter without changing
the domain or service layer.
@@ -737,7 +739,7 @@ sequenceDiagram
- If the container starts but `Lobby` cannot persist the runtime binding metadata,
the start is a full failure: `Lobby` must issue a stop job to `Runtime Manager`
before setting `start_failed`.
with `reason=orphan_cleanup` before setting `start_failed`.
- If metadata is persisted but `Game Master` is unavailable, the game must be
placed in `paused`, not in `start_failed`. The container is alive; only the
platform tracking is incomplete.
@@ -745,6 +747,96 @@ sequenceDiagram
- Concurrent start attempts for the same game must be serialized; the second
attempt must fail if the first already moved the game to `starting`.
### Runtime Manager envelopes
`Lobby` is the producer for both `runtime:start_jobs` and `runtime:stop_jobs`.
The `Lobby ↔ Runtime Manager` transport stays asynchronous indefinitely; there
is no synchronous Lobby→RTM REST call in v1 or planned for v2.
`runtime:start_jobs` envelope:
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | Lobby `game_id`. |
| `image_ref` | string | Docker reference resolved from `target_engine_version` via `LOBBY_ENGINE_IMAGE_TEMPLATE`. |
| `requested_at_ms` | int64 | UTC milliseconds; diagnostics only. |
`runtime:stop_jobs` envelope:
| Field | Type | Notes |
| --- | --- | --- |
| `game_id` | string | |
| `reason` | enum | `orphan_cleanup`, `cancelled`, `finished`, `admin_request`, `timeout`. |
| `requested_at_ms` | int64 | UTC milliseconds. |
`reason` semantics (Lobby producer side):
- `orphan_cleanup` — used by Lobby's runtime-job-result consumer to release a
container whose metadata persistence failed after a successful container
start.
- `cancelled` — used by the user-lifecycle cascade and by explicit cancel paths
for in-flight games.
- `finished` — reserved; not produced by Lobby in v1 because `game_finished`
is engine-driven and stop jobs after finish are an Admin/GM concern.
- `admin_request` — reserved for future admin-initiated stop paths through
Lobby; not produced in v1.
- `timeout` — reserved for future enrollment-timeout-driven stop paths; not
produced in v1.
### Design rationale: StopReason placement
The `StopReason` enum is declared in
`lobby/internal/ports/runtimemanager.go` alongside the `RuntimeManager`
interface that consumes it. The enum is publisher-side protocol: it
mirrors the AsyncAPI discriminator on `runtime:stop_jobs`, has no
behaviour beyond `Validate`, and co-locating it with the interface keeps
the AsyncAPI ↔ Go mapping visible in one file.
Alternatives considered and rejected:
- a dedicated `lobby/internal/domain/runtimejob` package — manufactures
a domain layer for a single string enum that exists only to be
serialised onto a Redis Stream;
- placing the enum in the publisher adapter package
(`lobby/internal/adapters/runtimemanager`) — the callers (start-game
service, runtime-job-result worker, user-lifecycle worker) live
outside that package and would have to depend on a concrete adapter
for an enum value.
### Design rationale: `engineimage.Resolver` validates the template at construction
`engineimage.Resolver` stores the validated template; the per-game
`Resolve(version)` call is therefore a pure string substitution that
cannot fail except on an empty `version`.
`LOBBY_ENGINE_IMAGE_TEMPLATE` is loaded at startup. A malformed value
(missing `{engine_version}` placeholder, empty string) is an
operational misconfiguration that fails fast before any traffic arrives
— not on the first start-game request hours later. The synchronous
start handler then incurs no per-call template-shape recheck.
A stateless free function `engineimage.Resolve(template, version)` was
rejected: the only useful checkpoint for the template literal is at
startup; a free function would either re-validate on every call (waste)
or skip validation (regression).
The resolver only guards against an empty/whitespace `version`. Semver
validation lives in `lobby/internal/domain/game/model.go:validateSemver`
and runs at game-record construction time. Re-running it inside the
resolver would either duplicate the rule (drift risk) or import the
validator across package boundaries for no behavioural gain. Keeping the
resolver narrow leaves it reusable from a future producer (for example
`Game Master`, when it takes over `image_ref` resolution) without
dragging Lobby's domain rules along.
The defensive `return start game: resolve image ref: %w` in
`startgame.Service.Handle` is a guard against a future invariant
violation; it is not exercised by the service-level test suite because
the only resolver-failure mode (empty `version`) requires bypassing
`game.Validate`, which `gameinmem.Save` always runs. Adding test
scaffolding to skip validation would teach the test suite a back door
that the production code path does not have.
## Paused State
`Lobby.paused` is a platform-level pause, distinct from `Game Master` runtime
@@ -1135,6 +1227,14 @@ Stream names:
- `LOBBY_RUNTIME_JOB_RESULTS_READ_BLOCK_TIMEOUT` with default `2s`
- `LOBBY_NOTIFICATION_INTENTS_STREAM` with default `notification:intents`
Runtime Manager integration:
- `LOBBY_ENGINE_IMAGE_TEMPLATE` with default `galaxy/game:{engine_version}` —
Go-style template applied to a game's `target_engine_version` to resolve
the Docker `image_ref` published on `runtime:start_jobs`. The template
must contain the literal placeholder `{engine_version}`; Lobby fails
fast at startup otherwise.
Upstream clients:
- `LOBBY_USER_SERVICE_TIMEOUT` with default `1s`
@@ -1264,6 +1364,18 @@ Key operations emit structured logs with these stable field names where applicab
## Verification
Test doubles split between two styles. Wide-surface ports with no
production state (`RuntimeManager`, `IntentPublisher`, `GMClient`,
`UserService`) use `gomock`-generated mocks under
`internal/adapters/mocks/`; regenerate with `make -C lobby mocks`.
Stateful behavioural fakes that mirror the production adapter
contract (`gameinmem`, `applicationinmem`, `inviteinmem`,
`membershipinmem`, `gameturnstatsinmem`, `racenameinmem`,
`evaluationguardinmem`, `gapactivationinmem`, `streamoffsetinmem`)
live as in-memory adapters under `internal/adapters/<name>inmem/`
and stay hand-rolled because tests rely on their CAS, status-transition,
and invariant-tracking behaviour.
Focused service-local coverage verifies:
- configuration loading and validation for all env var groups
@@ -1274,7 +1386,7 @@ Focused service-local coverage verifies:
- application flow: submit (eligibility check, race name check), approve, reject
- invite flow: create, redeem (auto-membership), decline, revoke, expire on enrollment close
- membership model: activate, remove, block with correct before/after-start semantics
- Race Name Directory (redis + stub adapters against the same suite):
- Race Name Directory (PostgreSQL + in-memory adapters against the same suite):
canonicalization + confusable-pair policy, `Reserve`/`ReleaseReservation`
per-game semantics, `MarkPendingRegistration`/`ExpirePendingRegistrations`
window, `Register` idempotency + quota, `ReleaseAllByUser` cascade