5.7 KiB
stage, title
| stage | title |
|---|---|
| 03 | Existing-service docs sync (Lobby, Notification, Game, RTM) |
Stage 03 — Existing-service docs sync
This decision record captures the non-obvious choices made while
synchronising every touched-service README with the post-Game-Master
contract before any code change lands. The mechanical edits
(strikethrough renames, drop of ships_built, replacement of the
engineimage.Resolver block) are not enumerated here — they are direct
consequences of the rules already recorded in
../README.md and
../../ARCHITECTURE.md.
Context
Stage 03 had to reach a state where every README in the repository agreed on three new contractual rules before any service-level code landed:
image_refis resolved synchronously fromGame Master's engine version registry, not from a Go-template held byGame Lobby.- A new outgoing
POST /api/v1/internal/games/{game_id}/memberships/invalidatehook fromGame LobbyintoGame Masterfires post-commit on every roster mutation. - The engine container splits its REST surface into
/api/v1/admin/*(GM-only) and/api/v1/{command,order,report}(player), andStateResponsecarries a new booleanfinishedfield that GM uses as the sole finish signal.
Three decisions were not derivable from the GM README and required a
deliberate choice while editing lobby/README.md, game/README.md,
and rtmanager/README.md.
Decision 1 — lobby.game.start failure modes for GM-driven image resolve
Game Lobby now calls
GET /api/v1/internal/engine-versions/{version}/image-ref synchronously
before publishing runtime:start_jobs. The contract defines two new
failure modes for the lobby.game.start command:
- GM unreachable (network error, timeout,
5xx) ⇒lobby.game.startreturnsservice_unavailable; the game stays inready_to_start. No container is created, no envelope is published. - GM reports the version is missing or deprecated (
404orengine_version_not_foundpayload) ⇒lobby.game.startreturnsengine_version_not_found; the game stays inready_to_start.
Both error codes were added to the stable error code list in
lobby/README.md. They are deliberately distinct from the existing
GM-unavailable-after-container-start path, which transitions the game to
paused (the container is alive; only platform tracking is missing).
Conflating the two would force operators to inspect the paused set
for misconfigurations that never produced a container.
Alternatives considered and rejected:
- treat GM-unavailable at resolve time as
pausedfor symmetry with the later path — rejected because no container exists, so thelobby.runtime_paused_after_startadmin notification (which announces a stranded container) would be a lie; - silently fall back to a Go-template default when GM is unreachable — rejected because it brings back the very coupling the stage is retiring and lets a misconfigured registry slip through unnoticed.
Decision 2 — Membership invalidate hook is fail-open
The new outgoing
POST /api/v1/internal/games/{game_id}/memberships/invalidate call from
approveapplication, rejectapplication, redeeminvite,
removemember, blockmember, and the user-lifecycle cascade worker is
documented as fail-open: a non-2xx response is logged and metered
but never rolls back the Lobby commit. GM's TTL safety net catches
stale data within the next cache TTL window.
This matches the architectural rule that a failed cross-service hook
must not invalidate an already committed business state. The TTL on
GM's in-process membership cache (default 30s) bounds the staleness
window; the explicit hook only optimises for the time between commit
and TTL expiry.
Alternatives considered and rejected:
- two-phase commit across Lobby and GM — rejected: GM is allowed to be unavailable without rolling back Lobby's roster mutation;
- queue the invalidation on a Redis Stream and let GM consume it asynchronously — rejected for v1 because it introduces a new stream contract for a rare event, and the synchronous post-commit call is cheap enough that the staleness reduction beats the operational cost.
Decision 3 — Keep runtime:start_jobs envelope shape unchanged
The runtime:start_jobs envelope continues to carry image_ref as a
top-level string field. Only the source of that string changes (from a
Lobby-side template substitution to a Lobby-side synchronous call into
GM). Runtime Manager does not need a contract change in this stage
and does not learn about engine versions — it still receives a
ready-to-pull Docker reference.
Alternatives considered and rejected:
- replace
image_refwithengine_versionand have RTM resolve the image — rejected: it would force RTM to call GM, which violates the rule that RTM has no upstream service dependencies for runtime operations; - attach the resolved version metadata to the envelope alongside
image_ref— rejected: RTM has no consumer for the metadata and carrying it would invite divergence between Lobby and RTM views of the engine version registry.
References
../PLAN.mdStage 03../README.md— Game Master service description.../../lobby/README.md— updated Game Start Flow, internal trusted REST, configuration, and error codes.../../game/README.md— admin path layout,StateResponse.finished,/admin/race/banishshape.../../rtmanager/README.md—runtime:health_eventsconsumer note.../../notification/README.md— GM as the producer of the threegame.*notification types.