--- stage: 03 title: Existing-service docs sync (Lobby, Notification, Game, RTM) --- # Stage 03 — Existing-service docs sync This decision record captures the non-obvious choices made while synchronising every touched-service README with the post-Game-Master contract before any code change lands. The mechanical edits (strikethrough renames, drop of `ships_built`, replacement of the `engineimage.Resolver` block) are not enumerated here — they are direct consequences of the rules already recorded in [`../README.md`](../README.md) and [`../../ARCHITECTURE.md`](../../ARCHITECTURE.md). ## Context Stage 03 had to reach a state where every README in the repository agreed on three new contractual rules before any service-level code landed: - `image_ref` is resolved synchronously from `Game Master`'s engine version registry, not from a Go-template held by `Game Lobby`. - A new outgoing `POST /api/v1/internal/games/{game_id}/memberships/invalidate` hook from `Game Lobby` into `Game Master` fires post-commit on every roster mutation. - The engine container splits its REST surface into `/api/v1/admin/*` (GM-only) and `/api/v1/{command,order,report}` (player), and `StateResponse` carries a new boolean `finished` field that GM uses as the sole finish signal. Three decisions were not derivable from the GM README and required a deliberate choice while editing `lobby/README.md`, `game/README.md`, and `rtmanager/README.md`. ## Decision 1 — `lobby.game.start` failure modes for GM-driven image resolve `Game Lobby` now calls `GET /api/v1/internal/engine-versions/{version}/image-ref` synchronously before publishing `runtime:start_jobs`. The contract defines two new failure modes for the `lobby.game.start` command: - GM unreachable (network error, timeout, `5xx`) ⇒ `lobby.game.start` returns `service_unavailable`; the game stays in `ready_to_start`. No container is created, no envelope is published. - GM reports the version is missing or deprecated (`404` or `engine_version_not_found` payload) ⇒ `lobby.game.start` returns `engine_version_not_found`; the game stays in `ready_to_start`. Both error codes were added to the stable error code list in `lobby/README.md`. They are deliberately distinct from the existing GM-unavailable-after-container-start path, which transitions the game to `paused` (the container is alive; only platform tracking is missing). Conflating the two would force operators to inspect the `paused` set for misconfigurations that never produced a container. Alternatives considered and rejected: - treat GM-unavailable at resolve time as `paused` for symmetry with the later path — rejected because no container exists, so the `lobby.runtime_paused_after_start` admin notification (which announces a stranded container) would be a lie; - silently fall back to a Go-template default when GM is unreachable — rejected because it brings back the very coupling the stage is retiring and lets a misconfigured registry slip through unnoticed. ## Decision 2 — Membership invalidate hook is fail-open The new outgoing `POST /api/v1/internal/games/{game_id}/memberships/invalidate` call from `approveapplication`, `rejectapplication`, `redeeminvite`, `removemember`, `blockmember`, and the user-lifecycle cascade worker is documented as **fail-open**: a non-2xx response is logged and metered but never rolls back the Lobby commit. GM's TTL safety net catches stale data within the next cache TTL window. This matches the architectural rule that a failed cross-service hook must not invalidate an already committed business state. The TTL on GM's in-process membership cache (default `30s`) bounds the staleness window; the explicit hook only optimises for the time between commit and TTL expiry. Alternatives considered and rejected: - two-phase commit across Lobby and GM — rejected: GM is allowed to be unavailable without rolling back Lobby's roster mutation; - queue the invalidation on a Redis Stream and let GM consume it asynchronously — rejected for v1 because it introduces a new stream contract for a rare event, and the synchronous post-commit call is cheap enough that the staleness reduction beats the operational cost. ## Decision 3 — Keep `runtime:start_jobs` envelope shape unchanged The `runtime:start_jobs` envelope continues to carry `image_ref` as a top-level string field. Only the source of that string changes (from a Lobby-side template substitution to a Lobby-side synchronous call into GM). `Runtime Manager` does not need a contract change in this stage and does not learn about engine versions — it still receives a ready-to-pull Docker reference. Alternatives considered and rejected: - replace `image_ref` with `engine_version` and have RTM resolve the image — rejected: it would force RTM to call GM, which violates the rule that RTM has no upstream service dependencies for runtime operations; - attach the resolved version metadata to the envelope alongside `image_ref` — rejected: RTM has no consumer for the metadata and carrying it would invite divergence between Lobby and RTM views of the engine version registry. ## References - [`../PLAN.md` Stage 03](../PLAN.md) - [`../README.md`](../README.md) — Game Master service description. - [`../../lobby/README.md`](../../lobby/README.md) — updated Game Start Flow, internal trusted REST, configuration, and error codes. - [`../../game/README.md`](../../game/README.md) — admin path layout, `StateResponse.finished`, `/admin/race/banish` shape. - [`../../rtmanager/README.md`](../../rtmanager/README.md) — `runtime:health_events` consumer note. - [`../../notification/README.md`](../../notification/README.md) — GM as the producer of the three `game.*` notification types.