Files
galaxy-game/gamemaster/docs/stage03-existing-service-docs-sync.md
T
2026-05-03 07:59:03 +02:00

5.7 KiB

stage, title
stage title
03 Existing-service docs sync (Lobby, Notification, Game, RTM)

Stage 03 — Existing-service docs sync

This decision record captures the non-obvious choices made while synchronising every touched-service README with the post-Game-Master contract before any code change lands. The mechanical edits (strikethrough renames, drop of ships_built, replacement of the engineimage.Resolver block) are not enumerated here — they are direct consequences of the rules already recorded in ../README.md and ../../ARCHITECTURE.md.

Context

Stage 03 had to reach a state where every README in the repository agreed on three new contractual rules before any service-level code landed:

  • image_ref is resolved synchronously from Game Master's engine version registry, not from a Go-template held by Game Lobby.
  • A new outgoing POST /api/v1/internal/games/{game_id}/memberships/invalidate hook from Game Lobby into Game Master fires post-commit on every roster mutation.
  • The engine container splits its REST surface into /api/v1/admin/* (GM-only) and /api/v1/{command,order,report} (player), and StateResponse carries a new boolean finished field that GM uses as the sole finish signal.

Three decisions were not derivable from the GM README and required a deliberate choice while editing lobby/README.md, game/README.md, and rtmanager/README.md.

Decision 1 — lobby.game.start failure modes for GM-driven image resolve

Game Lobby now calls GET /api/v1/internal/engine-versions/{version}/image-ref synchronously before publishing runtime:start_jobs. The contract defines two new failure modes for the lobby.game.start command:

  • GM unreachable (network error, timeout, 5xx) ⇒ lobby.game.start returns service_unavailable; the game stays in ready_to_start. No container is created, no envelope is published.
  • GM reports the version is missing or deprecated (404 or engine_version_not_found payload) ⇒ lobby.game.start returns engine_version_not_found; the game stays in ready_to_start.

Both error codes were added to the stable error code list in lobby/README.md. They are deliberately distinct from the existing GM-unavailable-after-container-start path, which transitions the game to paused (the container is alive; only platform tracking is missing). Conflating the two would force operators to inspect the paused set for misconfigurations that never produced a container.

Alternatives considered and rejected:

  • treat GM-unavailable at resolve time as paused for symmetry with the later path — rejected because no container exists, so the lobby.runtime_paused_after_start admin notification (which announces a stranded container) would be a lie;
  • silently fall back to a Go-template default when GM is unreachable — rejected because it brings back the very coupling the stage is retiring and lets a misconfigured registry slip through unnoticed.

Decision 2 — Membership invalidate hook is fail-open

The new outgoing POST /api/v1/internal/games/{game_id}/memberships/invalidate call from approveapplication, rejectapplication, redeeminvite, removemember, blockmember, and the user-lifecycle cascade worker is documented as fail-open: a non-2xx response is logged and metered but never rolls back the Lobby commit. GM's TTL safety net catches stale data within the next cache TTL window.

This matches the architectural rule that a failed cross-service hook must not invalidate an already committed business state. The TTL on GM's in-process membership cache (default 30s) bounds the staleness window; the explicit hook only optimises for the time between commit and TTL expiry.

Alternatives considered and rejected:

  • two-phase commit across Lobby and GM — rejected: GM is allowed to be unavailable without rolling back Lobby's roster mutation;
  • queue the invalidation on a Redis Stream and let GM consume it asynchronously — rejected for v1 because it introduces a new stream contract for a rare event, and the synchronous post-commit call is cheap enough that the staleness reduction beats the operational cost.

Decision 3 — Keep runtime:start_jobs envelope shape unchanged

The runtime:start_jobs envelope continues to carry image_ref as a top-level string field. Only the source of that string changes (from a Lobby-side template substitution to a Lobby-side synchronous call into GM). Runtime Manager does not need a contract change in this stage and does not learn about engine versions — it still receives a ready-to-pull Docker reference.

Alternatives considered and rejected:

  • replace image_ref with engine_version and have RTM resolve the image — rejected: it would force RTM to call GM, which violates the rule that RTM has no upstream service dependencies for runtime operations;
  • attach the resolved version metadata to the envelope alongside image_ref — rejected: RTM has no consumer for the metadata and carrying it would invite divergence between Lobby and RTM views of the engine version registry.

References