Files
galaxy-game/gamemaster/docs/stage16-membership-cache-and-invalidation.md
T
2026-05-03 07:59:03 +02:00

11 KiB
Raw Blame History

stage, title
stage title
16 Hot-path services and membership cache

Stage 16 — Hot-path services and membership cache

This decision record captures the non-obvious choices made while implementing the gateway-facing trio of player services (commandexecute, orderput, reportget) and the in-process membership cache that authorises every hot-path call. It is the last service-layer stage before Stage 17 (admin operations) and Stage 19 (REST handlers and wiring).

Context

../PLAN.md Stage 16 ships four components that together make the player surface usable:

  1. service/membership — concurrent in-process LRU cache holding the per-game user_id → status projection from Lobby /api/v1/internal/games/{game_id}/memberships. TTL is the safety net; the explicit invalidation hook from Lobby is the primary staleness control.
  2. service/commandexecute — orchestrator behind POST /api/v1/internal/games/{game_id}/commands. Authorises the caller, resolves actor=race_name, reshapes the JSON envelope, and forwards PUT /api/v1/command to the engine.
  3. service/orderput — same shape as commandexecute, targeting the engine PUT /api/v1/order.
  4. service/reportget — orchestrator behind GET /api/v1/internal/games/{game_id}/reports/{turn}. Authorises the caller, resolves race_name, and forwards GET /api/v1/report?player=<race>&turn=<turn> to the engine.

The reference precedent for the orchestrator shape (Input / Result / Dependencies / NewService / Handle, plus a private classifyEngineError helper) is Stage 15's service/turngeneration. Six decisions deviate from a literal reading of the README, the OpenAPI surface, or the turngeneration precedent. Each is recorded below.

Decisions

D1. reportget does not require runtime_records.status = running

Decision. service/reportget accepts any non-deleted runtime row and forwards the read to the engine. runtime_not_running is not part of reportget's error vocabulary (errors.go). commandexecute and orderput, by contrast, reject anything other than StatusRunning with runtime_not_running.

Why. Three signals point at the same conclusion:

  • The OpenAPI surface for internalGetReport (api/internal-openapi.yaml lines 546575) lists only 403 / 404 / 502 / 500 responses; there is no 409 / runtime_not_running on the report path. The matching error response on commands and orders (lines 502, 540) does include 409.
  • The README §Reports flow (../README.md lines 508520) lists only authorisation, race-name resolution, and engine forwarding. The preceding §Player commands and orders block (lines 492506) lists the status=running precondition explicitly. The two sections are separately worded by design.
  • A finished or stopped runtime is a normal target for a post-mortem read of older turns. Refusing the read forces operators to use ad-hoc database access for the same data the engine already exposes.

The engine_unreachable outcome remains the natural failure mode when the engine container is genuinely gone (e.g., on engine_unreachable status); no extra branch is required.

This decision was confirmed with the user during plan-mode review.

D2. GM rewrites the engine envelope (commandscmd, inject actor)

Decision. commandexecute.rewriteCommandPayload and the parallel orderput.rewriteOrderPayload unmarshal the GM ExecuteCommandsRequest / PutOrdersRequest body as map[string]json.RawMessage, take the commands field, and emit a fresh JSON object containing only actor (set to the resolved race name) and cmd (carrying the original array). Every other top-level key is dropped. The OpenAPI descriptions for ExecuteCommandsRequest and PutOrdersRequest were updated in the same patch to document the rewrite.

Why. The literal "forwarded verbatim" wording in the original Stage 06 OpenAPI description conflicted with two upstream constraints:

  • The engine CommandRequest schema in game/openapi.yaml lines 345364 declares actor and cmd as required, with no top-level commands.
  • The README §Hot Path rule "GM never trusts a payload field for actor identification" (../README.md lines 487490) requires GM to set actor from the authenticated user identity.

Two alternatives were rejected:

  • Move the rewrite into engineclient. The adapter's role is thin transport; injecting actor (an authorisation concern) into transport would muddle the boundary and make the adapter test harness authorisation-aware. The service is the right home.
  • Inject actor only and keep the commands key. The engine schema requires cmd; this would require an engine contract change outside the Stage 16 scope and break Stage 05's frozen path.

The transform is duplicated across the two services rather than extracted to a shared package. Each implementation is twelve lines and each service is otherwise independent; a shared package would add import-edge surface for marginal savings, and the project convention is to prefer the minimal diff (CLAUDE.md §Priorities). The duplication is explicitly documented in both file-level comments.

This decision was confirmed with the user during plan-mode review.

D3. Hot-path services do not append to operation_log

Decision. None of the three services emit an operation_log entry. The Input shape carries no OpSource/SourceRef fields. Telemetry counters (gamemaster.command_execute.outcomes, gamemaster.order_put.outcomes, gamemaster.report_get.outcomes) are the only audit surface.

Why. The operation.OpKind enum (internal/domain/operation/log.go) intentionally has no value for command, order, or report — it stops at admin and lifecycle operations. Every hot-path call would multiply audit volume by the order rate without adding investigative value: the telemetry counter already exposes outcome distribution, and the engine itself is the source of truth for per-command results. Adding three new OpKind values would also bloat the SQL CHECK on operation_log with no operational consumer.

D4. Membership cache uses a hand-rolled per-game inflight tracker

Decision. Cache.fetch coordinates concurrent misses on the same game_id through a tiny map[gameID]*flight plus a per-flight done channel. Joiners block on select { case <-existing.done: case <-ctx.Done(): }. The leader populates members (or err) on the flight before closing the channel.

Why. golang.org/x/sync/singleflight would be a sharper tool, but adding it as a direct dependency (it is currently only an indirect transitive of other modules in the workspace) requires the "justification for direct deps" bar set by CLAUDE.md §Dependencies. The cache is the only consumer in gamemaster, the implementation is ~30 lines, and a context-cancellable wait is one extra select line we would otherwise have to wrap around singleflight.Do anyway. The cache-internal helper is the cheaper choice.

D5. Cache returns the raw status string

Decision. Cache.Resolve returns (status string, err error) where the status is the verbatim Lobby vocabulary ("active", "removed", "blocked") plus the empty string when the user is not in the roster. Callers compare against membershipStatusActive = "active" directly. There is no typed wrapper.

Why. ports.Membership.Status is already string (internal/ports/lobbyclient.go line 56); introducing a MembershipStatus domain type purely to be passed through would add boilerplate without enforcing any invariant Go's type system can check. The hot-path services need only a single equality check, so a typed enum buys nothing; it would also need a fallback for "unknown vocabulary" defensive against future Lobby additions, which is more decision surface than the cache should own.

D6. Empty roster slot surfaces as forbidden

Decision. Two distinct underlying conditions both surface as ErrorCodeForbidden from the three services:

  • The membership cache returns the empty string for the requested (gameID, userID): the user is not present in the Lobby roster.
  • The membership cache returns "active" but playermappingstore.Get(gameID, userID) returns playermapping.ErrNotFound: the user is an active platform member but has no engine roster slot.

The second condition is an internal inconsistency (register-runtime should have installed the row), but the user-visible semantics — "you are not authorised to act on this game" — are identical to the first. The structured log captures the underlying cause.

Why. Surfacing the second condition as internal_error would expose 500 to a perfectly-routine "user not part of the engine roster" case and obscure the actual outcome from the gateway and the user. The inconsistency, if it ever materialises, is an operator concern visible in the warn-level log and the forbidden metric attribution; treating it as a 5xx would not help operators (who would then ignore the false alarm) nor users (who only care that they cannot act).

Files landed

Created:

Modified:

  • ../api/internal-openapi.yaml — rewrote the description fields of ExecuteCommandsRequest and PutOrdersRequest to document the GM-side envelope rewrite.

Reused (not modified):

  • internal/ports/{engineclient.go, lobbyclient.go, playermappingstore.go, runtimerecordstore.go} — every interface and sentinel was already present.
  • internal/domain/runtime/model.goStatusRunning constant + the whole status vocabulary.
  • internal/domain/playermapping/model.goPlayerMapping and ErrNotFound.
  • internal/domain/operation/log.goOutcome enum.
  • internal/config/config.goMembershipCacheConfig.{TTL, MaxGames} with defaults 30s / 4096.
  • internal/telemetry/runtime.goRecordCommandExecuteOutcome, RecordOrderPutOutcome, RecordReportGetOutcome, RecordMembershipCacheResult, RecordEngineCall (already wired in Stage 08).

Verification

cd gamemaster

# Membership cache (race-clean concurrency).
go test -race ./internal/service/membership/...

# Each new player service.
go test ./internal/service/commandexecute/...
go test ./internal/service/orderput/...
go test ./internal/service/reportget/...

# Module-wide build + suite.
go build ./...
go test ./...

Out-of-scope for this stage: app wiring (Stage 19), service-local integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).