--- stage: 16 title: Hot-path services and membership cache --- # Stage 16 — Hot-path services and membership cache This decision record captures the non-obvious choices made while implementing the gateway-facing trio of player services (`commandexecute`, `orderput`, `reportget`) and the in-process membership cache that authorises every hot-path call. It is the last service-layer stage before Stage 17 (admin operations) and Stage 19 (REST handlers and wiring). ## Context [`../PLAN.md` Stage 16](../PLAN.md) ships four components that together make the player surface usable: 1. `service/membership` — concurrent in-process LRU cache holding the per-game `user_id → status` projection from `Lobby /api/v1/internal/games/{game_id}/memberships`. TTL is the safety net; the explicit invalidation hook from Lobby is the primary staleness control. 2. `service/commandexecute` — orchestrator behind `POST /api/v1/internal/games/{game_id}/commands`. Authorises the caller, resolves `actor=race_name`, reshapes the JSON envelope, and forwards `PUT /api/v1/command` to the engine. 3. `service/orderput` — same shape as `commandexecute`, targeting the engine `PUT /api/v1/order`. 4. `service/reportget` — orchestrator behind `GET /api/v1/internal/games/{game_id}/reports/{turn}`. Authorises the caller, resolves `race_name`, and forwards `GET /api/v1/report?player=&turn=` to the engine. The reference precedent for the orchestrator shape (Input / Result / Dependencies / NewService / Handle, plus a private `classifyEngineError` helper) is Stage 15's `service/turngeneration`. Six decisions deviate from a literal reading of the README, the OpenAPI surface, or the turngeneration precedent. Each is recorded below. ## Decisions ### D1. `reportget` does not require `runtime_records.status = running` **Decision.** [`service/reportget`](../internal/service/reportget/service.go) accepts any non-deleted runtime row and forwards the read to the engine. `runtime_not_running` is **not** part of `reportget`'s error vocabulary ([`errors.go`](../internal/service/reportget/errors.go)). `commandexecute` and `orderput`, by contrast, reject anything other than `StatusRunning` with `runtime_not_running`. **Why.** Three signals point at the same conclusion: - The OpenAPI surface for `internalGetReport` (`api/internal-openapi.yaml` lines 546–575) lists only `403 / 404 / 502 / 500` responses; there is no 409 / `runtime_not_running` on the report path. The matching error response on commands and orders (lines 502, 540) does include 409. - The README §Reports flow (`../README.md` lines 508–520) lists only authorisation, race-name resolution, and engine forwarding. The preceding §Player commands and orders block (lines 492–506) lists the `status=running` precondition explicitly. The two sections are separately worded by design. - A finished or stopped runtime is a normal target for a post-mortem read of older turns. Refusing the read forces operators to use ad-hoc database access for the same data the engine already exposes. The `engine_unreachable` outcome remains the natural failure mode when the engine container is genuinely gone (e.g., on `engine_unreachable` status); no extra branch is required. This decision was confirmed with the user during plan-mode review. ### D2. GM rewrites the engine envelope (`commands` → `cmd`, inject `actor`) **Decision.** [`commandexecute.rewriteCommandPayload`](../internal/service/commandexecute/service.go) and the parallel [`orderput.rewriteOrderPayload`](../internal/service/orderput/service.go) unmarshal the GM `ExecuteCommandsRequest` / `PutOrdersRequest` body as `map[string]json.RawMessage`, take the `commands` field, and emit a fresh JSON object containing only `actor` (set to the resolved race name) and `cmd` (carrying the original array). Every other top-level key is dropped. The OpenAPI descriptions for `ExecuteCommandsRequest` and `PutOrdersRequest` were updated in the same patch to document the rewrite. **Why.** The literal "forwarded verbatim" wording in the original Stage 06 OpenAPI description conflicted with two upstream constraints: - The engine `CommandRequest` schema in `game/openapi.yaml` lines 345–364 declares `actor` and `cmd` as required, with no top-level `commands`. - The README §Hot Path rule "GM never trusts a payload field for actor identification" (`../README.md` lines 487–490) requires GM to set `actor` from the authenticated user identity. Two alternatives were rejected: - **Move the rewrite into `engineclient`.** The adapter's role is thin transport; injecting actor (an authorisation concern) into transport would muddle the boundary and make the adapter test harness authorisation-aware. The service is the right home. - **Inject `actor` only and keep the `commands` key.** The engine schema requires `cmd`; this would require an engine contract change outside the Stage 16 scope and break Stage 05's frozen path. The transform is duplicated across the two services rather than extracted to a shared package. Each implementation is twelve lines and each service is otherwise independent; a shared package would add import-edge surface for marginal savings, and the project convention is to prefer the minimal diff (`CLAUDE.md §Priorities`). The duplication is explicitly documented in both file-level comments. This decision was confirmed with the user during plan-mode review. ### D3. Hot-path services do not append to `operation_log` **Decision.** None of the three services emit an `operation_log` entry. The `Input` shape carries no `OpSource`/`SourceRef` fields. Telemetry counters (`gamemaster.command_execute.outcomes`, `gamemaster.order_put.outcomes`, `gamemaster.report_get.outcomes`) are the only audit surface. **Why.** The `operation.OpKind` enum (`internal/domain/operation/log.go`) intentionally has no value for command, order, or report — it stops at admin and lifecycle operations. Every hot-path call would multiply audit volume by the order rate without adding investigative value: the telemetry counter already exposes outcome distribution, and the engine itself is the source of truth for per-command results. Adding three new `OpKind` values would also bloat the SQL CHECK on `operation_log` with no operational consumer. ### D4. Membership cache uses a hand-rolled per-game inflight tracker **Decision.** [`Cache.fetch`](../internal/service/membership/cache.go) coordinates concurrent misses on the same `game_id` through a tiny `map[gameID]*flight` plus a per-flight `done` channel. Joiners block on `select { case <-existing.done: case <-ctx.Done(): }`. The leader populates `members` (or `err`) on the flight before closing the channel. **Why.** `golang.org/x/sync/singleflight` would be a sharper tool, but adding it as a *direct* dependency (it is currently only an indirect transitive of other modules in the workspace) requires the "justification for direct deps" bar set by `CLAUDE.md §Dependencies`. The cache is the only consumer in `gamemaster`, the implementation is ~30 lines, and a context-cancellable wait is one extra `select` line we would otherwise have to wrap around `singleflight.Do` anyway. The cache-internal helper is the cheaper choice. ### D5. Cache returns the raw status string **Decision.** [`Cache.Resolve`](../internal/service/membership/cache.go) returns `(status string, err error)` where the status is the verbatim Lobby vocabulary (`"active"`, `"removed"`, `"blocked"`) plus the empty string when the user is not in the roster. Callers compare against `membershipStatusActive = "active"` directly. There is no typed wrapper. **Why.** `ports.Membership.Status` is already `string` (`internal/ports/lobbyclient.go` line 56); introducing a `MembershipStatus` domain type purely to be passed through would add boilerplate without enforcing any invariant Go's type system can check. The hot-path services need only a single equality check, so a typed enum buys nothing; it would also need a fallback for "unknown vocabulary" defensive against future Lobby additions, which is more decision surface than the cache should own. ### D6. Empty roster slot surfaces as `forbidden` **Decision.** Two distinct underlying conditions both surface as `ErrorCodeForbidden` from the three services: - The membership cache returns the empty string for the requested `(gameID, userID)`: the user is not present in the Lobby roster. - The membership cache returns `"active"` but `playermappingstore.Get(gameID, userID)` returns `playermapping.ErrNotFound`: the user is an active platform member but has no engine roster slot. The second condition is an internal inconsistency (register-runtime should have installed the row), but the user-visible semantics — "you are not authorised to act on this game" — are identical to the first. The structured log captures the underlying cause. **Why.** Surfacing the second condition as `internal_error` would expose 500 to a perfectly-routine "user not part of the engine roster" case and obscure the actual outcome from the gateway and the user. The inconsistency, if it ever materialises, is an operator concern visible in the warn-level log and the `forbidden` metric attribution; treating it as a 5xx would not help operators (who would then ignore the false alarm) nor users (who only care that they cannot act). ## Files landed **Created:** - [`../internal/service/membership/{errors.go, cache.go, cache_test.go}`](../internal/service/membership/) — concurrent LRU cache plus `ErrLobbyUnavailable` sentinel. - [`../internal/service/commandexecute/{errors.go, service.go, service_test.go}`](../internal/service/commandexecute/) — command-execute orchestrator and tests. - [`../internal/service/orderput/{errors.go, service.go, service_test.go}`](../internal/service/orderput/) — order-put orchestrator and tests. - [`../internal/service/reportget/{errors.go, service.go, service_test.go}`](../internal/service/reportget/) — report-get orchestrator and tests. - This decision record. **Modified:** - [`../api/internal-openapi.yaml`](../api/internal-openapi.yaml) — rewrote the description fields of `ExecuteCommandsRequest` and `PutOrdersRequest` to document the GM-side envelope rewrite. **Reused (not modified):** - `internal/ports/{engineclient.go, lobbyclient.go, playermappingstore.go, runtimerecordstore.go}` — every interface and sentinel was already present. - `internal/domain/runtime/model.go` — `StatusRunning` constant + the whole status vocabulary. - `internal/domain/playermapping/model.go` — `PlayerMapping` and `ErrNotFound`. - `internal/domain/operation/log.go` — `Outcome` enum. - `internal/config/config.go` — `MembershipCacheConfig.{TTL, MaxGames}` with defaults `30s` / `4096`. - `internal/telemetry/runtime.go` — `RecordCommandExecuteOutcome`, `RecordOrderPutOutcome`, `RecordReportGetOutcome`, `RecordMembershipCacheResult`, `RecordEngineCall` (already wired in Stage 08). ## Verification ```sh cd gamemaster # Membership cache (race-clean concurrency). go test -race ./internal/service/membership/... # Each new player service. go test ./internal/service/commandexecute/... go test ./internal/service/orderput/... go test ./internal/service/reportget/... # Module-wide build + suite. go build ./... go test ./... ``` Out-of-scope for this stage: app wiring (Stage 19), service-local integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).