11 KiB
stage, title
| stage | title |
|---|---|
| 16 | Hot-path services and membership cache |
Stage 16 — Hot-path services and membership cache
This decision record captures the non-obvious choices made while
implementing the gateway-facing trio of player services
(commandexecute, orderput, reportget) and the in-process membership
cache that authorises every hot-path call. It is the last service-layer
stage before Stage 17 (admin operations) and Stage 19 (REST handlers and
wiring).
Context
../PLAN.md Stage 16 ships four components that together
make the player surface usable:
service/membership— concurrent in-process LRU cache holding the per-gameuser_id → statusprojection fromLobby /api/v1/internal/games/{game_id}/memberships. TTL is the safety net; the explicit invalidation hook from Lobby is the primary staleness control.service/commandexecute— orchestrator behindPOST /api/v1/internal/games/{game_id}/commands. Authorises the caller, resolvesactor=race_name, reshapes the JSON envelope, and forwardsPUT /api/v1/commandto the engine.service/orderput— same shape ascommandexecute, targeting the enginePUT /api/v1/order.service/reportget— orchestrator behindGET /api/v1/internal/games/{game_id}/reports/{turn}. Authorises the caller, resolvesrace_name, and forwardsGET /api/v1/report?player=<race>&turn=<turn>to the engine.
The reference precedent for the orchestrator shape (Input / Result /
Dependencies / NewService / Handle, plus a private classifyEngineError
helper) is Stage 15's service/turngeneration. Six decisions deviate
from a literal reading of the README, the OpenAPI surface, or the
turngeneration precedent. Each is recorded below.
Decisions
D1. reportget does not require runtime_records.status = running
Decision.
service/reportget accepts
any non-deleted runtime row and forwards the read to the engine.
runtime_not_running is not part of reportget's error vocabulary
(errors.go).
commandexecute and orderput, by contrast, reject anything other than
StatusRunning with runtime_not_running.
Why. Three signals point at the same conclusion:
- The OpenAPI surface for
internalGetReport(api/internal-openapi.yamllines 546–575) lists only403 / 404 / 502 / 500responses; there is no 409 /runtime_not_runningon the report path. The matching error response on commands and orders (lines 502, 540) does include 409. - The README §Reports flow (
../README.mdlines 508–520) lists only authorisation, race-name resolution, and engine forwarding. The preceding §Player commands and orders block (lines 492–506) lists thestatus=runningprecondition explicitly. The two sections are separately worded by design. - A finished or stopped runtime is a normal target for a post-mortem read of older turns. Refusing the read forces operators to use ad-hoc database access for the same data the engine already exposes.
The engine_unreachable outcome remains the natural failure mode when
the engine container is genuinely gone (e.g., on engine_unreachable
status); no extra branch is required.
This decision was confirmed with the user during plan-mode review.
D2. GM rewrites the engine envelope (commands → cmd, inject actor)
Decision.
commandexecute.rewriteCommandPayload
and the parallel
orderput.rewriteOrderPayload
unmarshal the GM ExecuteCommandsRequest / PutOrdersRequest body as
map[string]json.RawMessage, take the commands field, and emit a
fresh JSON object containing only actor (set to the resolved race
name) and cmd (carrying the original array). Every other top-level
key is dropped. The OpenAPI descriptions for ExecuteCommandsRequest
and PutOrdersRequest were updated in the same patch to document the
rewrite.
Why. The literal "forwarded verbatim" wording in the original Stage 06 OpenAPI description conflicted with two upstream constraints:
- The engine
CommandRequestschema ingame/openapi.yamllines 345–364 declaresactorandcmdas required, with no top-levelcommands. - The README §Hot Path rule "GM never trusts a payload field for actor
identification" (
../README.mdlines 487–490) requires GM to setactorfrom the authenticated user identity.
Two alternatives were rejected:
- Move the rewrite into
engineclient. The adapter's role is thin transport; injecting actor (an authorisation concern) into transport would muddle the boundary and make the adapter test harness authorisation-aware. The service is the right home. - Inject
actoronly and keep thecommandskey. The engine schema requirescmd; this would require an engine contract change outside the Stage 16 scope and break Stage 05's frozen path.
The transform is duplicated across the two services rather than
extracted to a shared package. Each implementation is twelve lines and
each service is otherwise independent; a shared package would add
import-edge surface for marginal savings, and the project convention is
to prefer the minimal diff (CLAUDE.md §Priorities). The duplication is
explicitly documented in both file-level comments.
This decision was confirmed with the user during plan-mode review.
D3. Hot-path services do not append to operation_log
Decision. None of the three services emit an operation_log entry.
The Input shape carries no OpSource/SourceRef fields. Telemetry
counters
(gamemaster.command_execute.outcomes,
gamemaster.order_put.outcomes, gamemaster.report_get.outcomes) are
the only audit surface.
Why. The operation.OpKind enum
(internal/domain/operation/log.go) intentionally has no value for
command, order, or report — it stops at admin and lifecycle operations.
Every hot-path call would multiply audit volume by the order rate
without adding investigative value: the telemetry counter already
exposes outcome distribution, and the engine itself is the source of
truth for per-command results. Adding three new OpKind values would
also bloat the SQL CHECK on operation_log with no operational
consumer.
D4. Membership cache uses a hand-rolled per-game inflight tracker
Decision.
Cache.fetch coordinates
concurrent misses on the same game_id through a tiny
map[gameID]*flight plus a per-flight done channel. Joiners block on
select { case <-existing.done: case <-ctx.Done(): }. The leader
populates members (or err) on the flight before closing the channel.
Why. golang.org/x/sync/singleflight would be a sharper tool, but
adding it as a direct dependency (it is currently only an indirect
transitive of other modules in the workspace) requires the
"justification for direct deps" bar set by CLAUDE.md §Dependencies.
The cache is the only consumer in gamemaster, the implementation is
~30 lines, and a context-cancellable wait is one extra select line we
would otherwise have to wrap around singleflight.Do anyway. The
cache-internal helper is the cheaper choice.
D5. Cache returns the raw status string
Decision.
Cache.Resolve returns
(status string, err error) where the status is the verbatim Lobby
vocabulary ("active", "removed", "blocked") plus the empty string
when the user is not in the roster. Callers compare against
membershipStatusActive = "active" directly. There is no typed
wrapper.
Why. ports.Membership.Status is already string
(internal/ports/lobbyclient.go line 56); introducing a MembershipStatus
domain type purely to be passed through would add boilerplate without
enforcing any invariant Go's type system can check. The hot-path
services need only a single equality check, so a typed enum buys
nothing; it would also need a fallback for "unknown vocabulary"
defensive against future Lobby additions, which is more decision
surface than the cache should own.
D6. Empty roster slot surfaces as forbidden
Decision. Two distinct underlying conditions both surface as
ErrorCodeForbidden from the three services:
- The membership cache returns the empty string for the requested
(gameID, userID): the user is not present in the Lobby roster. - The membership cache returns
"active"butplayermappingstore.Get(gameID, userID)returnsplayermapping.ErrNotFound: the user is an active platform member but has no engine roster slot.
The second condition is an internal inconsistency (register-runtime should have installed the row), but the user-visible semantics — "you are not authorised to act on this game" — are identical to the first. The structured log captures the underlying cause.
Why. Surfacing the second condition as internal_error would
expose 500 to a perfectly-routine "user not part of the engine roster"
case and obscure the actual outcome from the gateway and the user. The
inconsistency, if it ever materialises, is an operator concern visible
in the warn-level log and the forbidden metric attribution; treating
it as a 5xx would not help operators (who would then ignore the false
alarm) nor users (who only care that they cannot act).
Files landed
Created:
../internal/service/membership/{errors.go, cache.go, cache_test.go}— concurrent LRU cache plusErrLobbyUnavailablesentinel.../internal/service/commandexecute/{errors.go, service.go, service_test.go}— command-execute orchestrator and tests.../internal/service/orderput/{errors.go, service.go, service_test.go}— order-put orchestrator and tests.../internal/service/reportget/{errors.go, service.go, service_test.go}— report-get orchestrator and tests.- This decision record.
Modified:
../api/internal-openapi.yaml— rewrote the description fields ofExecuteCommandsRequestandPutOrdersRequestto document the GM-side envelope rewrite.
Reused (not modified):
internal/ports/{engineclient.go, lobbyclient.go, playermappingstore.go, runtimerecordstore.go}— every interface and sentinel was already present.internal/domain/runtime/model.go—StatusRunningconstant + the whole status vocabulary.internal/domain/playermapping/model.go—PlayerMappingandErrNotFound.internal/domain/operation/log.go—Outcomeenum.internal/config/config.go—MembershipCacheConfig.{TTL, MaxGames}with defaults30s/4096.internal/telemetry/runtime.go—RecordCommandExecuteOutcome,RecordOrderPutOutcome,RecordReportGetOutcome,RecordMembershipCacheResult,RecordEngineCall(already wired in Stage 08).
Verification
cd gamemaster
# Membership cache (race-clean concurrency).
go test -race ./internal/service/membership/...
# Each new player service.
go test ./internal/service/commandexecute/...
go test ./internal/service/orderput/...
go test ./internal/service/reportget/...
# Module-wide build + suite.
go build ./...
go test ./...
Out-of-scope for this stage: app wiring (Stage 19), service-local integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).