14 KiB
stage, title
| stage | title |
|---|---|
| 15 | Scheduler, turn generation, and snapshot publisher |
Stage 15 — Scheduler, turn generation, and snapshot publisher
This decision record captures the non-obvious choices made while
implementing the scheduler ticker, the turn-generation orchestrator,
and the publication of gm:lobby_events plus notification:intents
at PLAN Stage 15. It is the heart of Game Master: every running game
flows through this code path on every scheduled or admin-forced turn.
Context
../PLAN.md Stage 15 ships three components that
together drive a turn:
service/turngeneration— the orchestrator that CAS'srunning → generation_in_progress, calls the engine/admin/turn, branches onfinished, and publishes aruntime_snapshot_update/game_finishedevent plus the correspondinggame.turn.ready/game.finished/game.generation_failednotification.service/scheduler— a thin, stateless wrapper arounddomain/schedule.Schedule.Nextreused by the turn-generation recompute step and (in Stage 17) byservice/adminforce.worker/schedulerticker— the 1-second loop that scansruntime_records.ListDueRunning(now)and dispatches oneturngeneration.Handleper due game.
The lifecycle the orchestrator drives is frozen by
../README.md §Lifecycles → Turn generation, and the
publication cadence by §Async Stream Contracts and
§Notification Contracts. The reference precedent for
the orchestrator shape (Input / Result / Dependencies / NewService /
Handle) is Stage 13's service/registerruntime.
Seven decisions deviate from a literal reading of either PLAN Stage 15, the README, or the Stage 13 precedent. Each is recorded below.
Decisions
D1. Resolve game_name synchronously from Lobby per notification
Decision. ports.LobbyClient
gains a GetGameSummary(ctx, gameID) (GameSummary, error) method plus
a narrow GameSummary{GameID, GameName, Status} type. The
HTTP-backed adapter at
internal/adapters/lobbyclient/client.go
issues a GET /api/v1/internal/games/{game_id} against the Lobby
internal listener, decodes the GameRecord shape (Lobby's frozen
contract), and wraps every non-success outcome with
ports.ErrLobbyUnavailable. The turngeneration service calls it
before publishing each notification:intents entry; on any error the
orchestrator falls back to using game_id as game_name and logs a
warn event with error_code=lobby_unavailable.
Why. notificationintent.GameTurnReadyPayload,
GameFinishedPayload, and GameGenerationFailedPayload all require a
game_name string, but Game Master does not own the platform name and
the register-runtime envelope does not carry it. Three alternatives
were considered and rejected:
- Extend the
register-runtimecontract withgame_nameand persist it onruntime_records. Cleanest architecturally, but requires editing the Stage 06 frozen OpenAPI spec, the contract test, the Stage 09 migration, the Stage 10 domain type, the Stage 11 store and tests, the Stage 13 register-runtime service and tests, and the regenerated jet code. Substantial cross-stage churn for a single denormalised string. - Use
game_idas thegame_nameplaceholder unconditionally. Zero change cost, but every push notification a user receives carries the opaque platform identifier — a user-visible regression. - Defer notification publication to Stage 16. Contradicts the
PLAN Stage 15 task list, which explicitly enumerates
game.turn.ready,game.finished, andgame.generation_failedpublication.
The chosen design adds one method and one return type to a port already established in Stage 12, with fail-soft fallback semantics that keep notification publication best-effort.
D2. Trigger parameter classifies telemetry, never logic
Decision. The plan's input shape {gameID, trigger ∈ {scheduler, force}} is preserved as turngeneration.Input.Trigger. The value
flows into the
gamemaster.turn_generation.outcomes counter as a
trigger label and into structured logs; it does not branch the
orchestrator's persistence path. The skip-tick mechanic is driven
exclusively by the runtime record's skip_next_tick column.
Why. ../README.md §Force-next-turn describes
adminforce as: "Run the turn-generation flow synchronously (the same
code path the scheduler uses). After success, set
runtime_records.skip_next_tick = true." Adminforce flips the flag
after the forced turn completes; the next scheduler-driven
generation consumes it. Forking the orchestrator on Trigger would
duplicate the recompute logic in two places and reopen the question
"what if a force fires while skip_next_tick is already true?".
Single-path makes the answer fall out of the existing rule (read the
flag at start, clear at recompute) without special cases.
D3. Two CAS pattern with cleanup on engine failure
Decision. Persistence steps mirror Stage 13's CAS-then-rollback pattern with two CAS transitions per generation:
running → generation_in_progressat the start. Onruntime.ErrConflict(concurrent stop / external mutation) the orchestrator returnsResult{ErrorCode: conflict}without publishing events; the external mutation is responsible for its own snapshot.- After the engine call:
- success +
finished=true→generation_in_progress → finished; - success +
finished=false→generation_in_progress → running; - engine error →
generation_in_progress → generation_failed.
- success +
The post-engine CAS surfaces runtime.ErrConflict only when an
external mutation (typical cause: admin issued a stop while the engine
was generating) overtook the orchestrator. The engine call has
already mutated state, but the runtime row is owned by the new actor;
the orchestrator records the audit failure with conflict and exits.
Why. This keeps Stage 13's pattern intact: every CAS knows what
state the row should be in before the call, and a mismatch always
yields conflict. Mixing the two CAS guards with a single combined
status update (e.g., a transactional "running and not stopped") would
require the adapter to expose multi-status CAS predicates, breaking
the per-row CAS abstraction Stage 11 settled on.
D4. Snapshot cadence: one publication per outcome
Decision. The orchestrator publishes exactly one
runtime_snapshot_update or game_finished per turn-generation
call:
- success + not finished →
PublishSnapshotUpdatewith fullplayer_turn_stats; - success + finished →
PublishGameFinishedwith fullplayer_turn_stats; - engine failure →
PublishSnapshotUpdatewithRuntimeStatus=generation_failedand emptyplayer_turn_stats(no fresh engine payload).
The intermediate running → generation_in_progress transition is
not broadcast.
Why. The README cadence enumerates "transitioned" cases as
examples (running ↔ generation_in_progress), but PLAN Stage 15
explicitly anchors publication on the outcome side. Publishing twice
would double Lobby's processing cost without delivering new
information, because generation_in_progress carries no fresh engine
state and Lobby cannot act on the in-progress moment.
D5. Notification recipients = playermappingstore.ListByGame
Decision. game.turn.ready and game.finished use
AudienceKindUser and need a sorted unique non-empty
recipient_user_ids list. The orchestrator derives it from
playermappingstore.ListByGame(gameID) projected to UserID values,
deduplicated and sorted ascending. Empty rosters cause the
notification to be skipped silently with a warn log; the runtime
mutation persists.
Why. This is the only roster data Game Master owns until Stage 16
delivers the membership cache. After Stage 17 wires banish, the
player_mappings rows still represent the engine-known roster and
remain a correct conservative recipient set (banished members will be
filtered separately by Notification Service's user resolution if
absent in User Service). Adding a synchronous Lobby
GetMemberships call here would duplicate the work Stage 16 is
already on the hook to provide.
D6. Scheduler service is a stateless utility
Decision.
service/scheduler.Service
exposes a single ComputeNext(turnSchedule, after, skipNextTick) (time.Time, bool, error) method that wraps schedule.Parse(...).Next(after, skipNextTick). The service holds no dependencies and no clock; the
caller passes after. turngeneration injects a
*scheduler.Service and uses it during the post-success recompute;
Stage 17 will reuse the same instance from adminforce.
Why. Centralising the parse-then-next sequence in one place keeps
the skip rule in one place and makes the future Stage 17 caller
trivial. Holding no state means tests are pure value tests against the
domain/schedule wrapper; no clock injection or dependency wiring is
required.
D7. Per-game in-flight set on the scheduler ticker
Decision.
worker/schedulerticker.Worker
holds a sync.Map[gameID]struct{} of currently-dispatched games. At
each tick the worker scans RuntimeRecords.ListDueRunning(now) and
launches one goroutine per due game; if LoadOrStore reports the game
is already in-flight, the worker logs at debug and skips. The
goroutine releases the slot via defer w.inflight.Delete(gameID).
Why. A 1-second tick is shorter than typical engine call latency
plus PostgreSQL round-trips, so two ticks can observe the same due row
before the first completes. The CAS in turngeneration is the
authoritative protection (only one goroutine can flip running → generation_in_progress), but two goroutines doing the engine call and
discarding the loser as conflict would waste an engine call and
inflate engine_validation_error / engine_unreachable counters with
spurious entries. The in-flight set is a 4-line optimisation that
removes the spurious work.
Worker.Wait exposes the in-flight sync.WaitGroup so tests (and
Stage 19's wiring) can drive Tick deterministically and observe
completion. Run itself waits on the same group before returning so
context cancellation gracefully drains in-flight work.
Files landed
Modified:
../internal/ports/lobbyclient.go— addedGetGameSummaryto the interface plus theGameSummarytype.../internal/adapters/lobbyclient/client.go— implementedGetGameSummarywith the sameErrLobbyUnavailablewrapping precedent asGetMemberships.../internal/adapters/lobbyclient/client_test.go— table-driven tests for happy path, 404, 5xx, malformed JSON, missing required fields, timeout, and bad input.../internal/adapters/mocks/mock_lobbyclient.go— regenerated.
Created:
../internal/service/scheduler/service.go,../internal/service/scheduler/service_test.go— stateless scheduler utility.../internal/service/turngeneration/service.go,../internal/service/turngeneration/errors.go,../internal/service/turngeneration/service_test.go— turn-generation orchestrator and tests.../internal/worker/schedulerticker/worker.go,../internal/worker/schedulerticker/worker_test.go— scheduler ticker worker and tests.- This decision record.
Reused (not modified):
internal/domain/runtime/{model.go, transitions.go}—running → generation_in_progress,generation_in_progress → running,generation_in_progress → generation_failed,generation_in_progress → finishedwere all permitted by the Stage 10 transitions table.internal/domain/schedule/nexttick.go— the cron + skip wrapper.internal/domain/operation/log.go— theOpKindTurnGenerationenum value already in place.internal/ports/{runtimerecordstore.go, engineclient.go, playermappingstore.go, operationlog.go, notificationpublisher.go, lobbyeventspublisher.go}— every store and publisher used by the orchestrator was already present.internal/telemetry/runtime.go—RecordTurnGenerationOutcome,RecordLobbyEventPublished,RecordNotificationPublishAttempt.pkg/notificationintent.NewGameTurnReadyIntent,NewGameFinishedIntent,NewGameGenerationFailedIntent.
Verification
cd gamemaster
# Mock regeneration must produce the GetGameSummary additions and
# nothing else.
make mocks
git diff --stat internal/adapters/mocks
# Domain + ports tests still pass.
go test ./internal/domain/... ./internal/ports/...
# Scheduler utility.
go test ./internal/service/scheduler/...
# Turn-generation orchestrator.
go test ./internal/service/turngeneration/...
# Scheduler ticker worker.
go test ./internal/worker/schedulerticker/...
# Updated lobby client adapter.
go test ./internal/adapters/lobbyclient/...
# Module-wide build remains green.
go test ./...
Out-of-scope for this stage: app wiring (Stage 19), service-local integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).