--- stage: 15 title: Scheduler, turn generation, and snapshot publisher --- # Stage 15 — Scheduler, turn generation, and snapshot publisher This decision record captures the non-obvious choices made while implementing the scheduler ticker, the turn-generation orchestrator, and the publication of `gm:lobby_events` plus `notification:intents` at PLAN Stage 15. It is the heart of Game Master: every running game flows through this code path on every scheduled or admin-forced turn. ## Context [`../PLAN.md` Stage 15](../PLAN.md) ships three components that together drive a turn: 1. `service/turngeneration` — the orchestrator that CAS's `running → generation_in_progress`, calls the engine `/admin/turn`, branches on `finished`, and publishes a `runtime_snapshot_update` / `game_finished` event plus the corresponding `game.turn.ready` / `game.finished` / `game.generation_failed` notification. 2. `service/scheduler` — a thin, stateless wrapper around `domain/schedule.Schedule.Next` reused by the turn-generation recompute step and (in Stage 17) by `service/adminforce`. 3. `worker/schedulerticker` — the 1-second loop that scans `runtime_records.ListDueRunning(now)` and dispatches one `turngeneration.Handle` per due game. The lifecycle the orchestrator drives is frozen by [`../README.md` §Lifecycles → Turn generation](../README.md), and the publication cadence by [§Async Stream Contracts](../README.md) and [§Notification Contracts](../README.md). The reference precedent for the orchestrator shape (Input / Result / Dependencies / NewService / Handle) is Stage 13's `service/registerruntime`. Seven decisions deviate from a literal reading of either PLAN Stage 15, the README, or the Stage 13 precedent. Each is recorded below. ## Decisions ### D1. Resolve `game_name` synchronously from Lobby per notification **Decision.** [`ports.LobbyClient`](../internal/ports/lobbyclient.go) gains a `GetGameSummary(ctx, gameID) (GameSummary, error)` method plus a narrow `GameSummary{GameID, GameName, Status}` type. The HTTP-backed adapter at [`internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go) issues a `GET /api/v1/internal/games/{game_id}` against the Lobby internal listener, decodes the `GameRecord` shape (Lobby's frozen contract), and wraps every non-success outcome with `ports.ErrLobbyUnavailable`. The `turngeneration` service calls it before publishing each `notification:intents` entry; on any error the orchestrator falls back to using `game_id` as `game_name` and logs a `warn` event with `error_code=lobby_unavailable`. **Why.** `notificationintent.GameTurnReadyPayload`, `GameFinishedPayload`, and `GameGenerationFailedPayload` all require a `game_name` string, but Game Master does not own the platform name and the `register-runtime` envelope does not carry it. Three alternatives were considered and rejected: - **Extend the `register-runtime` contract with `game_name` and persist it on `runtime_records`.** Cleanest architecturally, but requires editing the Stage 06 frozen OpenAPI spec, the contract test, the Stage 09 migration, the Stage 10 domain type, the Stage 11 store and tests, the Stage 13 register-runtime service and tests, and the regenerated jet code. Substantial cross-stage churn for a single denormalised string. - **Use `game_id` as the `game_name` placeholder unconditionally.** Zero change cost, but every push notification a user receives carries the opaque platform identifier — a user-visible regression. - **Defer notification publication to Stage 16.** Contradicts the PLAN Stage 15 task list, which explicitly enumerates `game.turn.ready`, `game.finished`, and `game.generation_failed` publication. The chosen design adds one method and one return type to a port already established in Stage 12, with fail-soft fallback semantics that keep notification publication best-effort. ### D2. `Trigger` parameter classifies telemetry, never logic **Decision.** The plan's input shape `{gameID, trigger ∈ {scheduler, force}}` is preserved as `turngeneration.Input.Trigger`. The value flows into the `gamemaster.turn_generation.outcomes` counter as a `trigger` label and into structured logs; it does **not** branch the orchestrator's persistence path. The skip-tick mechanic is driven exclusively by the runtime record's `skip_next_tick` column. **Why.** [`../README.md §Force-next-turn`](../README.md) describes adminforce as: "Run the turn-generation flow synchronously (the same code path the scheduler uses). After success, set `runtime_records.skip_next_tick = true`." Adminforce flips the flag *after* the forced turn completes; the *next* scheduler-driven generation consumes it. Forking the orchestrator on `Trigger` would duplicate the recompute logic in two places and reopen the question "what if a force fires while skip_next_tick is already true?". Single-path makes the answer fall out of the existing rule (read the flag at start, clear at recompute) without special cases. ### D3. Two CAS pattern with cleanup on engine failure **Decision.** Persistence steps mirror Stage 13's CAS-then-rollback pattern with two CAS transitions per generation: 1. `running → generation_in_progress` at the start. On `runtime.ErrConflict` (concurrent stop / external mutation) the orchestrator returns `Result{ErrorCode: conflict}` without publishing events; the external mutation is responsible for its own snapshot. 2. After the engine call: - success + `finished=true` → `generation_in_progress → finished`; - success + `finished=false` → `generation_in_progress → running`; - engine error → `generation_in_progress → generation_failed`. The post-engine CAS surfaces `runtime.ErrConflict` only when an external mutation (typical cause: admin issued a stop while the engine was generating) overtook the orchestrator. The engine call has already mutated state, but the runtime row is owned by the new actor; the orchestrator records the audit failure with `conflict` and exits. **Why.** This keeps Stage 13's pattern intact: every CAS knows what state the row should be in before the call, and a mismatch always yields `conflict`. Mixing the two CAS guards with a single combined status update (e.g., a transactional "running and not stopped") would require the adapter to expose multi-status CAS predicates, breaking the per-row CAS abstraction Stage 11 settled on. ### D4. Snapshot cadence: one publication per outcome **Decision.** The orchestrator publishes exactly one `runtime_snapshot_update` *or* `game_finished` per turn-generation call: - success + not finished → `PublishSnapshotUpdate` with full `player_turn_stats`; - success + finished → `PublishGameFinished` with full `player_turn_stats`; - engine failure → `PublishSnapshotUpdate` with `RuntimeStatus=generation_failed` and empty `player_turn_stats` (no fresh engine payload). The intermediate `running → generation_in_progress` transition is **not** broadcast. **Why.** The README cadence enumerates "transitioned" cases as examples (`running ↔ generation_in_progress`), but PLAN Stage 15 explicitly anchors publication on the outcome side. Publishing twice would double Lobby's processing cost without delivering new information, because `generation_in_progress` carries no fresh engine state and Lobby cannot act on the in-progress moment. ### D5. Notification recipients = `playermappingstore.ListByGame` **Decision.** `game.turn.ready` and `game.finished` use `AudienceKindUser` and need a sorted unique non-empty `recipient_user_ids` list. The orchestrator derives it from `playermappingstore.ListByGame(gameID)` projected to `UserID` values, deduplicated and sorted ascending. Empty rosters cause the notification to be skipped silently with a `warn` log; the runtime mutation persists. **Why.** This is the only roster data Game Master owns until Stage 16 delivers the membership cache. After Stage 17 wires `banish`, the player_mappings rows still represent the engine-known roster and remain a correct conservative recipient set (banished members will be filtered separately by Notification Service's user resolution if absent in `User Service`). Adding a synchronous Lobby `GetMemberships` call here would duplicate the work Stage 16 is already on the hook to provide. ### D6. Scheduler service is a stateless utility **Decision.** [`service/scheduler.Service`](../internal/service/scheduler/service.go) exposes a single `ComputeNext(turnSchedule, after, skipNextTick) (time.Time, bool, error)` method that wraps `schedule.Parse(...).Next(after, skipNextTick)`. The service holds no dependencies and no clock; the caller passes `after`. `turngeneration` injects a `*scheduler.Service` and uses it during the post-success recompute; Stage 17 will reuse the same instance from `adminforce`. **Why.** Centralising the parse-then-next sequence in one place keeps the skip rule in one place and makes the future Stage 17 caller trivial. Holding no state means tests are pure value tests against the `domain/schedule` wrapper; no clock injection or dependency wiring is required. ### D7. Per-game in-flight set on the scheduler ticker **Decision.** [`worker/schedulerticker.Worker`](../internal/worker/schedulerticker/worker.go) holds a `sync.Map[gameID]struct{}` of currently-dispatched games. At each tick the worker scans `RuntimeRecords.ListDueRunning(now)` and launches one goroutine per due game; if `LoadOrStore` reports the game is already in-flight, the worker logs at `debug` and skips. The goroutine releases the slot via `defer w.inflight.Delete(gameID)`. **Why.** A 1-second tick is shorter than typical engine call latency plus PostgreSQL round-trips, so two ticks can observe the same due row before the first completes. The CAS in `turngeneration` is the authoritative protection (only one goroutine can flip `running → generation_in_progress`), but two goroutines doing the engine call and discarding the loser as `conflict` would waste an engine call and inflate `engine_validation_error` / `engine_unreachable` counters with spurious entries. The in-flight set is a 4-line optimisation that removes the spurious work. `Worker.Wait` exposes the in-flight `sync.WaitGroup` so tests (and Stage 19's wiring) can drive `Tick` deterministically and observe completion. `Run` itself waits on the same group before returning so context cancellation gracefully drains in-flight work. ## Files landed **Modified:** - [`../internal/ports/lobbyclient.go`](../internal/ports/lobbyclient.go) — added `GetGameSummary` to the interface plus the `GameSummary` type. - [`../internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go) — implemented `GetGameSummary` with the same `ErrLobbyUnavailable` wrapping precedent as `GetMemberships`. - [`../internal/adapters/lobbyclient/client_test.go`](../internal/adapters/lobbyclient/client_test.go) — table-driven tests for happy path, 404, 5xx, malformed JSON, missing required fields, timeout, and bad input. - [`../internal/adapters/mocks/mock_lobbyclient.go`](../internal/adapters/mocks/mock_lobbyclient.go) — regenerated. **Created:** - [`../internal/service/scheduler/service.go`](../internal/service/scheduler/service.go), [`../internal/service/scheduler/service_test.go`](../internal/service/scheduler/service_test.go) — stateless scheduler utility. - [`../internal/service/turngeneration/service.go`](../internal/service/turngeneration/service.go), [`../internal/service/turngeneration/errors.go`](../internal/service/turngeneration/errors.go), [`../internal/service/turngeneration/service_test.go`](../internal/service/turngeneration/service_test.go) — turn-generation orchestrator and tests. - [`../internal/worker/schedulerticker/worker.go`](../internal/worker/schedulerticker/worker.go), [`../internal/worker/schedulerticker/worker_test.go`](../internal/worker/schedulerticker/worker_test.go) — scheduler ticker worker and tests. - This decision record. **Reused (not modified):** - `internal/domain/runtime/{model.go, transitions.go}` — `running → generation_in_progress`, `generation_in_progress → running`, `generation_in_progress → generation_failed`, `generation_in_progress → finished` were all permitted by the Stage 10 transitions table. - `internal/domain/schedule/nexttick.go` — the cron + skip wrapper. - `internal/domain/operation/log.go` — the `OpKindTurnGeneration` enum value already in place. - `internal/ports/{runtimerecordstore.go, engineclient.go, playermappingstore.go, operationlog.go, notificationpublisher.go, lobbyeventspublisher.go}` — every store and publisher used by the orchestrator was already present. - `internal/telemetry/runtime.go` — `RecordTurnGenerationOutcome`, `RecordLobbyEventPublished`, `RecordNotificationPublishAttempt`. - `pkg/notificationintent.NewGameTurnReadyIntent`, `NewGameFinishedIntent`, `NewGameGenerationFailedIntent`. ## Verification ```sh cd gamemaster # Mock regeneration must produce the GetGameSummary additions and # nothing else. make mocks git diff --stat internal/adapters/mocks # Domain + ports tests still pass. go test ./internal/domain/... ./internal/ports/... # Scheduler utility. go test ./internal/service/scheduler/... # Turn-generation orchestrator. go test ./internal/service/turngeneration/... # Scheduler ticker worker. go test ./internal/worker/schedulerticker/... # Updated lobby client adapter. go test ./internal/adapters/lobbyclient/... # Module-wide build remains green. go test ./... ``` Out-of-scope for this stage: app wiring (Stage 19), service-local integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).