---
stage: 15
title: Scheduler, turn generation, and snapshot publisher
---

# Stage 15 — Scheduler, turn generation, and snapshot publisher

This decision record captures the non-obvious choices made while
implementing the scheduler ticker, the turn-generation orchestrator,
and the publication of `gm:lobby_events` plus `notification:intents`
at PLAN Stage 15. It is the heart of Game Master: every running game
flows through this code path on every scheduled or admin-forced turn.

## Context

[`../PLAN.md` Stage 15](../PLAN.md) ships three components that
together drive a turn:

1. `service/turngeneration` — the orchestrator that CAS's `running →
   generation_in_progress`, calls the engine `/admin/turn`, branches
   on `finished`, and publishes a `runtime_snapshot_update` /
   `game_finished` event plus the corresponding `game.turn.ready` /
   `game.finished` / `game.generation_failed` notification.
2. `service/scheduler` — a thin, stateless wrapper around
   `domain/schedule.Schedule.Next` reused by the turn-generation
   recompute step and (in Stage 17) by `service/adminforce`.
3. `worker/schedulerticker` — the 1-second loop that scans
   `runtime_records.ListDueRunning(now)` and dispatches one
   `turngeneration.Handle` per due game.

The lifecycle the orchestrator drives is frozen by
[`../README.md` §Lifecycles → Turn generation](../README.md), and the
publication cadence by [§Async Stream Contracts](../README.md) and
[§Notification Contracts](../README.md). The reference precedent for
the orchestrator shape (Input / Result / Dependencies / NewService /
Handle) is Stage 13's `service/registerruntime`.

Seven decisions deviate from a literal reading of either PLAN Stage 15,
the README, or the Stage 13 precedent. Each is recorded below.

## Decisions

### D1. Resolve `game_name` synchronously from Lobby per notification

**Decision.** [`ports.LobbyClient`](../internal/ports/lobbyclient.go)
gains a `GetGameSummary(ctx, gameID) (GameSummary, error)` method plus
a narrow `GameSummary{GameID, GameName, Status}` type. The
HTTP-backed adapter at
[`internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go)
issues a `GET /api/v1/internal/games/{game_id}` against the Lobby
internal listener, decodes the `GameRecord` shape (Lobby's frozen
contract), and wraps every non-success outcome with
`ports.ErrLobbyUnavailable`. The `turngeneration` service calls it
before publishing each `notification:intents` entry; on any error the
orchestrator falls back to using `game_id` as `game_name` and logs a
`warn` event with `error_code=lobby_unavailable`.

**Why.** `notificationintent.GameTurnReadyPayload`,
`GameFinishedPayload`, and `GameGenerationFailedPayload` all require a
`game_name` string, but Game Master does not own the platform name and
the `register-runtime` envelope does not carry it. Three alternatives
were considered and rejected:

- **Extend the `register-runtime` contract with `game_name` and
  persist it on `runtime_records`.** Cleanest architecturally, but
  requires editing the Stage 06 frozen OpenAPI spec, the contract
  test, the Stage 09 migration, the Stage 10 domain type, the
  Stage 11 store and tests, the Stage 13 register-runtime service and
  tests, and the regenerated jet code. Substantial cross-stage churn
  for a single denormalised string.
- **Use `game_id` as the `game_name` placeholder unconditionally.**
  Zero change cost, but every push notification a user receives
  carries the opaque platform identifier — a user-visible regression.
- **Defer notification publication to Stage 16.** Contradicts the
  PLAN Stage 15 task list, which explicitly enumerates
  `game.turn.ready`, `game.finished`, and `game.generation_failed`
  publication.

The chosen design adds one method and one return type to a port
already established in Stage 12, with fail-soft fallback semantics
that keep notification publication best-effort.

### D2. `Trigger` parameter classifies telemetry, never logic

**Decision.** The plan's input shape `{gameID, trigger ∈ {scheduler,
force}}` is preserved as `turngeneration.Input.Trigger`. The value
flows into the
`gamemaster.turn_generation.outcomes` counter as a
`trigger` label and into structured logs; it does **not** branch the
orchestrator's persistence path. The skip-tick mechanic is driven
exclusively by the runtime record's `skip_next_tick` column.

**Why.** [`../README.md §Force-next-turn`](../README.md) describes
adminforce as: "Run the turn-generation flow synchronously (the same
code path the scheduler uses). After success, set
`runtime_records.skip_next_tick = true`." Adminforce flips the flag
*after* the forced turn completes; the *next* scheduler-driven
generation consumes it. Forking the orchestrator on `Trigger` would
duplicate the recompute logic in two places and reopen the question
"what if a force fires while skip_next_tick is already true?".
Single-path makes the answer fall out of the existing rule (read the
flag at start, clear at recompute) without special cases.

### D3. Two CAS pattern with cleanup on engine failure

**Decision.** Persistence steps mirror Stage 13's CAS-then-rollback
pattern with two CAS transitions per generation:

1. `running → generation_in_progress` at the start. On
   `runtime.ErrConflict` (concurrent stop / external mutation) the
   orchestrator returns `Result{ErrorCode: conflict}` without
   publishing events; the external mutation is responsible for its
   own snapshot.
2. After the engine call:
   - success + `finished=true` → `generation_in_progress → finished`;
   - success + `finished=false` → `generation_in_progress → running`;
   - engine error → `generation_in_progress → generation_failed`.

The post-engine CAS surfaces `runtime.ErrConflict` only when an
external mutation (typical cause: admin issued a stop while the engine
was generating) overtook the orchestrator. The engine call has
already mutated state, but the runtime row is owned by the new actor;
the orchestrator records the audit failure with `conflict` and exits.

**Why.** This keeps Stage 13's pattern intact: every CAS knows what
state the row should be in before the call, and a mismatch always
yields `conflict`. Mixing the two CAS guards with a single combined
status update (e.g., a transactional "running and not stopped") would
require the adapter to expose multi-status CAS predicates, breaking
the per-row CAS abstraction Stage 11 settled on.

### D4. Snapshot cadence: one publication per outcome

**Decision.** The orchestrator publishes exactly one
`runtime_snapshot_update` *or* `game_finished` per turn-generation
call:

- success + not finished → `PublishSnapshotUpdate` with full
  `player_turn_stats`;
- success + finished → `PublishGameFinished` with full
  `player_turn_stats`;
- engine failure → `PublishSnapshotUpdate` with
  `RuntimeStatus=generation_failed` and empty `player_turn_stats`
  (no fresh engine payload).

The intermediate `running → generation_in_progress` transition is
**not** broadcast.

**Why.** The README cadence enumerates "transitioned" cases as
examples (`running ↔ generation_in_progress`), but PLAN Stage 15
explicitly anchors publication on the outcome side. Publishing twice
would double Lobby's processing cost without delivering new
information, because `generation_in_progress` carries no fresh engine
state and Lobby cannot act on the in-progress moment.

### D5. Notification recipients = `playermappingstore.ListByGame`

**Decision.** `game.turn.ready` and `game.finished` use
`AudienceKindUser` and need a sorted unique non-empty
`recipient_user_ids` list. The orchestrator derives it from
`playermappingstore.ListByGame(gameID)` projected to `UserID` values,
deduplicated and sorted ascending. Empty rosters cause the
notification to be skipped silently with a `warn` log; the runtime
mutation persists.

**Why.** This is the only roster data Game Master owns until Stage 16
delivers the membership cache. After Stage 17 wires `banish`, the
player_mappings rows still represent the engine-known roster and
remain a correct conservative recipient set (banished members will be
filtered separately by Notification Service's user resolution if
absent in `User Service`). Adding a synchronous Lobby
`GetMemberships` call here would duplicate the work Stage 16 is
already on the hook to provide.

### D6. Scheduler service is a stateless utility

**Decision.**
[`service/scheduler.Service`](../internal/service/scheduler/service.go)
exposes a single `ComputeNext(turnSchedule, after, skipNextTick)
(time.Time, bool, error)` method that wraps `schedule.Parse(...).Next(after,
skipNextTick)`. The service holds no dependencies and no clock; the
caller passes `after`. `turngeneration` injects a
`*scheduler.Service` and uses it during the post-success recompute;
Stage 17 will reuse the same instance from `adminforce`.

**Why.** Centralising the parse-then-next sequence in one place keeps
the skip rule in one place and makes the future Stage 17 caller
trivial. Holding no state means tests are pure value tests against the
`domain/schedule` wrapper; no clock injection or dependency wiring is
required.

### D7. Per-game in-flight set on the scheduler ticker

**Decision.**
[`worker/schedulerticker.Worker`](../internal/worker/schedulerticker/worker.go)
holds a `sync.Map[gameID]struct{}` of currently-dispatched games. At
each tick the worker scans `RuntimeRecords.ListDueRunning(now)` and
launches one goroutine per due game; if `LoadOrStore` reports the game
is already in-flight, the worker logs at `debug` and skips. The
goroutine releases the slot via `defer w.inflight.Delete(gameID)`.

**Why.** A 1-second tick is shorter than typical engine call latency
plus PostgreSQL round-trips, so two ticks can observe the same due row
before the first completes. The CAS in `turngeneration` is the
authoritative protection (only one goroutine can flip `running →
generation_in_progress`), but two goroutines doing the engine call and
discarding the loser as `conflict` would waste an engine call and
inflate `engine_validation_error` / `engine_unreachable` counters with
spurious entries. The in-flight set is a 4-line optimisation that
removes the spurious work.

`Worker.Wait` exposes the in-flight `sync.WaitGroup` so tests (and
Stage 19's wiring) can drive `Tick` deterministically and observe
completion. `Run` itself waits on the same group before returning so
context cancellation gracefully drains in-flight work.

## Files landed

**Modified:**

- [`../internal/ports/lobbyclient.go`](../internal/ports/lobbyclient.go)
  — added `GetGameSummary` to the interface plus the `GameSummary`
  type.
- [`../internal/adapters/lobbyclient/client.go`](../internal/adapters/lobbyclient/client.go)
  — implemented `GetGameSummary` with the same `ErrLobbyUnavailable`
  wrapping precedent as `GetMemberships`.
- [`../internal/adapters/lobbyclient/client_test.go`](../internal/adapters/lobbyclient/client_test.go)
  — table-driven tests for happy path, 404, 5xx, malformed JSON,
  missing required fields, timeout, and bad input.
- [`../internal/adapters/mocks/mock_lobbyclient.go`](../internal/adapters/mocks/mock_lobbyclient.go)
  — regenerated.

**Created:**

- [`../internal/service/scheduler/service.go`](../internal/service/scheduler/service.go),
  [`../internal/service/scheduler/service_test.go`](../internal/service/scheduler/service_test.go)
  — stateless scheduler utility.
- [`../internal/service/turngeneration/service.go`](../internal/service/turngeneration/service.go),
  [`../internal/service/turngeneration/errors.go`](../internal/service/turngeneration/errors.go),
  [`../internal/service/turngeneration/service_test.go`](../internal/service/turngeneration/service_test.go)
  — turn-generation orchestrator and tests.
- [`../internal/worker/schedulerticker/worker.go`](../internal/worker/schedulerticker/worker.go),
  [`../internal/worker/schedulerticker/worker_test.go`](../internal/worker/schedulerticker/worker_test.go)
  — scheduler ticker worker and tests.
- This decision record.

**Reused (not modified):**

- `internal/domain/runtime/{model.go, transitions.go}` —
  `running → generation_in_progress`, `generation_in_progress →
  running`, `generation_in_progress → generation_failed`,
  `generation_in_progress → finished` were all permitted by the
  Stage 10 transitions table.
- `internal/domain/schedule/nexttick.go` — the cron + skip wrapper.
- `internal/domain/operation/log.go` — the `OpKindTurnGeneration`
  enum value already in place.
- `internal/ports/{runtimerecordstore.go, engineclient.go,
  playermappingstore.go, operationlog.go,
  notificationpublisher.go, lobbyeventspublisher.go}` — every store
  and publisher used by the orchestrator was already present.
- `internal/telemetry/runtime.go` — `RecordTurnGenerationOutcome`,
  `RecordLobbyEventPublished`, `RecordNotificationPublishAttempt`.
- `pkg/notificationintent.NewGameTurnReadyIntent`,
  `NewGameFinishedIntent`, `NewGameGenerationFailedIntent`.

## Verification

```sh
cd gamemaster

# Mock regeneration must produce the GetGameSummary additions and
# nothing else.
make mocks
git diff --stat internal/adapters/mocks

# Domain + ports tests still pass.
go test ./internal/domain/... ./internal/ports/...

# Scheduler utility.
go test ./internal/service/scheduler/...

# Turn-generation orchestrator.
go test ./internal/service/turngeneration/...

# Scheduler ticker worker.
go test ./internal/worker/schedulerticker/...

# Updated lobby client adapter.
go test ./internal/adapters/lobbyclient/...

# Module-wide build remains green.
go test ./...
```

Out-of-scope for this stage: app wiring (Stage 19), service-local
integration suite (Stage 21), cross-service Lobby ↔ GM tests
(Stage 22).