Files
galaxy-game/gamemaster/docs/stage15-scheduler-and-turn-generation.md
T
2026-05-03 07:59:03 +02:00

14 KiB

stage, title
stage title
15 Scheduler, turn generation, and snapshot publisher

Stage 15 — Scheduler, turn generation, and snapshot publisher

This decision record captures the non-obvious choices made while implementing the scheduler ticker, the turn-generation orchestrator, and the publication of gm:lobby_events plus notification:intents at PLAN Stage 15. It is the heart of Game Master: every running game flows through this code path on every scheduled or admin-forced turn.

Context

../PLAN.md Stage 15 ships three components that together drive a turn:

  1. service/turngeneration — the orchestrator that CAS's running → generation_in_progress, calls the engine /admin/turn, branches on finished, and publishes a runtime_snapshot_update / game_finished event plus the corresponding game.turn.ready / game.finished / game.generation_failed notification.
  2. service/scheduler — a thin, stateless wrapper around domain/schedule.Schedule.Next reused by the turn-generation recompute step and (in Stage 17) by service/adminforce.
  3. worker/schedulerticker — the 1-second loop that scans runtime_records.ListDueRunning(now) and dispatches one turngeneration.Handle per due game.

The lifecycle the orchestrator drives is frozen by ../README.md §Lifecycles → Turn generation, and the publication cadence by §Async Stream Contracts and §Notification Contracts. The reference precedent for the orchestrator shape (Input / Result / Dependencies / NewService / Handle) is Stage 13's service/registerruntime.

Seven decisions deviate from a literal reading of either PLAN Stage 15, the README, or the Stage 13 precedent. Each is recorded below.

Decisions

D1. Resolve game_name synchronously from Lobby per notification

Decision. ports.LobbyClient gains a GetGameSummary(ctx, gameID) (GameSummary, error) method plus a narrow GameSummary{GameID, GameName, Status} type. The HTTP-backed adapter at internal/adapters/lobbyclient/client.go issues a GET /api/v1/internal/games/{game_id} against the Lobby internal listener, decodes the GameRecord shape (Lobby's frozen contract), and wraps every non-success outcome with ports.ErrLobbyUnavailable. The turngeneration service calls it before publishing each notification:intents entry; on any error the orchestrator falls back to using game_id as game_name and logs a warn event with error_code=lobby_unavailable.

Why. notificationintent.GameTurnReadyPayload, GameFinishedPayload, and GameGenerationFailedPayload all require a game_name string, but Game Master does not own the platform name and the register-runtime envelope does not carry it. Three alternatives were considered and rejected:

  • Extend the register-runtime contract with game_name and persist it on runtime_records. Cleanest architecturally, but requires editing the Stage 06 frozen OpenAPI spec, the contract test, the Stage 09 migration, the Stage 10 domain type, the Stage 11 store and tests, the Stage 13 register-runtime service and tests, and the regenerated jet code. Substantial cross-stage churn for a single denormalised string.
  • Use game_id as the game_name placeholder unconditionally. Zero change cost, but every push notification a user receives carries the opaque platform identifier — a user-visible regression.
  • Defer notification publication to Stage 16. Contradicts the PLAN Stage 15 task list, which explicitly enumerates game.turn.ready, game.finished, and game.generation_failed publication.

The chosen design adds one method and one return type to a port already established in Stage 12, with fail-soft fallback semantics that keep notification publication best-effort.

D2. Trigger parameter classifies telemetry, never logic

Decision. The plan's input shape {gameID, trigger ∈ {scheduler, force}} is preserved as turngeneration.Input.Trigger. The value flows into the gamemaster.turn_generation.outcomes counter as a trigger label and into structured logs; it does not branch the orchestrator's persistence path. The skip-tick mechanic is driven exclusively by the runtime record's skip_next_tick column.

Why. ../README.md §Force-next-turn describes adminforce as: "Run the turn-generation flow synchronously (the same code path the scheduler uses). After success, set runtime_records.skip_next_tick = true." Adminforce flips the flag after the forced turn completes; the next scheduler-driven generation consumes it. Forking the orchestrator on Trigger would duplicate the recompute logic in two places and reopen the question "what if a force fires while skip_next_tick is already true?". Single-path makes the answer fall out of the existing rule (read the flag at start, clear at recompute) without special cases.

D3. Two CAS pattern with cleanup on engine failure

Decision. Persistence steps mirror Stage 13's CAS-then-rollback pattern with two CAS transitions per generation:

  1. running → generation_in_progress at the start. On runtime.ErrConflict (concurrent stop / external mutation) the orchestrator returns Result{ErrorCode: conflict} without publishing events; the external mutation is responsible for its own snapshot.
  2. After the engine call:
    • success + finished=truegeneration_in_progress → finished;
    • success + finished=falsegeneration_in_progress → running;
    • engine error → generation_in_progress → generation_failed.

The post-engine CAS surfaces runtime.ErrConflict only when an external mutation (typical cause: admin issued a stop while the engine was generating) overtook the orchestrator. The engine call has already mutated state, but the runtime row is owned by the new actor; the orchestrator records the audit failure with conflict and exits.

Why. This keeps Stage 13's pattern intact: every CAS knows what state the row should be in before the call, and a mismatch always yields conflict. Mixing the two CAS guards with a single combined status update (e.g., a transactional "running and not stopped") would require the adapter to expose multi-status CAS predicates, breaking the per-row CAS abstraction Stage 11 settled on.

D4. Snapshot cadence: one publication per outcome

Decision. The orchestrator publishes exactly one runtime_snapshot_update or game_finished per turn-generation call:

  • success + not finished → PublishSnapshotUpdate with full player_turn_stats;
  • success + finished → PublishGameFinished with full player_turn_stats;
  • engine failure → PublishSnapshotUpdate with RuntimeStatus=generation_failed and empty player_turn_stats (no fresh engine payload).

The intermediate running → generation_in_progress transition is not broadcast.

Why. The README cadence enumerates "transitioned" cases as examples (running ↔ generation_in_progress), but PLAN Stage 15 explicitly anchors publication on the outcome side. Publishing twice would double Lobby's processing cost without delivering new information, because generation_in_progress carries no fresh engine state and Lobby cannot act on the in-progress moment.

D5. Notification recipients = playermappingstore.ListByGame

Decision. game.turn.ready and game.finished use AudienceKindUser and need a sorted unique non-empty recipient_user_ids list. The orchestrator derives it from playermappingstore.ListByGame(gameID) projected to UserID values, deduplicated and sorted ascending. Empty rosters cause the notification to be skipped silently with a warn log; the runtime mutation persists.

Why. This is the only roster data Game Master owns until Stage 16 delivers the membership cache. After Stage 17 wires banish, the player_mappings rows still represent the engine-known roster and remain a correct conservative recipient set (banished members will be filtered separately by Notification Service's user resolution if absent in User Service). Adding a synchronous Lobby GetMemberships call here would duplicate the work Stage 16 is already on the hook to provide.

D6. Scheduler service is a stateless utility

Decision. service/scheduler.Service exposes a single ComputeNext(turnSchedule, after, skipNextTick) (time.Time, bool, error) method that wraps schedule.Parse(...).Next(after, skipNextTick). The service holds no dependencies and no clock; the caller passes after. turngeneration injects a *scheduler.Service and uses it during the post-success recompute; Stage 17 will reuse the same instance from adminforce.

Why. Centralising the parse-then-next sequence in one place keeps the skip rule in one place and makes the future Stage 17 caller trivial. Holding no state means tests are pure value tests against the domain/schedule wrapper; no clock injection or dependency wiring is required.

D7. Per-game in-flight set on the scheduler ticker

Decision. worker/schedulerticker.Worker holds a sync.Map[gameID]struct{} of currently-dispatched games. At each tick the worker scans RuntimeRecords.ListDueRunning(now) and launches one goroutine per due game; if LoadOrStore reports the game is already in-flight, the worker logs at debug and skips. The goroutine releases the slot via defer w.inflight.Delete(gameID).

Why. A 1-second tick is shorter than typical engine call latency plus PostgreSQL round-trips, so two ticks can observe the same due row before the first completes. The CAS in turngeneration is the authoritative protection (only one goroutine can flip running → generation_in_progress), but two goroutines doing the engine call and discarding the loser as conflict would waste an engine call and inflate engine_validation_error / engine_unreachable counters with spurious entries. The in-flight set is a 4-line optimisation that removes the spurious work.

Worker.Wait exposes the in-flight sync.WaitGroup so tests (and Stage 19's wiring) can drive Tick deterministically and observe completion. Run itself waits on the same group before returning so context cancellation gracefully drains in-flight work.

Files landed

Modified:

Created:

Reused (not modified):

  • internal/domain/runtime/{model.go, transitions.go}running → generation_in_progress, generation_in_progress → running, generation_in_progress → generation_failed, generation_in_progress → finished were all permitted by the Stage 10 transitions table.
  • internal/domain/schedule/nexttick.go — the cron + skip wrapper.
  • internal/domain/operation/log.go — the OpKindTurnGeneration enum value already in place.
  • internal/ports/{runtimerecordstore.go, engineclient.go, playermappingstore.go, operationlog.go, notificationpublisher.go, lobbyeventspublisher.go} — every store and publisher used by the orchestrator was already present.
  • internal/telemetry/runtime.goRecordTurnGenerationOutcome, RecordLobbyEventPublished, RecordNotificationPublishAttempt.
  • pkg/notificationintent.NewGameTurnReadyIntent, NewGameFinishedIntent, NewGameGenerationFailedIntent.

Verification

cd gamemaster

# Mock regeneration must produce the GetGameSummary additions and
# nothing else.
make mocks
git diff --stat internal/adapters/mocks

# Domain + ports tests still pass.
go test ./internal/domain/... ./internal/ports/...

# Scheduler utility.
go test ./internal/service/scheduler/...

# Turn-generation orchestrator.
go test ./internal/service/turngeneration/...

# Scheduler ticker worker.
go test ./internal/worker/schedulerticker/...

# Updated lobby client adapter.
go test ./internal/adapters/lobbyclient/...

# Module-wide build remains green.
go test ./...

Out-of-scope for this stage: app wiring (Stage 19), service-local integration suite (Stage 21), cross-service Lobby ↔ GM tests (Stage 22).