feat: gamemaster

This commit is contained in:
Ilia Denisov
2026-05-03 07:59:03 +02:00
committed by GitHub
parent a7cee15115
commit 3e2622757e
229 changed files with 41521 additions and 1098 deletions
+123 -28
View File
@@ -417,9 +417,9 @@ It also stores a denormalized runtime snapshot for convenience, at least:
* `engine_health_summary`.
Additionally, `Game Lobby` aggregates per-member game statistics from
`player_turn_stats` carried on each `runtime_snapshot_update` event: current
and running-max of `planets`, `population`, and `ships_built`. The aggregate
is retained from game start until capability evaluation at `game_finished`.
`player_turn_stats` carried on each `runtime_snapshot_update` event:
current and running-max of `planets` and `population`. The aggregate is
retained from game start until capability evaluation at `game_finished`.
This prevents user-facing list/read flows from fan-out requests into `Game Master`.
@@ -544,7 +544,7 @@ background worker.
`RND.ReleaseAllByUser(user_id)` atomically with membership/application/invite
cancellations for the affected user.
## 8. Game Master
## 8. [Game Master](gamemaster/README.md)
`Game Master` owns runtime and operational metadata of already running games.
@@ -561,6 +561,40 @@ It owns:
* engine version registry and version-specific engine options;
* runtime mapping `platform user_id -> engine player UUID` for each running game.
### Topology
`Game Master` runs as a single process in v1. The in-process scheduler is
authoritative; multi-instance with leader election is an explicit future
iteration. Every other service that interacts with `Game Master`
(`Edge Gateway`, `Game Lobby`, `Admin Service`, `Runtime Manager`) treats
GM as a singleton on the trusted network segment.
### Engine container contract
`Game Master` is the only platform component that talks to the engine. The
engine container exposes two route classes:
* admin paths under `/api/v1/admin/*``init`, `status`, `turn`, and
`race/banish`. They are unauthenticated and reachable only inside the
trusted network segment that connects GM to the engine container;
* player paths under `/api/v1/{command, order, report}` — invoked by GM on
behalf of an authenticated platform user; the actor field on each call
is set by GM from the verified user identity, never from the inbound
payload;
* `GET /healthz` — liveness probe used by `Runtime Manager` and operator
tooling.
Two engine-side fields are part of the contract:
* `StateResponse.finished:bool` — when `true` on a turn-generation
response, GM transitions the runtime to `finished`, publishes
`game_finished`, and dispatches the finish notification. The conditional
logic that flips the flag lives in the engine's domain code and is not
GM's concern;
* `POST /api/v1/admin/race/banish` with body `{race_name}` — invoked by GM
in response to the Lobby-driven banish flow after a permanent
platform-level membership removal. The engine returns `204` on success.
### Game Master status model
Minimum runtime-level status set:
@@ -571,8 +605,12 @@ Minimum runtime-level status set:
* `generation_failed`
* `stopped`
* `engine_unreachable`
* `finished`
`running` here means `running_accepting_commands`.
`running` here means `running_accepting_commands`. `finished` is terminal:
the runtime record stays in this state indefinitely; no further turn
generation, command, or order is accepted, and operator cleanup is the
only path out.
### Game command routing
@@ -599,14 +637,25 @@ Private-game owner can use the subset allowed for the owner of that game.
### Turn cutoff and scheduling
`Game Master` is the owner of authoritative platform time for turn cutoff decisions.
`Game Master` is the owner of authoritative platform time for turn cutoff
decisions.
Commands arriving exactly on the boundary of a new turn are considered stale and must not reach the engine.
The cutoff is enforced by a single status compare-and-swap: every player
command, order, and report read requires `runtime_status=running` at the
moment of the call, and turn generation begins by CAS-ing
`running → generation_in_progress`. There is no separately tracked shadow
window or grace period — the status transition itself is the boundary.
Commands arriving after the CAS are rejected with `runtime_not_running`.
The scheduler is a subsystem inside `Game Master`.
It triggers turn generation according to the game schedule.
The scheduler is a subsystem inside `Game Master`. It triggers turn
generation according to the game schedule.
If a manual force next turn is executed, the next scheduled turn slot must be skipped so that players still get at least one full normal schedule interval before the following generated turn.
If a manual `force next turn` is executed, the next scheduled turn slot
must be skipped so that players still get at least one full normal
schedule interval before the following generated turn. The skip is
recorded as `runtime_records.skip_next_tick=true`; the scheduler advances
`next_generation_at` by one extra cron step the next time it computes the
tick and clears the flag.
### Runtime snapshot publishing
@@ -615,16 +664,27 @@ consumed by `Game Lobby`. Events include:
* `runtime_snapshot_update` — carries the current `current_turn`,
`runtime_status`, `engine_health_summary`, and a `player_turn_stats` array
with one entry per active member (`user_id`, `planets`, `population`,
`ships_built`). `Game Lobby` maintains a per-game per-user stats aggregate
from these events for capability evaluation at game finish.
with one entry per active member (`user_id`, `planets`, `population`).
`Game Lobby` maintains a per-game per-user stats aggregate from these
events for capability evaluation at game finish.
* `game_finished` — carries the final snapshot values and triggers the
platform status transition plus Race Name Directory capability evaluation
inside `Game Lobby`.
`Game Master` does not retain the aggregate; it only publishes the per-turn
observation. `Game Lobby` is responsible for holding initial values and
running maxima across the lifetime of the game.
Publication cadence is event-driven. GM publishes a snapshot when:
* a turn was generated (success or failure);
* `runtime_status` transitioned (e.g.,
`running ↔ generation_in_progress`, `running → engine_unreachable`,
`* → finished`);
* `engine_health_summary` changed in response to a `runtime:health_events`
observation; consecutive observations with identical summaries are
debounced.
There is no periodic heartbeat. `Game Master` does not retain the
aggregate; it only publishes the per-turn observation. `Game Lobby` is
responsible for holding initial values and running maxima across the
lifetime of the game.
### Runtime/engine finish flow
@@ -847,13 +907,17 @@ requests for no operational benefit.
* `Gateway -> Admin Service`
* `Gateway -> User Service`
* `Gateway -> Game Lobby`
* `Gateway -> Game Master`
* `Gateway -> Game Master` for verified player command, order, and report
calls;
* `Auth / Session Service -> User Service`
* `Auth / Session Service -> Mail Service`
* `Geo Profile Service -> Auth / Session Service`
* `Geo Profile Service -> User Service`
* `Game Lobby -> User Service`
* `Game Lobby -> Game Master` for critical registration/update calls
* `Game Lobby -> Game Master` for `register-runtime` after a successful
container start, engine-version `image-ref` resolve, membership
invalidation hook, banish, and the liveness reply consumed by Lobby's
resume flow;
* `Game Master -> Runtime Manager` for inspect, restart, patch, stop, and cleanup REST calls
* `Admin Service -> Runtime Manager` for operational inspect, restart, patch, stop, and cleanup REST calls
@@ -864,11 +928,15 @@ requests for no operational benefit.
* `Lobby -> Runtime Manager` runtime jobs through `runtime:start_jobs` (`{game_id, image_ref, requested_at_ms}`) and `runtime:stop_jobs` (`{game_id, reason, requested_at_ms}`);
* `Runtime Manager -> Lobby` job outcomes through `runtime:job_results`;
* `Runtime Manager -> Notification Service` admin-only failure intents (image pull, container start, start config) through `notification:intents`;
* `Runtime Manager` outbound technical health stream `runtime:health_events` consumed by `Game Master`; `Game Lobby` and `Admin Service` are reserved as future consumers;
* `Runtime Manager` outbound technical health stream `runtime:health_events`
consumed by `Game Master`; `Game Lobby` and `Admin Service` are reserved
as future consumers;
* all event-bus propagation;
* `Game Master -> Game Lobby` runtime snapshot updates (including
`player_turn_stats` for capability aggregation) and game-finish events
through a dedicated Redis Stream consumed by `Game Lobby`;
through the `gm:lobby_events` Redis Stream consumed by `Game Lobby`,
published event-only with no periodic heartbeat (turn generation,
status transition, or debounced engine-health summary change);
* `User Service -> Game Lobby` user lifecycle events
(`user.lifecycle.permanent_blocked`, `user.lifecycle.deleted`) through the
`user:lifecycle_events` Redis Stream, consumed by `Game Lobby` to cascade
@@ -908,6 +976,10 @@ PostgreSQL is the source of truth for table-shaped business state:
registry (registered/reservation/pending tiers);
* runtime manager runtime records (`game_id -> current_container_id`),
per-operation audit log, and latest health snapshot per game;
* game master runtime records (`game_id -> engine_endpoint`,
status/turn/scheduling), the engine version registry (`engine_versions`),
per-game player mappings (`game_id, user_id -> race_name,
engine_player_uuid`), and the GM operation log;
* idempotency records, expressed as `UNIQUE` constraints on the durable
table — not as a separate kv;
* retry scheduling state, expressed as a `next_attempt_at` column on the
@@ -931,9 +1003,9 @@ Redis is the source of truth for ephemeral and runtime-coordination state:
### Database topology
* Single PostgreSQL database `galaxy`.
* Schema per service: `user`, `mail`, `notification`, `lobby`, `rtmanager`.
Reserved for future use: `geoprofile`. Not allocated unless needed:
`gateway`, `authsession`.
* Schema per service: `user`, `mail`, `notification`, `lobby`, `rtmanager`,
`gamemaster`. Reserved for future use: `geoprofile`. Not allocated unless
needed: `gateway`, `authsession`.
* Each service connects with its own PostgreSQL role whose grants are
restricted to its own schema (defense-in-depth).
* Authentication is username + password only. `sslmode=disable`. No client
@@ -1012,7 +1084,8 @@ crossing the SQL boundary carry `time.UTC` as their location.
### Configuration
For each service `<S>` ∈ { `USERSERVICE`, `MAIL`, `NOTIFICATION`,
`LOBBY`, `RTMANAGER`, `GATEWAY`, `AUTHSESSION` }, the Redis connection accepts:
`LOBBY`, `RTMANAGER`, `GAMEMASTER`, `GATEWAY`, `AUTHSESSION` }, the Redis
connection accepts:
* `<S>_REDIS_MASTER_ADDR` (required)
* `<S>_REDIS_REPLICA_ADDRS` (optional, comma-separated)
@@ -1020,7 +1093,7 @@ For each service `<S>` ∈ { `USERSERVICE`, `MAIL`, `NOTIFICATION`,
* `<S>_REDIS_DB`, `<S>_REDIS_OPERATION_TIMEOUT`
For PG-backed services (`USERSERVICE`, `MAIL`, `NOTIFICATION`, `LOBBY`,
`RTMANAGER`) the Postgres connection accepts:
`RTMANAGER`, `GAMEMASTER`) the Postgres connection accepts:
* `<S>_POSTGRES_PRIMARY_DSN` (required;
`postgres://<role>:<pwd>@<host>:5432/galaxy?search_path=<schema>&sslmode=disable`)
@@ -1384,7 +1457,17 @@ Rules:
* upgrade during a running game is allowed only as a patch update within the same major/minor line;
* game-engine version management is manual in v1;
* each engine version may carry version-specific engine options;
* `Game Master` owns the engine version registry and its internal API.
* `Game Master` owns the engine version registry from v1 — `(version,
image_ref, options, status)` rows live in the `gamemaster` schema and
are managed exclusively through GM's internal REST surface;
* `Game Lobby` resolves `image_ref` synchronously through GM at game start
by calling `GET /api/v1/internal/engine-versions/{version}/image-ref`;
`LOBBY_ENGINE_IMAGE_TEMPLATE` and any Lobby-side template-based
resolution are removed without a backward-compat shim. If GM is
unavailable when Lobby attempts the resolve, the start fails with
`service_unavailable` and `runtime:start_jobs` is never published;
* `Runtime Manager` continues to receive a verbatim `image_ref` from the
start envelope and never resolves engine versions itself.
## Administrative Access Model
@@ -1457,7 +1540,7 @@ Recommended order for implementation is:
6. **Game Lobby Service** (implemented)
Platform game records, membership, invites, applications, approvals, schedules, user-facing lists, pre-start lifecycle.
7. **Runtime Manager**
7. **Runtime Manager** (implemented)
Dedicated Docker-control service for container lifecycle (start, stop,
restart, semver-patch, cleanup) and inspect/health monitoring through
Docker events, periodic inspect, and active HTTP probes. Driven
@@ -1466,7 +1549,19 @@ Recommended order for implementation is:
`Admin Service` via the trusted internal REST surface.
8. **Game Master**
Running-game orchestration, engine version registry, runtime state, turn scheduler, engine API mediation, operational controls.
Single-instance running-game orchestrator. Owns the runtime state
(`game_id → engine_endpoint`, status, current turn, scheduling, engine
health), the engine version registry consumed synchronously by
`Game Lobby` for `image_ref` resolution, and the platform mapping
`(user_id, race_name, engine_player_uuid)` per running game. Drives
the turn scheduler with the force-next-turn skip rule, mediates every
engine HTTP call (admin paths under `/api/v1/admin/*`, player paths
under `/api/v1/{command, order, report}`), and reacts to
`StateResponse.finished` by transitioning the runtime to `finished` and
publishing `game_finished`. Drives `Runtime Manager` synchronously over
REST for stop, restart, and patch; consumes `runtime:health_events`
from RTM; publishes `gm:lobby_events` (event-only, no heartbeat) and
`notification:intents`. Never opens the Docker SDK.
9. **Admin Service**
Admin UI backend that orchestrates trusted APIs of other services.