65 KiB
Services Architecture
Galaxy: Turn-based Strategy Game
Purpose
This document defines the high-level architecture of the Galaxy Ga,e platform as a single source of truth for implementing all core microservices.
It describes:
- public and trusted service boundaries;
- ownership of main business entities and state;
- request routing and transport rules;
- interaction rules between services;
- runtime model for game containers;
- notification and event propagation model;
- recommended implementation order.
Detailed behavior of each concrete service belongs in its own README. This document fixes the system-level structure and the architectural rules that must remain stable across service implementations.
Scope
Galaxy Game is a multiplayer turn-based online strategy game platform.
Core product properties:
- many game sessions may exist simultaneously;
- one user may participate in multiple games at once;
- users authenticate by e-mail confirmation code;
- users have platform roles and tariff/entitlement state;
- games may be public or private;
- public games are managed by system administrators;
- private games are created and managed by eligible paid users;
- each running game is executed inside its own dedicated game engine container;
- each running game is bound to one concrete engine version;
- in-place upgrade of a running game is allowed only as a patch update within the same semver major/minor line;
- player commands are turn-bound and are accepted only before the next scheduled turn generation cutoff.
The platform stores durable business state in PostgreSQL (one shared database, schema per service) and uses Redis with Redis Streams for ephemeral state, caches, and the internal event bus. The backend split, library stack, and staged migration plan live in PG_PLAN.md and the Persistence Backends section below.
Main Principles
- The platform exposes a single external entry point: Edge Gateway.
- Public unauthenticated flows use REST/JSON.
- Authenticated user edge traffic uses signed gRPC over HTTP/2 with protobuf control envelopes and FlatBuffers payload bytes.
- Trusted synchronous inter-service traffic uses REST/JSON unless a service-specific contract states otherwise.
- For the direct
Gateway -> Userself-service boundary, gateway keeps the external authenticated gRPC + FlatBuffers contract and performs REST/JSON transcoding towardUser Serviceinternally. - The gateway handles only edge concerns: parsing, authentication, integrity checks, anti-replay, rate limiting, routing, and push delivery. Business authorization and domain rules remain in downstream services.
Auth / Session Serviceis the source of truth fordevice_session, but it is not on the hot path of every authenticated request. Gateway authenticates steady-state traffic from session cache and lifecycle updates.Game Lobbyowns platform-level metadata of game sessions.Game Masterowns runtime and operational state of running games.Runtime Manageris the only service allowed to access Docker API directly.Notification Serviceis the platform-level delivery/orchestration layer for push and most non-auth email notifications.Mail Servicesends email; auth-code mail is sent directly byAuth / Session Service, while all other platform mail is initiated throughNotification Service.Geo Profile Serviceis auxiliary and fail-open relative to gameplay; it never blocks the currently processed request and may affect only later requests.- If a user-facing request must complete with a deterministic result in the same flow, the critical internal chain must be synchronous. If the interaction is propagation, notification, cache update, runtime job completion, telemetry, or denormalized read-model update, it should be asynchronous.
Security and Transport Model
The former standalone security model is part of the main architecture and is no longer treated as a separate subsystem.
Public and authenticated transport classes
The gateway already distinguishes:
- public REST/JSON for unauthenticated traffic such as health checks and public auth;
- authenticated gRPC over HTTP/2 for verified commands and push delivery.
For downstream business services, the current default trusted transport is strict REST/JSON. Gateway may therefore authenticate and verify one external FlatBuffers command, then transcode it to one trusted downstream REST call.
When forwarding an authenticated command to a downstream service, Edge Gateway
enriches the REST call with the X-User-ID header carrying the verified platform
user identifier. Downstream services derive the acting user identity exclusively
from this header and must never accept identity claims from request body fields.
The public auth contract is:
send-email-code(email) -> challenge_idconfirm-email-code(challenge_id, code, client_public_key, time_zone) -> device_session_id
The authenticated request contract is based on:
device_session_idmessage_typetimestamp_msrequest_idpayload_hash- Ed25519 client signature over canonical envelope fields.
Server responses and push events are signed by the gateway so clients can verify server-originated messages. Push streams are bound to authenticated user_id and device_session_id, and session revoke closes only streams bound to the revoked session.
Verification boundary
Before routing an authenticated request, gateway must:
- validate envelope presence and protocol version;
- resolve session from session cache;
- reject unknown or revoked sessions;
- verify
payload_hash; - verify client signature;
- verify freshness window;
- verify anti-replay by
device_session_id + request_id; - apply edge rate limits and basic policy checks;
- build an authenticated internal command context and only then route downstream.
Downstream services must never receive unauthenticated external traffic.
High-Level System Diagram
flowchart LR
Client["Game Client\n(native / browser)"]
AdminUI["Admin UI"]
Gateway["Edge Gateway\nPublic REST\nAuthenticated gRPC\nAdmin REST"]
Auth["Auth / Session Service"]
User["User Service"]
Lobby["Game Lobby Service"]
GM["Game Master"]
Runtime["Runtime Manager"]
Notify["Notification Service"]
Mail["Mail Service"]
Geo["Geo Profile Service"]
Billing["Billing Service\nfuture"]
Redis["Redis\nCache, Streams, Leases"]
Postgres["PostgreSQL\nDurable Business State"]
Telemetry["Telemetry"]
Client --> Gateway
AdminUI --> Gateway
Gateway --> Auth
Gateway --> User
Gateway --> Lobby
Gateway --> GM
Gateway --> Geo
Auth --> User
Auth --> Mail
Auth --> Redis
User --> Redis
Lobby --> User
Lobby --> GM
Lobby --> Runtime
Lobby --> Redis
User --> Lobby
GM --> Lobby
GM --> Runtime
GM --> Redis
Geo --> Auth
Geo --> User
Geo --> Redis
Notify --> Gateway
Notify --> Mail
Notify --> Redis
Runtime --> Redis
Mail --> Redis
User --> Postgres
Mail --> Postgres
Notify --> Postgres
Lobby --> Postgres
Billing --> User
Telemetry --- Gateway
Telemetry --- Auth
Telemetry --- User
Telemetry --- Lobby
Telemetry --- GM
Telemetry --- Runtime
Telemetry --- Notify
Telemetry --- Geo
The baseline gateway/auth/session/pub-sub model above is consistent with the existing architecture and service READMEs.
Service List and Responsibility Boundaries
1. Edge Gateway
Edge Gateway is the only public entry point for all external traffic. It already owns transport parsing, session-cache-based authentication, signature verification, freshness/replay checks, edge rate limiting, routing, and push delivery. It must remain free of domain-specific business logic.
External surfaces:
-
public REST:
- health and readiness;
- public auth commands;
- browser/bootstrap and public route classes where needed.
-
authenticated gRPC:
- generic
ExecuteCommand; - authenticated
SubscribeEvents.
- generic
-
admin REST:
- separate public administrative surface for system administrators;
- routed only for authenticated users with admin role.
The gateway does not directly access game engine containers.
For running games it routes to Game Master.
For pre-game platform flows it routes to Game Lobby.
For user-profile requests it routes to User Service.
For public auth it routes to Auth / Session Service.
2. Auth / Session Service
Auth / Session Service owns:
- challenge lifecycle;
- e-mail-code authentication;
- creation of
device_session; - registration of the client Ed25519 public key;
- revoke/logout/block state;
- trusted internal read/revoke/block API;
- projection of session lifecycle state into gateway-consumable Redis data.
It is the source of truth for:
- authentication challenges;
device_session;- revoke/block state.
Important architectural rules:
- public auth stays synchronous;
confirm-email-codereturns a readydevice_session_id;- no async “pending session provisioning” step exists;
- session source of truth and gateway-facing projection remain separate;
- active-session limits are configuration-driven;
send-email-codestays success-shaped for existing, new, blocked, and throttled email flows.
When confirm-email-code reaches first successful completion for an e-mail
address that does not yet belong to a user, auth may pass create-only
registration context to User Service during the synchronous ensure/create
step.
Direct integrations:
- synchronous to
User Servicefor user resolution/create/block decision; - synchronous to
Mail Servicefor auth-code delivery; - asynchronous session lifecycle projection into Redis for gateway consumption.
3. User Service
User Service owns regular-user identity and profile as platform-level
business data.
It is the source of truth for:
user_idof regular platform users;user_name— immutable auto-generated unique platform handle inplayer-<suffix>form; never used as foreign key in other models;display_name— mutable free-text user-editable label validated throughpkg/util/string.go:ValidateTypeName; not required to be unique; default empty for new accounts;- editable user settings (
preferred_language,time_zone); - current tariff/entitlement state including
max_registered_race_names; - user-specific limits and platform sanctions (including
permanent_blockandmax_registered_race_namesoverride limits); - latest effective
declared_country; - soft-delete state via
DeleteUser.
User Service does not own in-game race_name values; those live in
Game Lobby Race Name Directory.
System-administrator identity remains outside this service and belongs to the
later Admin Service. Trusted administrative reads and mutations against
regular-user state do not make User Service the owner of administrator
identity.
It is directly reachable through gateway for selected user-facing operations such as:
- reading and editing allowed profile fields;
- viewing tariff and entitlement state;
- viewing user settings;
- viewing current restrictions and sanctions.
Not every profile mutation goes directly here. For example:
- email change must use a code-confirm flow;
declared_countrychange remains under admin approval flow viaGeo Profile Service.
Architectural rules fixed for this service:
User Serviceowns regular-user identity only; system-admin identity is out of scope.User Servicestores only the current effectivedeclared_country; review workflow and history belong toGeo Profile Service.User Servicedoes not own in-gamerace_namevalues. All in-game name state (registered, reserved, pending registration) lives in the Game Lobby Race Name Directory. The only identity strings owned byUser Serviceareuser_name(immutable) anddisplay_name(mutable, non-unique).permanent_blockis a dedicated sanction code that collapses everycan_*eligibility marker to false and triggers RND cascade release via theuser:lifecycle_eventsstream.DeleteUseris a trusted internal endpoint that soft-deletes the account, rejects all subsequent operations withsubject_not_found, and triggers the same RND cascade release.- During the current auth-registration rollout,
Auth / Session Servicepasses a preferred-language candidate derived from publicAccept-Language, falling back toenwhen no supported value is available, plus the confirmedtime_zoneintoUser Service.
Future billing does not become a direct dependency of other services. Billing Service will feed entitlement/payment outcomes into User Service, and the rest of the platform will continue to use User Service as the source of truth for current entitlements.
4. Mail Service
Mail Service is the internal email delivery service.
Split of responsibility:
- auth code emails:
Auth / Session Service -> Mail Servicedirectly; - all other user/admin notification emails:
Notification Service -> Mail Service.
Transport rules:
Auth / Session Service -> Mail Serviceuses the dedicated synchronous trusted internal REST contractPOST /api/v1/internal/login-code-deliveries;Notification Service -> Mail Serviceis an asynchronous internal command flow carried through dedicated queue-backed handoff after durable route acceptance insideNotification Service.
This split is covered by integration tests: auth-code delivery bypasses
Notification Service, while notification-generated mail uses template-mode
commands whose template_id equals notification_type.
Mail Service may internally queue both flows.
Its trusted operator read and resend APIs are part of the v1 service surface,
not a later add-on.
For auth callers, a successful result means the request was durably accepted
into the mail-delivery pipeline or intentionally suppressed; it does not
require that the external SMTP exchange already completed before the response
is returned.
Stable service-local delivery rules, retry semantics, and storage details
(PostgreSQL for the durable delivery record, attempt history, dead letters,
and audit; Redis for the inbound mail:delivery_commands stream and its
consumer offset) belong in mail/README.md, not in the
root architecture document.
5. Geo Profile Service
Geo Profile Service is an internal trusted auxiliary service for country-level connection signals of authenticated users.
It integrates with:
- gateway as asynchronous ingest producer;
User Servicefor current effectivedeclared_country;Auth / Session Servicefor suspicious session blocking;Notification Servicefor optional admin notifications.
It owns:
- observed country facts;
- per-session country aggregation;
usual_connection_country;country_review_recommended;- history of
declared_countrychanges.
It does not block the request that triggered suspicion.
It can only request block of suspicious sessions for subsequent requests.
It does not call Mail Service directly; optional admin mail must flow
through Notification Service.
In this document, references to Edge Service in older geo documentation should be understood as Edge Gateway.
6. Admin Service
Admin Service is the external backend/orchestration layer for the administrative UI.
It is not a heavy domain owner. Its job is to:
- expose administrator-facing workflows;
- call trusted internal APIs of other services;
- aggregate administrative views where needed;
- enforce system-admin role checks at the gateway/admin boundary.
System administrators can view and operate on all games, including private ones.
7. Game Lobby Service
Game Lobby owns platform-level metadata and lifecycle of game sessions as platform entities.
It is the source of truth for:
- game records before and after runtime existence;
- public/private game type;
- owner of a private game;
- user-bound invitations and invite lifecycle;
- applications and approvals;
- membership and roster;
- blocked/removed participants at platform level;
- turn schedule configuration;
- target engine version for launch;
- user-facing lists of games;
- denormalized runtime snapshot imported from
Game Master.
Game Lobby is the source of truth for:
- party membership;
- invited / pending / active / finished / removed status of players relative to games;
- user-visible lists such as
active / finished / pending / invited games.
It also stores a denormalized runtime snapshot for convenience, at least:
current_turn;runtime_status;engine_health_summary.
Additionally, Game Lobby aggregates per-member game statistics from
player_turn_stats carried on each runtime_snapshot_update event:
current and running-max of planets and population. The aggregate is
retained from game start until capability evaluation at game_finished.
This prevents user-facing list/read flows from fan-out requests into Game Master.
Lobby status model
Minimum platform-level status set:
draftenrollment_openready_to_startstartingstart_failedrunningpausedfinishedcancelled
Lobby.paused is a business/platform pause, distinct from engine/runtime failure states.
start_failed indicates that the runtime container could not be started or that
metadata persistence failed after a successful container start.
From start_failed an admin or owner may retry (→ ready_to_start) or cancel (→ cancelled).
Enrollment rules
Each game stores three enrollment configuration fields set at creation:
min_players— minimum approved participants required before the game may start.max_players— target roster size that activates the gap admission window.start_gap_hours— hours to keep enrollment open aftermax_playersis reached.start_gap_players— additional players admitted during the gap window.enrollment_ends_at— UTC Unix timestamp at which enrollment closes automatically.
Transition from enrollment_open to ready_to_start occurs via one of three paths:
- Manual: an admin (public game) or owner (private game) issues a close-enrollment
command when
approved_count >= min_players. - Deadline:
enrollment_ends_atis reached andapproved_count >= min_players. - Gap exhaustion:
approved_count >= max_playersactivates a gap window ofstart_gap_hoursduring which up tostart_gap_playersadditional participants may join; the transition fires when the gap window expires orapproved_count >= max_players + start_gap_players.
All pending invites transition to expired when the game moves to ready_to_start.
Membership rules
User Serviceowns users of the platform as identities.Game Lobbyowns membership in concrete games.- game engine does not own platform membership;
Game Mastermay cache membership for runtime authorization, butGame Lobbyremains the source of truth.
Public vs private game rules
Public games:
- created and controlled by system administrators;
- visible in public list;
- joining is based on application and manual admin approval in v1.
Private games:
- can be created only by eligible paid users;
- visible only to their owner and to invited users whose invitation is bound
to a concrete
user_idand later accepted; - joining uses a user-bound invite; accepting the invite immediately creates active membership without a separate owner-approval step;
- invite lifecycle belongs entirely to
Game Lobby.
Private-party owners get a limited owner-admin capability set, not full system admin power.
Race Name Directory
Race Name Directory (RND) is the platform source of truth for in-game player
names (race_name). It is owned by Game Lobby in v1 and is scheduled to move
to a dedicated Race Name Service later without changing the domain or
service-layer logic.
RND owns three levels of state per name:
- registered — platform-unique permanent names owned by one regular user.
A registered name cannot be transferred, released, or renamed; the only path
back to availability is
permanent_blockorDeleteUseron the owning account. The number of registered names a user can hold is bounded by the current tariff (max_registered_race_namesin theUser Serviceeligibility snapshot):free=1,paid_monthly=2,paid_yearly=6,paid_lifetime=unlimited. Tariff downgrade never revokes existing registrations; it only constrains new ones. - reservation — per-game binding created when a participant joins a game
through application approval or invite redeem. The reservation key is
(game_id, canonical_key). One user may hold the same name simultaneously across multiple active games. A reservation survives until the game finishes, then either becomes apending_registration(see below) or is released. - pending_registration — a reservation that survived a capable finish and
is now waiting up to 30 days for the owner to upgrade it into a registered
name via
lobby.race_name.register. Expiration releases the binding.
Canonical key — RND uses a canonical key (lowercase + frozen
confusable-pair policy) to enforce uniqueness. A name is considered taken for
another user when any registered, active reservation, or
pending_registration with a different user_id exists under the same
canonical key. The confusable-pair policy lives in Lobby
(lobby/internal/domain/racename/policy.go).
Capability gating — at game_finished Game Lobby evaluates per-member
capability: capable = max_planets > initial_planets AND max_population > initial_population, computed from the player_turn_stats stream published by
Game Master. Capable reservations transition to pending_registration with
eligible_until = finished_at + 30 days; non-capable reservations are
released immediately.
Registration — a user initiates registration via lobby.race_name.register
inside the 30-day window. Registration succeeds only when the user is still
eligible (no permanent_block, tariff slot available) and the pending entry
is still within its window. Expired pending entries are released by a
background worker.
Cascade release — User Service publishes
user.lifecycle.permanent_blocked and user.lifecycle.deleted events to
user:lifecycle_events. Game Lobby consumes this stream and calls
RND.ReleaseAllByUser(user_id) atomically with membership/application/invite
cancellations for the affected user.
8. Game Master
Game Master owns runtime and operational metadata of already running games.
It is the only trusted service allowed to communicate with game engine containers.
It owns:
- runtime mapping of running game to container endpoint/binding;
- current turn number;
- runtime status;
- generation status;
- engine health;
- patch state;
- engine version registry and version-specific engine options;
- runtime mapping
platform user_id -> engine player UUIDfor each running game.
Topology
Game Master runs as a single process in v1. The in-process scheduler is
authoritative; multi-instance with leader election is an explicit future
iteration. Every other service that interacts with Game Master
(Edge Gateway, Game Lobby, Admin Service, Runtime Manager) treats
GM as a singleton on the trusted network segment.
Engine container contract
Game Master is the only platform component that talks to the engine. The
engine container exposes two route classes:
- admin paths under
/api/v1/admin/*—init,status,turn, andrace/banish. They are unauthenticated and reachable only inside the trusted network segment that connects GM to the engine container; - player paths under
/api/v1/{command, order, report}— invoked by GM on behalf of an authenticated platform user; the actor field on each call is set by GM from the verified user identity, never from the inbound payload; GET /healthz— liveness probe used byRuntime Managerand operator tooling.
Two engine-side fields are part of the contract:
StateResponse.finished:bool— whentrueon a turn-generation response, GM transitions the runtime tofinished, publishesgame_finished, and dispatches the finish notification. The conditional logic that flips the flag lives in the engine's domain code and is not GM's concern;POST /api/v1/admin/race/banishwith body{race_name}— invoked by GM in response to the Lobby-driven banish flow after a permanent platform-level membership removal. The engine returns204on success.
Game Master status model
Minimum runtime-level status set:
startingrunninggeneration_in_progressgeneration_failedstoppedengine_unreachablefinished
running here means running_accepting_commands. finished is terminal:
the runtime record stays in this state indefinitely; no further turn
generation, command, or order is accepted, and operator cleanup is the
only path out.
Game command routing
All game-related message_type include game_id.
Gateway enriches them with authenticated user_id and routes them to Game Master.
Game Master checks whether this user may access this running game, using membership data sourced from Game Lobby, then routes the command to the correct engine container using Game Engine's API.
The gateway never routes directly to game engine containers.
Runtime admin operations
For already running games, Game Master handles:
stop gameforce next turnpatch engine- admin/runtime status reads
- player deactivation/removal inside engine when required
- regular collection of game runtime metrics
System admin can use all of them. Private-game owner can use the subset allowed for the owner of that game.
Turn cutoff and scheduling
Game Master is the owner of authoritative platform time for turn cutoff
decisions.
The cutoff is enforced by a single status compare-and-swap: every player
command, order, and report read requires runtime_status=running at the
moment of the call, and turn generation begins by CAS-ing
running → generation_in_progress. There is no separately tracked shadow
window or grace period — the status transition itself is the boundary.
Commands arriving after the CAS are rejected with runtime_not_running.
The scheduler is a subsystem inside Game Master. It triggers turn
generation according to the game schedule.
If a manual force next turn is executed, the next scheduled turn slot
must be skipped so that players still get at least one full normal
schedule interval before the following generated turn. The skip is
recorded as runtime_records.skip_next_tick=true; the scheduler advances
next_generation_at by one extra cron step the next time it computes the
tick and clears the flag.
Runtime snapshot publishing
Game Master publishes runtime updates to the gm:lobby_events Redis Stream
consumed by Game Lobby. Events include:
runtime_snapshot_update— carries the currentcurrent_turn,runtime_status,engine_health_summary, and aplayer_turn_statsarray with one entry per active member (user_id,planets,population).Game Lobbymaintains a per-game per-user stats aggregate from these events for capability evaluation at game finish.game_finished— carries the final snapshot values and triggers the platform status transition plus Race Name Directory capability evaluation insideGame Lobby.
Publication cadence is event-driven. GM publishes a snapshot when:
- a turn was generated (success or failure);
runtime_statustransitioned (e.g.,running ↔ generation_in_progress,running → engine_unreachable,* → finished);engine_health_summarychanged in response to aruntime:health_eventsobservation; consecutive observations with identical summaries are debounced.
There is no periodic heartbeat. Game Master does not retain the
aggregate; it only publishes the per-turn observation. Game Lobby is
responsible for holding initial values and running maxima across the
lifetime of the game.
Runtime/engine finish flow
When the engine determines that a game is finished:
- engine reports finish to
Game Master; Game Masterupdates runtime state;Game MasternotifiesGame Lobby;Game Lobbyupdates the platform-level game record tofinished.
Player removal after start
After a game has started, two different actions exist:
-
temporary removal/block at platform level:
- the player cannot send commands through gateway/platform;
- the engine still keeps the player slot;
-
final removal or account-level block:
Game Mastermust additionally send an admin command to the engine to deactivate/remove the player inside the game.
This distinction is architectural and must remain explicit.
9. Runtime Manager
Runtime Manager is the only internal service allowed to access Docker API directly.
It owns:
- starting game engine containers;
- stopping containers;
- restarting containers where allowed;
- patching/replacing containers (semver patch only) where allowed;
- technical runtime inspection/status;
- monitoring containers via Docker events, periodic inspect, and active HTTP probe;
- publishing technical runtime events (
runtime:job_results,runtime:health_events); - publishing admin-only notification intents for first-touch start failures.
It does not own platform metadata of games.
It does not own runtime business state of games.
It does not resolve engine versions; the producer (Game Lobby in v1, Game Master later) supplies image_ref.
It executes runtime jobs for Game Lobby and Game Master.
Container model
- one game = one container;
- one container = one game.
This is a hard invariant.
Each container is created with hostname galaxy-game-{game_id} and attached to the
single user-defined Docker bridge network configured by RTMANAGER_DOCKER_NETWORK.
The network is provisioned outside Runtime Manager (compose, Terraform, or operator
runbook); a missing network is a fail-fast condition at startup. The published
engine_endpoint is the stable URL http://galaxy-game-{game_id}:8080; restart and
patch keep the same DNS name even though current_container_id changes.
Image policy
Runtime Manager never resolves engine versions. The producer (Game Lobby in v1,
Game Master once implemented) computes image_ref from its own template and
hands it to Runtime Manager on the start envelope. Runtime Manager accepts the
reference verbatim, applies the configured pull policy
(RTMANAGER_IMAGE_PULL_POLICY), and reads container resource limits from labels
on the resolved image.
The producer-supplied image_ref rule decouples Runtime Manager from any
engine-version arbitration logic, lets the v1 launch ship without Game Master's
engine-version registry, and cleanly separates "which image to run" (Lobby/GM
concern) from "how to run it" (RTM concern). Two alternatives were rejected:
RTM holding its own image map (would need to consume upstream tariff or
compatibility signals that belong in the producers) and RTM resolving the
image at start time by querying GM (would create a circular dependency for
v1 and add a synchronous hop on the hot path).
Patch is restart with a new image_ref and is allowed only as a semver patch
within the same major/minor line; cross-major or cross-minor patch attempts fail
with semver_patch_only. Producers that need to change the major/minor line must
stop the game and start a new container.
State ownership
Engine state lives on the host filesystem under the per-game directory
<RTMANAGER_GAME_STATE_ROOT>/{game_id} and is bind-mounted into the container at
RTMANAGER_ENGINE_STATE_MOUNT_PATH. The mount path is exposed to the engine through
GAME_STATE_PATH and, for backward compatibility, also as STORAGE_PATH. Both
names are accepted by galaxy/game in v1.
Runtime Manager never deletes the host state directory. Removing a container
through the cleanup endpoint or the retention TTL leaves the directory intact.
Backup, archival, and operator cleanup of state directories belong to operator
tooling or a future Admin Service workflow.
Reconcile policy
Runtime Manager reconciles its runtime_records with Docker reality at startup
(blocking, before workers start) and on a periodic interval
(RTMANAGER_RECONCILE_INTERVAL). Two rules apply unconditionally:
- unrecorded containers labelled
com.galaxy.owner=rtmanagerare adopted intoruntime_recordsasrunning, never killed; operators may have launched one manually for diagnostics; - recorded
runningrows whose container is missing in Docker are markedremoved, with acontainer_disappearedevent emitted onruntime:health_events.
10. Notification Service
Notification Service is the async delivery/orchestration layer for platform notifications.
It has a deliberately minimal role:
- consume normalized notification intents from services through dedicated
Redis Stream
notification:intents; - validate idempotency and persist durable notification route state;
- enrich user-targeted routes with
emailandpreferred_languagefromUser Service; - decide whether a given notification type results in
push,email, or both; - send user-targeted
pushevents toward gateway byuser_id; - send non-auth email asynchronous commands toward
Mail Service.
It is not a source of truth for user preferences in v1 unless a later feature requires it.
For user-targeted intents, upstream producers publish the concrete recipient
user_id values. Notification Service resolves user email and locale from
User Service, uses configured administrator email lists per
notification_type for admin-only notifications, keeps
template_id == notification_type for notification-generated email, and
treats private-game invite flows in v1 as user-bound by internal user_id.
Go producers use the shared galaxy/notificationintent module to build and
append compatible intents into notification:intents; a failed append is a
notification degradation signal and must not roll back already committed source
business state.
Acceptance of a user-targeted notification intent is complete only after every
published recipient user_id resolves through User Service; unresolved user
ids are treated as producer input defects and are recorded as malformed
notification intents rather than deferred publication failures.
User-facing notifications use push+email unless a type explicitly opts out of
one channel. Administrator-facing notifications are email-only in v1.
All platform notifications except auth-code delivery flow through this service, including:
- game lifecycle notifications;
- invite/application updates;
- new turn notifications;
- operational/admin notifications where appropriate.
The current process surface exposes only one private probe HTTP listener with
GET /healthz and GET /readyz; that probe surface is documented in
notification/openapi.yaml. The canonical
notification-intent stream contract remains
notification/api/intents-asyncapi.yaml.
It does not expose an operator REST API.
11. Billing Service (future)
Billing Service is not part of the first implementation wave.
When introduced, it will:
- process payment/billing events;
- calculate or validate payment outcomes;
- feed resulting entitlement changes into
User Service.
User Service remains the source of truth for current entitlement used by the rest of the platform.
Billing-driven tariff changes alter only the headroom for new registered
race names: tariff downgrade never revokes already registered names. The
affected ceiling is materialized as max_registered_race_names in the
eligibility snapshot consumed by Game Lobby.
Data Ownership Summary
flowchart TD
U["User Service"]
A["Auth / Session Service"]
L["Game Lobby"]
G["Game Master"]
R["Runtime Manager"]
P["Geo Profile Service"]
N["Notification Service"]
M["Mail Service"]
U -->|"regular users, user_name/display_name, settings, tariffs, limits, sanctions, declared_country, soft-delete"| X1["Platform user identity"]
A -->|"challenges, device sessions, revoke/block state"| X2["Auth/session state"]
L -->|"game metadata, invites, applications, membership, roster, race names (registered/reservations/pending)"| X3["Platform game records"]
G -->|"runtime state, current turn, engine health, engine mapping, engine version registry"| X4["Running-game state"]
R -->|"container execution and technical runtime control"| X5["Container runtime"]
P -->|"observed country, usual_connection_country, review state, declared_country history"| X6["Geo state"]
N -->|"notification routing only"| X7["Notification orchestration"]
M -->|"email delivery only"| X8["Email transport"]
Internal Transport Semantics
The platform uses one simple rule:
- if the user-facing request must complete with a deterministic result in the same flow, the critical internal chain is synchronous;
- if the interaction is propagation, notification, cache invalidation, runtime job completion, telemetry, or denormalized read-model update, it is asynchronous.
The Lobby ↔ Runtime Manager transport is the canonical asynchronous case:
Lobby drives RTM exclusively through Redis Streams (runtime:start_jobs,
runtime:stop_jobs, runtime:job_results); there is no synchronous
Lobby→RTM REST call in v1, and no plan to add one. Synchronous coupling
would force Lobby to block on Docker pull/start latency, which is
unbounded in the worst case. Game Master and Admin Service, by contrast,
drive RTM synchronously over REST because they operate on already-running
containers and need deterministic per-request outcomes (for example,
"restart this game's container now"); routing those operations through
streams would force operators to correlate async results back to admin
requests for no operational benefit.
Fixed synchronous interactions
Gateway -> Auth / Session ServiceGateway -> Admin ServiceGateway -> User ServiceGateway -> Game LobbyGateway -> Game Masterfor verified player command, order, and report calls;Auth / Session Service -> User ServiceAuth / Session Service -> Mail ServiceGeo Profile Service -> Auth / Session ServiceGeo Profile Service -> User ServiceGame Lobby -> User ServiceGame Lobby -> Game Masterforregister-runtimeafter a successful container start, engine-versionimage-refresolve, membership invalidation hook, banish, and the liveness reply consumed by Lobby's resume flow;Game Master -> Runtime Managerfor inspect, restart, patch, stop, and cleanup REST callsAdmin Service -> Runtime Managerfor operational inspect, restart, patch, stop, and cleanup REST calls
Fixed asynchronous interactions
- session lifecycle projection toward gateway cache;
- revoke propagation;
Lobby -> Runtime Managerruntime jobs throughruntime:start_jobs({game_id, image_ref, requested_at_ms}) andruntime:stop_jobs({game_id, reason, requested_at_ms});Runtime Manager -> Lobbyjob outcomes throughruntime:job_results;Runtime Manager -> Notification Serviceadmin-only failure intents (image pull, container start, start config) throughnotification:intents;Runtime Manageroutbound technical health streamruntime:health_eventsconsumed byGame Master;Game LobbyandAdmin Serviceare reserved as future consumers;- all event-bus propagation;
Game Master -> Game Lobbyruntime snapshot updates (includingplayer_turn_statsfor capability aggregation) and game-finish events through thegm:lobby_eventsRedis Stream consumed byGame Lobby, published event-only with no periodic heartbeat (turn generation, status transition, or debounced engine-health summary change);User Service -> Game Lobbyuser lifecycle events (user.lifecycle.permanent_blocked,user.lifecycle.deleted) through theuser:lifecycle_eventsRedis Stream, consumed byGame Lobbyto cascade RND release and membership/application/invite cancellation;Game Master -> Notification Servicenotification intents throughnotification:intents;Game Lobby -> Notification Servicenotification intents throughnotification:intents;Geo Profile Service -> Notification Servicenotification intents throughnotification:intents;Notification Service -> Gateway;Notification Service -> Mail Service;- geo auxiliary ingest from gateway to geo service;
- runtime health events from
Runtime Manager.
Mixed interactions
Some service pairs may use both styles for different flows.
The main example is Lobby -> Game Master:
- synchronous for critical registration/update after successful start;
- asynchronous for secondary propagation and denormalized status fan-out.
Persistence Backends
The platform splits durable state across two backends.
PostgreSQL is the source of truth for table-shaped business state:
- user identity, profile settings, tariffs/entitlements, sanctions, limits, and the blocked-email registry;
- mail deliveries, attempt history, dead letters, payloads, and malformed-command audit;
- notification records, route materialisations, dead letters, and malformed-intent audit;
- lobby games, applications, invites, memberships, and the race-name registry (registered/reservation/pending tiers);
- runtime manager runtime records (
game_id -> current_container_id), per-operation audit log, and latest health snapshot per game; - game master runtime records (
game_id -> engine_endpoint, status/turn/scheduling), the engine version registry (engine_versions), per-game player mappings (game_id, user_id -> race_name, engine_player_uuid), and the GM operation log; - idempotency records, expressed as
UNIQUEconstraints on the durable table — not as a separate kv; - retry scheduling state, expressed as a
next_attempt_atcolumn on the durable table and worked off viaSELECT ... FOR UPDATE SKIP LOCKED.
Redis is the source of truth for ephemeral and runtime-coordination state:
- the platform event bus implemented as Redis Streams (
user:domain_events,user:lifecycle_events,gm:lobby_events,runtime:start_jobs,runtime:stop_jobs,runtime:job_results,runtime:health_events,notification:intents,gateway:client-events,mail:delivery_commands); - stream consumer offsets;
- gateway session cache, replay reservations, rate-limit counters, and
short-lived runtime locks/leases (e.g. notification
route_leases, runtime manager per-game operation leasesrtmanager:game_lease:{game_id}); Auth / Session Servicechallenges and active session tokens, which are TTL-bounded and where loss is recoverable by re-authentication;- lobby per-game runtime aggregates that are deleted at game finish
(
game_turn_stats,gap_activated_at, capability evaluation marker).
Database topology
- Single PostgreSQL database
galaxy. - Schema per service:
user,mail,notification,lobby,rtmanager,gamemaster. Reserved for future use:geoprofile. Not allocated unless needed:gateway,authsession. - Each service connects with its own PostgreSQL role whose grants are restricted to its own schema (defense-in-depth).
- Authentication is username + password only.
sslmode=disable. No client certificates and no SCRAM channel binding. - Each service connects to one primary plus zero-or-more read-only replicas. Only the primary is used in this iteration; the replica pool is wired but receives no traffic. Future read-routing is a non-breaking change.
Redis topology
- Each service connects to one master plus zero-or-more replicas.
- All connections require a password.
USERNAME/ACL is not used. TLS is off. - Only the master is used in this iteration; the replica list is wired but unused. Failover/read routing is added later without a config break.
- The legacy env vars
*_REDIS_TLS_ENABLEDand*_REDIS_USERNAMEare removed without a backward-compat shim.
Library stack and migration discipline
- Driver:
github.com/jackc/pgx/v5, exposed as*sql.DBviagithub.com/jackc/pgx/v5/stdlibso it is consumable by query builders written againstdatabase/sql. - Query layer:
github.com/go-jet/jet/v2(PostgreSQL dialect). Generated code lives under each serviceinternal/adapters/postgres/jet/, regenerated by a per-servicemake jettarget (testcontainers + goose + jet) and committed to the repo so consumers don't need Docker just to build. - Migrations:
github.com/pressly/goose/v3library API. Migration files are embedded via//go:embed *.sql, applied at service startup before any listener opens; the service exits non-zero on failure. Files are forward-only, sequence-numbered, and use the standard-- +goose Up/-- +goose Downmarkers. - Single-init policy during pre-launch development: each PG-backed
service ships exactly one migration file,
00001_init.sql, that represents the full current schema. New tables, columns, and indexes are added by editing that file directly rather than by appending00002_*.sql,00003_*.sql, etc. The trade-off is intentional — schema clarity beats migration-history granularity while no production database exists. Once the platform reaches its first production deploy, future schema evolution switches to additive sequence-numbered migrations. - Test infrastructure:
github.com/testcontainers/testcontainers-goplus themodules/postgressubmodule for unit tests and formake jet.
Per-service decision records that capture schema and adapter choices live
at galaxy/<service>/docs/postgres-migration.md.
Timestamp handling
Every time-valued column in every Galaxy schema is timestamptz. The
adapter layer is responsible for ensuring that all time.Time values
crossing the SQL boundary carry time.UTC as their location.
- Writes. Every
time.Timeparameter bound throughdatabase/sql(ExecContext,QueryContext,QueryRowContext) is normalised with.UTC()at the binding site. Optional*time.Timecolumns are bound through a shared helper (nullableTimeor equivalent per adapter) that returnsvalue.UTC()when non-nil and SQLNULLotherwise. Helper bindings ofcutoff,now, etc. (retention, schedulers) follow the same rule even when the input was already produced viaclock.Now().UTC()— defensive.UTC()calls are intentional and cheap. - Reads. Every
time.Timescanned out of PostgreSQL is re-wrapped with.UTC()(directly or via a small helper that mirrorsnullableTimefor the read path) before it leaves the adapter. The domain layer therefore never observes atime.Timewhose location is anything other thantime.UTC. - Why. PostgreSQL stores
timestamptzas UTC at rest, but the Go driver returns scanned values intime.Local. Mixing locations across the boundary produces inequalities in tests, drift in JSON output, and comparison bugs against pointer fields. The defensive.UTC()rule on both sides removes that class of bug entirely.
Configuration
For each service <S> ∈ { USERSERVICE, MAIL, NOTIFICATION,
LOBBY, RTMANAGER, GAMEMASTER, GATEWAY, AUTHSESSION }, the Redis
connection accepts:
<S>_REDIS_MASTER_ADDR(required)<S>_REDIS_REPLICA_ADDRS(optional, comma-separated)<S>_REDIS_PASSWORD(required)<S>_REDIS_DB,<S>_REDIS_OPERATION_TIMEOUT
For PG-backed services (USERSERVICE, MAIL, NOTIFICATION, LOBBY,
RTMANAGER, GAMEMASTER) the Postgres connection accepts:
<S>_POSTGRES_PRIMARY_DSN(required;postgres://<role>:<pwd>@<host>:5432/galaxy?search_path=<schema>&sslmode=disable)<S>_POSTGRES_REPLICA_DSNS(optional, comma-separated)<S>_POSTGRES_OPERATION_TIMEOUT,<S>_POSTGRES_MAX_OPEN_CONNS,<S>_POSTGRES_MAX_IDLE_CONNS,<S>_POSTGRES_CONN_MAX_LIFETIME
Stream- and key-shape env vars (*_REDIS_DOMAIN_EVENTS_STREAM,
*_REDIS_LIFECYCLE_EVENTS_STREAM, *_REDIS_KEYSPACE_PREFIX,
MAIL_REDIS_COMMAND_STREAM, NOTIFICATION_INTENTS_STREAM,
RTMANAGER_REDIS_START_JOBS_STREAM, RTMANAGER_REDIS_STOP_JOBS_STREAM,
RTMANAGER_REDIS_JOB_RESULTS_STREAM, RTMANAGER_REDIS_HEALTH_EVENTS_STREAM,
etc.) keep their current names and semantics — they describe stream/key
shapes, not connection topology.
Test and Contract Conventions
The repository follows a small set of cross-service rules for contract specifications and test doubles. Each rule is captured below with the rejected alternatives so future services do not re-litigate them.
AsyncAPI version: 3.1.0
Every AsyncAPI spec in the repository declares asyncapi: 3.1.0
(notification/api/intents-asyncapi.yaml,
rtmanager/api/runtime-jobs-asyncapi.yaml,
rtmanager/api/runtime-health-asyncapi.yaml). Operators read the same
shape across services — channel with address, separate operations
block, action: send | receive vocabulary.
Alternatives rejected:
- AsyncAPI 2.6.0 — would carry the same information under different
field names (
publish/subscribeblocks living inside the channel) and the shared YAML walker assertions would not transfer cleanly; - adding a typed AsyncAPI parser library — no Galaxy service uses one
today; introducing a new dependency for the existing specs would
break the established pattern that all AsyncAPI freeze tests are pure
YAML walkers using
gopkg.in/yaml.v3.
The oneOf-based polymorphism on the details field in
runtime-health-asyncapi.yaml is plain JSON Schema and works
identically in 3.1.0; no AsyncAPI-version-specific feature is used. If
notification/api/intents-asyncapi.yaml ever moves to a newer major,
every downstream service moves with it as a cross-service contract bump.
Contract freeze tests
OpenAPI freeze tests use github.com/getkin/kin-openapi/openapi3. The
library is already a workspace-wide dependency
(lobby/contract_openapi_test.go, game/openapi_contract_test.go,
rtmanager/contract_openapi_test.go). It validates OpenAPI 3.0
syntactic correctness, exposes a typed AST, and lets assertions reach
operation IDs, schema references, required fields, and enum membership
without a hand-rolled parser.
AsyncAPI freeze tests use gopkg.in/yaml.v3 plus a small set of
helpers (getMapValue, getStringValue, getStringSlice,
getSliceValue, getBoolValue). AsyncAPI 3.1.0 is itself a JSON
Schema document; the freeze tests only need to assert on field paths,
enum membership, required fields, and $ref targets — none of which
require type-aware parsing.
Both freeze tests live at the module root (package <service> next to
go.mod) for every service. A subpackage like <service>/contracts/
would have to import the service's domain types to share constants,
which would create the exact import cycle the freeze tests are meant
to prevent.
Test doubles: mockgen for narrow recorder ports, *inmem for behavioural fakes
Test doubles in the repository follow a three-track convention:
- Narrow recorder ports (interfaces whose implementation has no
domain semantics — record calls, return injectable errors, expose
accessor methods) use
go.uber.org/mockmocks. Examples:lobby/internal/ports/{RuntimeManager, IntentPublisher, GMClient, UserService},rtmanager/internal/ports/DockerClient,rtmanager/internal/api/internalhttp/handlers/{Start,Stop,Restart, Patch,Cleanup}Service.//go:generatedirectives live next to the interface declaration; generated mocks are committed under<module>/internal/adapters/mocks/(orhandlers/mocks/); themake -C <module> mockstarget regenerates them. - Behavioural in-memory adapters (re-implement the production
contract — CAS, domain transitions, monotonic invariants, two-tier
invariants like the Race Name Directory) live under
<module>/internal/adapters/<thing>inmem/and stay hand-rolled. Replacing them withmockgenwould force every consumer site to scriptEXPECT()chains for behaviour the fake currently handles automatically, and would lose the cross-implementation parity guarantee. - Dead test doubles with no consumers are deleted on sight.
Per-test recorder helpers (small structs holding captured slices and
per-test error injection) live inside the test files that use them
rather than in a shared mockrec / testfixtures package. A shared
package would re-create the retired *stub convention in a different
namespace; per-test recorders are easy to specialise without polluting
a shared surface.
racenameinmem is a special case: it is also one of two selectable
Race Name Directory backends chosen via
LOBBY_RACE_NAME_DIRECTORY_BACKEND=stub (the config token name is
preserved while the package name follows the *inmem convention; both
backends pass the shared conformance suite at
lobby/internal/ports/racenamedirtest/).
The maintained go.uber.org/mock fork is preferred over the archived
github.com/golang/mock.
Main End-to-End Flows
1. Public authentication flow
sequenceDiagram
participant Client
participant Gateway
participant Auth
participant User
participant Mail
participant Redis
Client->>Gateway: POST send-email-code
Gateway->>Auth: send-email-code
Auth->>User: resolve existing/creatable/blocked
User-->>Auth: decision
Auth->>Mail: send or suppress code
Auth-->>Gateway: challenge_id
Gateway-->>Client: challenge_id
Client->>Gateway: POST confirm-email-code(time_zone)
Gateway->>Auth: confirm-email-code(time_zone)
Auth->>Auth: validate challenge/code/public key/time_zone
Auth->>User: resolve/create/block with create-only registration context when needed
User-->>Auth: user_id or deny
Auth->>Auth: create device_session
Auth->>Redis: write gateway session projection
Auth->>Redis: publish session lifecycle update
Auth-->>Gateway: device_session_id
Gateway-->>Client: device_session_id
This preserves the existing gateway/auth contract and the rule that auth is not on the steady-state hot path.
2. Authenticated game/platform request flow
sequenceDiagram
participant Client
participant Gateway
participant Lobby
participant GM as Game Master
Client->>Gateway: ExecuteCommand(message_type, payload, signature)
Gateway->>Gateway: verify session, signature, freshness, replay
alt platform-level command
Gateway->>Lobby: verified authenticated command
Lobby-->>Gateway: response
else running-game command
Gateway->>GM: verified authenticated command with game_id
GM-->>Gateway: response
end
Gateway-->>Client: signed response
3. Game creation and pre-start lifecycle
sequenceDiagram
participant Client
participant Gateway
participant Lobby
participant User
Client->>Gateway: create/apply/invite/approve/start-preparation commands
Gateway->>Lobby: verified platform command
Lobby->>User: entitlement/limit checks when needed
User-->>Lobby: allow/deny and user metadata
Lobby->>Lobby: update game metadata, roster, schedule, target engine version
Lobby-->>Gateway: response
Gateway-->>Client: signed response
4. Game start flow
sequenceDiagram
participant Owner as Admin or Private Owner
participant Gateway
participant Lobby
participant Runtime
participant GM as Game Master
participant Engine as Game Engine Container
participant Redis
Owner->>Gateway: start game
Gateway->>Lobby: verified start command
Lobby->>Lobby: validate ready_to_start and roster
Lobby->>Runtime: async start job
Runtime-->>Redis: runtime job result event
alt start failed
Lobby->>Lobby: keep failure / starting error state
Lobby-->>Gateway: failure or accepted-then-observed failure path
else container started
Lobby->>Lobby: persist game metadata and runtime binding
Lobby->>GM: sync running-game registration
GM->>Engine: initial engine setup API
GM->>GM: initialize runtime state
GM-->>Lobby: registration result
Lobby->>Lobby: mark game running or paused
end
Critical rule:
if the container starts but Lobby cannot persist metadata, the launch is considered a full failure and the container must be removed.
If metadata is persisted but Game Master is unavailable, the game is placed into paused and administrators are notified.
5. Running-game command flow
sequenceDiagram
participant Client
participant Gateway
participant GM as Game Master
participant Lobby
participant Engine
Client->>Gateway: game-related ExecuteCommand(game_id,...)
Gateway->>GM: verified authenticated command
GM->>GM: check runtime status
GM->>Lobby: resolve/cached-check membership if needed
Lobby-->>GM: membership / permissions
GM->>Engine: game or runtime-admin API call
Engine-->>GM: result
GM-->>Gateway: response payload
Gateway-->>Client: signed response
6. Scheduled turn generation flow
sequenceDiagram
participant Scheduler as Game Master Scheduler
participant GM as Game Master
participant Engine
participant Lobby
participant Notify as Notification Service
participant Gateway
Scheduler->>GM: due turn slot reached
GM->>GM: switch runtime_status to generation_in_progress
GM->>Engine: generate next turn
alt generation success
Engine-->>GM: new turn result / maybe finished
GM->>GM: update current_turn and runtime state
GM->>Lobby: sync runtime snapshot
GM->>Notify: publish new-turn intent
Notify->>Gateway: client-facing push events
else generation failed
Engine-->>GM: error / timeout
GM->>GM: mark generation_failed
GM->>Lobby: sync runtime snapshot
GM->>Notify: notify administrators only
end
Players receive only a lightweight push notification that a new turn exists. They then request their own per-player game state separately.
If force next turn is used, the next scheduled slot is skipped so that the effective time between turns never becomes shorter than the schedule spacing.
7. Game finish flow
sequenceDiagram
participant Engine
participant GM as Game Master
participant Lobby
participant Notify as Notification Service
participant Gateway
Engine->>GM: game finished
GM->>GM: update runtime state
GM->>Lobby: mark platform game finished
Lobby->>Lobby: finalize game record
GM->>Notify: publish game-finished intent
Notify->>Gateway: push user-facing/platform events
8. Geo profile auxiliary flow
sequenceDiagram
participant Gateway
participant Geo
participant User
participant Auth
Gateway-->>Geo: async observation(user_id, device_session_id, ip_addr)
Geo->>Geo: derive observed_country and aggregates
alt suspicious multi-country pattern
Geo->>Auth: sync block suspicious session(s)
end
alt declared_country admin change approved later
Geo->>User: sync current declared_country update
end
This flow is intentionally fail-open relative to gameplay.
Separation of Platform Metadata and Engine State
This distinction is fundamental.
Platform-level state
Owned by Game Lobby:
- who owns the game;
- who is invited;
- who applied;
- who was approved;
- who is currently a platform participant;
- what the schedule is;
- whether the game is public/private;
- whether the game is
draft,running,paused,finished, etc. as a platform entity.
Runtime/operational state
Owned by Game Master:
- current turn;
- runtime status;
- generation state;
- engine reachability;
- patch state;
- mapping to engine player UUIDs;
- engine version registry;
- operational metadata of the running game.
Full game state
Owned only by the game engine container:
- actual per-player game state;
- internal mechanics and progression;
- player-visible game state snapshots;
- win/lose logic;
- domain truth of the game world.
The platform must not attempt to duplicate the full game state outside the engine.
Versioning of Game Engines
Every game runs on one specific game engine version.
Rules:
- active games stay on the version with which they were started;
- upgrade during a running game is allowed only as a patch update within the same major/minor line;
- game-engine version management is manual in v1;
- each engine version may carry version-specific engine options;
Game Masterowns the engine version registry from v1 —(version, image_ref, options, status)rows live in thegamemasterschema and are managed exclusively through GM's internal REST surface;Game Lobbyresolvesimage_refsynchronously through GM at game start by callingGET /api/v1/internal/engine-versions/{version}/image-ref;LOBBY_ENGINE_IMAGE_TEMPLATEand any Lobby-side template-based resolution are removed without a backward-compat shim. If GM is unavailable when Lobby attempts the resolve, the start fails withservice_unavailableandruntime:start_jobsis never published;Runtime Managercontinues to receive a verbatimimage_reffrom the start envelope and never resolves engine versions itself.
Administrative Access Model
Two distinct external admin modes exist.
System administrator
Uses a separate admin-facing REST surface via gateway and Admin Service.
System administrator can:
- manage public games;
- see and operate on all private games;
- inspect platform operational state;
- launch, stop, patch, pause, and monitor games;
- approve/reject participation in public games;
- perform user/game administrative actions.
Private-game owner
Uses the normal authenticated client protocol, not the separate system admin UI.
Allowed owner-admin actions are limited to the owner’s own private games and include at least:
- initiate enrollment;
- create and manage user-bound invites inside the system;
- approve/reject applicants;
- start game after enrollment;
- force next turn while running;
- stop game;
- temporarily or permanently remove/block players from that game according to allowed policy.
These operations use dedicated admin-related message_type values in the normal authenticated game/client protocol.
Non-Goals
The architecture intentionally does not try to solve all future concerns now.
Current non-goals:
- a separate policy engine;
- automatic billing integration in v1;
- automatic match balancing in v1;
- direct external access to internal services;
- pushing full per-player game state over notification channels;
- allowing game engine containers to be called directly by clients or by services other than
Game Master; - using
Auth / Session Serviceas a hot synchronous dependency for all authenticated traffic; - making
Notification Servicethe source of truth for notification preferences in v1.
Recommended Order of Service Implementation
Recommended order for implementation is:
-
Edge Gateway Service (implemented)
First public ingress, transport boundary, authentication boundary, signed request/response model, push delivery, session cache, replay protection. -
Auth / Session Service (implemented)
Public auth flow,device_session, revoke/block lifecycle, gateway session projection. -
User Service (implemented)
Regular-user identity, profile/settings, tariffs/entitlements, user limits, sanctions, and currentdeclared_country. -
Mail Service (implemented)
Internal email delivery for auth codes and platform notification mail. -
Notification Service (implemented)
Unified async delivery of push and non-auth email notifications, with real Gateway and Mail Service boundary coverage. -
Game Lobby Service (implemented)
Platform game records, membership, invites, applications, approvals, schedules, user-facing lists, pre-start lifecycle. -
Runtime Manager (implemented)
Dedicated Docker-control service for container lifecycle (start, stop, restart, semver-patch, cleanup) and inspect/health monitoring through Docker events, periodic inspect, and active HTTP probes. Driven asynchronously fromGame Lobbyviaruntime:start_jobs/runtime:stop_jobsand synchronously fromGame MasterandAdmin Servicevia the trusted internal REST surface. -
Game Master
Single-instance running-game orchestrator. Owns the runtime state (game_id → engine_endpoint, status, current turn, scheduling, engine health), the engine version registry consumed synchronously byGame Lobbyforimage_refresolution, and the platform mapping(user_id, race_name, engine_player_uuid)per running game. Drives the turn scheduler with the force-next-turn skip rule, mediates every engine HTTP call (admin paths under/api/v1/admin/*, player paths under/api/v1/{command, order, report}), and reacts toStateResponse.finishedby transitioning the runtime tofinishedand publishinggame_finished. DrivesRuntime Managersynchronously over REST for stop, restart, and patch; consumesruntime:health_eventsfrom RTM; publishesgm:lobby_events(event-only, no heartbeat) andnotification:intents. Never opens the Docker SDK. -
Admin Service
Admin UI backend that orchestrates trusted APIs of other services. -
Geo Profile Service (planned)
Auxiliary geo aggregation, review recommendation, suspicious-session blocking, declared-country workflow. -
Billing Service
Future payment and subscription source feeding entitlements intoUser Service.
This order gives the platform a usable public perimeter first, then identity/auth, then core gameplay lifecycle, then runtime orchestration, and only afterward secondary auxiliary services.