developer/galaxy-game

Fork 0

Files

T

Ilia Denisov fe829285a6 feat: use postgres

2026-04-26 20:34:39 +02:00

51 KiB

Raw Blame History

Services Architecture

Galaxy: Turn-based Strategy Game

Purpose

This document defines the high-level architecture of the Galaxy Ga,e platform as a single source of truth for implementing all core microservices.

It describes:

public and trusted service boundaries;
ownership of main business entities and state;
request routing and transport rules;
interaction rules between services;
runtime model for game containers;
notification and event propagation model;
recommended implementation order.

Detailed behavior of each concrete service belongs in its own README. This document fixes the system-level structure and the architectural rules that must remain stable across service implementations.

Scope

Galaxy Game is a multiplayer turn-based online strategy game platform.

Core product properties:

many game sessions may exist simultaneously;
one user may participate in multiple games at once;
users authenticate by e-mail confirmation code;
users have platform roles and tariff/entitlement state;
games may be public or private;
public games are managed by system administrators;
private games are created and managed by eligible paid users;
each running game is executed inside its own dedicated game engine container;
each running game is bound to one concrete engine version;
in-place upgrade of a running game is allowed only as a patch update within the same semver major/minor line;
player commands are turn-bound and are accepted only before the next scheduled turn generation cutoff.

The platform stores durable business state in PostgreSQL (one shared database, schema per service) and uses Redis with Redis Streams for ephemeral state, caches, and the internal event bus. The backend split, library stack, and staged migration plan live in PG_PLAN.md and the Persistence Backends section below.

Main Principles

The platform exposes a single external entry point: Edge Gateway.
Public unauthenticated flows use REST/JSON.
Authenticated user edge traffic uses signed gRPC over HTTP/2 with protobuf control envelopes and FlatBuffers payload bytes.
Trusted synchronous inter-service traffic uses REST/JSON unless a service-specific contract states otherwise.
For the direct Gateway -> User self-service boundary, gateway keeps the external authenticated gRPC + FlatBuffers contract and performs REST/JSON transcoding toward User Service internally.
The gateway handles only edge concerns: parsing, authentication, integrity checks, anti-replay, rate limiting, routing, and push delivery. Business authorization and domain rules remain in downstream services.
Auth / Session Service is the source of truth for device_session, but it is not on the hot path of every authenticated request. Gateway authenticates steady-state traffic from session cache and lifecycle updates.
Game Lobby owns platform-level metadata of game sessions.
Game Master owns runtime and operational state of running games.
Runtime Manager is the only service allowed to access Docker API directly.
Notification Service is the platform-level delivery/orchestration layer for push and most non-auth email notifications.
Mail Service sends email; auth-code mail is sent directly by Auth / Session Service, while all other platform mail is initiated through Notification Service.
Geo Profile Service is auxiliary and fail-open relative to gameplay; it never blocks the currently processed request and may affect only later requests.
If a user-facing request must complete with a deterministic result in the same flow, the critical internal chain must be synchronous. If the interaction is propagation, notification, cache update, runtime job completion, telemetry, or denormalized read-model update, it should be asynchronous.

Security and Transport Model

The former standalone security model is part of the main architecture and is no longer treated as a separate subsystem.

Public and authenticated transport classes

The gateway already distinguishes:

public REST/JSON for unauthenticated traffic such as health checks and public auth;
authenticated gRPC over HTTP/2 for verified commands and push delivery.

For downstream business services, the current default trusted transport is strict REST/JSON. Gateway may therefore authenticate and verify one external FlatBuffers command, then transcode it to one trusted downstream REST call.

When forwarding an authenticated command to a downstream service, Edge Gateway enriches the REST call with the X-User-ID header carrying the verified platform user identifier. Downstream services derive the acting user identity exclusively from this header and must never accept identity claims from request body fields.

The public auth contract is:

send-email-code(email) -> challenge_id
confirm-email-code(challenge_id, code, client_public_key, time_zone) -> device_session_id

The authenticated request contract is based on:

device_session_id
message_type
timestamp_ms
request_id
payload_hash
Ed25519 client signature over canonical envelope fields.

Server responses and push events are signed by the gateway so clients can verify server-originated messages. Push streams are bound to authenticated user_id and device_session_id, and session revoke closes only streams bound to the revoked session.

Verification boundary

Before routing an authenticated request, gateway must:

validate envelope presence and protocol version;
resolve session from session cache;
reject unknown or revoked sessions;
verify payload_hash;
verify client signature;
verify freshness window;
verify anti-replay by device_session_id + request_id;
apply edge rate limits and basic policy checks;
build an authenticated internal command context and only then route downstream.

Downstream services must never receive unauthenticated external traffic.

High-Level System Diagram

flowchart LR
    Client["Game Client\n(native / browser)"]
    AdminUI["Admin UI"]
    Gateway["Edge Gateway\nPublic REST\nAuthenticated gRPC\nAdmin REST"]
    Auth["Auth / Session Service"]
    User["User Service"]
    Lobby["Game Lobby Service"]
    GM["Game Master"]
    Runtime["Runtime Manager"]
    Notify["Notification Service"]
    Mail["Mail Service"]
    Geo["Geo Profile Service"]
    Billing["Billing Service\nfuture"]
    Redis["Redis\nCache, Streams, Leases"]
    Postgres["PostgreSQL\nDurable Business State"]
    Telemetry["Telemetry"]

    Client --> Gateway
    AdminUI --> Gateway

    Gateway --> Auth
    Gateway --> User
    Gateway --> Lobby
    Gateway --> GM
    Gateway --> Geo

    Auth --> User
    Auth --> Mail
    Auth --> Redis

    User --> Redis

    Lobby --> User
    Lobby --> GM
    Lobby --> Runtime
    Lobby --> Redis

    User --> Lobby

    GM --> Lobby
    GM --> Runtime
    GM --> Redis

    Geo --> Auth
    Geo --> User
    Geo --> Redis

    Notify --> Gateway
    Notify --> Mail
    Notify --> Redis

    Runtime --> Redis

    Mail --> Redis
    User --> Postgres
    Mail --> Postgres
    Notify --> Postgres
    Lobby --> Postgres

    Billing --> User
    Telemetry --- Gateway
    Telemetry --- Auth
    Telemetry --- User
    Telemetry --- Lobby
    Telemetry --- GM
    Telemetry --- Runtime
    Telemetry --- Notify
    Telemetry --- Geo

The baseline gateway/auth/session/pub-sub model above is consistent with the existing architecture and service READMEs.

Service List and Responsibility Boundaries

1. Edge Gateway

Edge Gateway is the only public entry point for all external traffic. It already owns transport parsing, session-cache-based authentication, signature verification, freshness/replay checks, edge rate limiting, routing, and push delivery. It must remain free of domain-specific business logic.

External surfaces:

public REST:
- health and readiness;
- public auth commands;
- browser/bootstrap and public route classes where needed.
authenticated gRPC:
- generic ExecuteCommand;
- authenticated SubscribeEvents.
admin REST:
- separate public administrative surface for system administrators;
- routed only for authenticated users with admin role.

The gateway does not directly access game engine containers. For running games it routes to Game Master. For pre-game platform flows it routes to Game Lobby. For user-profile requests it routes to User Service. For public auth it routes to Auth / Session Service.

2. Auth / Session Service

Auth / Session Service owns:

challenge lifecycle;
e-mail-code authentication;
creation of device_session;
registration of the client Ed25519 public key;
revoke/logout/block state;
trusted internal read/revoke/block API;
projection of session lifecycle state into gateway-consumable Redis data.

It is the source of truth for:

authentication challenges;
device_session;
revoke/block state.

Important architectural rules:

public auth stays synchronous;
confirm-email-code returns a ready device_session_id;
no async “pending session provisioning” step exists;
session source of truth and gateway-facing projection remain separate;
active-session limits are configuration-driven;
send-email-code stays success-shaped for existing, new, blocked, and throttled email flows.

When confirm-email-code reaches first successful completion for an e-mail address that does not yet belong to a user, auth may pass create-only registration context to User Service during the synchronous ensure/create step.

Direct integrations:

synchronous to User Service for user resolution/create/block decision;
synchronous to Mail Service for auth-code delivery;
asynchronous session lifecycle projection into Redis for gateway consumption.

3. User Service

User Service owns regular-user identity and profile as platform-level business data.

It is the source of truth for:

user_id of regular platform users;
user_name — immutable auto-generated unique platform handle in player-<suffix> form; never used as foreign key in other models;
display_name — mutable free-text user-editable label validated through pkg/util/string.go:ValidateTypeName; not required to be unique; default empty for new accounts;
editable user settings (preferred_language, time_zone);
current tariff/entitlement state including max_registered_race_names;
user-specific limits and platform sanctions (including permanent_block and max_registered_race_names override limits);
latest effective declared_country;
soft-delete state via DeleteUser.

User Service does not own in-game race_name values; those live in Game Lobby Race Name Directory.

System-administrator identity remains outside this service and belongs to the later Admin Service. Trusted administrative reads and mutations against regular-user state do not make User Service the owner of administrator identity.

It is directly reachable through gateway for selected user-facing operations such as:

reading and editing allowed profile fields;
viewing tariff and entitlement state;
viewing user settings;
viewing current restrictions and sanctions.

Not every profile mutation goes directly here. For example:

email change must use a code-confirm flow;
declared_country change remains under admin approval flow via Geo Profile Service.

Architectural rules fixed for this service:

User Service owns regular-user identity only; system-admin identity is out of scope.
User Service stores only the current effective declared_country; review workflow and history belong to Geo Profile Service.
User Service does not own in-game race_name values. All in-game name state (registered, reserved, pending registration) lives in the Game Lobby Race Name Directory. The only identity strings owned by User Service are user_name (immutable) and display_name (mutable, non-unique).
permanent_block is a dedicated sanction code that collapses every can_* eligibility marker to false and triggers RND cascade release via the user:lifecycle_events stream.
DeleteUser is a trusted internal endpoint that soft-deletes the account, rejects all subsequent operations with subject_not_found, and triggers the same RND cascade release.
During the current auth-registration rollout, Auth / Session Service passes a preferred-language candidate derived from public Accept-Language, falling back to en when no supported value is available, plus the confirmed time_zone into User Service.

Future billing does not become a direct dependency of other services. Billing Service will feed entitlement/payment outcomes into User Service, and the rest of the platform will continue to use User Service as the source of truth for current entitlements.

4. Mail Service

Mail Service is the internal email delivery service.

Split of responsibility:

auth code emails: Auth / Session Service -> Mail Service directly;
all other user/admin notification emails: Notification Service -> Mail Service.

Transport rules:

Auth / Session Service -> Mail Service uses the dedicated synchronous trusted internal REST contract POST /api/v1/internal/login-code-deliveries;
Notification Service -> Mail Service is an asynchronous internal command flow carried through dedicated queue-backed handoff after durable route acceptance inside Notification Service.

This split is covered by integration tests: auth-code delivery bypasses Notification Service, while notification-generated mail uses template-mode commands whose template_id equals notification_type.

Mail Service may internally queue both flows. Its trusted operator read and resend APIs are part of the v1 service surface, not a later add-on. For auth callers, a successful result means the request was durably accepted into the mail-delivery pipeline or intentionally suppressed; it does not require that the external SMTP exchange already completed before the response is returned. Stable service-local delivery rules, retry semantics, and storage details (PostgreSQL for the durable delivery record, attempt history, dead letters, and audit; Redis for the inbound mail:delivery_commands stream and its consumer offset) belong in mail/README.md, not in the root architecture document.

5. Geo Profile Service

Geo Profile Service is an internal trusted auxiliary service for country-level connection signals of authenticated users.

It integrates with:

gateway as asynchronous ingest producer;
User Service for current effective declared_country;
Auth / Session Service for suspicious session blocking;
Notification Service for optional admin notifications.

It owns:

observed country facts;
per-session country aggregation;
usual_connection_country;
country_review_recommended;
history of declared_country changes.

It does not block the request that triggered suspicion. It can only request block of suspicious sessions for subsequent requests. It does not call Mail Service directly; optional admin mail must flow through Notification Service.

In this document, references to Edge Service in older geo documentation should be understood as Edge Gateway.

6. Admin Service

Admin Service is the external backend/orchestration layer for the administrative UI.

It is not a heavy domain owner. Its job is to:

expose administrator-facing workflows;
call trusted internal APIs of other services;
aggregate administrative views where needed;
enforce system-admin role checks at the gateway/admin boundary.

System administrators can view and operate on all games, including private ones.

7. Game Lobby Service

Game Lobby owns platform-level metadata and lifecycle of game sessions as platform entities.

It is the source of truth for:

game records before and after runtime existence;
public/private game type;
owner of a private game;
user-bound invitations and invite lifecycle;
applications and approvals;
membership and roster;
blocked/removed participants at platform level;
turn schedule configuration;
target engine version for launch;
user-facing lists of games;
denormalized runtime snapshot imported from Game Master.

Game Lobby is the source of truth for:

party membership;
invited / pending / active / finished / removed status of players relative to games;
user-visible lists such as active / finished / pending / invited games.

It also stores a denormalized runtime snapshot for convenience, at least:

current_turn;
runtime_status;
engine_health_summary.

Additionally, Game Lobby aggregates per-member game statistics from player_turn_stats carried on each runtime_snapshot_update event: current and running-max of planets, population, and ships_built. The aggregate is retained from game start until capability evaluation at game_finished.

This prevents user-facing list/read flows from fan-out requests into Game Master.

Lobby status model

Minimum platform-level status set:

draft
enrollment_open
ready_to_start
starting
start_failed
running
paused
finished
cancelled

Lobby.paused is a business/platform pause, distinct from engine/runtime failure states.

start_failed indicates that the runtime container could not be started or that metadata persistence failed after a successful container start. From start_failed an admin or owner may retry (→ ready_to_start) or cancel (→ cancelled).

Enrollment rules

Each game stores three enrollment configuration fields set at creation:

min_players — minimum approved participants required before the game may start.
max_players — target roster size that activates the gap admission window.
start_gap_hours — hours to keep enrollment open after max_players is reached.
start_gap_players — additional players admitted during the gap window.
enrollment_ends_at — UTC Unix timestamp at which enrollment closes automatically.

Transition from enrollment_open to ready_to_start occurs via one of three paths:

Manual: an admin (public game) or owner (private game) issues a close-enrollment command when approved_count >= min_players.
Deadline: enrollment_ends_at is reached and approved_count >= min_players.
Gap exhaustion: approved_count >= max_players activates a gap window of start_gap_hours during which up to start_gap_players additional participants may join; the transition fires when the gap window expires or approved_count >= max_players + start_gap_players.

All pending invites transition to expired when the game moves to ready_to_start.

Membership rules

User Service owns users of the platform as identities.
Game Lobby owns membership in concrete games.
game engine does not own platform membership;
Game Master may cache membership for runtime authorization, but Game Lobby remains the source of truth.

Public vs private game rules

Public games:

created and controlled by system administrators;
visible in public list;
joining is based on application and manual admin approval in v1.

Private games:

can be created only by eligible paid users;
visible only to their owner and to invited users whose invitation is bound to a concrete user_id and later accepted;
joining uses a user-bound invite; accepting the invite immediately creates active membership without a separate owner-approval step;
invite lifecycle belongs entirely to Game Lobby.

Private-party owners get a limited owner-admin capability set, not full system admin power.

Race Name Directory

Race Name Directory (RND) is the platform source of truth for in-game player names (race_name). It is owned by Game Lobby in v1 and is scheduled to move to a dedicated Race Name Service later without changing the domain or service-layer logic.

RND owns three levels of state per name:

registered — platform-unique permanent names owned by one regular user. A registered name cannot be transferred, released, or renamed; the only path back to availability is permanent_block or DeleteUser on the owning account. The number of registered names a user can hold is bounded by the current tariff (max_registered_race_names in the User Service eligibility snapshot): free=1, paid_monthly=2, paid_yearly=6, paid_lifetime=unlimited. Tariff downgrade never revokes existing registrations; it only constrains new ones.
reservation — per-game binding created when a participant joins a game through application approval or invite redeem. The reservation key is (game_id, canonical_key). One user may hold the same name simultaneously across multiple active games. A reservation survives until the game finishes, then either becomes a pending_registration (see below) or is released.
pending_registration — a reservation that survived a capable finish and is now waiting up to 30 days for the owner to upgrade it into a registered name via lobby.race_name.register. Expiration releases the binding.

Canonical key — RND uses a canonical key (lowercase + frozen confusable-pair policy) to enforce uniqueness. A name is considered taken for another user when any registered, active reservation, or pending_registration with a different user_id exists under the same canonical key. The confusable-pair policy lives in Lobby (lobby/internal/domain/racename/policy.go).

Capability gating — at game_finished Game Lobby evaluates per-member capability: capable = max_planets > initial_planets AND max_population > initial_population, computed from the player_turn_stats stream published by Game Master. Capable reservations transition to pending_registration with eligible_until = finished_at + 30 days; non-capable reservations are released immediately.

Registration — a user initiates registration via lobby.race_name.register inside the 30-day window. Registration succeeds only when the user is still eligible (no permanent_block, tariff slot available) and the pending entry is still within its window. Expired pending entries are released by a background worker.

Cascade release — User Service publishes user.lifecycle.permanent_blocked and user.lifecycle.deleted events to user:lifecycle_events. Game Lobby consumes this stream and calls RND.ReleaseAllByUser(user_id) atomically with membership/application/invite cancellations for the affected user.

8. Game Master

Game Master owns runtime and operational metadata of already running games.

It is the only trusted service allowed to communicate with game engine containers.

It owns:

runtime mapping of running game to container endpoint/binding;
current turn number;
runtime status;
generation status;
engine health;
patch state;
engine version registry and version-specific engine options;
runtime mapping platform user_id -> engine player UUID for each running game.

Game Master status model

Minimum runtime-level status set:

starting
running
generation_in_progress
generation_failed
stopped
engine_unreachable

running here means running_accepting_commands.

Game command routing

All game-related message_type include game_id.

Gateway enriches them with authenticated user_id and routes them to Game Master. Game Master checks whether this user may access this running game, using membership data sourced from Game Lobby, then routes the command to the correct engine container.

The gateway never routes directly to game engine containers.

Runtime admin operations

For already running games, Game Master handles:

stop game
force next turn
patch engine
admin/runtime status reads
player deactivation/removal inside engine when required
regular collection of game runtime metrics

System admin can use all of them. Private-game owner can use the subset allowed for the owner of that game.

Turn cutoff and scheduling

Game Master is the owner of authoritative platform time for turn cutoff decisions.

Commands arriving exactly on the boundary of a new turn are considered stale and must not reach the engine.

The scheduler is a subsystem inside Game Master. It triggers turn generation according to the game schedule.

If a manual “force next turn” is executed, the next scheduled turn slot must be skipped so that players still get at least one full normal schedule interval before the following generated turn.

Runtime snapshot publishing

Game Master publishes runtime updates to the gm:lobby_events Redis Stream consumed by Game Lobby. Events include:

runtime_snapshot_update — carries the current current_turn, runtime_status, engine_health_summary, and a player_turn_stats array with one entry per active member (user_id, planets, population, ships_built). Game Lobby maintains a per-game per-user stats aggregate from these events for capability evaluation at game finish.
game_finished — carries the final snapshot values and triggers the platform status transition plus Race Name Directory capability evaluation inside Game Lobby.

Game Master does not retain the aggregate; it only publishes the per-turn observation. Game Lobby is responsible for holding initial values and running maxima across the lifetime of the game.

Runtime/engine finish flow

When the engine determines that a game is finished:

engine reports finish to Game Master;
Game Master updates runtime state;
Game Master notifies Game Lobby;
Game Lobby updates the platform-level game record to finished.

Player removal after start

After a game has started, two different actions exist:

temporary removal/block at platform level:
- the player cannot send commands through gateway/platform;
- the engine still keeps the player slot;
final removal or account-level block:
- Game Master must additionally send an admin command to the engine to deactivate/remove the player inside the game.

This distinction is architectural and must remain explicit.

9. Runtime Manager

Runtime Manager is the only internal service allowed to access Docker API directly.

It owns:

starting game engine containers;
stopping containers;
restarting containers where allowed;
patching/replacing containers where allowed;
technical runtime inspection/status;
monitoring containers and publishing technical health events.

It does not own platform metadata of games. It does not own runtime business state of games. It executes runtime jobs for Game Lobby and Game Master.

Container model

one game = one container;
one container = one game.

This is a hard invariant.

10. Notification Service

Notification Service is the async delivery/orchestration layer for platform notifications.

It has a deliberately minimal role:

consume normalized notification intents from services through dedicated Redis Stream notification:intents;
validate idempotency and persist durable notification route state;
enrich user-targeted routes with email and preferred_language from User Service;
decide whether a given notification type results in push, email, or both;
send user-targeted push events toward gateway by user_id;
send non-auth email asynchronous commands toward Mail Service.

It is not a source of truth for user preferences in v1 unless a later feature requires it.

For user-targeted intents, upstream producers publish the concrete recipient user_id values. Notification Service resolves user email and locale from User Service, uses configured administrator email lists per notification_type for admin-only notifications, keeps template_id == notification_type for notification-generated email, and treats private-game invite flows in v1 as user-bound by internal user_id. Go producers use the shared galaxy/notificationintent module to build and append compatible intents into notification:intents; a failed append is a notification degradation signal and must not roll back already committed source business state. Acceptance of a user-targeted notification intent is complete only after every published recipient user_id resolves through User Service; unresolved user ids are treated as producer input defects and are recorded as malformed notification intents rather than deferred publication failures.

User-facing notifications use push+email unless a type explicitly opts out of one channel. Administrator-facing notifications are email-only in v1.

All platform notifications except auth-code delivery flow through this service, including:

game lifecycle notifications;
invite/application updates;
new turn notifications;
operational/admin notifications where appropriate.

The current process surface exposes only one private probe HTTP listener with GET /healthz and GET /readyz; that probe surface is documented in notification/openapi.yaml. The canonical notification-intent stream contract remains notification/api/intents-asyncapi.yaml. It does not expose an operator REST API.

11. Billing Service (future)

Billing Service is not part of the first implementation wave.

When introduced, it will:

process payment/billing events;
calculate or validate payment outcomes;
feed resulting entitlement changes into User Service.

User Service remains the source of truth for current entitlement used by the rest of the platform.

Billing-driven tariff changes alter only the headroom for new registered race names: tariff downgrade never revokes already registered names. The affected ceiling is materialized as max_registered_race_names in the eligibility snapshot consumed by Game Lobby.

Data Ownership Summary

flowchart TD
    U["User Service"]
    A["Auth / Session Service"]
    L["Game Lobby"]
    G["Game Master"]
    R["Runtime Manager"]
    P["Geo Profile Service"]
    N["Notification Service"]
    M["Mail Service"]

    U -->|"regular users, user_name/display_name, settings, tariffs, limits, sanctions, declared_country, soft-delete"| X1["Platform user identity"]
    A -->|"challenges, device sessions, revoke/block state"| X2["Auth/session state"]
    L -->|"game metadata, invites, applications, membership, roster, race names (registered/reservations/pending)"| X3["Platform game records"]
    G -->|"runtime state, current turn, engine health, engine mapping, engine version registry"| X4["Running-game state"]
    R -->|"container execution and technical runtime control"| X5["Container runtime"]
    P -->|"observed country, usual_connection_country, review state, declared_country history"| X6["Geo state"]
    N -->|"notification routing only"| X7["Notification orchestration"]
    M -->|"email delivery only"| X8["Email transport"]

Internal Transport Semantics

The platform uses one simple rule:

if the user-facing request must complete with a deterministic result in the same flow, the critical internal chain is synchronous;
if the interaction is propagation, notification, cache invalidation, runtime job completion, telemetry, or denormalized read-model update, it is asynchronous.

Fixed synchronous interactions

Gateway -> Auth / Session Service
Gateway -> Admin Service
Gateway -> User Service
Gateway -> Game Lobby
Gateway -> Game Master
Auth / Session Service -> User Service
Auth / Session Service -> Mail Service
Geo Profile Service -> Auth / Session Service
Geo Profile Service -> User Service
Game Lobby -> User Service
Game Lobby -> Game Master for critical registration/update calls

Fixed asynchronous interactions

session lifecycle projection toward gateway cache;
revoke propagation;
Lobby -> Runtime Manager runtime jobs;
Game Master -> Runtime Manager runtime jobs;
all event-bus propagation;
Game Master -> Game Lobby runtime snapshot updates (including player_turn_stats for capability aggregation) and game-finish events through a dedicated Redis Stream consumed by Game Lobby;
User Service -> Game Lobby user lifecycle events (user.lifecycle.permanent_blocked, user.lifecycle.deleted) through the user:lifecycle_events Redis Stream, consumed by Game Lobby to cascade RND release and membership/application/invite cancellation;
Game Master -> Notification Service notification intents through notification:intents;
Game Lobby -> Notification Service notification intents through notification:intents;
Geo Profile Service -> Notification Service notification intents through notification:intents;
Notification Service -> Gateway;
Notification Service -> Mail Service;
geo auxiliary ingest from gateway to geo service;
runtime health events from Runtime Manager.

Mixed interactions

Some service pairs may use both styles for different flows. The main example is Lobby -> Game Master:

synchronous for critical registration/update after successful start;
asynchronous for secondary propagation and denormalized status fan-out.

Persistence Backends

The platform splits durable state across two backends.

PostgreSQL is the source of truth for table-shaped business state:

user identity, profile settings, tariffs/entitlements, sanctions, limits, and the blocked-email registry;
mail deliveries, attempt history, dead letters, payloads, and malformed-command audit;
notification records, route materialisations, dead letters, and malformed-intent audit;
lobby games, applications, invites, memberships, and the race-name registry (registered/reservation/pending tiers);
idempotency records, expressed as UNIQUE constraints on the durable table — not as a separate kv;
retry scheduling state, expressed as a next_attempt_at column on the durable table and worked off via SELECT ... FOR UPDATE SKIP LOCKED.

Redis is the source of truth for ephemeral and runtime-coordination state:

the platform event bus implemented as Redis Streams (user:domain_events, user:lifecycle_events, gm:lobby_events, runtime:job_results, notification:intents, gateway:client-events, mail:delivery_commands);
stream consumer offsets;
gateway session cache, replay reservations, rate-limit counters, and short-lived runtime locks/leases (e.g. notification route_leases);
Auth / Session Service challenges and active session tokens, which are TTL-bounded and where loss is recoverable by re-authentication;
lobby per-game runtime aggregates that are deleted at game finish (game_turn_stats, gap_activated_at, capability evaluation marker).

Database topology

Single PostgreSQL database galaxy.
Schema per service: user, mail, notification, lobby. Reserved for future use: geoprofile. Not allocated unless needed: gateway, authsession.
Each service connects with its own PostgreSQL role whose grants are restricted to its own schema (defense-in-depth).
Authentication is username + password only. sslmode=disable. No client certificates and no SCRAM channel binding.
Each service connects to one primary plus zero-or-more read-only replicas. Only the primary is used in this iteration; the replica pool is wired but receives no traffic. Future read-routing is a non-breaking change.

Redis topology

Each service connects to one master plus zero-or-more replicas.
All connections require a password. USERNAME/ACL is not used. TLS is off.
Only the master is used in this iteration; the replica list is wired but unused. Failover/read routing is added later without a config break.
The legacy env vars *_REDIS_TLS_ENABLED and *_REDIS_USERNAME are removed without a backward-compat shim.

Library stack and migration discipline

Driver: github.com/jackc/pgx/v5, exposed as *sql.DB via github.com/jackc/pgx/v5/stdlib so it is consumable by query builders written against database/sql.
Query layer: github.com/go-jet/jet/v2 (PostgreSQL dialect). Generated code lives under each service internal/adapters/postgres/jet/, regenerated by a per-service make jet target (testcontainers + goose + jet) and committed to the repo so consumers don't need Docker just to build.
Migrations: github.com/pressly/goose/v3 library API. Migration files are embedded via //go:embed *.sql, applied at service startup before any listener opens; the service exits non-zero on failure. Files are forward-only, sequence-numbered, and use the standard -- +goose Up / -- +goose Down markers.
Single-init policy during pre-launch development: each PG-backed service ships exactly one migration file, 00001_init.sql, that represents the full current schema. New tables, columns, and indexes are added by editing that file directly rather than by appending 00002_*.sql, 00003_*.sql, etc. The trade-off is intentional — schema clarity beats migration-history granularity while no production database exists. Once the platform reaches its first production deploy, future schema evolution switches to additive sequence-numbered migrations.
Test infrastructure: github.com/testcontainers/testcontainers-go plus the modules/postgres submodule for unit tests and for make jet.

Per-service decision records that capture schema and adapter choices live at galaxy/<service>/docs/postgres-migration.md.

Timestamp handling

Every time-valued column in every Galaxy schema is timestamptz. The adapter layer is responsible for ensuring that all time.Time values crossing the SQL boundary carry time.UTC as their location.

Writes. Every time.Time parameter bound through database/sql (ExecContext, QueryContext, QueryRowContext) is normalised with .UTC() at the binding site. Optional *time.Time columns are bound through a shared helper (nullableTime or equivalent per adapter) that returns value.UTC() when non-nil and SQL NULL otherwise. Helper bindings of cutoff, now, etc. (retention, schedulers) follow the same rule even when the input was already produced via clock.Now().UTC() — defensive .UTC() calls are intentional and cheap.
Reads. Every time.Time scanned out of PostgreSQL is re-wrapped with .UTC() (directly or via a small helper that mirrors nullableTime for the read path) before it leaves the adapter. The domain layer therefore never observes a time.Time whose location is anything other than time.UTC.
Why. PostgreSQL stores timestamptz as UTC at rest, but the Go driver returns scanned values in time.Local. Mixing locations across the boundary produces inequalities in tests, drift in JSON output, and comparison bugs against pointer fields. The defensive .UTC() rule on both sides removes that class of bug entirely.

Configuration

For each service <S> ∈ { USERSERVICE, MAIL, NOTIFICATION, LOBBY, GATEWAY, AUTHSESSION }, the Redis connection accepts:

<S>_REDIS_MASTER_ADDR (required)
<S>_REDIS_REPLICA_ADDRS (optional, comma-separated)
<S>_REDIS_PASSWORD (required)
<S>_REDIS_DB, <S>_REDIS_OPERATION_TIMEOUT

For PG-backed services (USERSERVICE, MAIL, NOTIFICATION, LOBBY) the Postgres connection accepts:

<S>_POSTGRES_PRIMARY_DSN (required; postgres://<role>:<pwd>@<host>:5432/galaxy?search_path=<schema>&sslmode=disable)
<S>_POSTGRES_REPLICA_DSNS (optional, comma-separated)
<S>_POSTGRES_OPERATION_TIMEOUT, <S>_POSTGRES_MAX_OPEN_CONNS, <S>_POSTGRES_MAX_IDLE_CONNS, <S>_POSTGRES_CONN_MAX_LIFETIME

Stream- and key-shape env vars (*_REDIS_DOMAIN_EVENTS_STREAM, *_REDIS_LIFECYCLE_EVENTS_STREAM, *_REDIS_KEYSPACE_PREFIX, MAIL_REDIS_COMMAND_STREAM, NOTIFICATION_INTENTS_STREAM, etc.) keep their current names and semantics — they describe stream/key shapes, not connection topology.

Main End-to-End Flows

1. Public authentication flow

sequenceDiagram
    participant Client
    participant Gateway
    participant Auth
    participant User
    participant Mail
    participant Redis

    Client->>Gateway: POST send-email-code
    Gateway->>Auth: send-email-code
    Auth->>User: resolve existing/creatable/blocked
    User-->>Auth: decision
    Auth->>Mail: send or suppress code
    Auth-->>Gateway: challenge_id
    Gateway-->>Client: challenge_id

    Client->>Gateway: POST confirm-email-code(time_zone)
    Gateway->>Auth: confirm-email-code(time_zone)
    Auth->>Auth: validate challenge/code/public key/time_zone
    Auth->>User: resolve/create/block with create-only registration context when needed
    User-->>Auth: user_id or deny
    Auth->>Auth: create device_session
    Auth->>Redis: write gateway session projection
    Auth->>Redis: publish session lifecycle update
    Auth-->>Gateway: device_session_id
    Gateway-->>Client: device_session_id

This preserves the existing gateway/auth contract and the rule that auth is not on the steady-state hot path.

2. Authenticated game/platform request flow

sequenceDiagram
    participant Client
    participant Gateway
    participant Lobby
    participant GM as Game Master

    Client->>Gateway: ExecuteCommand(message_type, payload, signature)
    Gateway->>Gateway: verify session, signature, freshness, replay
    alt platform-level command
        Gateway->>Lobby: verified authenticated command
        Lobby-->>Gateway: response
    else running-game command
        Gateway->>GM: verified authenticated command with game_id
        GM-->>Gateway: response
    end
    Gateway-->>Client: signed response

3. Game creation and pre-start lifecycle

sequenceDiagram
    participant Client
    participant Gateway
    participant Lobby
    participant User

    Client->>Gateway: create/apply/invite/approve/start-preparation commands
    Gateway->>Lobby: verified platform command
    Lobby->>User: entitlement/limit checks when needed
    User-->>Lobby: allow/deny and user metadata
    Lobby->>Lobby: update game metadata, roster, schedule, target engine version
    Lobby-->>Gateway: response
    Gateway-->>Client: signed response

4. Game start flow

sequenceDiagram
    participant Owner as Admin or Private Owner
    participant Gateway
    participant Lobby
    participant Runtime
    participant GM as Game Master
    participant Engine as Game Engine Container
    participant Redis

    Owner->>Gateway: start game
    Gateway->>Lobby: verified start command
    Lobby->>Lobby: validate ready_to_start and roster
    Lobby->>Runtime: async start job
    Runtime-->>Redis: runtime job result event

    alt start failed
        Lobby->>Lobby: keep failure / starting error state
        Lobby-->>Gateway: failure or accepted-then-observed failure path
    else container started
        Lobby->>Lobby: persist game metadata and runtime binding
        Lobby->>GM: sync running-game registration
        GM->>Engine: initial engine setup API
        GM->>GM: initialize runtime state
        GM-->>Lobby: registration result
        Lobby->>Lobby: mark game running or paused
    end

Critical rule: if the container starts but Lobby cannot persist metadata, the launch is considered a full failure and the container must be removed. If metadata is persisted but Game Master is unavailable, the game is placed into paused and administrators are notified.

5. Running-game command flow

sequenceDiagram
    participant Client
    participant Gateway
    participant GM as Game Master
    participant Lobby
    participant Engine

    Client->>Gateway: game-related ExecuteCommand(game_id,...)
    Gateway->>GM: verified authenticated command
    GM->>GM: check runtime status
    GM->>Lobby: resolve/cached-check membership if needed
    Lobby-->>GM: membership / permissions
    GM->>Engine: game or runtime-admin API call
    Engine-->>GM: result
    GM-->>Gateway: response payload
    Gateway-->>Client: signed response

6. Scheduled turn generation flow

sequenceDiagram
    participant Scheduler as Game Master Scheduler
    participant GM as Game Master
    participant Engine
    participant Lobby
    participant Notify as Notification Service
    participant Gateway

    Scheduler->>GM: due turn slot reached
    GM->>GM: switch runtime_status to generation_in_progress
    GM->>Engine: generate next turn
    alt generation success
        Engine-->>GM: new turn result / maybe finished
        GM->>GM: update current_turn and runtime state
        GM->>Lobby: sync runtime snapshot
        GM->>Notify: publish new-turn intent
        Notify->>Gateway: client-facing push events
    else generation failed
        Engine-->>GM: error / timeout
        GM->>GM: mark generation_failed
        GM->>Lobby: sync runtime snapshot
        GM->>Notify: notify administrators only
    end

Players receive only a lightweight push notification that a new turn exists. They then request their own per-player game state separately.

If force next turn is used, the next scheduled slot is skipped so that the effective time between turns never becomes shorter than the schedule spacing.

7. Game finish flow

sequenceDiagram
    participant Engine
    participant GM as Game Master
    participant Lobby
    participant Notify as Notification Service
    participant Gateway

    Engine->>GM: game finished
    GM->>GM: update runtime state
    GM->>Lobby: mark platform game finished
    Lobby->>Lobby: finalize game record
    GM->>Notify: publish game-finished intent
    Notify->>Gateway: push user-facing/platform events

8. Geo profile auxiliary flow

sequenceDiagram
    participant Gateway
    participant Geo
    participant User
    participant Auth

    Gateway-->>Geo: async observation(user_id, device_session_id, ip_addr)
    Geo->>Geo: derive observed_country and aggregates
    alt suspicious multi-country pattern
        Geo->>Auth: sync block suspicious session(s)
    end
    alt declared_country admin change approved later
        Geo->>User: sync current declared_country update
    end

This flow is intentionally fail-open relative to gameplay.

Separation of Platform Metadata and Engine State

This distinction is fundamental.

Platform-level state

Owned by Game Lobby:

who owns the game;
who is invited;
who applied;
who was approved;
who is currently a platform participant;
what the schedule is;
whether the game is public/private;
whether the game is draft, running, paused, finished, etc. as a platform entity.

Runtime/operational state

Owned by Game Master:

current turn;
runtime status;
generation state;
engine reachability;
patch state;
mapping to engine player UUIDs;
engine version registry;
operational metadata of the running game.

Full game state

Owned only by the game engine container:

actual per-player game state;
internal mechanics and progression;
player-visible game state snapshots;
win/lose logic;
domain truth of the game world.

The platform must not attempt to duplicate the full game state outside the engine.

Versioning of Game Engines

Every game runs on one specific game engine version.

Rules:

active games stay on the version with which they were started;
upgrade during a running game is allowed only as a patch update within the same major/minor line;
game-engine version management is manual in v1;
each engine version may carry version-specific engine options;
Game Master owns the engine version registry and its internal API.

Administrative Access Model

Two distinct external admin modes exist.

System administrator

Uses a separate admin-facing REST surface via gateway and Admin Service.

System administrator can:

manage public games;
see and operate on all private games;
inspect platform operational state;
launch, stop, patch, pause, and monitor games;
approve/reject participation in public games;
perform user/game administrative actions.

Private-game owner

Uses the normal authenticated client protocol, not the separate system admin UI.

Allowed owner-admin actions are limited to the owner’s own private games and include at least:

initiate enrollment;
create and manage user-bound invites inside the system;
approve/reject applicants;
start game after enrollment;
force next turn while running;
stop game;
temporarily or permanently remove/block players from that game according to allowed policy.

These operations use dedicated admin-related message_type values in the normal authenticated game/client protocol.

Non-Goals

The architecture intentionally does not try to solve all future concerns now.

Current non-goals:

a separate policy engine;
automatic billing integration in v1;
automatic match balancing in v1;
direct external access to internal services;
pushing full per-player game state over notification channels;
allowing game engine containers to be called directly by clients or by services other than Game Master;
using Auth / Session Service as a hot synchronous dependency for all authenticated traffic;
making Notification Service the source of truth for notification preferences in v1.

Recommended Order of Service Implementation

Recommended order for implementation is:

Edge Gateway Service (implemented)
First public ingress, transport boundary, authentication boundary, signed request/response model, push delivery, session cache, replay protection.
Auth / Session Service (implemented)
Public auth flow, device_session, revoke/block lifecycle, gateway session projection.
User Service (implemented)
Regular-user identity, profile/settings, tariffs/entitlements, user limits, sanctions, and current declared_country.
Mail Service (implemented)
Internal email delivery for auth codes and platform notification mail.
Notification Service (implemented) Unified async delivery of push and non-auth email notifications, with real Gateway and Mail Service boundary coverage.
Game Lobby Service
Platform game records, membership, invites, applications, approvals, schedules, user-facing lists, pre-start lifecycle.
Runtime Manager
Dedicated Docker-control service for container start/stop/patch/status and technical runtime monitoring.
Game Master
Running-game orchestration, engine version registry, runtime state, turn scheduler, engine API mediation, operational controls.
Admin Service
Admin UI backend that orchestrates trusted APIs of other services.
Geo Profile Service (planned)
Auxiliary geo aggregation, review recommendation, suspicious-session blocking, declared-country workflow.
Billing Service
Future payment and subscription source feeding entitlements into User Service.

This order gives the platform a usable public perimeter first, then identity/auth, then core gameplay lifecycle, then runtime orchestration, and only afterward secondary auxiliary services.

51 KiB Raw Blame History Unescape Escape

Services Architecture

Purpose

Scope

Main Principles

Security and Transport Model

Public and authenticated transport classes

Verification boundary

High-Level System Diagram

Service List and Responsibility Boundaries

1. Edge Gateway

2. Auth / Session Service

3. User Service

4. Mail Service

5. Geo Profile Service

6. Admin Service

7. Game Lobby Service

Lobby status model

Enrollment rules

Membership rules

Public vs private game rules

Race Name Directory

8. Game Master

Game Master status model

Game command routing

Runtime admin operations

Turn cutoff and scheduling

Runtime snapshot publishing

Runtime/engine finish flow

Player removal after start

9. Runtime Manager

Container model

10. Notification Service

11. Billing Service (future)

Data Ownership Summary

Internal Transport Semantics

Fixed synchronous interactions

Fixed asynchronous interactions

Mixed interactions

Persistence Backends

Database topology

Redis topology

Library stack and migration discipline

Timestamp handling

Configuration

Main End-to-End Flows

1. Public authentication flow

2. Authenticated game/platform request flow

3. Game creation and pre-start lifecycle

4. Game start flow

5. Running-game command flow

6. Scheduled turn generation flow

7. Game finish flow

8. Geo profile auxiliary flow

Separation of Platform Metadata and Engine State

Platform-level state

Runtime/operational state

Full game state

Versioning of Game Engines

Administrative Access Model

System administrator

Private-game owner

Non-Goals

Recommended Order of Service Implementation

51 KiB

Raw Blame History