Files
Ilia Denisov 2ca47eb4df ui/phase-25: backend turn-cutoff guard + auto-pause + UI sync protocol
Backend now owns the turn-cutoff and pause guards the order tab
relies on: the scheduler flips runtime_status between
generation_in_progress and running around every engine tick, a
failed tick auto-pauses the game through OnRuntimeSnapshot, and a
new game.paused notification kind fans out alongside
game.turn.ready. The user-games handlers reject submits with
HTTP 409 turn_already_closed or game_paused depending on the
runtime state.

UI delegates auto-sync to a new OrderQueue: offline detection,
single retry on reconnect, conflict / paused classification.
OrderDraftStore surfaces conflictBanner / pausedBanner runes,
clears them on local mutation or on a game.turn.ready push via
resetForNewTurn. The order tab renders the matching banners and
the new conflict per-row badge; i18n bundles cover en + ru.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 22:00:16 +02:00

45 KiB
Raw Permalink Blame History

Galaxy Architecture

Galaxy is a turn-based strategy platform. This document is the source of truth for the platform architecture and supersedes ARCHITECTURE_deprecated.md. The previous design factored the platform into nine independently deployed services. This design consolidates all business logic into a single backend service alongside the existing gateway and game components.

1. Overview

The platform is composed of three executable units:

  • gateway — single public ingress. Owns transport security, request authentication via Ed25519-signed envelopes, anti-replay, response signing, and routing of authenticated traffic to backend. Stays as a separate process and is the only component reachable from the public internet.
  • backend — single internal service that owns every domain concern of the platform: identity, sessions, lobby, game runtime, mail, push and email notification delivery, geo signals, and administration. Talks to Postgres, the Docker daemon, an SMTP relay, and the GeoLite2 country database. The only consumer of backend over the network is gateway.
  • game — turn-engine container. One container per active game, managed exclusively by backend. The contract is the OpenAPI document shipped with the engine module; behaviour is unchanged by this architecture.
flowchart LR
  Client((Client)) -- TLS + Ed25519 envelopes --> Gateway
  Gateway -- REST/JSON, X-User-ID --> Backend
  Backend -- gRPC stream (push) --> Gateway
  Backend -- REST/JSON --> Engine[(Game Engine\ncontainer)]
  Backend -- pgx --> Postgres[(Postgres)]
  Backend -- Docker API --> Docker[(Docker daemon)]
  Backend -- SMTP --> Mail[(SMTP relay)]
  Backend -- GeoLite2 lookup --> GeoIP[(GeoLite2 DB)]
  Gateway -- anti-replay reservations --> Redis[(Redis)]

The MVP runs gateway and backend as single-instance processes inside a trusted network. Horizontal scaling, distributed coordination, and mTLS-secured east-west traffic are explicit future work and are called out in Deployment topology.

2. Component Boundaries

backend

  • Owns every persistent record of platform state in a Postgres schema named backend. No other process writes that schema.
  • Owns every Docker call to galaxy-game-{game_id} containers.
  • Owns the SMTP relationship and the durable email outbox.
  • Owns the in-memory caches that serve hot reads.
  • Exposes one HTTP listener and one gRPC listener. No public ingress.

gateway

  • Public ingress. Performs TLS termination, request signature verification, freshness window enforcement, anti-replay reservations, and rate limiting before any request is forwarded to backend.
  • Forwards authenticated requests to backend over HTTP/REST with the resolved user_id carried as the X-User-ID header. Forwards unauthenticated public traffic verbatim.
  • Subscribes to backend over a long-lived gRPC server stream to receive client push events and session-invalidation notices, signs them, and delivers them to active client subscriptions.
  • Stops everything that can be stopped at the edge. Any check that does not require backend state — bad signature, stale timestamp, replayed request_id, malformed envelope, blocked-session shortcut — is enforced by gateway so that backend is not loaded with invalid traffic.

game

  • A single game-engine instance per running game, packaged as a Docker container. Stateful only on its host bind-mounted state directory.
  • Reachable inside the trusted network at http://galaxy-game-{game_id}:8080.
  • Receives all administrative and player-action calls from backend only.

3. Backend API Surfaces

backend exposes one HTTP listener with four route groups distinguished by middleware. The full contract lives in backend/openapi.yaml.

Prefix Authentication Audience
/api/v1/public/* none unauthenticated registration
/api/v1/user/* X-User-ID injected by gateway authenticated end users
/api/v1/internal/* none (network-trusted) gateway-only server-to-server endpoints
/api/v1/admin/* HTTP Basic Auth against admin_accounts platform administrators
/healthz, /readyz none infrastructure probes

backend derives user identity exclusively from the X-User-ID header on the user surface. Request bodies are never trusted to convey identity.

The admin surface is on the same listener as the user surface; isolation between admin and the public is provided by Basic Auth and by the trust boundary described in §15. The internal surface is part of that same trust boundary: it is network-locked rather than auth-locked, and only gateway is expected to call it. The internal surface is read-only with respect to device sessions — it carries the per-request lookup gateway needs to verify a signed envelope, and nothing else. Revocations are user-driven (through the user surface) or admin-driven (through in-process calls inside backend); see FUNCTIONAL.md §1.5.

JSON bodies use snake_case field names everywhere on the wire. Backend, gateway, and the shared pkg/model schemas are aligned on this convention; any future migration to camelCase must happen at the pkg/model boundary and propagate uniformly. Every error response follows the envelope {"error": {"code": "<machine-readable>", "message": "<human-readable>"}}. The closed set of code values is enumerated in components/schemas/ErrorBody of backend/openapi.yaml. 409 Conflict is the standard status when a request collides with existing state (duplicate admin username, duplicate (template_id, idempotency_key), resend on a sent mail delivery, lobby state-machine collisions).

4. Backend Domain Modules

Each module is a Go package under backend/internal/. Modules are wired by direct struct references; interfaces are introduced only where a test seam or an external system boundary justifies them.

A few cross-module invariants survive consolidation and are surfaced here because they cross domain boundaries:

  • accounts.user_name is the immutable login handle assigned at first sign-in. Backend synthesises it as Player-XXXXXXXX (eight crypto/rand-backed alphanumerics, retried on UNIQUE collisions), so a fresh email always lands a unique account without a client-supplied name. The column is never overwritten on subsequent sign-ins.
  • accounts.permanent_block is the canonical permanent-block flag. When set, both auth.SendEmailCode and auth.ConfirmEmailCode reject with 400 invalid_request. The send-time check stops fresh challenges for already-blocked addresses; the confirm-time check (re-run after the verification code matches) catches admin blocks applied in the window between send and confirm. Every other branch on send — including a blocked_emails row, a throttled email, a fresh email — returns the opaque {challenge_id} shape so the endpoint cannot be used to enumerate accounts.
  • Public lobby games are admin-created through POST /api/v1/admin/games. The user-facing POST /api/v1/user/lobby/games always emits private games owned by X-User-ID. Public games carry owner_user_id IS NULL; the partial index on (owner_user_id) WHERE visibility = 'private' keeps the private-owner lookup efficient.
  • Authenticated lobby commands flow through the gateway envelope by message_type. The catalog is lobby.my.games.list, lobby.public.games.list, lobby.my.applications.list, lobby.my.invites.list, lobby.game.create, lobby.game.open-enrollment, lobby.application.submit, lobby.invite.redeem, and lobby.invite.decline. Each lands on a REST handler under /api/v1/user/lobby/*; the gateway forces visibility to private on lobby.game.create before forwarding, matching the user-surface invariant above.
Package Responsibility
backend/internal/config Environment-variable loader and validator.
backend/internal/server gin engine, listeners, route groups, shared middleware (request id, panic recovery, metrics, tracing).
backend/internal/auth Email-code challenges, device sessions, Ed25519 client public keys, send/confirm, user-driven revoke (single + revoke-all), admin-driven revoke (sanctions, soft-delete, in-process), durable revocation audit in session_revocations, internal session lookup endpoint for gateway.
backend/internal/user User accounts, settings (preferred_language, time_zone, declared_country), entitlements, sanctions, limits, soft delete with in-process cascade.
backend/internal/lobby Games, applications, invites, memberships, enrollment state machine, turn schedule, Race Name Directory.
backend/internal/runtime Engine version registry, container lifecycle, turn scheduler, (user_id ↔ race_name ↔ engine_player_uuid) mapping per game, runtime snapshot publication into lobby.
backend/internal/mail Postgres outbox, SMTP delivery worker, retry/backoff, dead letters, admin resend.
backend/internal/notification Notification intent normalization, idempotency, per-route fan-out into push (gRPC) and email (outbox).
backend/internal/geo Per-session country observation, (user_id, country) counter, declared_country initialisation at registration.
backend/internal/admin admin_accounts table, env-driven bootstrap, Basic Auth verifier, admin-side operations across other modules.
backend/internal/push gRPC server hosting the SubscribePush stream consumed by gateway.
backend/internal/engineclient Thin REST client to running game engines. Reuses DTOs from pkg/model/{order,report,rest}.
backend/internal/dockerclient Wrapper around github.com/docker/docker for container start, stop, restart, patch, inspect, reconcile.
backend/internal/postgres pgx pool, embedded migrations, jet-generated query packages.
backend/internal/telemetry OpenTelemetry runtime, zap logger factory, trace-field helpers.

5. Persistence

  • A single Postgres database, schema backend. backend is the only writer. Every backend table lives in this schema.

  • Migrations are kept in backend/internal/postgres/migrations/, embedded into the binary, and applied via pressly/goose/v3 during startup before any listener opens. The DSN must include ?search_path=backend so unqualified reads and writes resolve to the service-owned schema.

  • Queries are written through go-jet/jet/v2. Generated code lives in backend/internal/postgres/jet/ and is regenerated by make jet.

  • Every domain identifier is a uuid primary key (device_session_id, user_id, game_id, application_id, invite_id, membership_id, delivery_id, notification_id, …). Identifiers that are not Postgres-side identities (email, user_name, canonical, template_id, idempotency_key, race_name) remain text.

  • Foreign keys are intra-domain only: accounts → entitlement_* / sanction_* / limit_*; games → applications / invites / memberships (with ON DELETE CASCADE); mail_payloads → mail_deliveries → mail_recipients / mail_attempts / mail_dead_letters; notifications → notification_routes / notification_dead_letters. Cross-domain references (memberships.user_id, games.owner_user_id, etc.) are kept as opaque uuid columns because each domain runs its own cleanup through the in-process cascade described in §7. Adding a database cascade would either duplicate that work or hide it behind opaque triggers.

  • created_at, updated_at, deleted_at are always timestamptz. UTC normalisation is applied on read and write.

  • Idempotency is enforced through UNIQUE indexes on durable tables (for example (template_id, idempotency_key) on mail_deliveries, race_name_canonical on registered race names, (game_id, user_id) on memberships). There is no separate idempotency table.

  • Worker pickup uses SELECT ... FOR UPDATE SKIP LOCKED ordered by next_attempt_at. This pattern serves the mail outbox, retry-able runtime jobs, and any future deferred work.

  • session_revocations is the append-only audit trail of every device session revocation, keyed by revocation_id (uuid) with device_session_id, user_id, actor_kind, the actor pair actor_user_id uuid + actor_username text (exactly one is non-NULL per row, enforced by a CHECK constraint), reason, and revoked_at. The row is inserted in the same transaction that flips device_sessions.status to 'revoked', so a successful revoke always leaves a matching audit row.

    The two-column actor pair is the canonical shape used by every audit-bearing table — accounts.deleted_actor_*, entitlement_records, entitlement_snapshots, sanction_records.actor_* + removed_by_*, and limit_records.actor_* + removed_by_* follow the same convention. actor_kind (or actor_type on the user-domain tables) values are user, admin, system. The Go layer hides the split behind user.ActorRef{Type, ID string}: Type=="user" requires ID to be a UUID, Type=="admin" stores ID as the operator username (passed to actor_username), and Type=="system" requires an empty ID. See backend/internal/user/store.go (actorToColumnArgs/actorFromColumns) for the SQL boundary.

6. In-Memory Cache

Postgres is the cold store. In-memory caches in backend serve hot reads and are warmed at process start.

Cache Population Update path
Active device sessions Full table read at startup. Write-through on create/revoke.
User entitlement snapshots Latest snapshot per active user at startup. Write-through on entitlement change.
Engine version registry Full table read at startup. Write-through on admin update.
Active runtime records Full table read at startup. Write-through on container ops.
Active games + memberships Full table read at startup. Write-through inside lobby commands.
Race Name Directory canonicals Full table read at startup. Write-through inside lobby commands.
Admin accounts Full table read at startup. Write-through on admin CRUD.

Every cache is bounded to MVP-scale data sets that comfortably fit in process memory (10K accounts, 1000 active games, 100K device sessions, a few thousand directory entries — all together well under 100 MB). If a specific cache is observed to grow beyond a process budget at scale, moving that cache to Redis must be discussed and approved before implementation; the architecture leaves backend Redis-free by default.

Cache writes happen after the matching Postgres mutation commits. A commit failure leaves the cache in sync with the prior database state. Each cache exposes a Ready flag flipped to true after the warm-up read finishes; the /readyz probe waits on every cache being ready before reporting ready, so the listener never serves a request that would spuriously miss because of a cold cache.

gateway carries a separate, smaller cache: the in-memory session cache fronting every authenticated request. It is a bounded LRU (default 50 000 entries) with a safety-net TTL (default 10 minutes). Misses trigger a single synchronous REST call to backend's /api/v1/internal/sessions/{id} lookup; hits answer the hot path directly. The cache is kept consistent through the session_invalidation push events backend emits over Push.SubscribePush: each event flips the cached entry to revoked so subsequent authenticated requests bound to that session are rejected at the edge without another backend round-trip. The TTL covers the case of a missed event (cursor aged out, gateway restart) by forcing a refresh at most once per window.

7. In-Process Async Patterns

Async work is implemented with goroutines and channels. There is no Redis pub/sub, no Redis Stream, and no message broker between domain modules.

The following table records how previously inter-service streams are realised in process. The semantics — when each event fires, how many times, in which order — are preserved; the transport changes from a durable stream to an in-process function call or buffered channel.

Previous external stream In-process realisation
User lifecycle (block / soft delete) → Lobby cascade lobby.OnUserBlocked(user_id) and lobby.OnUserDeleted(user_id) invoked synchronously after user commits.
Runtime snapshot updates → Lobby denormalisation lobby.OnRuntimeSnapshot(snapshot) invoked from runtime after each engine status read.
Game finished → Lobby promotion / cleanup lobby.OnGameFinished(game_id).
Lobby start/stop jobs → Runtime container lifecycle runtime.StartGame(game_id) / runtime.StopGame(game_id). Long-running pull/start drained on a per-game worker goroutine, serialised by per-game mutex.
Runtime job results → Lobby Direct return value from runtime.StartGame, plus optional lobby.OnRuntimeJobResult callback for asynchronous progression.
Runtime health events runtime publishes onto an in-process channel; lobby and admin observers consume.
Notification intents Direct call notification.Submit(intent) by producers (lobby, runtime, geo).
Mail delivery commands Direct insert into mail_deliveries by producers; mail worker drains the table.
Auth → Mail (login codes) Direct call mail.EnqueueLoginCode(...) from auth.confirmEmailCode.
Gateway client-events stream Backend push server emits client_event on the gRPC stream consumed by gateway.
Gateway session-events stream Backend push server emits session_invalidation on the same gRPC stream.

Workers drain outstanding work on graceful shutdown in a deterministic order: stop accepting new HTTP/gRPC traffic → finish in-flight requests → flush mail outbox writes that already started → flush push events to gateway buffer → close the Docker client → close the database pool.

The lobby state machine is the only domain whose transitions cross several producers and consumers. The closed transitions are draft → enrollment_open → ready_to_start → starting → running ↔ paused → finished, with cancelled reachable from every pre-finished state and start_failed → ready_to_start for retry. Owner-driven endpoints (or admin overrides for public games) trigger transitions; the runtime callback OnRuntimeJobResult is the only path that flips starting → running or starting → start_failed. lobby.OnGameFinished is invoked when the engine reports the game finished, after which the runtime container is torn down and Race Name Directory promotions run.

8. Backend ↔ Gateway Communication

There are two channels between gateway and backend.

Sync REST (gateway → backend). Every authenticated user request and every public auth request goes over plain HTTP/JSON. The gateway sends X-User-ID (when authenticated) and forwards the verified payload. The backend never re-derives user identity from the body. The session lookup hits backend's /api/v1/internal/sessions/{id} only on a cache miss in the gateway-side LRU described in §6; backend updates device_sessions.last_seen_at on every successful lookup so admin operators can observe when each session was last resolved at the edge.

gRPC stream (gateway ⇄ backend). Backend exposes a single RPC SubscribePush(GatewaySubscribeRequest) returns (stream PushEvent). The gateway opens this stream once at start and keeps it open. Each PushEvent carries a oneof:

  • client_event — opaque payload addressed to (user_id [, device_session_id]), which gateway signs and delivers to active client subscriptions.
  • session_invalidation — instructs gateway to immediately close any active streams for (device_session_id) or for all sessions of user_id, and to reject in-flight requests bound to those sessions.

Backend keeps a small in-memory ring buffer of recent events keyed by cursor with TTL equal to the gateway freshness window. On reconnect, gateway sends its last consumed cursor; backend resumes from the next event or from a fresh cursor if the requested point has expired.

gateway keeps using Redis for anti-replay request_id reservations. No other gateway↔backend interaction uses Redis.

Edge enforcement

gateway is responsible for stopping every check it can answer locally so that backend processes only well-shaped, fresh, authentic traffic:

  • TLS termination and pinning where applicable.
  • Request envelope parsing, payload hash verification, Ed25519 signature verification, freshness window enforcement, anti-replay reservation.
  • Public-facing rate limiting and basic policy.
  • Closing of streams marked invalid via session_invalidation.

Backend assumes those checks have happened. It runs business validation, authorisation, and state transitions on top of that assumption.

9. Backend ↔ Game Engine Communication

Backend is the only platform participant that talks to galaxy-game-* containers. The contract is the engine OpenAPI document; backend uses the existing typed DTOs in pkg/model/{order,report,rest} and a hand-written net/http client in backend/internal/engineclient.

Authenticated client traffic for in-game operations crosses three serialisation boundaries: signed-gRPC FlatBuffers (client ↔ gateway), JSON over REST (gateway ↔ backend), and JSON over REST again (backend ↔ engine). Gateway owns the FB ↔ JSON transcoding for the four message types user.games.command, user.games.order, user.games.order.get, user.games.report (FB schemas in pkg/schema/fbs/{order,report}, encoders in pkg/transcoder). user.games.order.get reads back the player's stored order for a given turn — paired with the POST user.games.order so the client can hydrate its local draft after a cache loss without re-deriving from the report. Backend never touches FlatBuffers and never re-interprets the JSON beyond rebinding the actor field from the runtime player mapping (clients never carry a trusted actor).

Container state is owned by backend/internal/runtime:

  • runtime_records is the persistent map from game_id to current container state.
  • engine_versions is the registry of allowed engine images and serves as the source for image_ref arbitration. Producers do not pick image references on their own.
  • Patch is semver-patch-only inside the same major/minor line; any major/minor change requires an explicit stop and start.
  • Reconciliation runs at startup and periodically: every container with the galaxy.backend label is matched against runtime_records; unrecorded containers with the label are adopted, missing recorded containers are marked removed and an internal event is emitted.
  • Container naming is fixed: galaxy-game-{game_id}; engine endpoint is always http://galaxy-game-{game_id}:8080.
  • Engine probes (/healthz) feed runtime health observations and turn generation status.

10. Geo Profile (reduced)

The geo concern is intentionally minimal.

  • At registration (/api/v1/public/auth/confirm-email-code), backend looks up the source IP against the GeoLite2 country database via pkg/geoip and stores the resulting ISO country code in accounts.declared_country. This value is never updated afterwards; there is no version history.
  • On every authenticated user-facing request, a fire-and-forget goroutine performs the same lookup against the request IP and increments user_country_counters by (user_id, country, count bigint). The request itself does not block on this update.
  • There is no aggregation, no automatic flagging, no review recommendations, no admin notifications, and no detection of account takeover. Counter data is only available to operators via the admin surface for manual inspection.
  • Geo work is fail-open: any geoip error is logged but never blocks the user request.
  • Source IP for both flows is read from the leftmost X-Forwarded-For entry, falling back to RemoteAddr when the header is absent. Backend trusts the value because the network segment between gateway and backend is the trust boundary (§15§16); duplicating the edge rate-limit / spoof checks here would be double work.
  • Email addresses are never written to logs verbatim. Backend modules emit a per-process HMAC-SHA256-truncated email_hash instead, so operators can correlate log lines within a single process lifetime without persisting PII.

11. Mail Outbox

Email is delivered through a Postgres-backed outbox.

  • Producers (auth login codes, notification routes) write into mail_deliveries with a unique (template_id, idempotency_key) and the rendered payload bytes in mail_payloads.
  • A worker goroutine selects work from mail_deliveries with SELECT ... FOR UPDATE SKIP LOCKED, attempts SMTP delivery via wneessen/go-mail, records the attempt in mail_attempts, and either marks the delivery sent or schedules next_attempt_at for retry with exponential backoff and jitter.
  • After the configured maximum retry budget the delivery moves to mail_dead_letters. The mail.dead_lettered notification kind is reserved in the catalog but has no producer wired up yet, so no admin notification is emitted today — operator visibility comes from a log line and the /api/v1/admin/mail/dead-letters listing.
  • On startup the worker drains everything pending. There is no separate recovery procedure: starting backend is sufficient.
  • Operators can re-enqueue from mail_dead_letters through the admin surface.

The auth path returns success as soon as the delivery row is durably committed; SMTP completion is asynchronous to the auth request.

12. Notification Pipeline

Notifications are an in-process pipeline. The closed catalog is defined in backend/internal/notification/catalog.go and currently covers 13 kinds: 10 lobby kinds (invite received/revoked, application submitted/approved/rejected, membership removed/blocked, race name registered/pending/expired) and 3 admin-recipient runtime kinds (image pull failed, container start failed, start config invalid). Per-kind delivery channels (push, email, or both) and the admin-vs- per-user recipient routing live in the same file.

For every intent, notification.Submit performs:

  1. Idempotency check (UNIQUE on (intent_kind, idempotency_key)).
  2. Recipient resolution against user.
  3. Per-recipient route materialisation in notification_routespush, email, or both — based on the type-specific policy table.
  4. Push routes are emitted onto the gRPC client_event channel for the recipient. The dispatcher passes the producer's payload map through notification.buildClientPushEvent(kind, payload), which maps the kind to the matching FlatBuffers schema in pkg/schema/fbs/notification.fbs (one table per catalog kind, 1:1 with the camel-case form of the kind plus the Event suffix) and returns a typed push.Event. push.Service invokes Marshal and places the bytes into pushv1.ClientEvent.Payload. An unknown kind falls back to push.JSONEvent so a misconfigured producer does not silently drop frames; new kinds must ship with a typed FB schema and a matching buildClientPushEvent case rather than relying on the fallback.
  5. Email routes are inserted into mail_deliveries with the matching template id.
  6. Malformed intents go to notification_malformed_intents and never block the producer.

Notification persistence is the auditable record of "we tried to tell this user about this thing"; clients still derive their actual game state through normal user-facing reads.

13. Container Lifecycle (in-process)

backend/internal/runtime owns the lifecycle of game-engine containers and is the only component permitted to issue Docker calls.

  • All Docker calls go through dockerclient, which is a thin wrapper over github.com/docker/docker configured against BACKEND_DOCKER_HOST.
  • Per-game container operations are serialised through a per-game mutex (held in memory) so that concurrent start/stop/patch attempts cannot race. runtime_operation_log records every operation for audit.
  • Long-running pulls and starts execute on worker goroutines; the calling path returns as soon as the operation is queued, then receives completion through a callback or a follow-up status read.
  • The turn scheduler uses pkg/cronutil (a wrapper over robfig/cron/v3) and schedules a tick per running game according to games.turn_schedule. Force-next-turn sets a skip-flag that advances the next scheduled tick by one cron step.
  • Snapshots are read from the engine on a schedule, after every successful command, and on health probe transitions; each read publishes a runtime_snapshot_update to lobby in process.

Containers managed by backend carry the Docker label galaxy.backend=1. Reconciliation matches that label against runtime_records so a redeploy of backend re-attaches to running games rather than orphaning them.

Future improvement (not in MVP): introduce a docker-socket-proxy sidecar (for example tecnativa/docker-socket-proxy) and connect dockerclient through it over TCP. Until then backend mounts /var/run/docker.sock directly.

14. Admin Surface

  • Admin authentication is HTTP Basic Auth.
  • Credentials live in the Postgres table admin_accounts with username, password_hash (bcrypt cost 12), created_at, last_used_at, disabled_at.
  • Bootstrap: at startup backend reads BACKEND_ADMIN_BOOTSTRAP_USER and BACKEND_ADMIN_BOOTSTRAP_PASSWORD; if no admin_accounts record with that username exists, it is inserted with the bcrypt hash. The insert is idempotent so restarts are safe.
  • Existing admins can manage other admins through the same /api/v1/admin/admin-accounts endpoints.
  • All other admin endpoints (/api/v1/admin/users/*, /api/v1/admin/games/*, /api/v1/admin/runtimes/*, /api/v1/admin/mail/*, /api/v1/admin/notifications/*) reuse the per-domain logic of the module they target.

15. Transport Security Model (gateway boundary)

This section describes the secure exchange model between client and gateway. It applies at the public boundary and does not rely on backend behaviour for any of its guarantees.

The authenticated edge listener is built on connectrpc.com/connect and natively serves the Connect, gRPC, and gRPC-Web protocols on a single HTTP/2 cleartext (h2c) port. Browser clients use Connect via @connectrpc/connect-web; native iOS / Android / desktop clients can use either Connect or raw gRPC framing against the same listener. Envelope, signature, freshness, and anti-replay rules below are protocol-agnostic — they apply identically to every supported wire framing.

Principles

  • No browser cookies.
  • Authentication is device-session based.
  • Each device session is unique and independently revocable.
  • No short-lived access tokens or refresh-token flows.
  • Requests are authenticated by client signatures.
  • Responses and push events are authenticated by server signatures.
  • Transport integrity and freshness are verified before any payload is processed.

Device session model

After a successful email-code login:

  1. The client generates an Ed25519 key pair.
  2. The private key remains on the client.
  3. The client public key is registered with backend as the standard base64-encoded raw 32-byte Ed25519 key.
  4. backend creates a persistent device session.
  5. The client persists device_session_id and the private key.

backend stores at least device_session_id, user_id, the base64-encoded raw 32-byte Ed25519 client public key, session status, and revoke metadata.

Key storage

  • Native clients use platform secure storage; private keys never leave the device.
  • Browser/WASM clients use WebCrypto with non-exportable storage where available. Loss of browser storage is acceptable and is recovered by re-login. The concrete browser baseline, IndexedDB schema, and keystore lifecycle live in ui/docs/storage.md.

Request envelope

Each authenticated request carries payload_bytes, a request_envelope, and a signature. The envelope contains:

  • protocol_version (v1)
  • device_session_id
  • message_type
  • timestamp_ms
  • request_id
  • payload_hash (raw 32-byte SHA-256 of payload_bytes)

The client signs canonical bytes built from:

"galaxy-request-v1" || protocol_version || device_session_id ||
message_type || timestamp_ms || request_id || payload_hash

with this binary encoding:

  • each string and bytes field is encoded as uvarint(len(field_bytes)) followed by raw bytes;
  • timestamp_ms is encoded as an 8-byte big-endian unsigned integer;
  • fields are appended in the exact order listed.

The signature scheme is Ed25519. The signature carries the raw 64-byte signature.

Response envelope

Each server response carries payload_bytes, a response_envelope, and a signature. The envelope contains:

  • protocol_version
  • request_id
  • timestamp_ms
  • result_code
  • payload_hash

Canonical bytes:

"galaxy-response-v1" || protocol_version || request_id ||
timestamp_ms || result_code || payload_hash

The gateway signs with a PKCS#8 PEM-encoded Ed25519 private key. Clients verify with a trusted server public key.

Push events

Each server push event carries payload_bytes, an event_envelope, and a signature. Required envelope fields: event_type, event_id, timestamp_ms, payload_hash. Optional: request_id, trace_id.

Canonical bytes:

"galaxy-event-v1" || event_type || event_id || timestamp_ms ||
request_id || trace_id || payload_hash

Gateway signs each event at delivery time using the same Ed25519 key as for responses. The bootstrap event delivered when a SubscribeEvents stream opens is event_type = gateway.server_time, reusing the opening request_id as event_id and carrying server_time_ms so clients can calibrate offset without a separate time request.

Verification order at gateway

Before any payload is forwarded to backend, gateway must:

  1. Verify the transport envelope is present and supported.
  2. Resolve device_session_id (against backend, sync REST).
  3. Reject unknown or revoked sessions.
  4. Verify the client signature using the stored public key.
  5. Verify payload_hash.
  6. Verify timestamp freshness (symmetric ±5 minutes around server time).
  7. Verify anti-replay: reserve (device_session_id, request_id) until timestamp_ms + freshness_window.
  8. Apply edge rate limits and basic policy.
  9. Forward to backend with X-User-ID set.

Verification order at client

Before accepting a response payload, the client must verify the response signature, that request_id matches the corresponding request, the payload_hash, and where applicable the timestamp freshness.

Before accepting a push payload, the client must verify the event signature, the payload_hash, the request_id when correlated, and where applicable the timestamp freshness.

Anti-replay

Anti-replay uses (timestamp_ms, request_id). Recently seen request_id values are tracked per session in Redis until timestamp_ms + freshness_window. This protects transport freshness only; business idempotency is a separate concern enforced by backend domain tables.

TLS and MITM

Native clients should use TLS pinning (SPKI-based) in addition to the signed exchange. Browser clients rely on browser-managed TLS and the signed exchange.

Threat model boundaries

The transport model protects against tampering in transit, replay inside the freshness window, use of unknown or revoked sessions, forged server responses without the gateway signing key, and forged client requests without the client signing key. It does not prevent a legitimate user from generating their own valid requests; that is handled by backend business validation and authorisation.

16. Security Boundaries Summary

Concern Enforced by Notes
Public TLS termination, pinning gateway Native clients pin SPKI.
Request signature, payload hash, freshness, anti-replay gateway See §15.
Session lookup backend (sync REST) + gateway in-memory LRU gateway-side LRU with TTL safety net (§6) hits backend's /api/v1/internal/sessions/{id} only on miss; no Redis projection.
Session revocation propagation backend → gateway session_invalidation over the gRPC push stream flips the gateway-side cache entry to revoked and closes any active push stream.
Authorisation, ownership, state transitions backend X-User-ID is the sole identity input on the user surface.
Edge rate limiting gateway Backend has no rate-limit responsibility in MVP.
Admin authentication backend Basic Auth against admin_accounts.
Engine API authentication network Engine listens only on the trusted network; backend is the only caller.

Backend ↔ Gateway trust

The MVP does not require an additional authenticator between gateway and backend. Backend trusts X-User-ID from gateway and accepts gateway gRPC subscribers without authentication. The trust boundary is the network: deployment must ensure that only gateway can reach backend's HTTP and gRPC listeners.

This is an explicit, accepted risk. Compromise of the trusted network between gateway and backend would let any party impersonate any user or admin against backend. The risk is mitigated only by network isolation of the deploy. Adding mutual authentication (a pre-shared bearer token or mTLS between gateway and backend) is a future hardening step; backend is structured so that adding such a check is a single middleware addition.

17. Observability

  • Tracing and metrics flow through OpenTelemetry. The default exporter is OTLP (gRPC or HTTP/protobuf, configurable). Metrics may also be exposed via a Prometheus pull endpoint when configured.
  • Logging uses go.uber.org/zap in JSON mode. Trace and span ids are injected into every log entry written inside a request scope.
  • Every backend module emits the metrics relevant to its concern: HTTP request count and duration per route group, gRPC subscription count and push event throughput, mail outbox depth and per-attempt outcomes, notification fan-out counts, container operation counts and durations, Postgres pool stats, geo lookup count and error rate.
  • Health probes are unauthenticated GET /healthz (process liveness) and GET /readyz (Postgres reachable, migrations applied, gRPC listener bound). Probes are excluded from anti-replay and rate limiting.

18. Deployment Topology (informational)

  • MVP runs three executables: one gateway instance, one backend instance, and N galaxy-game-{game_id} containers managed by backend.
  • One Postgres database is shared by backend only.
  • One Redis instance is reachable from gateway only (anti-replay).
  • One SMTP relay is reachable from backend.
  • The Docker daemon socket is mounted into backend.
  • The GeoLite2 country database file is mounted at the path given by BACKEND_GEOIP_DB_PATH.

Future scale-out hooks (not in MVP):

  • Distributed backend requires reintroducing Redis for shared session cache and runtime job leasing, plus leader election for the turn scheduler.
  • mTLS between gateway and backend.
  • Docker-socket-proxy sidecar fronting Docker daemon access.

19. Glossary

  • device_session_id — opaque identifier of an authenticated client device; primary key of the device session record.
  • race_name — in-game player display name. Three tiers in the Race Name Directory: registered (platform-unique), reservation (per-game), pending_registration (post-capable-finish).
  • canonical key — lowercased and confusable-folded form of a race name used for uniqueness checks, computed via disciplinedware/go-confusables.
  • capable finish — a finished game in which the player reached max_planets > initial AND max_population > initial. Only capable finishes promote a reservation to pending_registration.
  • runtime snapshot — engine-status read materialised into the lobby's denormalised view: current_turn, runtime_status, engine_health_summary, player_turn_stats.
  • turn cutoff — the running → generation_in_progress runtime-status flip performed by backend/internal/runtime/scheduler.go before each engine /admin/turn call. Commands and orders arriving while the flag is set are rejected by the user-games handlers with HTTP 409 turn_already_closed. The matching reopening flip (generation_in_progress → running) happens on a successful tick; a failing tick instead drives the lobby to paused and fans out game.paused (FUNCTIONAL.md §6.3, §6.5).
  • auto-pause — the lobby reaction to a failed runtime snapshot (engine_unreachable / generation_failed): the game flips running → paused, the order handlers refuse new submits with HTTP 409 game_paused, and lobby.publishGamePaused fans out the push event. Only an admin /resume followed by a successful tick recovers the game; the UI relies on the next game.turn.ready to clear the paused banner.
  • outbox — the durable queue of pending mail rows in mail_deliveries, drained by the mail worker.
  • freshness window — the symmetric ±5-minute interval around server time inside which a request timestamp_ms is accepted.
  • trust boundary — the network segment between gateway and backend. Compromise of this segment defeats backend authentication; deployment must isolate it.